Machine Learning Engineer
Responsibilities
Qualifications & Requirements
Experience Level: Mid Level
Full Job Description
We are seeking a hands-on Machine Learning Engineer with 4-6 years of experience to join our team in Bengaluru, Karnataka, India. This role focuses on building and fine-tuning large language models (LLMs) and transformer-based models, tackling complex problems at the intersection of ML research and production systems.
You will be involved in the entire ML development lifecycle, including data preparation, model fine-tuning, evaluation, and optimization. A strong understanding of what drives model performance and how to systematically improve it through experimentation is key. Experience with LLM fine-tuning techniques (LoRA, QLoRA), RLHF pipelines, and comprehensive model evaluation is highly desirable. We are looking for an individual with strong ownership, initiative, and a passion for developing production-ready ML models that will impact thousands of developers worldwide.
What You'll Do:
ML Model Development & Optimization
- Design and implement end-to-end LLMOps pipelines for model training, fine-tuning, and evaluation.
- Fine-tune and customize LLMs (e.g., Llama, Mistral, Gemma) using full fine-tuning and PEFT techniques (LoRA, QLoRA) with tools such as Unsloth, Axolotl, and HuggingFace Transformers.
- Implement Reinforcement Learning from Human Feedback (RLHF) pipelines for model alignment and preference optimization.
- Design experiments for automated hyperparameter tuning, training strategies, and model selection.
- Prepare and validate training datasets, ensuring data quality, preprocessing, and format correctness.
- Build comprehensive model evaluation systems with custom metrics (BLEU, ROUGE, perplexity, accuracy) and develop synthetic data generation pipelines.
- Optimize model accuracy, token efficiency, and training performance through systematic experimentation.
- Design and maintain prompt engineering workflows with version control systems.
- Deploy models using vLLM with multi-adapter LoRA serving, hot-swapping, and basic optimizations like speculative decoding, continuous batching, and KV cache management.
ML Operations & Technical Leadership
- Set up ML-specific monitoring for model quality, drift detection, and performance tracking, with automated retraining triggers.
- Manage model versioning, artifact storage, lineage tracking, and reproducibility using experiment tracking tools.
- Debug production model issues and optimize cost-performance trade-offs for training and inference.
- Collaborate with infrastructure engineers on ML-specific compute requirements and deployment pipelines.
- Document model development processes and share knowledge through internal tech talks.
Technical Skills & Experience:
We encourage you to apply if you meet some of these requirements and are eager to learn the rest.
- 4-6 years of hands-on experience in machine learning engineering or applied ML roles.
- Strong fine-tuning experience with modern LLMs, including practical knowledge of transformer architectures, attention mechanisms, and PEFT techniques (LoRA/QLoRA).
- Deep understanding of transformer model architectures and their modern variants (MoE, Grouped-Query Attention, Flash Attention, state space models).
- Production ML experience, including building and fine-tuning models for real-world applications.
- Proficiency in Python and ML frameworks such as PyTorch, HuggingFace Transformers, PEFT, and TRL, with hands-on experience in tools like Unsloth and Axolotl.
- Experience building model evaluation systems with metrics like BLEU, ROUGE, perplexity, and accuracy.
- Hands-on experience with prompt engineering, synthetic data generation, and data preprocessing pipelines.
- Basic deployment experience with vLLM, including multi-adapter serving, hot-swapping, and inference optimizations.
- Understanding of GPU computing concepts such as memory management, multi-GPU training, mixed precision, and gradient accumulation.
- Strong debugging skills for training failures, OOM errors, convergence issues, and data quality problems.
- Experience with model alignment techniques (RLHF, DPO) and implementing RLHF pipelines is highly desirable.
- Experience with distributed training (DeepSpeed, FSDP, DDP) is a plus.
- Knowledge of model quantization techniques (GPTQ, AWQ) and their impact on model quality is desirable.
- Prior experience with AWS SageMaker, MLflow for experiment tracking, and Weights & Biases is a strong plus.
- Exposure to cloud platforms (AWS/GCP/Azure) for training workloads is beneficial.
- Familiarity with Docker containerization for reproducible training environments.
Preferred Attributes:
- High ownership, self-driven, and a bias for action.
- Strong strategic thinking and the ability to connect technical decisions to business impact.
- Excellent communication and mentoring skills.
- Thrives in ambiguous, fast-paced environments and early-stage startup cultures.
Why Join AION?
- Work directly with high-pedigree founders shaping technical and product strategy.
- Contribute to building the infrastructure powering the future of AI compute globally.
- Significant ownership and impact with equity reflective of your contributions.
- Competitive compensation, flexible work options, and wellness benefits.
If you are a machine learning engineer ready to lead ML-as-a-Service (MLaaS) architecture and scale next-generation AI infrastructure, we encourage you to apply. Please include the following in your application summary:
- Your resume highlighting relevant projects and leadership experience.
- Links to products, code (GitHub), or demos you have built.
- A brief note explaining why AION’s mission excites you.
Company
AION
AION is pioneering a decentralized AI cloud platform designed for high-performance computing (HPC). We are transforming the future of compute by democratizing access and offering managed services, aim...