Amit is available for hire

Amit Yadav

Verified Expert in Engineering

MLOps Developer

Lucknow, Uttar Pradesh, India

Toptal member since August 16, 2024

Expertise

Machine Learning Artificial Intelligence Deep Learning TensorFlow Python Cloud Engineering Docker Grafana LLM MySQL PyTorch Kubernetes Google Cloud Development Computer Vision

Bio

Amit is a certified Google Cloud Platform professional cloud architect. He is a senior ML engineer and data scientist with years of experience in MLOps, infrastructure development, and automation and development in Python. Skilled in AI research and analyst environments, Amit exhibits excellent organizational and problem-solving skills and works well in team environments.

Portfolio

Publicis Groupe

Machine Learning Operations (MLOps), Kubeflow, ML Pipelines...

General Mills

Artificial Intelligence (AI), Google Cloud Platform (GCP), Docker, Kubeflow...

Suzlon

Azure, Azure ML Studio, CI/CD Pipelines, Kubeflow, Kubernetes, Docker...

Experience

Machine Learning - 9 years
Model Monitoring - 8 years
ML Pipelines - 5 years
Large Language Models (LLMs) - 4 years
Kubernetes - 3 years
Kubeflow - 3 years
Milvus - 2 years
Large Language Model Operations (LLMOps) - 2 years

Preferred Environment

Kubeflow, ML Pipelines, Model Monitoring, Model Deployment, MySQL, Deep Learning, Machine Learning, TensorFlow, PyTorch, Python 3

The most amazing...

...thing I've developed is a conversational application that uses LLMs to complete and answer the user's requested output.

Work Experience

Machine Learning Operations (MLOps) Engineer

2023 - 2024

Publicis Groupe

Built and fine-tuned large language models (LLMs) with LoRA and generative AI. Developed the SAGE project using the RAG pipeline, LLMs, and prompt engineering. Deployed MLOps pipeline with Kubeflow, TensorFlow Extended (TFX), and Docker.
Worked with cross-functional teams to integrate AI into the hybrid integration platform (HIP). Automated descriptive analysis using Python, GCP Composer, and unit tests. Designed the automated ML pipeline with Kubeflow, TFX, Docker, and BigQuery.
Automated 1D CNN pipelines for wind turbine anomaly detection, deployed with Azure CI/CD and Docker. Streamlined data preprocessing using Python scripts.
Developed the YOLO (you only look once) model for turbine blade defect detection; deployed in the cloud and integrated with drones.

Technologies: Machine Learning Operations (MLOps), Kubeflow, ML Pipelines, Large Language Models (LLMs), Generative Artificial Intelligence (GenAI), Kubernetes, Vertex AI, Azure, Google Cloud Platform (GCP), Python, Data Science, Google AI Platform, Agentic AI

MLOps Engineer (GCP)

2022 - 2023

General Mills

Designed and implemented a fully automated ML pipeline using Kubeflow, orchestrated with TensorFlow Extended. Containerized ML components using Docker and utilized BigQuery as the data warehouse for seamless data management.
Orchestrated data and ML pipelines with Airflow and GCP Composer as managed services for demand forecasting data pipelines. Leveraged BigQuery for data warehousing, ensuring efficient data flow and processing within the pipeline.
Implemented model serving and monitoring to detect model and data drift, ensuring the models' performance and accuracy remain optimal over time.
Utilized GCP Cloud Functions for event-driven data processing and orchestration tasks, enhancing pipeline flexibility and scalability.
Implemented real-time data ingestion and processing using Cloud Pub/Sub, enabling timely and efficient data flow through the ML pipelines.
Integrated Cloud Storage for scalable and durable storage of datasets and model artifacts, ensuring high availability and accessibility.

Technologies: Artificial Intelligence (AI), Google Cloud Platform (GCP), Docker, Kubeflow, ML Pipelines, BigQuery, Composer, TensorFlow Serving, TensorFlow, Vertex AI, GitHub Actions, CI/CD Pipelines, Google Cloud Functions, Firebase, Cloud Pub/Sub, Apache Airflow, Python, Data Science, Google AI Platform, XGBoost, Statistical Modeling

MLOps Engineer

2021 - 2022

Suzlon

Developed and automated deep learning 1D CNN model pipelines to detect anomalies in wind turbine gearboxes and main bearing failures. Implemented these pipelines in Azure CI/CD with Docker containers for seamless integration and deployment.
Developed and deployed machine learning and time series models to detect main bearing failures from oil sample data, enhancing predictive maintenance capabilities and orchestrating the pipeline.
Created automated data preprocessing tasks using Python scripts to streamline the data preparation workflow.
Designed and implemented reinforcement learning agents using DDPG (Deep Deterministic Policy Gradient) for optimizing real-time wind turbine torque control under varying wind conditions, balancing energy output with mechanical wear reduction.
Developed discrete action-space agents using Q-learning to model turbine fault mitigation and adaptive maintenance decision policies, enabling proactive interventions in simulated wind farm scenarios.

Technologies: Azure, Azure ML Studio, CI/CD Pipelines, Kubeflow, Kubernetes, Docker, Computer Vision, Machine Learning, Forecasting, MLflow, Databricks, Snowflake, Flask, Python, Azure SQL Databases, Data Science, Q-learning, DDPG, Reinforcement Learning, Random Forests, XGBoost, Statistical Modeling

Data Scientist

2020 - 2021

thyssenkrupp

Created CI/CD pipelines in Azure for machine learning models, ensuring efficient deployment and integration into production environments.
Developed and trained a Facebook Prophet model for forecasting parts of car models for the client “Volkswagen Portugal,” improving inventory management and planning.
Integrated unit tests in GitLab for MLOps CI/CD pipelines, ensuring robustness and reliability of the machine learning workflows. Developed and deployed a computer vision model using YOLOv5 to detect defects in conveyor belts in cement plants.
Developed discrete action-space agents using Q-learning to model turbine fault mitigation and adaptive maintenance decision policies, enabling proactive interventions in simulated wind farm scenarios.

Technologies: Forecasting, OpenCV, Amazon Forecast, Machine Learning Operations (MLOps), ETL, Docker, Azure Databricks, Azure, CI/CD Pipelines, MLflow, Machine Learning, Python, Data Science, Q-learning, Reinforcement Learning, Random Forests, XGBoost, Statistical Modeling

Data Analyst

2017 - 2019

Convergys

Performed predictive analytics on telecommunication data to forecast trends and improve customer retention strategies. Implemented machine learning concepts, including data processing, supervised learning, and unsupervised learning, to enhance.
Built a conversational chatbot to provide elementary information to telecommunication customers, utilizing NLP techniques for effective communication.
Fetched data using MySQL queries from AWS Redshift, ensuring seamless data retrieval for analysis and reporting.
Collected, interpreted, and analyzed large datasets to derive actionable insights for business decision-making.

Technologies: Amazon SageMaker, Amazon SageMaker Pipelines, Tableau, Data Lakes, Data Warehousing, Natural Language Processing (NLP), Machine Learning, Classification, Regression, Statistics, Python, Data Science, Random Forests, XGBoost, Statistical Modeling

Experience

Senior AI/MLOps & Platform Engineering

I played a key role in architecting and operating a production-grade Kubeflow platform on Azure Kubernetes Service (AKS), with deep focus on pipeline orchestration, execution internals, and performance optimization. I designed Kubeflow Pipelines by building reusable components, compiling them into pipeline Intermediate Representation (IR), and validating execution graphs orchestrated by Argo Workflows. I analyzed Argo-generated DAGs to optimize task dependencies, enable parallel execution, and tune retries and failure handling. I implemented pipeline-level optimizations, including step caching, controlled parallelism, and fine-grained resource requests and limits, to improve runtime efficiency and cluster utilization. On the infrastructure side, I worked directly with AKS to debug pod scheduling, image pull flows, service accounts, and runtime failures. I created and integrated Azure Container Registries (ACR) with GitHub Actions CI/CD to build, version, and push images consumed by Kubeflow tasks. I enabled secure multi-tenancy using Kubeflow Profiles mapped to Azure Active Directory (AAD) groups, providing namespace-level isolation and controlled access.

Multi-agent System for Charles Schwab | Google GenAI

Spearheaded the end-to-end development and deployment of a sophisticated multi-agent system for Google's client Charles Schwab, leveraging Google's cutting-edge AI technologies. The core of the project involved using the Agent Development Kit (ADK) to build intelligent agents powered by the Gemini 2.5 Pro model.

My contributions included:

Architecture and Development: Designed and implemented a complex multi-agent architecture, focusing on creating custom tools to empower agents with specific functionalities. I engineered robust systems for information sharing among sub-agents and handled persistent memory bank management and user session management to ensure contextual and coherent interactions.

Deployment and Evaluation: Successfully deployed the agent system on the Vertex AI Agent Engine platform, utilizing Cloud Run for scalable and efficient operation. A critical part of my role was conducting rigorous performance and accuracy assessments. I developed and executed a detailed evaluation strategy using both automated pytest frameworks and advanced Google GenAI evaluation techniques to validate agent effectiveness and reliability.

AutoFlow Agentic Framework

Developed an advanced agentic news digest system leveraging the Agent Auto Flow (AF2) framework. This project orchestrated multiple specialized AI agents—including a researcher, summarizer, critic, designer, and a GroupChatManager—to autonomously collaborate on creating personalized news digests.

The UserProxyAgent enabled users to clearly define their news preferences, initiating seamless agent collaboration. The ResearcherAgent employed web scraping and search tools to fetch targeted articles from sources like BBC and TechCrunch, while the SummarizerAgent distilled complex articles into concise summaries through Gemini LLM integration. Quality assurance was managed by the CriticAgent, ensuring summaries closely aligned with user-specified topics and standards. The DesignerAgent structured content into engaging digests, enhancing readability with organized layouts and optional visual elements.

My role encompassed the end-to-end development of agent interactions, integration of AF2’s dynamic communication tools, prompt engineering for specialized agents, and ensuring robust coordination via GroupChatManager. This significantly improved content personalization and operational efficiency.

LangGraph Agentic Framework on GCP

I developed a Gemini-powered agentic framework using Agent Engine and LangGraph to automate anomaly detection workflows from internal reports and structured data. The system is comprised of specialized agents—retriever, planner, synthesizer, validater, and UX writer—each designed to handle a distinct phase of the reasoning process. I used Gemini Pro for inference and Gemini Embeddings with ChromaDB for high-recall semantic retrieval. The agents operated via a finite-state logic, enabling iterative reasoning, task decomposition, and result validation. I enriched prompts with metadata from BigQuery to contextualize anomalies across patient, device, or lab data. To support testing and feedback, I built a Streamlit interface to visualize agent interactions in real time. The system enhanced anomaly traceability, reduced false positives, and streamlined report generation in a regulated healthcare setting.

Torque Control Optimization in Wind Turbines Using DDPG

I developed a DDPG-based reinforcement learning agent to optimize torque control in wind turbines under dynamic wind conditions.

PROBLEM STATEMENT
Traditional torque control logic resulted in inefficiencies in energy output and mechanical wear. A learning-based adaptive control mechanism was needed.

APPROACH
• Designed a custom continuous-action Gym environment simulating torque-wind dynamics.
• Defined a reward function balancing power output vs. bearing stress.
• Trained DDPG agents to learn optimal torque application strategies under noisy wind profiles.
• Integrated the agent into a simulated turbine SCADA environment.

IMPACT
• Improved simulated energy output by 12%.
• Reduced mechanical stress indicators by 8%, as per synthetic benchmark testing.
• Paved the way for self-adjusting turbine control systems.

Fault Mitigation Strategy Using Q-learning in Simulated Wind Farms

I implemented discrete-action Q-learning agents to make intelligent maintenance and fault mitigation decisions in a wind farm environment.

PROBLEM STATEMENT
Turbine faults, such as overheating or gearbox failure, led to production loss. Existing logic-based mitigation was reactive.

APPROACH
• Modeled turbine behavior and maintenance schedules using discrete state-action pairs.
• Implemented Q-learning to learn optimal actions like shutdown, continue, or schedule inspection.
• Simulated multiple episodes with stochastic fault occurrences and repair costs.

INNOVATIONS
• Designed fault-specific state representations for Q-table optimization.
• Introduced reward shaping to penalize downtime and promote preventive actions.

RESULTS
• Reduced cumulative maintenance cost by around 20% in simulated runs.
• Increased uptime and aligned maintenance with fault trends.

Demand Forecasting Machine Learning Operations (MLOps) Orchestration

Developed a complete MLOps system that involves Kubeflow Vertex AI on the Google Cloud platform. GitLab repo was the main source for keeping all the code files and the entire project directory set up. The project directory involves the sub-directories for each stage or component of the MLOps system, including data ingestion, data preprocessing, model training, model validation, model deployment, endpoint creation, model monitoring, uploading the pipeline, creating and scheduling pipeline runs to run automatically on scheduled date and time, and retraining on data drift.

The GitLab CI pipeline containerized each component into a Docker container and involved stages like Docker Build and push to GCR, which was triggered automatically upon a push in the branch and merged with the main branch. In the Kubeflow pipeline orchestration system, the Desired Sensation Level (DSL) Method was used to define each component and the pipeline. Components within the pipeline refer to Docker to orchestrate the pipeline. Upon creation of the pipeline, it is submitted on the Kubeflow dashboard and can be viewed here. The Prometheus and Grafana dashboard monitors the mode, and the retraining pipeline is executed upon model drift.

Generative AI Application

I developed an innovative application that dynamically generates code based on user requests. This app integrates with multiple data sources, including PDFs, Excel sheets, comma-separated values (CSV) files, and tabular data stored in data warehouses.

To enhance the accuracy and relevance of the generated code, I implemented a retrieval-augmented generation (RAG) pipeline. This pipeline improves the model's performance by first building a contextual understanding of the user's request from the existing database. The context is then fed into the model, significantly boosting the accuracy and relevance of the responses. The application employs two fine-tuned large language models (LLMs), including Llama and Generative Pre-trained Transformer 2 (GPT-2). These models were fine-tuned explicitly on objective data using low-rank adaptation (LoRA) adapters, which allow for more efficient and targeted learning. As a result, the product is highly effective at generating accurate and context-aware code snippets and solutions tailored to developers' needs, enhancing their productivity and streamlining the coding process.

LLMOps Pipeline with vLLM Serving and Kubeflow on Azure Cluster

I led the design and implementation of a robust LLMOps pipeline to process and serve clinical documents and scientific reports with enterprise-grade scalability, compliance, and observability. Using Kubeflow Pipelines, I orchestrated the full ML workflow—including data ingestion, embedding generation with GPT embeddings, vector indexing in Milvus, and prompt-based querying via a retrieval-augmented generation (RAG) architecture using LangChain. I deployed a quantized LLaMA 2 model inside an AKS cluster using vLLM, enabling cost-effective and low-latency inference without relying on external APIs. I integrated ML Metadata (MLMD) and BigQuery for tracking model lineage, data flow, and usage analytics. Prompt traces and output monitoring were handled through Langfuse, while Presidio ensured data privacy by detecting and redacting PII. The pipeline was containerized with Docker, deployed to Kubernetes, and automated via GitHub Actions. This system accelerated document-backed decision-making for Roche’s clinical teams while ensuring compliance and traceability.

Predictive Forecasting & Industrial Operations Optimization

At Suzlon, I led predictive analytics initiatives for mission-critical wind turbine operations. I designed and deployed time-series and deep learning models (1D CNN, statistical forecasting (ARIMA, SARIMA) to detect main bearing and gearbox failures using SCADA and oil-sample data. By applying FFT and signal-processing techniques, I transformed raw vibration signals into actionable failure predictions, reducing unplanned downtime and optimizing maintenance cycles.

I architected ETL workflows using Azure Data Factory and built scalable data pipelines across Databricks and Snowflake to process high-volume turbine telemetry. Models were containerized with Docker and deployed via Azure CI/CD for production-grade integration. I also built Suzlatics, a predictive monitoring platform visualizing turbine health metrics in near real-time.

This role strengthened my expertise in time-series forecasting, anomaly detection, constraint-based maintenance planning, and operational optimization in high-risk industrial environments.

Demand Forecasting & Cloud Data Pipelines | General Mills

At General Mills, I designed demand forecasting pipelines entirely on GCP using BigQuery, Vertex AI, Kubeflow, and Airflow (Composer). I built reproducible ML workflows for feature engineering, training, deployment, and monitoring, enabling accurate sales and supply chain forecasting.

I provisioned GKE clusters and containerized ML components for scalable deployment. Using BigQuery as a centralized warehouse, I developed data pipelines handling structured historical sales data and automated retraining workflows to adapt to seasonality and demand shifts.

I implemented model monitoring and drift-detection strategies to maintain forecast stability in production. CI/CD automation with GitHub Actions ensured reliable testing and deployment.

This role deepened my expertise in cloud-native forecasting architectures, scalable ETL systems, and production-grade MLOps aligned with finance and operational decision-making.

Demand Forecasting for Car's Part | Volksvogen

At Thyssenkrupp, I worked as a Data Scientist on automotive demand forecasting for Volkswagen Portugal, focusing on building accurate and reliable time-series models to support production and financial planning. I developed forecasting models using Facebook Prophet, performing hyperparameter tuning to optimize seasonality adjustments, changepoint sensitivity, and holiday effects, thereby improving prediction accuracy across regions and demand cycles. BigQuery was used as the central cloud data warehouse to store historical sales, supplier, and operational datasets. I built SQL-based data transformations and preprocessing pipelines to clean, aggregate, and prepare large datasets for model training. To validate performance, I conducted backtesting and A/B comparisons between baseline statistical models and tuned Prophet models, measuring metrics such as MAPE and forecast variance. I also supported automated retraining workflows to ensure models adapted to demand shifts over time.

Education

2019 - 2019

Master's Degree in Artificial Intelligence

Aegis School of Business - Mumbai, India

Certifications

JUNE 2024 - JUNE 2026

Professional Cloud Architect

Google Cloud

Skills

Libraries/APIs

TensorFlow, XGBoost, PyTorch, OpenCV

Tools

Grafana, Google AI Platform, GCP Security, BigQuery, Composer, TensorFlow Serving, Apache Airflow, Azure ML Studio, Amazon SageMaker, Tableau, Azure Kubernetes Service (AKS), Windows ADK, ARIMA, SARIMA, Prophet ERP, Google Kubernetes Engine (GKE)

Languages

Python, Python 3, JavaScript, Snowflake, SQL

Platforms

Kubeflow, Vertex AI, Google Cloud Platform (GCP), Docker, Kubernetes, Azure, Firebase, Databricks, Cloud Run

Storage

Google Cloud, MySQL, Azure SQL Databases, Data Lakes, PostgreSQL

Frameworks

Multi-armed Bandits (MABs), Flask, Agentic Frameworks

Paradigms

ETL, Model Context Protocol (MCP)

Other

ML Pipelines, Model Monitoring, Model Deployment, Deep Learning, Machine Learning, Artificial Intelligence (AI), Machine Learning Operations (MLOps), Pipelines, Prometheus, Model Drift, Google Container Registry (GCR), Data Science, Random Forests, Statistical Modeling, Computer Vision, Big Data, Large Language Models (LLMs), Generative Artificial Intelligence (GenAI), Security, Endpoint Creation, Retrieval-augmented Generation (RAG), Llama 2, LoRa, Forecasting, Large Language Model Operations (LLMOps), Milvus, Q-learning, Agentic AI, Generative Pre-trained Transformer 2 (GPT-2), GitHub Actions, CI/CD Pipelines, Google Cloud Functions, Cloud Pub/Sub, MLflow, Amazon Forecast, Azure Databricks, Amazon SageMaker Pipelines, Data Warehousing, Natural Language Processing (NLP), Classification, Regression, Statistics, Autoflow, Gemini API, Prompt Engineering, Vector Search, ChromaDB, multimodel, AI Agents, Gemini, Google BigQuery, Multi-agent Systems, Vector Databases, Multistage LLM Chains, OpenAI GPT-4 API, DDPG, Reinforcement Learning, OpenAI, agent development kit, Multimodal GenAI, agent engine, agent evaluation, Agent Deployment, memory bank, A2A, SCADA, fb prophet, facebook prophet, Demand Forecasting, A/B Testing

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring