
Shivayogi Math Doddabasayya
Verified Expert in Product Management
Product Manager
Bengaluru, Karnataka, India
Toptal member since October 15, 2025
Shivayogi is a product leader with over 20 years of technology experience, including four years in AI/ML product development. He bridges the gap between cutting-edge AI capabilities and measurable business outcomes, transforming operational challenges into strategic advantages through intelligent automation. Shivayogi has led AI initiatives that significantly reduced operational costs, improved customer satisfaction by 28%, and automated 65% of repetitive workflows.
Project Highlights
Expertise
- App Development
- Automation
- Cloud
- Continuous Deployment
- Continuous Integration (CI)
- Databases
- Kubernetes
- Python 3
Work Experience
Senior Lead
Kyndryl
- Modernized legacy applications for cloud deployment for a client that lacked automated CI/CD infrastructure, causing deployment delays and manual errors.
- Defined deployment success criteria with stakeholders: zero-downtime releases, automated rollbacks, and an 80% reduction in deployment time. Prioritized Amazon EKS over alternatives based on the client's existing cloud footprint and team skills.
- Established monitoring and observability as non-negotiables from day one. Reduced deployment time from hours to minutes. Enabled multiple daily releases vs. monthly deployment windows. Zero production incidents during the rollout phase.
Senior Lead
Kyndryl
- Led product strategy and execution for an intelligent automation platform. Focused on ticket classification first (highest ROI, fastest deployment) before expanding to auto-resolution.
- Worked with IT service managers and support teams to define success metrics: resolution time, SLA compliance, and agent productivity. Deployed in stages to build stakeholder confidence and refine based on real usage.
- Chose to build custom NLP models when off-the-shelf solutions couldn't meet accuracy requirements for enterprise workflows. Delivered 40% faster resolution times (from 6.8 hours to 4.1 hours), directly improving employee productivity.
- Integrated Kubernetes deployments with auto-scaling and secure access using IAM roles and ALB-based Ingress for production-ready environments. Delivered $1.8 million annual cost savings through automation of 65% of routine tickets.
- Shifted support teams from reactive firefighting to proactive problem-solving. The system processes over 5,000 tickets daily, with the capacity to scale three times without requiring additional headcount.
- Built a multi-database disaster recovery platform supporting MS SQL, MySQL, PostgreSQL, and Oracle. Delivered automated failover capabilities, reducing downtime by 85% and protecting $50+ million in business-critical data for enterprise clients.
Staff Engineer
McAfee
- Worked with security operations teams to understand their daily workflow pain points. Prioritized precision over recall—better to catch fewer threats accurately than flood teams with noise.
- Designed scalable ingestion pipelines on AWS and GCP for real-time data processing. Built iterative feedback loops to continuously improve model performance based on analyst input.
- Reduced false positive rate by 60%, allowing analysts to focus on genuine threats. Improved threat detection accuracy while decreasing investigation time. Solution became a key differentiator in client renewals.
- Leveraged Ansible for infrastructure automation across disaster recovery environments, orchestrating database deployments and configuration management. Reduced manual provisioning time by 70% and eliminated configuration drift across 500+ servers.
Lead
Wipro
- Collaborated with clinical stakeholders to define must-have compliance requirements vs. nice-to-have features. Balanced regulatory constraints with innovation—chose an architecture that allowed iterative AI capability additions.
- Enabled real-time clinical decision support for medical professionals. Achieved 100% regulatory compliance with HL7 v2.5.1 standards. Reduced image analysis time from hours to minutes.
Lead
Zynga
- Defined platform requirements with game development teams and operations.
- Prioritized plugin architecture for flexibility as new games launched. Measured success by infrastructure uptime, player experience metrics, and operational cost per user. Outcomes: Supported 10+ million concurrent sessions with 99.9% uptime.
- Reduced infrastructure management overhead by 50%. Enabled rapid deployment of new game titles without platform rewrites.
Project History
ML Model to Optimize Delivery Time & Predict Driver Demand
- 83% — Accuracy
- 75% — Demand Forecast
- 25% — Churn Reduction
A mid-sized food delivery platform faced critical operational challenges with its legacy hosted system, which resulted in poor customer retention and high costs. Their on-time delivery rate was 65%, meaning 35% of orders arrived late. This caused customer dissatisfaction and a 5% monthly churn rate, leading to $250,000 in monthly lost revenue from 100,000 orders.
PAIN POINTS
• Operational Inefficiency: Manual delivery partner assignment took 10-15 minutes per order, creating peak-hour bottlenecks.
• Inaccurate Time Predictions: Rule-based ETA ignored real-time traffic, weather, and demand, causing late deliveries.
• No Demand Forecasting: The platform couldn’t predict order volume, causing idle partners or delayed orders in various zones.
• Slow Deployment: The legacy system took 4-6 months to onboard new clients, compared to 8-12 weeks for SaaS competitors.
• Limited Scalability: The monolithic architecture restricted expansion and updates, requiring downtime and custom development.
CONTEXT AND DATA
• 15,000+ historical orders across 22 cities
• 65% on-time delivery rate vs. 75-80% industry standard
• 5% churn rate with 500 customers lost per 10,000 base
• An average customer lifetime value of $500
• Customer acquisition cost was at $20-50
This set the stage for building an ML-powered SaaS solution to improve prediction accuracy, automate operations, and reduce time-to-value.
5-STEP SOLUTION
-
Analyzed 15,000+ orders from my IISc capstone, shadowed 5-6 operations managers, and reviewed support tickets. Discovered rule-based systems ignored traffic or weather, causing only a 65% on-time rate.
-
Built 20+ input features: distance, real-time traffic with Google Maps API, weather, delivery partner rating, time of day, and festivals. Used sin/cos transformations for cyclical patterns and created lag features for time trends.
-
Tested linear regression, random forest, and XGBoost won. Used Optuna for hyperparameter tuning across 100+ combinations. Did 5-fold cross-validation with time-based splits, and achieved a mean absolute error of 4.2 minutes.
-
Deployed a FastAPI microservice with an under 200-millisecond response, Redis caching of 1-hour TTL—reducing monthly API costs from $500 to $200— and Kubernetes auto-scaling for rush hours. A/B-tested the 20% to 100% gradual rollout and created a multi-layer fallback.
-
Ensured AI tracked model drift, weekly auto-retraining on Sundays with new data, MLflow tracked versions for rollback, and operations managers flagged bad predictions—the system learns from mistakes and continuously improves.
UNIQUE VALUE
• Real ML experience: 83% vs. 70% industry from the capstone project
• Cost-conscious: Redis saved $300 monthly per customer
• Reliability: 3-layer fallback never completely fails
• Business impact: 72% on-time rate, 30% churn reduction, $177,000 yearly savings
Achieved an 83% R² score using XGBoost with Optuna hyperparameter tuning, predicting food delivery times, a 75% driver demand forecast by building an XGBoost Classifier, and a 25% customer churn reduction via accurate delivery time estimates and proactive delay notifications.
RESULTS
- Delivery Performance
• 11% improvement for on-time delivery from 65% to 72%
• 65% better prediction accuracy from 12 minutes to 4.2 minutes
• 35% monthly customer complaints reduction from 120 to 78
- Business Metrics
• 30% drop in customer churn from 5% to 3.5% equivalent to $75,000 yearly savings
• 50% cut in monthly delivery waste from $20,000 to $10,000, equivalent to $120,000 yearly savings
• 60% faster deployment from 4-6 months to 16 weeks
• $237,000 total annual savings per customer
- User Adoption
• Delivery app: 90% within 30 days
• Manager usage: 85% daily
• ML adoption: 75% within 60 days
- Technical Excellence:
• 200-millisecond API response
• 99.9% uptime
• 83% R² score in model accuracy
LONG-TERM BENEFITS
A data flywheel—more orders, better data, accurate predictions, happier customers, more orders—an 83% ML accuracy that rule-based competitors can’t match, 18% higher lifetime value, and a cloud that supports 10x growth, brand loyalty—the 72% on-time rate driving 25% repeat orders and 40% more referrals.
Inefficient Information Retrieval from Educational Course Documentation
- 15% — Answer Accuracy Improvement
- 40% — Faster Retrieval Workflow
- 60% — Troubleshooting Time Reduction
- 50% — Perceived Response Time Boost
Students and instructors in the AI/MLOps program struggled to efficiently extract specific information from multiple assignment instruction PDFs and course materials.
PAIN POINTS
• Time-consuming Manual Search: Students spend 15-20 minutes manually reading multiple PDF documents using Ctrl+F keyword search to find specific assignment requirements, submission guidelines, or grading rubrics, reducing productive learning time.
• Keyword Search Limitations: Traditional PDF search tools cannot understand semantic relationships. For example, searching for “deployment instructions” won’t find related information about “Hugging Face Spaces setup,” even though they refer to the same concept, resulting in 40% of relevant information missing.
• Repetitive Instructor Queries: Teaching assistants receive 50-100 repetitive questions weekly about information already documented in assignment PDFs, consuming 10-15 hours that could’ve been spent on meaningful mentorship.
• Context Loss Across Documents: Students lack a unified query interface to search across all documents simultaneously.
• No Contextual Understanding: There were no direct answers; documents had to be manually read.
• Late Discovery of Requirements: Students often discover critical requirements—team activity, mentor presentation timing, attendance rules—buried in the middle or end of PDFs only after starting work, leading to 30% of submissions having format or process errors.
As a solution, I developed an intelligent document query system using LangChain and LangGraph to orchestrate retrieval workflows with vector embeddings, semantic search, and LLMs.
- Ingestion: Used PyPDFLoader and DirectoryLoader to load PDFs and extract text with metadata.
- Chunking: Applied RecursiveCharacterTextSplitter with tuned chunk_size and chunk_overlap to retain context.
- Embedding and Storage: Generated embeddings via HuggingFace MTEB models; stored in FAISS and tested Chroma and Pinecone.
- RAG Pipeline: Built RetrievalQA chain combining LangChain retriever with Llama-2-7B/Mistral-7B. Added ConversationalRetrievalChain for multi-turn Q&A with memory, compared standard vs. MMR retrievers, and used PromptTemplate to ground answers in retrieved content.
- Orchestration in LangGraph: Created state machine with query analysis, retrieval, reranking, generation, and validation nodes, and conditional edges reroute low-confidence queries. Added self-reflection for rephrasing and re-retrieval.
- Validation: Tested with five realistic questions—grading, HF Spaces deployment, deadlines, project differences, and dataset choices—and evaluated groundedness via QAEvalChain.
- Gradio Interface: Wrapped ConversationalRetrievalChain in Gradio with history and “show sources” to display chunks.
- Deployment: Containerized and deployed on Hugging Face Spaces with GPU support and enabled streaming responses via callbacks.
Improved query answer accuracy by 15%, achieved a 40% faster retrieval workflow, and increased perceived response time by 50%.
Orchestration Efficiency:
• LangGraph State Management: Complex multi-step retrieval workflows were executed 40% faster than custom pipeline code.
• Conditional Logic: LangGraph’s conditional routing automatically retries with expanded search when initial retrieval confidence is low, improving answer accuracy by 15%.
Developer Productivity:
• LangChain Abstractions: Reduced development time from three weeks to one week using LangChain’s pre-built components.
• Prompt Engineering: LangChain’s PromptTemplate and FewShotPromptTemplate enabled rapid experimentation with 10+ prompt variations.
Maintainability:
• Modular Architecture: LangGraph nodes can be updated independently.
• LangSmith Integration: Logged all LangChain runs for debugging, showing exact chunks retrieved and LLM reasoning, reducing troubleshooting time by 60%.
Advanced Features:
• Conversational Memory: LangChain’s memory buffers enable follow-up questions.
• Agent-Based Routing: LangGraph agents can route queries to different knowledge bases based on query type.
Scalability:
• LangChain LCEL: Parallel retrieval execution from multiple vector stores reduced query latency from five seconds to two seconds.
• Streaming: LangChain streaming callbacks provide token-by-token LLM output, improving perceived response time by 50%.
End-to-end MLOps & Kubernetes Deployment for AI-driven Text Analytics
- 0.88-0.94 — F1 Score
- 0.90 — Precision
- 85-92% — Multi-class Accuracy
- 5% — Model Drift Detection Accuracy
PROBLEM
Text classification challenges in ticket routing: fastText model limitations in the production environment.
- Poor Contextual Understanding
• fastText was a bag-of-words with no semantic context
• “urgent server down” vs. “server down resolved” are in the same category
• Accuracy was at 72% and needed to be at 85%
• Misrouting was at 15% and caused a $180,000 yearly rework
- Out-of-vocabulary Failures
• Product codes and error messages had a 20% error rate
• “SAP timeout” vs. “Oracle timeout” were treated identically
• Wrong routing caused 4-6-hour delays
- Multilingual Gaps
• English 72%, Mandarin 58%, German 61%
• Separate models per language caused 3x training effort
• Code-mixed tickets failed
- Intent Misclassification
• “Critical issue” vs. “routine query” had the same priority
• False escalations were at 25%, wasting senior engineers’ time
TRIED MISTRAL
• 800 milliseconds to 1.2 seconds latency, exceeding the under-500-millisecond SLA
• GPU costs were at $450 monthly per instance
• 8% hallucinations on technical jargon
• Long tickets truncated
WHAT WASN’T WORKING
• fastText: Fast at 50 milliseconds, but was 72% inaccurate
• Mistral: 88% accurate, but slow and expensive
• Need: Balance for 10,000+ tickets per day
SOLUTION
Optimized BERT for production text classification
Week 1-2: Model Selection
• Evaluated BERT-base, DistilBERT, ALBERT vs. fastText
• Selected DistilBERT, which was 40% faster and had 97% BERT accuracy
• Saw a 150-200-millisecond latency, which met the under-500-millisecond SLA at a 60% smaller rate
• Tested 5,000 labeled tickets and validated on real data
Week 3-4: Hybrid Architecture
- fastText pre-filter was at 50 milliseconds and handled 60% of simple cases
- BERT for complex tickets was at 200 milliseconds for semantic understanding
- The result was a 95-millisecond average latency vs. the 200-millisecond full BERT
- Infrastructure comprised of AWS EKS and GPU nodes, batch inference, ONNX Runtime speeding up 2.3x, and FP16 quantization reduced memory by 50%
Week 5-6: Training Approach
• 42,000 historical tickets cleaned
• Handled class imbalance: Weighted loss (40% zero-class)
• Pre-trained DistilBERT and three domain-specific epochs
• Created a multilingual mBERT single model covering English, Mandarin, and German
Week 8-10: Deployment and Validation
• A/B tested 10% BERT traffic, monitored accuracy and latency
• Results were 91% accuracy vs. 72% and 4% misrouting vs. 15%
• Used Prometheus/Grafana for real-time performance monitoring
• Auto-scaled with Kubernetes HPA
KEY INNOVATIONS
• A hybrid pipeline for speed and accuracy
• Production-ready ONNX optimization
• mBERT eliminated three separate models
• Applied food delivery zero-inflated data handling
Translated food delivery ML—92% R², 54 clusters, and zero-inflated data—to enterprise text classification. Achieved 91% accuracy and a latency of under 200 milliseconds through a hybrid architecture, delivering 3.1x ROI and 39% faster resolution.
BERT IMPLEMENTATION OUTCOMES
- Classification Accuracy
• 26% accuracy improvement from 72% to 91%
• 73% misrouting reduction from 15% to 4%
• Multilingual gaps were unified to 91% and 87% from 72% for English and 58% for Mandarin 58%
- Operational Efficiency
• $180,000 yearly savings in manual re-classification costs.
• 39% faster resolution time from 6.2 hours to 3.8 hours
• 450 engineer hours saved per year with false escalations dropping from 25% to only 6%
• 18% better support ticket routing, reducing ticket volume
- System Performance
• 120-millisecond hybrid latency achieved vs. 50-millisecond fastText and 800-millisecond Mistral
• 2.4x throughput increase from 5,000 to 12,000 tickets daily
• SLA compliance improved from 71% to 94%
- Business Impact
• 62% customer NPS increase from 42% to 68%
• Customer churn rate dropped from 28% to 19% with improved service quality
• 3.1x ROI with $96,000 infrastructure cost and $300,000 savings
LONG-TERM BENEFITS
• Scalability: Single mBERT supports new languages (no retraining)
• Adaptability: Client onboarding dropped from two weeks to three days
• Competitive edge: AI routing vs. competitors’ rule-based systems
• Proactive insights: Semantic clustering detects emerging issues
AI-driven IT Service Desk Automation
- 92% — Accuracy Ratio
- 100% — Validation
- 40% — Reduction in Ticket Resolution Time
Back in 2020, I worked as a traditional infrastructure engineer, managing IT service desk operations. Our team was drowning in repetitive tickets, password resets, system access requests, and basic troubleshooting that consumed 60% of our daily bandwidth. Talented engineers spent their time on mundane tasks instead of solving complex problems that could truly impact business outcomes.
I automated service desk operations by building AI-driven ticket classification and routing using NLP and MLOps pipelines. Repetitive tasks were handled by bots and self-service workflows, cutting manual effort by over 50%. Engineers moved to high-impact work, while monitoring, dashboards, and automated retraining kept models accurate. This improved SLA compliance, reduced backlog, and increased customer satisfaction through faster, more reliable resolutions. It also lowered costs, improved auditability, and built strong leadership trust in AI-driven operations overall value.
• Our BERT-based ticket classification system achieved 92% accuracy.
• Our AI-driven system reduces ticket resolution time by 40%, and our team focuses on strategic initiatives instead of repetitive work.
Education
Postgraduate Advanced Certification Program in AI and MLOps
Indian Institute of Science - Bangalore, India
Master's Degree in Business Administration – Finance
Symbiosis Institute of Business Management - Pune, India
Master's Degree in Computer Science
BMS College of Engineering - Bangalore, India
Certifications
AIMLOps
IISC
Certified Scrum Master
Scaled Agile Inc
Skills
Tools
Zoom, Ansible
Paradigms
Continuous Deployment, Agile Product Management, DevOps, Scrum, Azure DevOps
Other
Automation, Python 3, Cloud, Deployment, App Development, Kubernetes, Continuous Integration (CI), Databases, Generative Artificial Intelligence (GenAI), Agile Product Delivery, AIOps, Machine Learning Operations (MLOps), Computer Science, Business Administration, Finance, Security, LangGraph, LangChain, Vector Databases, Transformers, Monitoring, BERT, Terraform, Cloud Automation, Product Ownership, Machine Learning, Artificial Intelligence (AI), Python 2, RESTFul APIs
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring