
Harshit Luthra
Verified Expert in Engineering
DevOps Engineer and Software Developer
Bengaluru, Karnataka, India
Toptal member since August 31, 2021
Harshit is a senior site reliability engineer with 7+ years of experience building and operating cloud-native platforms across AWS, GCP, Azure, and on-prem. At TrueFoundry, he owns multi-tenant SaaS reliability for enterprise ML workloads, maintaining 99.99% uptime on 95% spot instances and cutting annual cloud spend by over $200,000. CKS, CKA, and AWS Solutions Architect-certified, Harshit owns incidents end-to-end and partners with global enterprise teams to ship reliable, secure systems.
Portfolio
Experience
- Amazon Web Services (AWS) - 7 years
- Kubernetes Operations (kOps) - 6 years
- Monitoring - 6 years
- Python 3 - 6 years
- Kubernetes - 6 years
- Bash - 6 years
- Continuous Delivery (CD) - 5 years
- Redis - 5 years
Preferred Environment
Kubernetes, Kubernetes Operations (kOps), Google Kubernetes Engine (GKE), Amazon Web Services (AWS), Azure Cloud Services, Terraform, Docker, Prometheus, Helm, Google Cloud Platform (GCP)
The most amazing...
...achievement was building a social media platform with 4 million DAU and a $10 million ARR business, maintaining a 99.99% SLA at $50,000 in monthly cloud costs.
Work Experience
Senior Site Reliability Engineer
TrueFoundry
- Built a modular Terraform framework that slashed client onboarding time by 70%, enabling seamless deployment across AWS, Azure, GCP, and on-premises environments with minimal configuration changes.
- Created marketplace listings for major cloud providers, resulting in a 40% increase in self-service customer acquisition and significantly reducing sales engineering overhead.
- Drove multi-cloud adoption strategy by building flexible onboarding solutions that work across AWS, GCP, Azure, and on-prem environments.
- Migrated the logging stack from Grafana Loki to VictoriaLogs, cutting query latencies by 94%, reducing storage by around 40%, tripling ingestion throughput, and halving CPU/RAM. Published methodology as a public engineering blog.
- Architected severity-tiered incident management integrating Sentry, Grafana, and New Relic with Zenduty and Slack, with team-wise alert routing across five domains, materially reducing MTTR and on-call noise.
- Led platform-wide monitoring overhaul, bifurcating alerts into P0/P1 severity tiers and migrating critical components to New Relic after a successful PoC, improving signal quality while keeping observability spend flat.
- Hardened the platform for multi-tenant SaaS through tighter tenant isolation, namespace-scoped RBAC, and resource governance on shared clusters; standardized labels, annotations, and security contacts across the fleet.
- Served as an escalation point for complex multi-cloud production debugging across EKS, GKE, and AKS, spanning Karpenter autoscaling, EFS/CSI mounts, GPU node scheduling, IAM/IRSA, and airgapped artifact registries.
- Reduced annual cloud expenditure by $200,000+ through self-hosted services, efficient pod/node binpacking, spot-instance utilization, and optimization of network flow logs and metrics volume.
- Achieved 99.99% uptime while running stateless workloads on 95% spot instances through intelligent time-based scaling and node hibernation strategies.
Infrastructure Lead
Kutumb
- Inaugurated the infrastructure from scratch, including three Amazon virtual private clouds (VPC), private and public subnets, Kubernetes clusters, and VPC peering.
- Used Spot Instances for stateless workloads in Kubernetes using node affinity and cluster autoscaler to reduce infrastructure costs.
- Set up a monitoring and logging stack and alerting tools using the Kubernetes/Prometheus stack, Grafana, Loki, and APM.
- Reduced cloud bills by $200,000 yearly with self-hosting, bin-packing pods, and nodes, Spot nodes, network flow logs check, log, and metrics volume in check.
- Set up and maintained multi-broker Apache Kafka and multi-node ELK Stack on-spot instances.
- Implemented robust network policies and role-based access control (RBAC) configurations, enhancing the security posture of the Kubernetes cluster and reducing potential attack vectors.
- Implemented IPv6 support on the whole infrastructure and added AWS Global Accelerator to reduce latencies in global markets.
- Employed and maintained OpenVPN within the Kubernetes cluster, simplifying networking between on-cluster services and development machines.
- Reduced deployment times to under one minute using caching with self-hosted GitHub Runners on Spot using ArgoCD and Devtron.
DevOps Engineer
Smallcase
- Built AWS SSM to run Ansible playbooks on ASG lifecycle hooks.
- Pioneered a templated CI/CD solution to cater to a multi-tenant environment integration and deployment system using Jenkins.
- Reduced infrastructure costs by moving to better generation instances with a discounted price plan.
Experience
CI/CD Project Using CircleCI
https://github.com/sachincool/AutoDeploy-SuperPower-ProjectInfrastructure Lead
https://crafto.appTo maintain optimal system health and facilitate rapid issue resolution, I incorporated the principles of RED (rate, errors, duration) and USE (utilization, saturation, errors) monitoring. These practices provide comprehensive insights into the cluster's performance and any potential issues, thus enabling teams to identify and rectify any operational anomalies promptly.
To further enhance the agility and productivity of our development teams, I implemented ArgoCD for continuous deployment. This approach abstracts the complexities of deployment processes, allowing developers to remain focused on critical business requirements. As a result, they can deliver features and fixes more rapidly and reliably, which ultimately contributes to our organization's competitive advantage.
Kafka Debezium CDC Pipeline
DIAGRAM
MySQL tables | Kafka connector | Kafka topics | Kafka connector | Elasticsearch index
I added proactive monitoring and alerting.
IPv6 and Global Accelerator for Global Markets
I monitored VPC flow logs and integrated ALB logs with Amazon S3 bucket
and prepared Athena tables and queries.
Education
Bachelor of Engineering Degree in Computer Science and Engineering
Chitkara University - Punjab, India
Certifications
Certified Kubernetes Security Specialist
Cloud Native Computing Foundation
Certified Kubernetes Administrator
CNCF
AWS Certified Solutions Architect Associate
AWS
Skills
Libraries/APIs
Amazon API, Terragrunt, Node.js, vLLM, PySpark
Tools
Grafana, Helm, Amazon CloudWatch, Terraform, AWS Fargate, Observability Tools, AWS IAM, Amazon OpenSearch, Google Kubernetes Engine (GKE), Jenkins, Amazon Elastic Container Service (ECS), GitLab CI/CD, Amazon ElastiCache, Grafana k6, Amazon SageMaker, Istio, Sentry, VPN, CircleCI, Ansible, AWS CloudFormation, Kafka Streams, Amazon EKS, AWS ELB, Kubernetes Operators, GCP Security, OpenTofu, RabbitMQ, Amazon Athena
Paradigms
DevOps, Load Testing, Automation, Serverless Architecture, Continuous Delivery (CD), Continuous Integration (CI), Azure DevOps
Platforms
Kubernetes, AWS Lambda, Amazon Web Services (AWS), Amazon EC2, Docker, Google Cloud Platform (GCP), Apache Kafka, Linux, Vanta, New Relic, Confluent, Azure, Kubeflow
Storage
Amazon DynamoDB, Elasticsearch, Database Administration (DBA), Databases, Azure Cloud Services, Redis, Cassandra, MongoDB, MySQL, PostgreSQL, Druid.io
Languages
Bash, Python 3, Python, Go, C++11
Frameworks
AWS HA
Industry Expertise
Cybersecurity
Other
Kubernetes Operations (kOps), Amazon RDS, CI/CD Pipelines, Containerization, GitHub Actions, Machine Learning Operations (MLOps), Site Reliability Engineering (SRE), Cloud, Distributed Systems, Infrastructure as Code (IaC), Argo CD, Networking, GPU Computing, Graphics Processing Unit (GPU), Identity & Access Management (IAM), Cloudflare, AI Automation, Release Management, Observability, AWS Cloud Security, FedRAMP, AWS Bedrock AgentCore, Apache Cassandra, AWS DevOps, AWS Certified DevOps Engineer, Cloud Infrastructure, Message Queues, Firewalls, AWS Certified Solution Architect, Amazon API Gateway, AWS VPN, Queue Management, Cloud Architecture, Large Language Model Operations (LLMOps), Large Language Models (LLMs), Incident Management, Infrastructure Testing, SOC 2, System Administration, GCP DevOps, Capacity Planning, VictoriaLogs, Karpenter, Agentic AI, Artificial Intelligence (AI), QA Automation, Quality Assurance (QA), Software, Monitoring, Structured Logging, Operating Systems, Computer Networking, Prometheus, Lambda Functions, Relational Database Services (RDS), AI Architecture, MLflow, Amazon Bedrock, AI Model Training, Cloud Computing, Firmware
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring