Harshit is available for hire

Harshit Luthra

Verified Expert in Engineering

DevOps Engineer and Software Developer

Bengaluru, Karnataka, India

Toptal member since August 31, 2021

Expertise

Site Reliability Cloud Engineering Cloudflare AWS RDS AWS Lambda AWS AWS IAM Kubernetes DevOps Terraform Grafana Docker

Bio

Harshit is a senior site reliability engineer with 7+ years of experience building and operating cloud-native platforms across AWS, GCP, Azure, and on-prem. At TrueFoundry, he owns multi-tenant SaaS reliability for enterprise ML workloads, maintaining 99.99% uptime on 95% spot instances and cutting annual cloud spend by over $200,000. CKS, CKA, and AWS Solutions Architect-certified, Harshit owns incidents end-to-end and partners with global enterprise teams to ship reliable, secure systems.

Portfolio

TrueFoundry

Kubernetes, Kubernetes Operators, AWS HA, GCP Security...

Kutumb

Kubernetes, Kubernetes Operations (kOps), Amazon Web Services (AWS), VPN...

Smallcase

Amazon Web Services (AWS), Continuous Delivery (CD), Monitoring, Bash, Python 3...

Experience

Amazon Web Services (AWS) - 7 years
Kubernetes Operations (kOps) - 6 years
Monitoring - 6 years
Python 3 - 6 years
Kubernetes - 6 years
Bash - 6 years
Continuous Delivery (CD) - 5 years
Redis - 5 years

Preferred Environment

Kubernetes, Kubernetes Operations (kOps), Google Kubernetes Engine (GKE), Amazon Web Services (AWS), Azure Cloud Services, Terraform, Docker, Prometheus, Helm, Google Cloud Platform (GCP)

The most amazing...

...achievement was building a social media platform with 4 million DAU and a $10 million ARR business, maintaining a 99.99% SLA at $50,000 in monthly cloud costs.

Work Experience

Senior Site Reliability Engineer

2024 - PRESENT

TrueFoundry

Built a modular Terraform framework that slashed client onboarding time by 70%, enabling seamless deployment across AWS, Azure, GCP, and on-premises environments with minimal configuration changes.
Created marketplace listings for major cloud providers, resulting in a 40% increase in self-service customer acquisition and significantly reducing sales engineering overhead.
Drove multi-cloud adoption strategy by building flexible onboarding solutions that work across AWS, GCP, Azure, and on-prem environments.
Migrated the logging stack from Grafana Loki to VictoriaLogs, cutting query latencies by 94%, reducing storage by around 40%, tripling ingestion throughput, and halving CPU/RAM. Published methodology as a public engineering blog.
Architected severity-tiered incident management integrating Sentry, Grafana, and New Relic with Zenduty and Slack, with team-wise alert routing across five domains, materially reducing MTTR and on-call noise.
Led platform-wide monitoring overhaul, bifurcating alerts into P0/P1 severity tiers and migrating critical components to New Relic after a successful PoC, improving signal quality while keeping observability spend flat.
Hardened the platform for multi-tenant SaaS through tighter tenant isolation, namespace-scoped RBAC, and resource governance on shared clusters; standardized labels, annotations, and security contacts across the fleet.
Served as an escalation point for complex multi-cloud production debugging across EKS, GKE, and AKS, spanning Karpenter autoscaling, EFS/CSI mounts, GPU node scheduling, IAM/IRSA, and airgapped artifact registries.
Reduced annual cloud expenditure by $200,000+ through self-hosted services, efficient pod/node binpacking, spot-instance utilization, and optimization of network flow logs and metrics volume.
Achieved 99.99% uptime while running stateless workloads on 95% spot instances through intelligent time-based scaling and node hibernation strategies.

Technologies: Kubernetes, Kubernetes Operators, AWS HA, GCP Security, Google Cloud Platform (GCP), Azure, Terraform, OpenTofu, AWS CloudFormation, Database Administration (DBA), Databases, Site Reliability Engineering (SRE), Cloud, Machine Learning Operations (MLOps), Python, DevOps, Grafana k6, Cloud Architecture, Distributed Systems, Large Language Model Operations (LLMOps), Large Language Models (LLMs), Go, CI/CD Pipelines, Infrastructure as Code (IaC), Observability Tools, Incident Management, Amazon Athena, Amazon SageMaker, PySpark, Cloud Infrastructure, Infrastructure Testing, Load Testing, Vanta, SOC 2, Grafana, Prometheus, Helm, System Administration, GCP DevOps, Argo CD, Networking, AI Architecture, MLflow, Kubeflow, Amazon Bedrock, AI Model Training, Capacity Planning, vLLM, GPU Computing, Graphics Processing Unit (GPU), Istio, Azure DevOps, Identity & Access Management (IAM), Cloudflare, Agentic AI, AI Automation, Artificial Intelligence (AI), Automation, Release Management, QA Automation, Quality Assurance (QA), Observability, AWS Cloud Security, FedRAMP, AWS Lambda, AWS Bedrock AgentCore, Amazon DynamoDB, AWS IAM

Infrastructure Lead

2021 - 2024

Kutumb

Inaugurated the infrastructure from scratch, including three Amazon virtual private clouds (VPC), private and public subnets, Kubernetes clusters, and VPC peering.
Used Spot Instances for stateless workloads in Kubernetes using node affinity and cluster autoscaler to reduce infrastructure costs.
Set up a monitoring and logging stack and alerting tools using the Kubernetes/Prometheus stack, Grafana, Loki, and APM.
Reduced cloud bills by $200,000 yearly with self-hosting, bin-packing pods, and nodes, Spot nodes, network flow logs check, log, and metrics volume in check.
Set up and maintained multi-broker Apache Kafka and multi-node ELK Stack on-spot instances.
Implemented robust network policies and role-based access control (RBAC) configurations, enhancing the security posture of the Kubernetes cluster and reducing potential attack vectors.
Implemented IPv6 support on the whole infrastructure and added AWS Global Accelerator to reduce latencies in global markets.
Employed and maintained OpenVPN within the Kubernetes cluster, simplifying networking between on-cluster services and development machines.
Reduced deployment times to under one minute using caching with self-hosted GitHub Runners on Spot using ArgoCD and Devtron.

Technologies: Kubernetes, Kubernetes Operations (kOps), Amazon Web Services (AWS), VPN, Cassandra, MongoDB, MySQL, PostgreSQL, Redis, Druid.io, Monitoring, Continuous Delivery (CD), Continuous Integration (CI), Bash, Python 3, DevOps, CI/CD Pipelines, Terraform, Helm, Amazon API Gateway, AWS Fargate, GitHub Actions, Terragrunt, Docker, AWS VPN, GitLab CI/CD, Machine Learning Operations (MLOps), Database Administration (DBA), Queue Management, Databases, Site Reliability Engineering (SRE), Cloud, Python, Amazon ElastiCache, Node.js, Grafana k6, Cloud Architecture, Infrastructure as Code (IaC), Distributed Systems, Observability Tools, Incident Management, Cloud Infrastructure, Load Testing, Grafana, Prometheus, Google Cloud Platform (GCP), System Administration, Argo CD, Networking, Identity & Access Management (IAM), Cloudflare, Release Management, Quality Assurance (QA), Observability, AWS Cloud Security, AWS Lambda, Amazon DynamoDB, AWS IAM, Amazon OpenSearch

DevOps Engineer

2019 - 2020

Smallcase

Built AWS SSM to run Ansible playbooks on ASG lifecycle hooks.
Pioneered a templated CI/CD solution to cater to a multi-tenant environment integration and deployment system using Jenkins.
Reduced infrastructure costs by moving to better generation instances with a discounted price plan.

Technologies: Amazon Web Services (AWS), Continuous Delivery (CD), Monitoring, Bash, Python 3, Helm, Docker, RabbitMQ, Database Administration (DBA), Databases, Site Reliability Engineering (SRE), Cloud, DevOps, Cloud Architecture, CI/CD Pipelines, Distributed Systems, Observability Tools, Incident Management, Grafana, Ansible, System Administration, Argo CD, Networking, Identity & Access Management (IAM), Cloudflare, Release Management, Quality Assurance (QA), Observability, AWS Cloud Security, AWS Lambda, AWS Bedrock AgentCore, AWS IAM

Experience

CI/CD Project Using CircleCI

https://github.com/sachincool/AutoDeploy-SuperPower-Project

A project using CircleCI for testing, building, and deployment along with AWS infrastructure. It creates infrastructure and destroys it post-testing using AWS CloudFormation templates and the set up of Prometheus stack using Ansible roles.

Infrastructure Lead

https://crafto.app

Successfully architected and deployed production-grade, multi-region Kubernetes clusters that extend across key global locations, including Brazil, the United States, and India. These interconnected clusters deliver high availability and resilience, ensuring the smooth operation of services irrespective of geographic boundaries.

To maintain optimal system health and facilitate rapid issue resolution, I incorporated the principles of RED (rate, errors, duration) and USE (utilization, saturation, errors) monitoring. These practices provide comprehensive insights into the cluster's performance and any potential issues, thus enabling teams to identify and rectify any operational anomalies promptly.

To further enhance the agility and productivity of our development teams, I implemented ArgoCD for continuous deployment. This approach abstracts the complexities of deployment processes, allowing developers to remain focused on critical business requirements. As a result, they can deliver features and fixes more rapidly and reliably, which ultimately contributes to our organization's competitive advantage.

Kafka Debezium CDC Pipeline

I created multiple pipelines for changing data capture sources from MySQL databases to Elasticsearch sinks using Kafka connectors (Debezium).

DIAGRAM
MySQL tables | Kafka connector | Kafka topics | Kafka connector | Elasticsearch index

I added proactive monitoring and alerting.

IPv6 and Global Accelerator for Global Markets

I added support for IPv6 on the whole domain (r53, ALB, Subnets, VPC) and infrastructure along with the AWS Global Accelerator to decrease latency across regions without going multi-region clusters.
I monitored VPC flow logs and integrated ALB logs with Amazon S3 bucket
and prepared Athena tables and queries.

Education

2016 - 2020

Bachelor of Engineering Degree in Computer Science and Engineering

Chitkara University - Punjab, India

Certifications

JUNE 2023 - PRESENT

Certified Kubernetes Security Specialist

Cloud Native Computing Foundation

JANUARY 2021 - JANUARY 2024

Certified Kubernetes Administrator

CNCF

AUGUST 2019 - AUGUST 2022

AWS Certified Solutions Architect Associate

AWS

Skills

Libraries/APIs

Amazon API, Terragrunt, Node.js, vLLM, PySpark

Tools

Grafana, Helm, Amazon CloudWatch, Terraform, AWS Fargate, Observability Tools, AWS IAM, Amazon OpenSearch, Google Kubernetes Engine (GKE), Jenkins, Amazon Elastic Container Service (ECS), GitLab CI/CD, Amazon ElastiCache, Grafana k6, Amazon SageMaker, Istio, Sentry, VPN, CircleCI, Ansible, AWS CloudFormation, Kafka Streams, Amazon EKS, AWS ELB, Kubernetes Operators, GCP Security, OpenTofu, RabbitMQ, Amazon Athena

Paradigms

DevOps, Load Testing, Automation, Serverless Architecture, Continuous Delivery (CD), Continuous Integration (CI), Azure DevOps

Platforms

Kubernetes, AWS Lambda, Amazon Web Services (AWS), Amazon EC2, Docker, Google Cloud Platform (GCP), Apache Kafka, Linux, Vanta, New Relic, Confluent, Azure, Kubeflow

Storage

Amazon DynamoDB, Elasticsearch, Database Administration (DBA), Databases, Azure Cloud Services, Redis, Cassandra, MongoDB, MySQL, PostgreSQL, Druid.io

Languages

Bash, Python 3, Python, Go, C++11

Frameworks

AWS HA

Industry Expertise

Cybersecurity

Other

Kubernetes Operations (kOps), Amazon RDS, CI/CD Pipelines, Containerization, GitHub Actions, Machine Learning Operations (MLOps), Site Reliability Engineering (SRE), Cloud, Distributed Systems, Infrastructure as Code (IaC), Argo CD, Networking, GPU Computing, Graphics Processing Unit (GPU), Identity & Access Management (IAM), Cloudflare, AI Automation, Release Management, Observability, AWS Cloud Security, FedRAMP, AWS Bedrock AgentCore, Apache Cassandra, AWS DevOps, AWS Certified DevOps Engineer, Cloud Infrastructure, Message Queues, Firewalls, AWS Certified Solution Architect, Amazon API Gateway, AWS VPN, Queue Management, Cloud Architecture, Large Language Model Operations (LLMOps), Large Language Models (LLMs), Incident Management, Infrastructure Testing, SOC 2, System Administration, GCP DevOps, Capacity Planning, VictoriaLogs, Karpenter, Agentic AI, Artificial Intelligence (AI), QA Automation, Quality Assurance (QA), Software, Monitoring, Structured Logging, Operating Systems, Computer Networking, Prometheus, Lambda Functions, Relational Database Services (RDS), AI Architecture, MLflow, Amazon Bedrock, AI Model Training, Cloud Computing, Firmware

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring