Darshit Suratwala, Developer in Mumbai, Maharashtra, India
Darshit is available for hire
Hire Darshit

Darshit Suratwala

Bio

Darshit is a senior site reliability and platform engineer with more than seven years of experience spanning blockchain, AI, observability, and developer tooling. A CKA-certified engineer fluent in AWS, GCP, Azure, and bare-metal environments, he has managed 100+ node clusters at Coinbase, delivered over $700,000 in monthly cloud cost savings, and led SOC2 compliance initiatives. Darshit thrives at the intersection of reliability, automation, and infrastructure at scale.

Portfolio

Supra
Kubernetes, Terraform, Ansible, Amazon Web Services (AWS)...
Scale3 Labs
Kubernetes, Docker, Argo CD, GitOps, Helm, Terraform, Amazon Web Services (AWS)...
Coinbase
Kubernetes, Docker, Terraform, Python, Django, Datadog, PagerDuty...

Experience

  • Monitoring & Alerting - 7 years
  • Ansible - 6 years
  • Site Reliability Engineering (SRE) - 6 years
  • CI/CD Pipelines - 6 years
  • Google Cloud Platform (GCP) - 6 years
  • Amazon Web Services (AWS) - 5 years
  • Kubernetes - 5 years
  • Python - 5 years

Preferred Environment

Slack, Zoom, Google, Jira, Confluence, Notion

The most amazing...

...cost optimization I've delivered saved over $700,000 per month in cloud expenses at Supra by migrating to a distributed multi-cloud and bare-metal architecture.

Work Experience

Site Reliability Engineer

2025 - 2026
Supra
  • Achieved more than $700,000 a month in cloud cost reduction through strategic migration from cloud-only to a hybrid multi-cloud and bare-metal infrastructure model.
  • Directed migration to a distributed multi-cloud and bare-metal architecture, improving resilience and decentralization for a high-throughput Layer 1 blockchain network.
  • Drove SOC2 compliance across all infrastructure, implementing security controls, CIS benchmarks, and vulnerability management processes to meet audit requirements.
Technologies: Kubernetes, Terraform, Ansible, Amazon Web Services (AWS), Google Cloud Platform (GCP), Bare-metal Server, Argo CD, Helm, Compliance, Cloud Cost Management, GitHub, GitHub Actions, Blockchain, Bash, Site Reliability Engineering (SRE), Networking, Linux, Cloud, Cloud Security, Google Kubernetes Engine (GKE), Incident Response, Kubernetes Operations (kOps), Agile DevOps, Configuration Management, Disaster Recovery Plans (DRP), Software Development Lifecycle (SDLC), Infrastructure, Cloudflare, Architecture, GCP DevOps, Linux Administration, IT Infrastructure, Virtual Machines, Bash Script, IT Security, Continuous Integration (CI), Continuous Delivery (CD), Infrastructure Automation, Containers, Ubuntu, MySQL, DigitalOcean, Cloud Infrastructure, Performance, Server Optimization, Virtual Private Cloud (VPC), Cloud Run, Google Cloud Build, Logging, Hybrid Cloud Infrastructure, Multi-tenant Architecture, APIs, Docker Compose, Container Orchestration, Disaster Recovery Automation, Scripting, SOC 2, SOC Compliance, Python Script, Domain Migration, Domain DNS Setup, Web Hosting, Migration, Consulting, Cloud Migration

Software Engineer

2022 - 2024
Scale3 Labs
  • Led the full development of the Nodepilot product, from infrastructure architecture and design to customer onboarding, serving as the primary SRE owner of the blockchain observability platform.
  • Implemented a fully automated GitOps deployment workflow using Argo CD, Helm, and Terraform, eliminating manual intervention and ensuring consistent, reproducible infrastructure across environments.
  • Integrated VectorDB and LLM framework support into Python and TypeScript SDKs for Langtrace, extending observability capabilities for AI and machine learning workloads.
  • Deployed scalable self-hosting solutions for Langtrace across Kubernetes, Azure, Docker Compose, and Railway App, enabling diverse customer deployment models.
  • Automated blockchain binary release pipelines using serverless services, reducing release cycle time and minimizing human error in critical node software updates.
Technologies: Kubernetes, Docker, Argo CD, GitOps, Helm, Terraform, Amazon Web Services (AWS), Prometheus, Grafana, GitHub, GitHub Actions, Serverless, Python, TypeScript, Compliance, Blockchain, OpenTelemetry, Site Reliability Engineering (SRE), Networking, Linux, Cloud, Google Kubernetes Engine (GKE), Incident Response, Identity & Access Management (IAM), Kubernetes Operations (kOps), Configuration Management, Debugging Tools, Disaster Recovery Plans (DRP), Software Development Lifecycle (SDLC), SQL, Azure, Infrastructure, Microsoft Azure, Vercel, Architecture, Railway, GCP DevOps, IT Infrastructure, Virtual Machines, Bash Script, AWS IAM, JavaScript, Continuous Integration (CI), Continuous Delivery (CD), Amazon S3 (AWS S3), Go, Infrastructure Automation, Self-hosted, Artificial Intelligence (AI), Containers, Observability Tools, Ubuntu, DigitalOcean, Large Language Models (LLMs), AWS Cloud Operations, Cloud Infrastructure, MongoDB, Amazon Virtual Private Cloud (VPC), Google Cloud Build, Logging, MongoDB Atlas, Pulumi, HIPAA Compliance, Hybrid Cloud Infrastructure, Docker Compose, REST APIs, Container Orchestration, Argo Workflows, Disaster Recovery Automation, Containerization, Scripting, SOC 2, SOC Compliance, Domain DNS Setup, Web Hosting, Claude, Consulting, Cloud Migration

Site Reliability Engineer

2022 - 2022
Coinbase
  • Managed blockchain node operations across over 30 chains and more than 100 remote procedure call (RPC) node clusters, ensuring high availability for one of the world's largest publicly-traded cryptocurrency exchanges.
  • Defined service-level objectives (SLOs) and service-level indicators (SLIs) for critical blockchain infrastructure services, built monitoring dashboards, and reduced alert noise to improve on-call efficiency and signal-to-noise ratio.
  • Built an in-house Opsbook service using Django to centralize runbooks and incident response procedures, reducing mean time to resolution (MTTR) by 15 minutes.
Technologies: Kubernetes, Docker, Terraform, Python, Django, Datadog, PagerDuty, Amazon Web Services (AWS), Blockchain, Monitoring, Monitoring & Alerting, Incident Management, Site Reliability Engineering (SRE), Amazon EC2, Linux, Cloud, Incident Response, Agile DevOps, Configuration Management, Debugging Tools, Software Development Lifecycle (SDLC), Infrastructure, Linux Administration, IT Infrastructure, Virtual Machines, AWS IAM, Continuous Integration (CI), Continuous Delivery (CD), Infrastructure Automation, Observability Tools, Ubuntu, AWS DevOps, AWS Cloud Architecture, AWS Cloud Operations, AWS ELB, Virtual Private Cloud (VPC), Amazon Virtual Private Cloud (VPC), Logging, Python Script

Senior DevOps Engineer

2019 - 2022
BrowserStack
  • Implemented a disaster recovery strategy on an alternate cloud provider, achieving 50% lower recovery time objective (RTO) and ensuring business continuity for a platform serving 1+ million daily testing sessions.
  • Migrated monolith staging environments to Kubernetes, improving resource utilization and enabling faster, more reliable developer workflows across engineering teams.
  • Maintained and operated global cloud and on-premises infrastructure spanning more than 20 data centers, supporting platform reliability at scale.
Technologies: Amazon Web Services (AWS), Kubernetes, Docker, Jenkins, Vault, Disaster Recovery (DR), Terraform, Infrastructure as Code (IaC), On-premise, NGINX, Cloud Cost Management, Monitoring, Observability, Ansible, Site Reliability Engineering (SRE), Amazon EC2, Networking, Linux, Cloud, Virtualization, Cloud Security, Incident Response, Identity & Access Management (IAM), Kubernetes Operations (kOps), Amazon EKS, AWS CloudFormation, Agile DevOps, Configuration Management, Debugging Tools, Disaster Recovery Plans (DRP), Software Development Lifecycle (SDLC), SQL, Infrastructure, Cloudflare, Architecture, Linux Administration, IT Infrastructure, Microsoft SQL Server, Virtual Machines, System Administration, Bash Script, IT Security, AWS IAM, JavaScript, Continuous Integration (CI), Continuous Delivery (CD), Amazon DynamoDB, Amazon S3 (AWS S3), Groovy, Infrastructure Automation, Containers, Observability Tools, Ubuntu, MySQL, Data Migration, Ruby, AWS DevOps, AWS Cloud Architecture, AWS Cloud Operations, Cloud Infrastructure, Performance, Server Optimization, AWS ELB, Redis, Apache Kafka, Virtual Private Cloud (VPC), Amazon Virtual Private Cloud (VPC), Logging, Hybrid Cloud Infrastructure, Multi-tenant Architecture, APIs, Docker Compose, REST APIs, Transport Layer Security (TLS), Container Orchestration, Disaster Recovery Automation, Containerization, Scripting, SOC 2, SOC Compliance, Python Script, API Gateways, Domain Migration, Domain DNS Setup, Web Hosting, Migration, Cloud Migration

Platform Engineer

2018 - 2019
Quantiphi
  • Built the v1 file-browser module from scratch using Django and GCP Cloud Storage, enabling end users to upload, organize, and retrieve media assets for AI-driven content analysis.
  • Deployed and configured a multi-node Elastic Stack (ELK) cluster with Kibana dashboards, providing real-time log aggregation and search across platform microservices.
  • Developed RESTful APIs using Django REST Framework and AWS Lambda to power core platform functionality, serving as the integration layer between the front-end and AI inference services.
  • Automated CI/CD pipelines for microservices and AI model deployments, reducing manual release effort and accelerating delivery cycles.
Technologies: Google Cloud, Ansible, Jenkins, ELK (Elastic Stack), Amazon Web Services (AWS), Django, Python, Django REST Framework, CI/CD Pipelines, RESTful Microservices, Amazon EC2, AWS Lambda, Linux, Cloud, Identity & Access Management (IAM), AWS CloudFormation, Agile DevOps, Configuration Management, Disaster Recovery Plans (DRP), Software Development Lifecycle (SDLC), Infrastructure, Architecture, GCP DevOps, IT Infrastructure, Bash Script, Continuous Integration (CI), Continuous Delivery (CD), Amazon DynamoDB, Amazon S3 (AWS S3), Infrastructure Automation, Observability Tools, Ubuntu, Cloud Infrastructure, Virtual Private Cloud (VPC), Amazon Virtual Private Cloud (VPC), Logging, APIs, Docker Compose, REST APIs, Disaster Recovery Automation, Scripting, Python Script, Web Hosting

Experience

Production-grade 3-tier AWS Infrastructure with IaC and CI/CD

https://github.com/DSdatsme/node-3tier-app2
I designed and built a fully automated, scalable, and secure three-tier Node.js application on AWS using infrastructure as code (IaC). The architecture follows a workflow from CloudFront to Application Load Balancer to ECS Fargate to RDS PostgreSQL, deployed across multiple Availability Zones for high availability. I authored 60+ Terraform resources across 14 configuration files, covering VPC networking, path-based ALB routing with CloudFront origin validation, ECS Fargate services with CPU-based autoscaling, Multi-AZ RDS with automated backups, CloudWatch dashboards and alarms, and Cloud Map service discovery for internal service communication.

I built six GitHub Actions workflows: three for PR validation (linting, security audits, Dockerfile linting, Terraform plan) and three for deployment with environment-gated approval flows. I also created operational runbook scripts for day-2 management, including service start/stop/scale and RDS backup operations. This project demonstrates end-to-end ownership from infrastructure design through CI/CD automation to production operations.

Blockchain Goes Kubernetes

https://youtu.be/5_dwKZ88G8w
At Kubernetes Community Day Mumbai, 2023, I presented how blockchain node infrastructure can be orchestrated on Kubernetes, addressing challenges with stateful workloads, persistent storage, and network configurations unique to blockchain nodes. I also demonstrated real-world patterns from managing more than 30 blockchain chains on Kubernetes.

Terraform GitOps CI/CD with Approval and Slack Notifications

https://github.com/DSdatsme/gh-terraform
Built a Terraform GitOps pipeline on GitHub Actions for deploying infrastructure to Google Cloud Platform, featuring a manual approval gate and real-time Slack notifications. The workflow runs a Terraform plan on pull requests for review, then, on merge to master, executes the plan, sends a Slack notification requesting approval, waits for a designated approver via GitHub Issues, and applies changes upon approval, with success and failure notifications at each stage. Remote state is managed in GCS. This project served as the demo for my conference talk at GDG DevFest Mumbai 2022 and was later expanded into a 3-hour hands-on workshop at GDG DevFest Raipur 2022. An accompanying Medium article details the full architecture and workflow design.

Education

2014 - 2018

Bachelor's Degree in Information Technology

University of Mumbai - Mumbai, India

Certifications

JANUARY 2024 - PRESENT

Microsoft Azure Fundamentals

Microsoft

JANUARY 2023 - JANUARY 2025

Google Cloud Professional Cloud DevOps Engineer

Google Cloud

JANUARY 2023 - DECEMBER 2026

Certified Kubernetes Administrator

Linux Professional Institute

JANUARY 2021 - JANUARY 2023

GCP Professional Cloud Architect

Google Cloud

JANUARY 2019 - JANUARY 2022

GCP Associate Cloud Engineer

Google Cloud

Skills

Libraries/APIs

REST APIs, Node.js

Tools

Ansible, Jenkins, Terraform, Amazon CloudWatch, Google Kubernetes Engine (GKE), Slack, Zoom, Vault, NGINX, Helm, Grafana, OpenTofu, Google Compute Engine (GCE), Kubectl, Amazon EKS, AWS IAM, Observability Tools, AWS ELB, Logging, Docker Compose, Claude, Jira, Confluence, Notion, ELK (Elastic Stack), GitHub, Amazon CloudFront, BigQuery, AWS Fargate, AWS CloudFormation, GitLab CI/CD, Amazon Elastic Container Service (ECS), Amazon Virtual Private Cloud (VPC), MongoDB Atlas

Paradigms

DevOps, Continuous Integration (CI), Continuous Delivery (CD), Role-based Access Control (RBAC), HIPAA Compliance, Azure DevOps

Platforms

Amazon Web Services (AWS), Kubernetes, Google Cloud Platform (GCP), Azure, Docker, Blockchain, DigitalOcean, Amazon EC2, AWS Lambda, Linux, Ubuntu, PagerDuty, Bare-metal Server, Vercel, Apache Kafka, Cloud Run

Languages

Python, Bash, Bash Script, Groovy, Python Script, TypeScript, SQL, JavaScript, Go, Ruby

Storage

Google Cloud, Datadog, Microsoft SQL Server, Amazon S3 (AWS S3), MySQL, On-premise, PostgreSQL, Google Cloud Storage, Amazon DynamoDB, Redis, MongoDB

Frameworks

Django, Django REST Framework

Other

Infrastructure as Code (IaC), CI/CD Pipelines, GitHub Actions, Monitoring & Alerting, Cloud Architecture, Site Reliability Engineering (SRE), Cloud, Incident Response, Infrastructure, GCP DevOps, System Administration, Infrastructure Automation, SOC 2, Disaster Recovery (DR), Cloud Cost Management, Monitoring, Observability, Argo CD, GitOps, Prometheus, Compliance, Incident Management, Amazon RDS, Security, Networking, Troubleshooting, Identity & Access Management (IAM), Kubernetes Operations (kOps), Agile DevOps, Configuration Management, Debugging Tools, Disaster Recovery Plans (DRP), Software Development Lifecycle (SDLC), Cloudflare, Architecture, Linux Administration, IT Infrastructure, Virtual Machines, Self-hosted, Containers, AWS DevOps, AWS Cloud Architecture, AWS Cloud Operations, Cloud Infrastructure, Performance, Server Optimization, Hybrid Cloud Infrastructure, Multi-tenant Architecture, APIs, Transport Layer Security (TLS), Container Orchestration, Argo Workflows, Disaster Recovery Automation, Containerization, Scripting, SOC Compliance, Domain Migration, Domain DNS Setup, Web Hosting, Consulting, Cloud Migration, Google, Software Development, RESTful Microservices, Serverless, OpenTelemetry, AWS ECS Fargate, High Availability (HA), AWS Secrets Manager, GitHub Workflows, Slackbot, Virtualization, Cloud Security, Microsoft Azure, Railway, IT Security, Artificial Intelligence (AI), Data Migration, Large Language Models (LLMs), Virtual Private Cloud (VPC), Google Cloud Build, Pulumi, API Gateways, Migration

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring