Head of Site Reliability Engineering, Consultant2015 - PRESENTHazelOps
Technologies: AWS, Terraform, Cloudformation, Ansible, Docker, Docker Swarm, ECS, Serverless, Java, PHP, WordPress, Python, HAProxy, Traefik, Grafana
- Built scalable infrastructures for startups: multi-environment, with infrastructure as code, self-healing, scalable, and predictable environments on AWS.
- Took care of the legacy code with respect to Dockerizing JVM, PHP, and Python apps.
- Analyzed and audited performance for dozens of full-cycle reports based on key factors of infrastructure performance and action items based on proposals.
- Helped software engineers implement DevOps, including close communication, strategy and processes improvement.
- Instrumented site reliability practices by owning SLA, SLO, SLIs, eliminating toil, and increasing observability: automation, monitoring, and error budgeting.
- Implemented CI/CD, facilitating a streamlined deployment pipeline for dozens of different projects (GitLab, Jenkins, CircleCI). Utilized Docker, registry, and multi-stage builds.
- Implemented OPS procedures in customers' environments, including service-based alerting, on-call rotation, and escalations.
- Deployed and maintained Apache Kafka, including full-cycle management via Terraform, Ansible, and Docker.
Lead Site Reliability Engineer2016 - 2019Flo Technologies
Technologies: AWS, CloudFormation, Ansible, Kafka, Gitlab, ELK, TICK, Docker, CircleCI, Linux, TLS
- Designed and implemented a complex IoT Infrastructure from scratch on AWS: multi-tier, multi-subnet scalable cloud AWS infrastructure, multi-application stateless stack with Elastic Beanstalk/ECS and Docker, platform-agnostic local environments with Docker and docker-compose.
- Designed and implemented Ansible infrastructure: idempotent plays/roles to support infrastructure needs, wrote community-available roles for multiple platforms under Apache Foundation.
- Designed and implemented CI/CD: complete application lifecycle with green deployments of high-traffic services, platform agnostic framework to support SaaS or hosted CI servers, and hassle-free pipelines for software engineers.
- Designed and implemented monitoring solutions: log and data aggregation from multiple sources (ELK), on-prem monitring via TICK, Grafana. SaaS monitoring with Datadog and NewRelic when needed.
- Designed and implemented operational procedures: service-oriented OLA, Pagerduty with monitoring solutions, and Pagerduty "Service Owner First" policy.
- Designed and maintained an upgrade procedure for critical distributed systems to allow no-downtime and no-dataloss upgrades for the whole three-year time span.
Senior Member of Technical Staff2016 - 2017Delphix
Technologies: AWS, Jenkins, ELK, Ansible, Foreman, CloudFormation, Python
- Architected and implemented multi-tier hybrid cloud AWS infrastructure for a new project for a high-scale testing framework.
- Architected log and data aggregation from multiple sources (ELK).
- Architected a virtual and bare metal host provisioning system (Foreman).
- Designed and implemented nmap-based inventory software.
- Contributed to company-wide IT processes and improvements.
- Contributed major portions to on-call rotation, monitoring, SOA, and OLA designs/implementations.
Senior DevOps Engineer2013 - 2016Intuit
Technologies: AWS, Puppet, ELK, TeamCity, Git, Foreman
- Managed a hybrid cloud with around 300 nodes: AWS, VmWare, and bare metal.
- Implemented automation, config management, and provisioning: 90% of the environment is in Puppet and Git.
- Managed the lifecycle of legacy systems.
- Provided CI in configuration management and IaaC: git-flow, reusable code, and open-source contribution.
- Managed and mentored junior IT staff, including separation of concerns and easy onboarding.
- Led most of the post-acquisition infrastructure integration projects.
DevOps Engineer2011 - 2013Docstoc (Acquired by Intuit)
Technologies: Juniper SRX, A10 LB, MySQL, MongoDB, Python, Bash, Nagios
- Supported colocation with 180+ Windows and Linux dedicated servers as well as new server deployment.
- Managed network security and performance (Juniper SSG, SRX Firewalls, A10 networks Load Balancer, Radius, IPsec, NAT, Amazon EC2 VPC).
- Implemented proactive monitoring.
- Optimized Linux and Windows server performance.
- Deployed and maintained MySQL databases.
- Introduced and implemented ELK stack.