Lead Site Reliability Engineer2019 - 2020Pango
Technologies: Amazon Web Services (AWS), Teams, Agile, Tableau, Amazon Elastic MapReduce (EMR), Zeppelin, Redshift, Hybrid Cloud Infrastructure, On-premise, HAProxy, Envoy Proxy, Nginx, Python, Google Cloud Platform (GCP), Terraform, Ansible, ELK (Elastic Stack), Fluentd, CI/CD Pipelines, Grafana, Prometheus, Geohash, VPN, Okta, Vault, Consul, AWS, Docker, Jenkins, SLA monitoring
- Implemented a CD pipeline for automated deployment of a VPN stack to a production fleet (over 600 hosts), drastically reducing the time spent in toil work.
- Improved visibility, service quality, and customer experience, and reduced incident resolution time by implementing SLA monitoring and reporting.
- Implemented Geohash technology for proximity searches.
- Troubleshot and resolved complex tasks by providing a higher level of tech support for team members.
- Led a geographically distributed team of multilingual engineers in multiple countries, coached and mentored team members, and motivated people to achieve business and personal goals in a timely manner.
- Conducted on-site onboarding of a contractor team located in Costa Rica and Bolivia.
- Planned projects and sprints, conducted retrospectives and performance reviews for team members, ensured team success and efficiency, and reported to stakeholders.
Site Reliability Engineer2018 - 2019Pango
Technologies: Amazon Web Services (AWS), Docker Compose, RabbitMQ, Teams, ELK (Elastic Stack), Fluentd, PagerDuty, Opsgenie, GitHub, Jira, Hybrid Cloud Infrastructure, On-premise, Tableau, Amazon Elastic MapReduce (EMR), Zeppelin, Spark, Hadoop, SecOps, Apache, MySQL, Okta, Python, VPN, Nginx, HAProxy, Grafana, Prometheus, Ansible, Terraform, Docker, Vault, Consul, Google Cloud Platform (GCP), AWS
- Dockerized and migrated key parts of an on-site Hadoop/Spark cluster to AWS. Fine-tuned AWS EMR to increase stability, improve performance (faster ETL jobs processing), and reduce costs.
- Migrated an on-site legacy Tableau server to AWS and retained data. Implemented automated provisioning/deployment with Terraform and Ansible and monitoring with Prometheus. Drastically improved stability, performance, and report quality as a result.
- Collaborated with the SecOps team to implement golden images and drove end-to-end deployment across the production fleet in both cloud and bare metal, thereby significantly improving security and stability.
- Drove end-to-end implementation/deployment of a standardized naming schema across the production fleet.
- Trained and assisted team members on various topics, including best practices and documentation writing.
- Troubleshot networking and performance issues across production and worked closely with vendors and developers on resolutions.
DevOps Engineer2016 - 2018CyderSoft
Technologies: Amazon Web Services (AWS), Apache ZooKeeper, Consul, GitLab, Microservices, High-load, Autoscaling, Node.js, Ansible, Terraform, NATS, InfluxDB, Grafana, Prometheus, Twemproxy, Aerospike, ClickHouse, MySQL, Redis, Linux, Packer, Python, Bash, Lua, OpenResty, Proxies, Nginx, AWS
- Developed and supported a custom AWS Cloud orchestration solution. Particularly responsible for an EC2 Spot instances Auto Scaling module that significantly reduced infrastructure costs.
- Performed migrations from shell script-based automation to Ansible and Terraform, continuously developed new roles and modules for application deployment and infrastructure provisioning.
- Designed Grafana dashboards based on InfluxDB, Prometheus, and ClickHouse data sources for advanced monitoring and troubleshooting, effective cost control, and for BI and product teams.
Operations Engineer2014 - 2015iMesh
Technologies: Amazon Web Services (AWS), Apache, Nginx, Redis, MongoDB, MySQL, Ansible, SSL Configurations, SSL Certificates, OpenVAS, Bash, DNS Servers, Akamai, Content Delivery Networks (CDN), CentOS, Linux, KVM/Qemu, AWS
- Automated Hybrid Cloud (AWS, on-premise KVM) operations, significantly reducing time spent on toil work.
- Performed a vulnerability assessment with OpenVAS, including issues analysis and security hardening on production hosts, thereby drastically reducing the number of security incidents.
- Optimized backup procedures of a MySQL server fleet, thereby reducing backup time.