Aleksander is available for hire

Aleksander Luiz Lada Arruda

Verified Expert in Engineering

Site Reliability Engineering (SRE) Developer

Location

São Paulo - State of São Paulo, Brazil

Toptal Member Since

February 8, 2019

Aleksander is a DevOps and site reliability engineer with an abundance of experience with cloud-native technologies. Along with having a bachelor’s degree in computer science, he’s deployed and managed production-grade clusters—like Kubernetes, Kafka, and Elasticsearch—and worked on microservice architecture and everything that comes with it including container orchestration, service discovery, message queues, monitoring, logging, and tracing.

Portfolio

Metabase

Amazon Web Services (AWS), Kubernetes, Helm, Terraform, Terragrunt...

Scanifly

Terraform, Amazon Web Services (AWS), RabbitMQ, Google Cloud Platform (GCP)...

HMBradley

Site Reliability Engineering (SRE), PostgreSQL, Kubernetes Operations (kOps)...

Experience

Linux - 10 years Terraform - 6 years Kubernetes - 6 years Amazon Web Services (AWS) - 6 years Continuous Delivery (CD) - 5 years Elasticsearch - 5 years Site Reliability Engineering (SRE) - 5 years Go - 3 years

Availability

Part-time

Preferred Environment

Linux, MacOS, iTerm2, Bash, Shell Scripting, Git

The most amazing...

...thing I’ve built was a multi-cluster Kafka setup providing very high availability to receive incoming app data from a company with over a billion downloads.

Work Experience

Principal DevOps Engineer

2023 - PRESENT

Metabase

Deployed production Kubernetes clusters across multiple regions in three different continents.
Extended a Kubernetes operator for managing snapshots in scale.
Implemented security best practices and CIS benchmarks using Falco and other tools.

Technologies: Amazon Web Services (AWS), Kubernetes, Helm, Terraform, Terragrunt, CI/CD Pipelines, Grafana, Prometheus, Site Reliability Engineering (SRE), Amazon EKS, DevSecOps, IT Security, GitHub Actions, Java

Site Reliability Engineering Manager

2022 - 2023

Scanifly

Built the company's cloud-native infrastructure from the ground up, migrating all the existing infrastructure into AWS and Kubernetes.
Rewrote a complex CV/ML batch application into separate, independently scalable components.
Implemented event-driven autoscaling with KEDA, optimizing resource allocation for a resource-intensive system.
Instrumented applications with OpenTelemetry and Grafana Tempo, identifying bottlenecks that caused the application to drop connections.

Technologies: Terraform, Amazon Web Services (AWS), RabbitMQ, Google Cloud Platform (GCP), Python, Grafana, Prometheus, Elasticsearch, Kubernetes, Helm, GitLab, OpenTelemetry, Site Reliability Engineering (SRE), Microservices Architecture, Amazon EKS, DevSecOps, Java, GitLab CI/CD

Site Reliability Engineering Manager

2020 - 2022

HMBradley

Built a future-proof cloud-native infrastructure from scratch, managing several Kubernetes clusters across different environments, running self-contained, replaceable components maintained with infrastructure as code.
Implemented a scalable and highly available stack for centralizing logs and metrics with LokiJS and Cortex, with automated alerts sent to different channels based on their severity level.
Constructed the company's data infrastructure running on Kubernetes, managing clusters such as Kafka, Elasticsearch, and Cassandra; created components to extract data from different sources into Redshift and Snowflake.
Introduced security best practices such as AWS CIS Benchmarks as well as intrusion detection and prevention techniques, targeting SOC 2 compliance; implemented granular access control across the systems, including AWS and Kubernetes.
Automated building and deploying infrastructure components and applications throughout environments, combining continuous delivery and infrastructure as code.
Developed a small software for extracting detailed data on AWS costs hourly, tagging, and shipping them to Prometheus and Cortex, thereby allowing the visualization of the granular costs of the infrastructure in real-time.

Technologies: Site Reliability Engineering (SRE), PostgreSQL, Kubernetes Operations (kOps), Kubernetes, Redis, Cassandra, Vault, Apache ZooKeeper, Falcon, Prometheus, Grafana, Elasticsearch, Apache Kafka, Terraform, Amazon Web Services (AWS), Redshift, Snowflake, AWS Database Migration Service (DMS), Data Engineering, Data Warehousing, Shell Scripting, CI/CD Pipelines, SQL, Amazon EC2, Amazon S3 (AWS S3), Amazon Virtual Private Cloud (VPC), AWS IAM, Amazon CloudWatch, AWS Certified SysOps Administrator, GitHub, Git, Ansible, Infrastructure as Code (IaC), Cloud Infrastructure, SecOps, Amazon CloudFront CDN, Amazon RDS, Amazon DynamoDB, Cloudflare, Continuous Delivery (CD), Flask, Containerization, Architecture, Bash, Containers, Load Balancers, VPN, DevOps, Technical Leadership, AWS Cloud Architecture, Terragrunt, Microservices Architecture, Amazon EKS, DevSecOps, IT Security, GitLab CI/CD, GitLab

DevOps Technical Screener

2019 - 2021

Toptal

Handled, as part of the Toptal screening team, all types of applicants in the DevOps vertical.
Vetted candidates so that only 3% of the best among the best got approved.
Worked on polishing the interview process, proposing new technical questions and tasks, as well as improving the existing ones.
Advised applicants on improving their skills as DevOps engineers, what technologies they should seek to learn, and what certifications they should pursue based on their goals.
Assisted the approved candidates by building their profiles in a way that would improve their chances of getting hired.

Technologies: Amazon Web Services (AWS), Google Cloud Platform (GCP), Infrastructure Architecture, DevOps, Site Reliability Engineering (SRE), Shell Scripting, CI/CD Pipelines, SQL, Amazon EC2, Amazon S3 (AWS S3), Amazon Virtual Private Cloud (VPC), AWS IAM, Amazon CloudWatch, AWS Certified SysOps Administrator, GitHub, Git, Infrastructure as Code (IaC), Cloud Infrastructure, SecOps, Amazon CloudFront CDN, Amazon RDS, Amazon DynamoDB, DevSecOps

Senior Site Reliability Engineer

2019 - 2020

Pypestream

Deployed and upgraded well-known production clusters and databases, such as Kubernetes, Elasticsearch, PostgreSQL, and Ceph.
Fine-tuned our Elasticsearch cluster which ingested roughly 300G of data per day, implementing best practices considering the low-level implementation of Apache Lucene, thus and so improving its performance and allowing us to shrink its size.
Owned the implementation of security components and best practices such as AWS CIS Benchmarks and intrusion detection and prevention tooling, which rendered the company a SOC 2 certification.
Provided on-call support 24/7, dealing with various incidents on the production infrastructure.
Created several Jenkins pipelines with Groovy and Bash for deploying both infrastructure components and applications and worked with Jenkins Configuration as Code (JCasC), making sure the whole continuous delivery stack was easily replicable.
Containerized several applications, creating CI/CD pipelines not only for building and deploying but also for performing code checks and security scans.
Implemented different solutions for backing up different systems that enabled the development of an expeditious disaster recovery plan.

Technologies: Site Reliability Engineering (SRE), Rancher, LDAP, OpenStack, Harbor, GitLab CI/CD, Grafana, Prometheus, Ansible, Jenkins, Ceph, Elasticsearch, Kubernetes, Security, Docker, Kubernetes Operations (kOps), Amazon Web Services (AWS), Shell Scripting, CI/CD Pipelines, SQL, Amazon EC2, Amazon S3 (AWS S3), Amazon Virtual Private Cloud (VPC), AWS IAM, Amazon CloudWatch, AWS Certified SysOps Administrator, GitHub, Git, Infrastructure as Code (IaC), HIPAA Compliance, Cloud Infrastructure, SecOps, Amazon CloudFront CDN, Amazon RDS, Amazon DynamoDB, Continuous Delivery (CD), Flask, Containerization, Architecture, Bash, Containers, Load Balancers, VPN, DevOps, AWS Cloud Architecture, Microservices Architecture, Amazon EKS, DevSecOps, IT Security

DevOps Consultant

2018 - 2019

Audsat

Set up three Kubernetes clusters for development, staging, and production environments. All clusters were multi-az and had autoscaling. Monitoring was done with Datadog and Pagerduty.
Implemented GoCD with custom elastic agents for deploying applications into all Kubernetes clusters. Containerized applications and deployed them as Helm charts.
Implemented automatic provisioning and renewal of Let’s Encrypt TLS certificates with cert-manager.
Deployed Fluentd daemon sets for aggregating logs from all the applications into Elasticsearch. Also deployed Elasticsearch curators for cleaning old logs.
Set up the automatic monitoring of all Java applications deployed in the cluster by running them with sidecar containers exposing metrics retrieved from the application's JMX interface.
Spearheaded the project Navalis, which was a web application intended to allow developers to deploy, monitor, and scale their applications in multiple Kubernetes clusters with ease. It was developed with Golang and Vue.js.
Scaled Kubernetes up to 300 nodes in order to process massive batches of data within a few hours, taking into consideration the network and I/O limitations of both the local instances and the data source.

Technologies: Java, PagerDuty, Datadog, GoCD, Fluentd, Elasticsearch, Kubernetes, Amazon Web Services (AWS), Shell Scripting, CI/CD Pipelines, SQL, Amazon EC2, Amazon S3 (AWS S3), Amazon Virtual Private Cloud (VPC), AWS IAM, Amazon CloudWatch, AWS Certified SysOps Administrator, GitHub, Git, Infrastructure as Code (IaC), Cloud Infrastructure, SecOps, Amazon CloudFront CDN, Amazon RDS, Amazon DynamoDB, Grafana, Continuous Delivery (CD), Containerization, Architecture, Bash, Containers, Load Balancers, VPN, DevOps, AWS Cloud Architecture, Site Reliability Engineering (SRE), DevSecOps

DevOps Engineer

2017 - 2018

Wildlife Studios

Partnered with the data engineering team to develop a new Kafka cluster for the company inspired by Netflix’s way of orchestrating and monitoring Kafka. It consisted of several interconnected Kafka clusters that prevented the loss of data.
Developed a system for monitoring backups consisting of a Python and Flask server and a client written in Go. The system would centralize the status of the backups across the whole infrastructure and notify our team whenever a backup was missing.
Solved an issue with a large Elasticsearch cluster that used to crash at the beginning of each day. The issue was caused by misconfigured Logstash instances that flooded the cluster with requests for creating new shards.
Developed a tool with Go for cross-validating the Kubernetes network which would establish a route between every machine in Kubernetes generating a complete graph or pointing out issues in the network.
Created a redundant VPN between availability zones (US and AP) in AWS using VyOS.
Helped instrument our most important servers with Jaeger APM.
Deployed a Kubernetes cluster with autoscaling as a proof-of-concept to test how well a Kafka cluster would scale within Kubernetes.
Solved an issue in which our Kafka cluster would crash because of unexpected behavior of a tool someone had installed to monitor ZooKeper, Netflix's Exhibitor.
Deployed multiple MongoDB clusters for collecting data during a high-traffic event.
Deployed a Kubernetes cluster the hard way, without any tools like Kubernetes Operations (Kops) or Kubeadm, to learn deeper concepts of its architecture.

Technologies: Hyperledger Burrow, Apache ZooKeeper, Apache Kafka, Datadog, Elasticsearch, Jenkins, Helm, Kubernetes, VyOS, MongoDB, PagerDuty, Amazon Web Services (AWS), Go, Python, Docker, Terraform, Chef, Shell Scripting, CI/CD Pipelines, SQL, Amazon EC2, Amazon S3 (AWS S3), Amazon Virtual Private Cloud (VPC), AWS IAM, Amazon CloudWatch, AWS Certified SysOps Administrator, GitHub, Git, Infrastructure as Code (IaC), Cloud Infrastructure, SecOps, Amazon CloudFront CDN, Amazon RDS, Amazon DynamoDB, Continuous Delivery (CD), Containerization, Architecture, Bash, Containers, Load Balancers, VPN, DevOps, AWS Cloud Architecture, Site Reliability Engineering (SRE), Microservices Architecture, DevSecOps

DevOps Engineer

2017 - 2017

MAV Technology

Centralized in an HAProxy cluster all incoming requests which didn’t have a proper entry point for the infrastructure (i.e., DNS pointed to lots of different entry points)—thus avoiding single points of failure.
Fixed multiple bugs in Node.js servers, among them a critical one which forced us to restart production containers from time to time because of a progressive decay of performance.
Solved multiple bugs in Objective-C servers by creating a system for debugging multiple servers in real time, attaching multiple GDBs to multiple processes distributed amongst nodes and capturing eventual stack traces—allowing us to quickly fix bugs that would only occur in the production environment.
Developed a Node.js server that would hold thousands of connections open as a fronting proxy for a legacy server that was not able to receive too many simultaneous connections.
Stopped an ongoing brute-force password attack, which I was able to detect because of an expressive increase in the number of failed authentications in DataDog. I stopped the attack by blocking the attacker’s IP addresses in HAProxy.
Resolved a serious problem that would cause Ceph to crash. We traced the problem to a bug that was tied to the specific version of the software we were using.

Technologies: Ceph, MongoDB, MySQL, Datadog, Consul, HAProxy, Node.js, Shell Scripting, CI/CD Pipelines, SQL, Nagios, GitHub, Git, SecOps, Containerization, Architecture, Bash, Containers, Load Balancers, DevOps, Site Reliability Engineering (SRE), DevSecOps

Software Engineering Intern

2015 - 2016

Synopsys, Inc.

Developed a tool in Python for automatically generating C++ code that would bind hardware transactors written in C++ to TCL.
Built a tool for extracting statistics from a hardware-emulating platform and generating D3.js charts.
Fixed a major C++ bug caused by a racing condition between GTK and a hardware transactor.
Worked for a month at Synopsys' headquarters in Mountain View where I learned a lot about electronic design automation.

Technologies: EDA, D3.js, Tcl, Python, C++, Verilog, GitHub, Git, Bash

Junior Back-end Engineer

2012 - 2014

MAV Technology

Developed a substantial part of a back end of a corporate email service; it was written in C++ with language bindings to Lua. I utilized MongoDB for storing the email metadata, GridFS for storing their bodies, and MySQL for storing relational user data. Worked with REST interfaces in a monolithic architecture.
Built a part of their front end written in Java and Google Web Toolkit.
Constructed IMAP and POP3 proxies to route new users from other email service providers to their old servers while capturing their passwords and transparently migrating their accounts to our servers.
Developed HTTP and SMTP servers from scratch with C++.
Supported the development of the company’s ERP system; built with CakePHP and Bootstrap.

Technologies: Bootstrap, CakePHP, GWT, Java, MySQL, MongoDB, Lua, C++, SQL, Nagios, GitHub, Git, Bash, Load Balancers

Experience

Navalis

Navalis is a platform which enables developers to deploy and visualize applications in Kubernetes with ease. It also checks the cluster for inconsistencies and constantly monitors its health. It consists of an API written in Go and a front-end written in Vue.js.

Flux Control Language Compiler

https://github.com/aleksanderllada/FCL-Compiler

FCL is a programming language I designed as my final project in college. Its goal is to allow scientists that are not familiar with low-level programming languages to dynamically control a pipetting robot.

This project is the compiler I wrote with Java and ANTLR4 in order to generate FCL's p-code, based on the formal grammar I wrote for the language.

Flux Control Language Interpreter

https://github.com/aleksanderllada/FCL-Interpreter

FCL is a programming language I designed as my final project in college. Its goal is to allow scientists unfamiliar with low-level programming languages to dynamically control a pipetting robot.

This project is the interpreter I wrote for the language's p-code, which is generated by the FCL Compiler. It works like a stack machine, similar to Python's and Lua's interpreters.

Education

2011 - 2017

Bachelor of Science Degree in Computer Science

Federal University of Minas Gerais - Belo Horizonte, Minas Gerais, Brazil

Certifications

FEBRUARY 2020 - FEBRUARY 2023

AWS Certified SysOps Administrator

Amazon Web Services

Skills

Libraries/APIs

Node.js, POCO C++, Terragrunt, Vue, D3.js

Tools

Helm, GitLab CI/CD, Jenkins, Terraform, Amazon Virtual Private Cloud (VPC), AWS IAM, GitHub, Git, GitLab, Amazon EKS, Ansible, Vault, Chef, NGINX, Grafana, Amazon CloudWatch, Amazon CloudFront CDN, VPN, ANTLR 4, Kong, Fluentd, Apache ZooKeeper, MirrorMaker, Nagios, RabbitMQ

Languages

Bash, Java, Go, JavaScript, C++, Python, SQL, Lua, Verilog, Tcl, Falcon, Java 8, Transaction Control Language (TCL), Snowflake

Platforms

Kubernetes, Linux, Apache Kafka, Amazon Web Services (AWS), Docker, Amazon EC2, PagerDuty, Google Cloud Platform (GCP), Heroku, Hyperledger Burrow, Harbor, OpenStack, Rancher, MacOS

Paradigms

Continuous Integration (CI), Continuous Delivery (CD), Distributed Computing, DevOps, Microservices Architecture, DevSecOps, Scrum, Design Patterns, HIPAA Compliance

Storage

Elasticsearch, Datadog, Amazon S3 (AWS S3), MongoDB, MySQL, PostgreSQL, Amazon DynamoDB, Cassandra, Redis, Ceph, Redshift

Frameworks

Qt 5, Flask, Express.js, GWT, CakePHP, Bootstrap, Spring

Other

Kubernetes Operations (kOps), Site Reliability Engineering (SRE), GoCD, Prometheus, AWS DevOps, Shell Scripting, CI/CD Pipelines, Infrastructure as Code (IaC), Cloud Infrastructure, SecOps, Amazon RDS, Containerization, Architecture, Containers, Load Balancers, AWS Cloud Architecture, IT Security, Distributed Tracing, HAProxy, APM, AWS Certified SysOps Administrator, Cloudflare, Technical Leadership, GitHub Actions, EDA, LDAP, Infrastructure Architecture, Security, Computer Science, Compilers, Programming Languages, iTerm2, Consul, VyOS, AWS Database Migration Service (DMS), Data Engineering, Data Warehousing, OpenTelemetry

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring