Aleksander Luiz Lada Arruda
Verified Expert in Engineering
Site Reliability Engineering (SRE) Developer
Aleksander is a DevOps and site reliability engineer with an abundance of experience with cloud-native technologies. Along with having a bachelor’s degree in computer science, he’s deployed and managed production-grade clusters—like Kubernetes, Kafka, and Elasticsearch—and worked on microservice architecture and everything that comes with it including container orchestration, service discovery, message queues, monitoring, logging, and tracing.
Linux, MacOS, iTerm2, Bash, Shell Scripting, Git
The most amazing...
...thing I’ve written was a multi-cluster Kafka setup providing very high availability to receive incoming app data from a company with over a billion downloads.
Site Reliability Engineering Manager
- Built a future-proof cloud-native infrastructure from scratch, managing several Kubernetes clusters across different environments, running self-contained, replaceable components maintained with infrastructure as code.
- Implemented a scalable and highly available stack for centralizing logs and metrics with LokiJS and Cortex, with automated alerts sent to different channels based on their severity level.
- Constructed the company's data infrastructure running on Kubernetes, managing clusters such as Kafka, Elasticsearch, and Cassandra; created components to extract data from different sources into Redshift and Snowflake.
- Introduced security best practices such as AWS CIS Benchmarks as well as intrusion detection and prevention techniques, targeting SOC 2 compliance; implemented granular access control across the systems, including AWS and Kubernetes.
- Automated building and deploying infrastructure components and applications throughout environments, combining continuous delivery and infrastructure as code.
- Developed a small software for extracting detailed data on AWS costs hourly, tagging, and shipping them to Prometheus and Cortex, thereby allowing the visualization of the granular costs of the infrastructure in real-time.
DevOps Technical Screener
- Handled, as part of the Toptal screening team, all types of applicants in the DevOps vertical.
- Vetted candidates so that only 3% of the best among the best got approved.
- Worked on polishing the interview process, proposing new technical questions and tasks, as well as improving the existing ones.
- Advised applicants on improving their skills as DevOps engineers, what technologies they should seek to learn, and what certifications they should pursue based on their goals.
- Assisted the approved candidates by building their profiles in a way that would improve their chances of getting hired.
Senior Site Reliability Engineer
- Deployed and upgraded well-known production clusters and databases, such as Kubernetes, Elasticsearch, PostgreSQL, and Ceph.
- Fine-tuned our Elasticsearch cluster which ingested roughly 300G of data per day, implementing best practices considering the low-level implementation of Apache Lucene, thus and so improving its performance and allowing us to shrink its size.
- Owned the implementation of security components and best practices such as AWS CIS Benchmarks and intrusion detection and prevention tooling, which rendered the company a SOC 2 certification.
- Provided on-call support 24/7, dealing with various incidents on the production infrastructure.
- Created several Jenkins pipelines with Groovy and Bash for deploying both infrastructure components and applications and worked with Jenkins Configuration as Code (JCasC), making sure the whole continuous delivery stack was easily replicable.
- Containerized several applications, creating CI/CD pipelines not only for building and deploying but also for performing code checks and security scans.
- Implemented different solutions for backing up different systems that enabled the development of an expeditious disaster recovery plan.
- Set up three Kubernetes clusters for development, staging, and production environments. All clusters were multi-az and had autoscaling. Monitoring was done with Datadog and Pagerduty.
- Implemented GoCD with custom elastic agents for deploying applications into all Kubernetes clusters. Containerized applications and deployed them as Helm charts.
- Implemented automatic provisioning and renewal of Let’s Encrypt TLS certificates with cert-manager.
- Deployed Fluentd daemon sets for aggregating logs from all the applications into Elasticsearch. Also deployed Elasticsearch curators for cleaning old logs.
- Set up the automatic monitoring of all Java applications deployed in the cluster by running them with sidecar containers exposing metrics retrieved from the application's JMX interface.
- Spearheaded the project Navalis, which was a web application intended to allow developers to deploy, monitor, and scale their applications in multiple Kubernetes clusters with ease. It was developed with Golang and Vue.js.
- Scaled Kubernetes up to 300 nodes in order to process massive batches of data within a few hours, taking into consideration the network and I/O limitations of both the local instances and the data source.
- Partnered with the data engineering team to develop a new Kafka cluster for the company inspired by Netflix’s way of orchestrating and monitoring Kafka. It consisted of several interconnected Kafka clusters that prevented the loss of data.
- Developed a system for monitoring backups consisting of a Python and Flask server and a client written in Go. The system would centralize the status of the backups across the whole infrastructure and notify our team whenever a backup was missing.
- Solved an issue with a large Elasticsearch cluster that used to crash at the beginning of each day. The issue was caused by misconfigured Logstash instances that flooded the cluster with requests for creating new shards.
- Developed a tool with Go for cross-validating the Kubernetes network which would establish a route between every machine in Kubernetes generating a complete graph or pointing out issues in the network.
- Created a redundant VPN between availability zones (US and AP) in AWS using VyOS.
- Helped instrument our most important servers with Jaeger APM.
- Deployed a Kubernetes cluster with autoscaling as a proof-of-concept to test how well a Kafka cluster would scale within Kubernetes.
- Solved an issue in which our Kafka cluster would crash because of unexpected behavior of a tool someone had installed to monitor ZooKeper, Netflix's Exhibitor.
- Deployed multiple MongoDB clusters for collecting data during a high-traffic event.
- Deployed a Kubernetes cluster the hard way, without any tools like Kubernetes Operations (Kops) or Kubeadm, to learn deeper concepts of its architecture.
- Centralized in an HAProxy cluster all incoming requests which didn’t have a proper entry point for the infrastructure (i.e., DNS pointed to lots of different entry points)—thus avoiding single points of failure.
- Fixed multiple bugs in Node.js servers, among them a critical one which forced us to restart production containers from time to time because of a progressive decay of performance.
- Solved multiple bugs in Objective-C servers by creating a system for debugging multiple servers in real time, attaching multiple GDBs to multiple processes distributed amongst nodes and capturing eventual stack traces—allowing us to quickly fix bugs that would only occur in the production environment.
- Developed a Node.js server that would hold thousands of connections open as a fronting proxy for a legacy server that was not able to receive too many simultaneous connections.
- Stopped an ongoing brute-force password attack, which I was able to detect because of an expressive increase in the number of failed authentications in DataDog. I stopped the attack by blocking the attacker’s IP addresses in HAProxy.
- Resolved a serious problem that would cause Ceph to crash. We traced the problem to a bug that was tied to the specific version of the software we were using.
Software Engineering Intern
- Developed a tool in Python for automatically generating C++ code that would bind hardware transactors written in C++ to TCL.
- Built a tool for extracting statistics from a hardware-emulating platform and generating D3.js charts.
- Fixed a major C++ bug caused by a racing condition between GTK and a hardware transactor.
- Worked for a month at Synopsys' headquarters in Mountain View where I learned a lot about electronic design automation.
Junior Back-end Engineer
- Developed a substantial part of a back end of a corporate email service; it was written in C++ with language bindings to Lua. I utilized MongoDB for storing the email metadata, GridFS for storing their bodies, and MySQL for storing relational user data. Worked with REST interfaces in a monolithic architecture.
- Built a part of their front end written in Java and Google Web Toolkit.
- Constructed IMAP and POP3 proxies to route new users from other email service providers to their old servers while capturing their passwords and transparently migrating their accounts to our servers.
- Developed HTTP and SMTP servers from scratch with C++.
- Supported the development of the company’s ERP system; built with CakePHP and Bootstrap.
Flux Control Language Compilerhttps://github.com/aleksanderllada/FCL-Compiler
This project is the compiler I wrote with Java and ANTLR4 in order to generate FCL's p-code, based on the formal grammar I wrote for the language.
Flux Control Language Interpreterhttps://github.com/aleksanderllada/FCL-Interpreter
This project is the interpreter I wrote for the language's p-code, which is generated by the FCL Compiler. It works like a stack machine, similar to Python's and Lua's interpreters.
Jenkins, Terraform, Amazon Virtual Private Cloud (VPC), AWS IAM, GitHub, Git, Ansible, Vault, Chef, NGINX, Grafana, Amazon CloudWatch, Amazon CloudFront CDN, VPN, Helm, GitLab CI/CD, ANTLR 4, Kong, Fluentd, Apache ZooKeeper, MirrorMaker, Nagios
Continuous Integration (CI), Continuous Delivery (CD), Distributed Computing, DevOps, Scrum, Design Patterns, HIPAA Compliance
Kubernetes, Linux, Apache Kafka, Amazon Web Services (AWS), Docker, Amazon EC2, PagerDuty, Google Cloud Platform (GCP), Heroku, Hyperledger Burrow, Harbor, OpenStack, Rancher, MacOS
Elasticsearch, Datadog, Amazon S3 (AWS S3), MongoDB, MySQL, PostgreSQL, Amazon DynamoDB, Cassandra, Redis, Ceph, Redshift
Kubernetes Operations (kOps), Site Reliability Engineering (SRE), GoCD, Prometheus, AWS DevOps, Shell Scripting, CI/CD Pipelines, Infrastructure as Code (IaC), Cloud Infrastructure, SecOps, Amazon RDS, Containerization, Architecture, Containers, Load Balancers, AWS Cloud Architecture, Distributed Tracing, HAProxy, APM, AWS Certified SysOps Administrator, Cloudflare, Technical Leadership, EDA, LDAP, Infrastructure Architecture, Computer Science, Compilers, Programming Languages, iTerm2, Consul, VyOS, AWS Database Migration Service, Data Engineering, Data Warehousing
Qt 5, Flask, Express.js, GWT, CakePHP, Bootstrap, Spring
Node.js, POCO C++, Vue, D3.js
Bachelor of Science Degree in Computer Science
Federal University of Minas Gerais - Belo Horizonte, Minas Gerais, Brazil
AWS Certified SysOps Administrator
Amazon Web Services