
Caíque França
Verified Expert in Engineering
Platform Engineer and Developer
Belo Horizonte - State of Minas Gerais, Brazil
Toptal member since May 14, 2026
Caíque is a senior platform engineer with 14+ years of experience designing reliable, scalable cloud infrastructure. He specializes in AWS, Kubernetes, and Terraform, with deep expertise in DevOps, SRE, and observability tools such as Datadog, Prometheus, and Grafana. He's delivered high-impact platforms in international environments where uptime, automation, and developer experience drive business success.
Portfolio
Experience
- Linux - 14 years
- Observability - 12 years
- Bash Script - 12 years
- Docker - 7 years
- GitHub - 7 years
- Kubernetes - 7 years
- CI/CD Pipelines - 6 years
- Terraform - 5 years
Preferred Environment
Linux, Docker, Kubernetes, GitHub, GitLab, CI/CD Pipelines, Terraform, Observability, Bash Script, Python 3
The most amazing...
...project I've delivered was a custom VPN observability platform that turned 59% user dissatisfaction into a permanent home-office model for a major enterprise.
Work Experience
Senior Platform Engineer
TradeWeb Markets
- Leveraged Argo CD for application deployment management on Kubernetes clusters.
- Enhanced existing Terraform code to improve infrastructure-as-code.
- Developed tests for Kubernetes clusters using Open Policy Agent (OPA) and Conftest.
- Conducted a proof of concept for Linkerd on EKS and Rafay Kubernetes clusters.
- Managed tasks involving AWS services, RDS, and Logstash.
- Improved logging for Jira and Confluence to support internal teams.
- Created Python code to integrate PagerDuty contacts with Grafana.
- Worked on GitLab CI pipelines for build and deployment workflows.
- Implemented ElasticSearch and Kibana using Helm on EKS.
- Handled tasks involving Atlantis and used Terratest to test Kubernetes modules on EKS.
Senior Platform Engineer
babelforce
- Managed the entire cloud infrastructure, ensuring reliability and scalability.
- Participated in a 24/7 on-call rotation for critical system support.
- Led a project to implement a tracing solution, successfully deploying Grafana Tempo.
- Specialized in observability, leveraging Prometheus, Grafana Loki, and Grafana Tempo for metrics, logs, and tracing.
- Managed CI/CD pipelines using GitLab CI for efficient deployment processes.
- Orchestrated Kubernetes clusters using Kops on AWS, optimizing resource allocation.
- Implemented GitLab Runner on Amazon EKS using Fargate for enhanced scalability.
- Developed Grafana dashboards and alerts using Prometheus metrics and Grafana Loki logs.
- Engineered Helm chart templates for streamlined application deployment on AWS.
- Utilized Argo CD to manage application deployment and configuration in AWS.
Senior SRE Cloud Engineer
Venmo
- Focused on observability and incident response for Venmo/PayPal.
- Implemented the vector observability pipeline with Datadog.
- Created dashboards and alerts for Datadog cost management and developed cost reduction initiatives.
- Developed Terraform modules on GitHub to streamline creation of alerts, SLOs, and dashboards.
- Resolved Jira tickets related to Datadog and PagerDuty.
Specialist SRE, Observability
Dock
- Led the SRE observability squad, overseeing Dock's entire observability ecosystem.
- Developed strategic plans for short, medium, and long-term observability enhancements.
- Led open-source observability projects, starting with Prometheus for metrics.
- Created the "Observability Showroom" to inspire developers with model solutions.
- Developed the "Observability Journey" to document set-up steps for observability tools.
- Architected the observability team's AWS infrastructure.
- Provided technical support for Datadog and Splunk.
- Managed project tasks in Jira and handled backlog construction and delivery deadlines.
- Trained and mentored the team on SRE practices and DevOps culture.
Senior Site Reliability Engineer
Banco Itaú
- Led SRE efforts in a squad focused on observability.
- Provided technical consultancy for monitoring applications using AppDynamics.
- Planned and executed application migrations from PaaS (OpenShift), IaaS (OpenStack), AWS (ECS, EKS, EC2), and on-premise solutions to a new AppDynamics SaaS environment.
- Promoted observability culture across product squads.
Site Reliability Engineer
Localiza
- Ensured system stability as part of a multidisciplinary team.
- Provided technical consultancy for monitoring on-premises and Cloud applications using AppDynamics and Datadog.
- Monitored Windows servers with OpManager and WMI technology.
- Automated disaster recovery processes for the network team using a shell script.
- Managed, administered, and operated network assets including Cisco and HP switches, Aruba Wi-Fi controller and APs, Palo Alto, Fortinet, and ASA firewalls, Citrix Netscaler traffic balancer, and Aruba Clear Pass NAC.
- Managed, administered, and operated tools supporting operations such as OpManager, Zabbix, TRAFip, SLAviel, CFGtool, and Infoblox.
- Led the project for network traffic management and monitoring at Localiza headquarters and key branches.
- Handled N2 and N3 level support tickets for network infrastructure and information security.
- Produced reports, metrics, and dashboards with strategic insights for the business.
Technical Consultant
Telcomanager
- Provided technical consultancy in pre-sales, post-sales, and special projects, developing new business opportunities in the corporate market.
- Analyzed client network infrastructure to provide optimal solutions and effectively communicated client needs to support and development teams.
- Participated in corporate trade shows such as Futurecom, delivering speeches and presentations to promote Telcomanager solutions.
- Coordinated the technical support team, ensuring high-quality service delivery aligned with the company's mission, vision, and philosophy.
- Achieved significant cost savings, doubled software sales over three years, and substantially increased the client base through technical support advancements.
- Achieved over 98% customer satisfaction ratings for support calls and implemented improvements in response time, incident resolution, and customer surveys.
- Supervised, hired, and provided ongoing training for the technical support team, documenting processes, manuals, and guidelines for the department.
- Conducted customized client training sessions and developed scripts in Shell and LUA to support Telcomanager's network management tools.
- Identified and replicated bugs in network management tools, providing detailed reports and proposing system improvements to the development team.
- Configured client network assets focusing on flow export protocols (NetFlow, Sflow, IPFIX) and SNMP, and participated in Telcomanager solutions implementation.
Experience
Home Office VPN Observability Platform
The solution combined three monitoring layers:
• VPN firewalls, collecting CPU, memory, disk, traffic, simultaneous client connections, and UDP/ICMP/SSL/TCP session counts against device limits.
• User VPN sessions, capturing connection/disconnection events, session durations, and contextual data (employee ID, department, ISP, public/private IP, state, city) in near real-time (5-minute intervals).
• Home internet quality. A custom service running on every remote workstation, executing ICMP tests against VPN peers, telephony system, internal servers, and public DNS, plus Wi-Fi against cable detection and signal strength.
I built Power BI dashboards and SQL-based reports that drove decisions across support, security, and leadership teams.
Education
Bachelor's Degree in Control and Automation Engineering
Federal Center for Technological Education of Rio de Janeiro Celso Suckow da Fonseca - Rio de Janeiro, Brazil
Technical Course in Telecommunications
Federal Center for Technological Education of Rio de Janeiro Celso Suckow da Fonseca - Rio de Janeiro, Brazil
Certifications
HashiCorp Certified: Terraform Associate
HashiCorp
Certified Kubernetes Administrator
The Linux Foundation
LPIC-1: Linux Administrator
Linux Professional Institute
AppDynamics Certified Associate Performance Analyst
AppDynamics
AWS Certified Solutions Architect
Amazon Web Services
AWS Certified Cloud Practitioner
Amazon Web Services
GitLab Certified Associate
GitLab
Cisco Certified Network Associate
Cisco Systems
MikroTik Certified Network Associate (MTCNA)
MikroTik
Skills
Libraries/APIs
Thanos
Tools
Terraform, GitHub, GitLab, GitLab CI/CD, MATLAB, Helm, Grafana, Ansible, Kustomize, Jira, Confluence, Amazon EKS, CircleCI, Splunk, AppDynamics, Rundeck, F5 Load Balancer, Zabbix, Microsoft Power BI, ACL, VPN, Amazon CloudWatch, Kubectl, Git
Platforms
Linux, Docker, Kubernetes, PagerDuty, Amazon EC2, RouterOS
Languages
Bash Script, Python 3, C, Java, Lua, SQL
Frameworks
Crossplane
Paradigms
Azure DevOps
Storage
Datadog, Amazon S3 (AWS S3)
Other
Observability, CI/CD Pipelines, Robotics, Linear Algebra, Electronics, Networks, Physics, Mathematics, Mechanics, Telecommunication Engineering, TCP/IP, Programming, OSI Model, Argo CD, Prometheus, Karpenter, Open Policy Agent (OPA), OpenTelemetry, Amazon RDS, ServiceNow, Amazon Route 53, System Center Operations Management (SCOM), Palo Alto Networks, Cisco, Fortinet, NetFlow, SNMP, Firewalls, MikroTik, Data Analysis, Routing, Cisco Switches, VLANs, Virtual Private Cloud (VPC), Cloud Computing, Cloud, AWS Pricing, AWS Support, Cloud Security, Shell Scripting, System Administration, File Permissions, Networking, User Management, APM, Dashboards, Notification Center, Troubleshooting, Container Orchestration, Cluster Administration, Infrastructure as Code (IaC), HCL
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring