Aldo is available for hire

Aldo Orozco

Verified Expert in Engineering

Data Engineer and Software Developer

Location

Zapopan, Mexico

Toptal Member Since

October 11, 2022

Aldo has over ten years of experience as a software engineer, five of which he has spent focused on data solutions. His broad expertise in embedded Linux systems, cloud infrastructure, site reliability engineering (SRE), and high-performant data architectures gives him the upper hand in dealing with complex systems. Throughout his career, Aldo has assumed multiple roles, including the ones of a developer, consultant, architect, and lead.

Software Data Engineering Data Warehousing Big Data Architecture Big Data Bash Git Linux Python Python 3 SQL Databases Data Pipelines ETL Apache Airflow Scalability

Portfolio

Etsy

Apache Airflow, Apache Spark, Big Data Architecture, Bash, Kubernetes, Helm...

Wizeline

Amazon Web Services (AWS), Google Cloud Platform (GCP), Apache Airflow, Python...

Triolabs

Amazon Web Services (AWS), Apache Kafka, Python, Big Data Architecture...

Experience

Bash - 8 years Python - 7 years Data Warehousing - 4 years Apache Airflow - 4 years Spark - 4 years Amazon Web Services (AWS) - 3 years Terraform - 3 years Google Cloud Platform (GCP) - 3 years

Availability

Part-time

Preferred Environment

Apache Airflow, Spark, Python, Terraform, Kubernetes, Google Cloud Platform (GCP), Amazon Web Services (AWS), Helm, Data Warehousing, Big Data Architecture

The most amazing...

...things I've done include huge Apache Airflow re-platforming on Kubernetes and Spark pipeline optimizations, reducing execution by 90%.

Work Experience

Senior Data Engineer II

2022 - PRESENT

Etsy

Led an Airflow migration from 1.10 on VMs to Airflow 2 on Kubernetes while upgrading and validating thousands of legacy-directed acyclic graphs (DAGs). Reduced deployment times from hours to about a minute with zero downtime.
Rearchitected a SQL parsing service and introduced multiprocessing, effectively reducing the processing time from 30 to five minutes.
Implemented skew inference and input/output collection services on Spark jobs executed across the company, allowing users to optimize their pipelines.
Implemented several RESTful microservices on Python to manage ad-hoc testing of the Airflow environment.

Technologies: Apache Airflow, Apache Spark, Big Data Architecture, Bash, Kubernetes, Helm, Data Pipelines, ETL, SQL, Pipelines, Data Engineering, Google Cloud Platform (GCP), BigQuery, Google BigQuery, Google Bigtable, Google Cloud Storage, Buildkite, Jenkins, Google Cloud Build, Microservices, CI/CD Pipelines, Streaming Data, Terraform, Data Warehousing, Grafana, Prometheus, StatsD, Git, Python 3, Docker, Docker Compose, Google Compute Engine (GCE), Google Cloud SQL, Google Cloud Functions, Flask, Spark, Python, Software, PostgreSQL, Scala, Data Analysis, PHP, Databricks, ELT, Databases, Data Architecture, PySpark, Big Data, Scaling

Staff Data Engineer

2019 - 2022

Wizeline

Drove an internal program to transition software engineers to data. Eighteen graduates were successfully assigned to long-term projects, increasing profit by making engineers billable and filling in project gaps.
Counseled solution architects in new customer interactions to offer performant data architectures, acquiring new deals and better resource utilization since data architectures and expectations were more realistic.
Architected an IoT cloud ingestion and monitoring solution for several million devices in collaboration with the SRE team. The project successfully kicked off, and a dozen engineers were assigned to it.
Collaborated on a change data capture (CDC) pipeline with Delta Lake, Kafka, and Spark, which ingested financial data from other teams and third-party platforms and aggregated information in a data lake ultimately consumed by data scientists.
Led a data community with over 20 members for over a year. The community was an educational platform where members presented relevant topics to improve the tool's efficiency, deal with real-world errors, and explore data architecture trends.
Coordinated several mentorship programs aimed at training software engineers in data engineering, thus reducing the demand for the discipline in the Americas.

Technologies: Amazon Web Services (AWS), Google Cloud Platform (GCP), Apache Airflow, Python, Scala, Terraform, Apache Kafka, Apache Spark, Data Pipelines, ETL, SQL, Pipelines, Data Engineering, CI/CD Pipelines, ELT, Databricks, Amazon S3 (AWS S3), Amazon RDS, Amazon Athena, Amazon Kinesis, Redshift, Snowflake, Azure, Amazon EC2, AWS Lambda, CDC, Bash, BigQuery, AWS Glue, Java, Data Analysis, Spark, Kubernetes, Data Warehousing, Big Data Architecture, Software, Machine Learning, Git, PostgreSQL, Data Warehouse Design, Data Lakes, Google BigQuery, Google Cloud Storage, Google Cloud SQL, Google Compute Engine (GCE), Docker, Python 3, Jenkins, Azure Data Factory, Databases, Apache Hive, Data Architecture, PySpark, Big Data, Scaling

AWS Big Data Architect

2019 - 2019

Triolabs

Implemented a service to ingest and prune brain imaging data from private laboratories in near real time using Apache Kafka and Python on AWS. The resulting data was consumed by a machine learning model in R to extract insights from the scans.
Rearchitected a data warehouse data model to handle TB scale queries faster and speed up a drug research analysis.
Coordinated a team of two to create aggregation pipelines on Apache Spark on the brain imaging results so that the researchers could fine-tune drug research.

Technologies: Amazon Web Services (AWS), Apache Kafka, Python, Big Data Architecture, Data Warehousing, Data Lakes, Machine Learning, Data Pipelines, ETL, SQL, Pipelines, Data Engineering, Redshift, Amazon S3 (AWS S3), Amazon EC2, AWS Batch, R, Microservices, Amazon API Gateway, Spark, Apache Airflow, Software, Java, Bash, Git, Data Warehouse Design, Scala, Apache Spark, AWS Glue, Amazon RDS, ELT, Python 3, Databases, Data Architecture, PySpark, Big Data, Scaling

Big Data Engineer

2018 - 2019

Apex Systems

Developed over ten Spark pipelines to aggregate and store TB of data to Hive, Elasticsearch, and MongoDB for the marketing team. These aggregations were exposed via APIs, enabling the analytics team to generate campaigns to attract new users.
Created a library for Spark jobs to efficiently enrich data sets with location data from Google Maps APIs, thus allowing for better-targeted marketing campaigns.
Optimized Spark settings of several production workloads. Reduced cost and execution time of overnight execution to a third, minimizing report delivery delays to senior management.

Technologies: Amazon Web Services (AWS), Apache Airflow, Spark, Qubole, MongoDB, Elasticsearch, Java, Python, Spring, Data Pipelines, ETL, SQL, Pipelines, Data Engineering, Data Quality, Data Analysis, Terraform, Data Warehousing, Big Data Architecture, Software, Hadoop, Bash, Git, Google Maps API (GeoJSON), R, Apache Spark, Amazon EC2, Amazon Athena, Amazon RDS, Docker, Python 3, Databases, Apache Hive, Data Architecture, PySpark, Big Data, Scaling

Embedded Software Engineer

2014 - 2018

Continental Automotive Systems

Developed a data pipeline in Hadoop MapReduce to aggregate historical cellular network data, helping to reduce errors in a middleware service by half.
Architected and led the development of a service to automatically re-configure a car while driving and seamlessly connect back and forth between cellular stations.
Assisted in containerizing the development environment of people in my area, effectively helping reduce pain and setting up a stable environment.
Gave a series of training sessions for 30 developers on unit tests and coverage using a proprietary tool. Helped remove dozens of unreachable code snippets.

Technologies: C, C++, Hadoop, Python, Bash, Git, Linux, Software, Amazon Web Services (AWS), Docker, Python 3, Jenkins

Experience

Adaptive Big Data Pipelines

https://github.com/aldoorozco/adaptive_data_pipelines

A containerized system to generate big data pipelines developed using Apache Spark and Apache Airflow. It was automatically configured based on a data size and schema. The system provides a web page, so users with little to no experience can easily create datasets from files and specify aggregations in SQL and a cadence in which the pipeline should run. The system builds up the cloud infrastructure and submits the tailored pipelines.

Marketing Recommendation System

https://www.vrbo.com/es-mx/

A series of orchestrated ETL pipelines using Apache Spark, Apache Airflow, MongoDB, and Google Geographical APIs running on AWS, which aggregate clickstream visits so that marketing campaigns could recommend users nearby weekend trips.

Brain Imaging Prediction

https://neumoratx.com/

A data architecture that encompassed several components, including an ingestion pipeline to extract brain imaging data from a medical institution into a data lake, a series of pyspark and R-based pipelines to extract insight from the imaging files and refine data pipelines that merged medical records with the aforementioned insights, and finally a data warehouse where the data was stored.

Education

2018 - 2020

Master's Degree in Computer Science

ITESO, Jesuit University of Guadalajara - Guadalajara, Mexico

2010 - 2014

Bachelor's Degree in Mechatronics Engineering

Centro de Enseñanza Tecnica Industrial - Guadalajara, Mexico

Certifications

DECEMBER 2019 - DECEMBER 2021

GCP Professional Data Engineer

Google Cloud

Skills

Languages

Python, Bash, SQL, Python 3, C++, Java, C, Scala, Snowflake, R, PHP

Frameworks

Spark, Apache Spark, Hadoop, Spring, Flask

Libraries/APIs

PySpark, Google Maps API (GeoJSON)

Tools

Apache Airflow, Git, BigQuery, Terraform, Google Compute Engine (GCE), Amazon Athena, AWS Glue, Helm, Qubole, Docker Compose, Jenkins, Grafana, AWS Batch

Paradigms

ETL, Microservices

Platforms

Linux, Docker, Kubernetes, Google Cloud Platform (GCP), Amazon Web Services (AWS), Apache Kafka, Databricks, Amazon EC2, Buildkite, Azure, AWS Lambda

Storage

Data Pipelines, Amazon S3 (AWS S3), Databases, Data Lakes, PostgreSQL, Google Cloud Storage, Google Cloud SQL, Apache Hive, MongoDB, Elasticsearch, Google Bigtable, Redshift

Other

Data Warehousing, Big Data Architecture, Software, Pipelines, Data Engineering, Google BigQuery, Data Architecture, Scaling, Big Data, Complex Problem Solving, Teamwork, Data Warehouse Design, Google Cloud Build, ELT, Amazon RDS, Data Analysis, Machine Learning, CI/CD Pipelines, Streaming Data, Prometheus, StatsD, Google Cloud Functions, Amazon Kinesis, CDC, Amazon API Gateway, Data Quality, Azure Data Factory

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring