Sam is available for hire

Sam Rogers

Verified Expert in Engineering

Data Engineer and Developer

Location

Boston, MA, United States

Toptal Member Since

June 24, 2020

Sam is a data engineer who specializes in creating AWS solutions for ETL. Due to his attentiveness and drive for excellence, he has continuously provided scalable, repeatable, and cost-effective solutions to process data at scale. Where Sam thrives is on projects running his Python code and AWS resources but he has a working understanding of Google Cloud and Azure as well.

Data Warehousing Data Warehouse Design Business Intelligence (BI)Python SQL ETL Data Pipelines Snowflake Docker Spark PySpark Apache Airflow Data Modeling Amazon API Gateway PostGIS Dashboard Dask

Portfolio

Starry Internet

Salesforce, Stripe, NetSuite, Amazon Web Services (AWS), PostGIS, PostgreSQL...

Drift

Amazon Web Services (AWS), Redshift, Docker, Apache Airflow, Snowflake, SQL...

Liberty Mutual

Amazon Web Services (AWS), Geospatial Data, Dask, PostGIS, EMR, Redshift, SQL...

Experience

Python - 3 years SQL - 3 years ETL - 3 years Snowflake - 2 years Docker - 2 years Spark - 2 years Apache Airflow - 2 years

Availability

Part-time

Preferred Environment

Bash, PyCharm, Slack

The most amazing...

...thing that I've ever done was migrate a data warehouse in just two weeks. This involved 20 data sources, over 10TB of data, and hundreds of different reports.

Work Experience

Data Engineer

2019 - PRESENT

Starry Internet

Designed and implemented a large-scale data-processing platform utilizing Spark and AWS EMR. This included building a pipeline to process and aggregate IoT data coming from tens of thousands of devices every minute.
Revamped the team's code deployment process from manual build and upload to an CI/CD pipeline running in an AWS code build—which the reduced engineering effort per deployment from ten minutes to under one minute.
Led the design, implementation, testing, and migration to a highly scalable Airflow environment. The environment serves as the core platform for all ETL running over 800 containerized tasks every hour.
Drove an increased sense of accountability and reduced error-response time by implementing an incident response framework and ticketing system for the broader data engineering team.
Rearchitected the use of Snowflake for large scale data processing. Shifted workloads from Snowflake to PySpark running on EMR to result in a 30% cost savings and 50% pipeline run time reduction.
Developed data quality tool to run over 1200 checks per hour against data warehouse to ensure that data met expectations. This resulted in the shift from reactive error handling to proactive monitoring and incident management.

Technologies: Salesforce, Stripe, NetSuite, Amazon Web Services (AWS), PostGIS, PostgreSQL, Apache Airflow, Docker, Geospatial Data, Spark, Scala, Snowflake, SQL, Python

Data Engineer

2018 - 2019

Drift

Managed and maintained all aspects of ETL, data warehousing, and analytics tools and infrastructure and was responsible for the ingestion of new data sources, data quality, and availability (was also the data team's hire #1).
Stood up the Airflow back end using ECS, Fargate, RDS, and Redis to serve as the core ETL tool for all data processing and pipelines.
Led the migration from Redshift to Snowflake involving 17 separate streaming data sources, 1,000+ tables, and over 20 different teams reliant on the warehouse. Migration resulted in a zero increase in cost and a 75% decrease in query time.
Developed a reliable Spark pipeline to process 100GB+ of data daily and produce clean, manageable aggregations of end-user interaction data.
Built Success Factor Score: a statistical model that determines the health of a customer based on usage, interaction, and engagement data. This score serves as a key business metric that customer success managers are evaluated on.
Evaluated, implemented, and trained a team on Looker, a powerful data definition management and BI tool that enables nontechnical users to access and analyze data.

Technologies: Amazon Web Services (AWS), Redshift, Docker, Apache Airflow, Snowflake, SQL, Python

Data Science Engineer

2017 - 2018

Liberty Mutual

Developed an infrastructure to process and understand the impact of aircraft noise on the livability of a particular location (10+ billion records).
Produced a prototype to enable executives to quickly ingest and understand 1,000+ comments from monthly employee opinion surveys. Developed a front-end web app to allow for access with ease.
Architected and built a data pipeline to enable the automatic summarization of customer service calls.

Technologies: Amazon Web Services (AWS), Geospatial Data, Dask, PostGIS, EMR, Redshift, SQL, Python

Analytics Associate

2016 - 2017

Liberty Mutual

Developed market-sizing models from numerous different sources to estimate the potential business value of various new product concepts.
Assessed an opportunity and developed a model to intelligently select which no-fault claims should be sent to litigation. The model is projected to increase recovery dollars by $700,000.
Gathered use cases from leaders across the organization for a cloud-based infrastructure and prioritized use cases to ultimately create a cloud transition strategy.

Technologies: R, SQL

Experience

Total Home Score Data Pipeline

http://www.totalhomescore.com

Total Home Score is a product designed to help prospective home buyers and renters understand what it is like to live at a particular property prior to making a decision.

In order to scale this product and calculate scores for millions of properties, I constructed a large scale data pipeline to perform complex geospatial calculations and aggregations.

This pipeline involved the use of Spark and EMR to run calculations on road traffic data and produce aggregations on how drivers typically drive on a particular length of roadway. Addresses are then loaded into Dask and calculations are performed across thousands of partitions to determine how much "dangerous" roadway exists within a particular radius of a given address.

Additionally, I developed a pipeline to process aircraft location data (10 billion+ points) and determine the level of aircraft noise expected at a particular property.

End User Analytics Cache

A Python-based application running on AWS Lambda and Redis that enabled the millisecond-level record retrieval of aggregated product usage data.

While working at a marketing technology provider, our product team wanted the ability to surface product usage data to our customers. Customers do not want to have to wait for a query to run in our data warehouse a return a result. The solution that I devised was to run a predefined set of aggregations and place them into a cache so that results could be received and visualized by a customer almost instantly.

Not only was this more rapid than running aggregations on demand, but it was also more cost-effective, instead of running thousands of aggregation queries in Snowflake per day, only one query needed to run to generate the output data and place it into our cache.

Containerized Airflow Processing

Airflow is an open-source tool designed for orchestration, scheduling, and execution of ETL jobs. The tool was originally designed to run these processes within it as well. However, many problems can arise from Airflow performing actual processing and not just coordinating between resources. For example, all dependencies need to be installed on the same instance, memory leaks can bring down the entire cluster, and additional security measures may be necessary if customer data is being touched by the cluster.

My solution was to have Airflow serve solely as a container execution tool—no processing actually takes place within the application. In addition, all the configurations of what job to execute and what parameters to pass to it are still contained within the Airflow code, but executed elsewhere. This makes for a simple interface for other data engineers to implement new pipelines.

For example, if an engineer has a file in S3 that they want to be loaded to a database on a schedule, they simply utilize the loader operator class that already exists in the Airflow repository. When executed, the class provisions a task to run in AWS Fargate which executes a process using the configurations passed to it.

Education

2013 - 2016

Bachelor's Degree in Economics

University at Buffalo - Buffalo, NY, USA

Skills

Libraries/APIs

PySpark, Flask-RESTful, Dask, Stripe, Luigi

Tools

Apache Airflow, Looker, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Container Registry (ECR), Amazon Elastic MapReduce (EMR), Slack, PyCharm, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (SQS), AWS CodeBuild, AWS IAM

Frameworks

Spark, Flask, Serverless Framework, Django

Languages

Python, SQL, Snowflake, Bash, R, Scala, SAS

Paradigms

Business Intelligence (BI), ETL, REST, DevOps

Platforms

Docker, Amazon Web Services (AWS), AWS Lambda, Amazon EC2, Salesforce

Storage

Data Pipelines, PostGIS, MySQLdb, Databases, Redis, MySQL, PostgreSQL, Amazon S3 (AWS S3), Redshift

Other

Pipelines, Data Warehousing, Dashboards, Web Dashboards, Data Warehouse Design, Data Modeling, Geospatial Data, GeoSpark, Amazon API Gateway, Dash, EMR, NetSuite, Singer ETL, Data Build Tool (dbt)

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring