Steven Rouk, Developer in Arlington, VA, United States
Steven is available for hire
Hire Steven

Steven Rouk

Verified Expert  in Engineering

Data Engineer and Developer

Location
Arlington, VA, United States
Toptal Member Since
October 6, 2021

Steven is an expert Python and SQL data engineer with strong data science and data analysis capabilities and eight years of experience. He has led the development of technical solutions, and his most recent role was as a senior member of a data engineering team at a Fortune 100 company. The team processed billions of rows of data in a Snowflake ecosystem with thousands of tables, and the client regarded Steven as one of the highest contributing members of the 10+ member team.

Portfolio

Slalom
Python, SQL, Snowflake, Jenkins, Apache Airflow, Tableau, Bash...
Mercy For Animals
SQL, Python, R, Optimizely, Tableau, Data Science, Data Pipelines...
Boulder Insight
Tableau, Python, SQL, MySQL, Flask, Data Science, ETL, Data Pipelines...

Experience

Availability

Part-time

Preferred Environment

Python, SQL, Snowflake, Jupyter Notebook, Tableau, Pandas, Data Build Tool (dbt), Fivetran, Google Cloud Platform (GCP), Amazon Web Services (AWS)

The most amazing...

...tools I've developed are a Python-based automated data profiling tool and an automated data quality tool.

Work Experience

Data Engineer Consultant

2019 - 2021
Slalom
  • Served as the lead data engineer for complex ETL processes at a Fortune 100 telecommunications company. Regularly processed billions of rows of data in a Snowflake ecosystem with thousands of tables. One dataset resulted in more than 25% more accurate data.
  • Created an exploratory data analysis (EDA) process used by a team of 10 data engineers. The process included automated data profiling using Python and SQL and guidance for ad hoc analysis and data visualization using Jupyter notebooks.
  • Led and managed a workflow migration from Apache Airflow to Jenkins.
  • Created a centralized dashboard to monitor ETL workflows. This required pulling job data from Jenkins, writing it to Snowflake, and visualizing it in Sigma computing.
  • Developed an automated Snowflake table cleanup script that drops old temp tables nightly, saving our client thousands of dollars per month.
  • Mentored five junior data engineers in Python, SQL, Snowflake, Jenkins, Bash, Git, and data engineering techniques and processes.
  • Prototyped a data lineage solution using SQL parsing and a Neo4j graph database.
Technologies: Python, SQL, Snowflake, Jenkins, Apache Airflow, Tableau, Bash, Amazon Web Services (AWS), Data Science, Data Engineering, ETL, Neo4j

Analytics and Research Specialist

2017 - 2019
Mercy For Animals
  • Analyzed year-end donation data, uncovering distinct email clusters and revenue trends.
  • Created impact estimation methodologies for multiple programs.
  • Developed a data pipeline prototype using Google Cloud Platform.
  • Presented and led data and research workshops at four staff retreats.
Technologies: SQL, Python, R, Optimizely, Tableau, Data Science, Data Pipelines, Data Engineering

Python Developer and Tableau Consultant

2014 - 2016
Boulder Insight
  • Helped clients across a variety of industries with ETL, data visualization, and data dashboards.
  • Served as the lead developer on a client-facing Python Flask web application to automate the distribution of Tableau dashboards.
  • Constructed Python ETL pipelines to pull data from web APIs, process it, then save it to a MySQL database.
  • Presented at Boulder Startup Week, Analyze Boulder, Boulder Tableau User Group, and Boulder Python.
Technologies: Tableau, Python, SQL, MySQL, Flask, Data Science, ETL, Data Pipelines, Data Engineering, Statistics

Data Infrastructure Setup | Airbyte, dbt, and BigQuery

A nonprofit organization needed to set up a data and analytics infrastructure that ingested data from multiple sources (SendGrid, Bubble.io, and more). I transformed the data and output data marts on BigQuery, which are ready for analysis with Looker Studio.

I used Airbyte for the extract/load part of this process and dbt for the data transformations. Raw and transformed data were landed on BigQuery. Data visualizations and reports were built in Looker Studio.

Because of this new, automated analytics infrastructure, the team can now see metrics and data instantaneously rather than through the laborious manual approach they previously had to use.

Production Dataset for Network Customer Service Applications

A dataset that used network device configuration files to determine the correct services that should be available to a customer. The project involved combining data from five different systems—several of which had significant data quality issues—into a single consumable dataset with 99%+ data quality accuracy.

As the sole developer on the project, I worked closely with end users to design the desired data schema, create data logic to meet business needs, and validate the accuracy of the dataset. As the requirements continually evolved, I brought together disparate teams at each step to reach a common understanding of the required data and data logic. In the end, we increased the data accuracy by 25%+ in a customer-facing production environment.

Automated Data Profiling Script for SQL Databases

A process I created to automatically determine the characteristics of a new dataset in a Snowflake database, including nulls, distinct values, correlation with other columns, trends over time, data types, and intersections with other datasets. This allowed data engineers to do in minutes what used to take days or weeks to accomplish. After the initial data profiling, data engineers could conduct additional ad hoc analysis as needed using Python, SQL, and Jupyter Notebooks. They used this analysis to answer questions that arose from the automated data profiling process.

Evolution of Machine Learning | Analysis and Website

https://github.com/stevenrouk/evolution-of-machine-learning
An NLP data science project using arXiv.org research paper metadata to study the evolution of machine learning over the last 20 years. I pulled metadata for 1.6 million research papers, filtered the dataset down to 50,000 machine learning-focused research papers, and then used topic modeling to study shifts in the topics discussed over time. I used the Python scikit-learn package for topic modeling, Pandas for storing and manipulating data, Seaborn and Matplotlib for plotting, and custom text visualizations to demonstrate the topic modeling algorithm.

Finding Patterns in Social Networks Using Graph Data

https://github.com/stevenrouk/social-network-graph-analysis
I worked with the Stanford Social Network: Reddit Hyperlink Network dataset made available through the Stanford Network Analysis Platform (SNAP). This dataset catalogs hyperlinks between subreddits over 2.5 years, from January 2014 through April 2017.

Key Activities
• Created a graph of the data using the NetworkX Python graph library.
• Experimented with creating my own graph data objects to load and traverse the graph.
• Analyzed connections (in-degree and out-degree), distinct networks (component analysis), sharing reciprocity, centrality, and PageRank.
• Experimented with ways to visualize large graph structures by randomly sampling neighbor nodes.
2008 - 2012

Bachelor's Degree in Mathematics

Arizona State University - Tempe, AZ, USA

JULY 2020 - PRESENT

Neo4j Certified Professional

Neo4j

FEBRUARY 2020 - FEBRUARY 2023

AWS Certified Machine Learning – Specialty

AWS

JANUARY 2020 - JANUARY 2023

AWS Certified Solutions Architect – Associate

AWS

Libraries/APIs

Pandas, Scikit-learn, Matplotlib, NetworkX, NumPy

Tools

Jenkins, Git, Tableau, BigQuery, Amazon SageMaker, Apache Airflow, Optimizely, Seaborn

Languages

Python, SQL, Snowflake, Bash, Cypher, R

Platforms

Jupyter Notebook, Visual Studio Code (VS Code), Google Cloud Platform (GCP), Amazon Web Services (AWS), Amazon EC2, Airbyte

Paradigms

Data Science, ETL

Frameworks

Flask

Storage

Data Pipelines, Neo4j, Graph Databases, MySQL

Other

Data Engineering, Mathematics, Programming, Machine Learning, APIs, Data Modeling, Amazon RDS, Amazon Machine Learning, Natural Language Processing (NLP), Graph Theory, Social Network Analysis, Statistics, Generative Pre-trained Transformers (GPT), Data Build Tool (dbt), Fivetran

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring