Data Engineer
2019 - PRESENTStarry Internet- Designed and implemented a large-scale data-processing platform utilizing Spark and AWS EMR. This included building a pipeline to process and aggregate IoT data coming from tens of thousands of devices every minute.
- Revamped the team's code deployment process from manual build and upload to an CI/CD pipeline running in an AWS code build—which the reduced engineering effort per deployment from ten minutes to under one minute.
- Led the design, implementation, testing, and migration to a highly scalable Airflow environment. The environment serves as the core platform for all ETL running over 800 containerized tasks every hour.
- Drove an increased sense of accountability and reduced error-response time by implementing an incident response framework and ticketing system for the broader data engineering team.
- Rearchitected the use of Snowflake for large scale data processing. Shifted workloads from Snowflake to PySpark running on EMR to result in a 30% cost savings and 50% pipeline run time reduction.
- Developed data quality tool to run over 1200 checks per hour against data warehouse to ensure that data met expectations. This resulted in the shift from reactive error handling to proactive monitoring and incident management.
Technologies: Salesforce, Stripe, NetSuite, Amazon Web Services (AWS), PostGIS, PostgreSQL, Apache Airflow, Docker, Geospatial Data, Spark, Scala, Snowflake, SQL, PythonData Engineer
2018 - 2019Drift- Managed and maintained all aspects of ETL, data warehousing, and analytics tools and infrastructure and was responsible for the ingestion of new data sources, data quality, and availability (was also the data team's hire #1).
- Stood up the Airflow back end using ECS, Fargate, RDS, and Redis to serve as the core ETL tool for all data processing and pipelines.
- Led the migration from Redshift to Snowflake involving 17 separate streaming data sources, 1,000+ tables, and over 20 different teams reliant on the warehouse. Migration resulted in a zero increase in cost and a 75% decrease in query time.
- Developed a reliable Spark pipeline to process 100GB+ of data daily and produce clean, manageable aggregations of end-user interaction data.
- Built Success Factor Score: a statistical model that determines the health of a customer based on usage, interaction, and engagement data. This score serves as a key business metric that customer success managers are evaluated on.
- Evaluated, implemented, and trained a team on Looker, a powerful data definition management and BI tool that enables nontechnical users to access and analyze data.
Technologies: Amazon Web Services (AWS), Redshift, Docker, Apache Airflow, Snowflake, SQL, PythonData Science Engineer
2017 - 2018Liberty Mutual- Developed an infrastructure to process and understand the impact of aircraft noise on the livability of a particular location (10+ billion records).
- Produced a prototype to enable executives to quickly ingest and understand 1,000+ comments from monthly employee opinion surveys. Developed a front-end web app to allow for access with ease.
- Architected and built a data pipeline to enable the automatic summarization of customer service calls.
Technologies: Amazon Web Services (AWS), Geospatial Data, Dask, PostGIS, EMR, Redshift, SQL, PythonAnalytics Associate
2016 - 2017Liberty Mutual- Developed market-sizing models from numerous different sources to estimate the potential business value of various new product concepts.
- Assessed an opportunity and developed a model to intelligently select which no-fault claims should be sent to litigation. The model is projected to increase recovery dollars by $700,000.
- Gathered use cases from leaders across the organization for a cloud-based infrastructure and prioritized use cases to ultimately create a cloud transition strategy.
Technologies: R, SQL