Jeffrey is available for hire

Jeffrey Halley

Verified Expert in Engineering

Data Engineering Developer

Location

San Francisco, CA, United States

Toptal Member Since

June 24, 2020

Jeff is a data engineer, a software engineer, and a former geneticist and educator. He develops innovative uses for existing data, discovers and implements efficiencies, and helps others excel in their projects. As a scientist, he learned how to understand and explore complex problems. As an educator, he mastered the art of clearly communicating advanced topics. Jeff brings this rare combination of skills and experience to every data and software development project he takes on.

SQL Natural Language Processing (NLP)Data Engineering Python 3 Pandas R PostgreSQL Spark Snowflake NoSQL Data Analytics Data Profiling Python Apache Airflow Pytest

Portfolio

Anthem - AI

Python, SQL, ETL, T-SQL (Transact-SQL), PySpark, Data Analytics, Data Modeling...

Aura

Pytest, PostgreSQL, Apache Airflow, Snowflake, Python

Insight Data Science

Amazon Web Services (AWS), Plotly, Python, PostgreSQL, Spark

Experience

SQL - 3 years Python 3 - 2 years Pandas - 2 years Spark - 1 year Snowflake - 1 year NoSQL - 1 year Database Modeling - 1 year Data Engineering - 1 year

Availability

Part-time

Preferred Environment

Snowflake, NoSQL, SQL, Pandas, Spark, Python

The most amazing...

...thing I've developed is a tool to extract participation information from online meeting platform logs.

Work Experience

AI ETL Solutions Engineer

2020 - 2021

Anthem - AI

Improved ETL pipelines’ speed by more by more than 2,000% (27 to 1.2 hours) through rewriting Hive queries to Spark SQL, and optimizing Spark SQL queries and configurations.
Ensured code quality by building an automated CI/CD pipeline to run regression and unit tests of ETL and ML pipelines using GitLab and by implementing a Gitflow-style branching strategy for our repositories.
Guaranteed reproducibility for ML pipelines by creating a Python package to create and run Docker images used in production.
Met customer needs by developing an efficient and user-friendly API and Tableau dashboards to deliver our team's ML results.

Technologies: Python, SQL, ETL, T-SQL (Transact-SQL), PySpark, Data Analytics, Data Modeling, Data Profiling, Spark, Docker, GitLab CI/CD, Apache Hive

Data Engineer

2020 - 2020

Aura

Provided data scientists and business analysts with reliable access to loan application and payment data by building an Airflow-orchestrated data pipeline between an RDS transactional database and a Snowflake data warehouse.
Ensured reliable service by writing automated tests for data pipelines using Pytest and Tox.
Increased team productivity by expanding documentation and writing Bash shell scripts to automate the installation of required tools and packages.

Technologies: Pytest, PostgreSQL, Apache Airflow, Snowflake, Python

Data Engineer

2019 - 2020

Insight Data Science

Assisted Google Ads users to find the most cost-effective options for their Google Ads (AdWords) purchases.
Created an application that identifies new trending words within social media communities devoted to a specific topic.
Provided a fast and resilient pipeline that ingests data from social media sites, processes the data with Spark to find trending topic-specific words, and stores the processed data in a PostgreSQL database that updates via Airflow DAG.
Built an easy-to-use and informative Dash-based UI that delivers results from a database by converting user input into SQL queries to generate a list of possible words for Google Ads and informative plots about the words’ usage on Reddit.

Technologies: Amazon Web Services (AWS), Plotly, Python, PostgreSQL, Spark

Instructor and Technology Committee Member

2010 - 2019

Stanford University

Enabled online teachers to quantitatively track their students’ participation and use of class time.
Developed a Python application that extracts student participation data from XML-log files and generates easily understandable reports and charts using Bokeh.
Saved teachers approximately five hours per week by finding, testing, evaluating, and making recommendations about new software for learning management, grade book, video recording, and video playback.
Increased new technology adoption rate by approximately 30% by giving talks, hosting workshops, and writing user guides for instructors and staff.

Technologies: Python

Experience

WordEdge (Social Media NLP ETL Pipeline)

https://github.com/jehalley/identifying_topic_specific_trending_words

I developed WordEdge to help users get the best deals on their Google Ads purchases. Businesses that advertise on Google Ads purchase search terms through an auction. Suppose you wanted to advertise "Basketball Shoes." You might want to purchase the search term "Basketball Shoes," but because of its popularity, it's likely too expensive to be cost-effective.

WordEdge helped users identify the newest trending words in a topic related to their business before those words got cool and before they got so expensive. If you were the basketball-shoe seller described above, WordEdge would help you discover basketball fans' inside jokes, player nicknames, and names of hot new rookies, all of which enabled effective and affordable search term purchases.

Adobe Connect Participation Extractor (XML ETL)

https://github.com/jehalley/Quantify_Participation_From_Adobe_Connect_Recordings

This Python script extracts participation information from the .XML files that are included with downloaded recordings of Adobe Connect sessions. For each participant, the script determines time on camera, time with camera paused, time on microphone, the number of chat messages sent, and a summary participation grade. The script generates a report on all of these features and some related calculations in a summary participation report .csv file. Additionally, the script generates a series of bar plots showing each of the participation features and saves them as a .html file.

Predictive Text in R

https://github.com/jehalley/word_suggestor

The Word Suggester app searches a database of commonly used phrases (wordlists) for a phrase that matches the text typed by a user. The app attempts to match all of the text input by the user (up to five words long), but if no matches are found, the app progressively removes words from the beginning of the phrase until a match is found. Once a match is found, the three words that most commonly follow the matched phrase are suggested to the user, ranked in order from most common to least common. If fewer than three words commonly follow a particular phrase, the app will suggest the words that follow that phrase and then trim words from the beginning of the phrase until a total of three words can be suggested to the user.

Skills

Languages

SQL, Python 3, R, Snowflake, Python, T-SQL (Transact-SQL)

Frameworks

Spark

Libraries/APIs

Pandas, PySpark

Storage

Database Modeling, PostgreSQL, NoSQL, Apache Hive

Other

Data Engineering, Natural Language Processing (NLP), GPT, Generative Pre-trained Transformers (GPT), Data Analytics, Data Modeling, Data Profiling

Tools

Apache Airflow, Pytest, Plotly, GitLab CI/CD

Paradigms

ETL

Platforms

Amazon Web Services (AWS), Docker

Education

2003 - 2009

Ph.D. in Molecular and Cellular Biology

University of California, Berkeley - Berkeley, CA, USA

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring