Jeffrey Halley
Verified Expert in Engineering
Data Engineering Developer
Jeff is a data engineer, a software engineer, and a former geneticist and educator. He develops innovative uses for existing data, discovers and implements efficiencies, and helps others excel in their projects. As a scientist, he learned how to understand and explore complex problems. As an educator, he mastered the art of clearly communicating advanced topics. Jeff brings this rare combination of skills and experience to every data and software development project he takes on.
Portfolio
Experience
Availability
Preferred Environment
Snowflake, NoSQL, SQL, Pandas, Spark, Python
The most amazing...
...thing I've developed is a tool to extract participation information from online meeting platform logs.
Work Experience
AI ETL Solutions Engineer
Anthem - AI
- Improved ETL pipelines’ speed by more by more than 2,000% (27 to 1.2 hours) through rewriting Hive queries to Spark SQL, and optimizing Spark SQL queries and configurations.
- Ensured code quality by building an automated CI/CD pipeline to run regression and unit tests of ETL and ML pipelines using GitLab and by implementing a Gitflow-style branching strategy for our repositories.
- Guaranteed reproducibility for ML pipelines by creating a Python package to create and run Docker images used in production.
- Met customer needs by developing an efficient and user-friendly API and Tableau dashboards to deliver our team's ML results.
Data Engineer
Aura
- Provided data scientists and business analysts with reliable access to loan application and payment data by building an Airflow-orchestrated data pipeline between an RDS transactional database and a Snowflake data warehouse.
- Ensured reliable service by writing automated tests for data pipelines using Pytest and Tox.
- Increased team productivity by expanding documentation and writing Bash shell scripts to automate the installation of required tools and packages.
Data Engineer
Insight Data Science
- Assisted Google Ads users to find the most cost-effective options for their Google Ads (AdWords) purchases.
- Created an application that identifies new trending words within social media communities devoted to a specific topic.
- Provided a fast and resilient pipeline that ingests data from social media sites, processes the data with Spark to find trending topic-specific words, and stores the processed data in a PostgreSQL database that updates via Airflow DAG.
- Built an easy-to-use and informative Dash-based UI that delivers results from a database by converting user input into SQL queries to generate a list of possible words for Google Ads and informative plots about the words’ usage on Reddit.
Instructor and Technology Committee Member
Stanford University
- Enabled online teachers to quantitatively track their students’ participation and use of class time.
- Developed a Python application that extracts student participation data from XML-log files and generates easily understandable reports and charts using Bokeh.
- Saved teachers approximately five hours per week by finding, testing, evaluating, and making recommendations about new software for learning management, grade book, video recording, and video playback.
- Increased new technology adoption rate by approximately 30% by giving talks, hosting workshops, and writing user guides for instructors and staff.
Experience
WordEdge (Social Media NLP ETL Pipeline)
https://github.com/jehalley/identifying_topic_specific_trending_wordsWordEdge helped users identify the newest trending words in a topic related to their business before those words got cool and before they got so expensive. If you were the basketball-shoe seller described above, WordEdge would help you discover basketball fans' inside jokes, player nicknames, and names of hot new rookies, all of which enabled effective and affordable search term purchases.
Adobe Connect Participation Extractor (XML ETL)
https://github.com/jehalley/Quantify_Participation_From_Adobe_Connect_RecordingsPredictive Text in R
https://github.com/jehalley/word_suggestorSkills
Languages
SQL, Python 3, R, Snowflake, Python, T-SQL (Transact-SQL)
Frameworks
Spark
Libraries/APIs
Pandas, PySpark
Storage
Database Modeling, PostgreSQL, NoSQL, Apache Hive
Other
Data Engineering, Natural Language Processing (NLP), GPT, Generative Pre-trained Transformers (GPT), Data Analytics, Data Modeling, Data Profiling
Tools
Apache Airflow, Pytest, Plotly, GitLab CI/CD
Paradigms
ETL
Platforms
Amazon Web Services (AWS), Docker
Education
Ph.D. in Molecular and Cellular Biology
University of California, Berkeley - Berkeley, CA, USA
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring