Jeff Carter, Ph.D., Data Engineer and Developer in Temecula, CA, United States
Jeff Carter, Ph.D.

Data Engineer and Developer in Temecula, CA, United States

Member since March 31, 2018
Jeff is a full-stack data professional, well-versed in both data science and data engineering. He has a passion for building predictive data models, data flow processes and custom infrastructures. With over 15 years in the data arena, his experience spans statistical modeling and data visualization to building out real-time data-streaming infrastructures.
Jeff is now available for hire

Portfolio

Experience

  • Statistics 10 years
  • Python 8 years
  • Spark 4 years
  • Apache Kafka 4 years
  • NoSQL 4 years
  • SQL 4 years
  • Streamsets 2 years
  • Kudu 2 years

Location

Temecula, CA, United States

Availability

Part-time

Preferred Environment

Apache Kafka, Apache Hive, IntelliJ, Sublime Text, Git, Tableau, CouchDB, ZeroMQ, RabbitMQ, Kinetica, Kudu, Spark, Streamsets, Oracle, PostgreSQL, Microsoft SQL Server, Java, Python, Linux

The most amazing...

...thing that I've built is a real-time streaming infrastructure with more than seven data sources, moving 10+ million records daily into multiple destinations.

Employment

  • Data Engineer | Data Scientist

    2016 - 2020
    Pechanga Resort & Casino
    • Developed real-time streaming data pipelines processing 10 million records daily.
    • Designed and built data warehouse in Kinetica that tracks all SDCII dimensions with 3TB of data coming from previously isolated sources.
    • Wrote custom MCMC algorithms to calculate ROI on marketing events in a high-dimensional space, generating over a million dollars of additional annual revenue.
    • Built custom ETL to process millions of daily records detecting potential money laundering.
    • Advanced customer segmentation of 3+ million individuals, using a combination of custom behavioral metrics, traditional RFM (recency, frequency, monetary) metrics, and geolocation data.
    Technologies: Cloudera, Tableau, Apache Kafka, Kinetica, Kudu, Spark, IBM DB2, Oracle, Microsoft SQL Server, Streamsets, Java, Python
  • Data Scientist

    2013 - 2016
    Picarro
    • Redesigned a configurable and modular real-time data pipeline framework to process several IoT sensors in a unified manner.
    • Developed machine learning algorithms to predict the ROI of making additional measurements of the Surveyor product, using Bayesian statistics.
    • Conducted sensitivity analysis of critical model parameters of a highly non-linear, multi-dimensional algorithm.
    • Built a complete software package that collects real-time streaming data from IoT sensors, visualizes multiple time series, conducts on-the-fly statistical calculations, and allows the user to control and interact with hardware firmware.
    Technologies: Amazon Web Services (AWS), AWS EC2, AWS, Spark, Logstash, Elasticsearch, RabbitMQ, ZeroMQ, Microsoft SQL Server, Python
  • Postdoctoral Researcher

    2011 - 2013
    Lawrence Livermore National Laboratory
    • Performed nonlinear regression modeling of multi-dimensional experimental data with custom models.
    • Built a framework to enable physics-based computer simulations of state-of-the-art experiments to better understand experimental results and sources of potential errors.
    • Published experimental data and modeling results in peer-reviewed scientific journals.
    Technologies: Python
  • Research Assistant

    2005 - 2011
    University of Illinois
    • Automated real-time data collection and on-the-fly regression modeling from multiple sensors.
    • Developed a framework to simulate quantum dynamics resulting from external perturbations.
    • Published experimental results and data models in peer-reviewed scientific journals.
    Technologies: Python, Data Analysis

Experience

  • Real-time Data into a Data Lake and Data Warehouse (Development)

    Real-time data flows from MS SQL CDC tables, from Oracle LogMiner, and other message queues into a data lake, which mirrors the production data, and subsequently, extracted, transformed, and loaded (ETL) into the data warehouse (DWH). The pipelines for this project were primarily implemented in StreamSets, with the caveat that the open-source version of StreamSets did not have the full-functionality required. Additional pipeline stages were written in Java and integrated into the StreamSets framework.

    Custom Python code enabled the automated build-out of the entire data lake schema by querying each source database and automatically generating the appropriate tables, including the mapping of the data types. This type of code as infrastructure enables rapid prototyping and rebuilding from scratch as needed with minimal effort.

    The surrogate keys for the DWH are generated by a unique combination of primary keys and source database log IDs. These meaningful surrogate keys provide not only a way to track changes within mutable data but also an intrinsic, built-in data lineage.

Skills

  • Languages

    Python, Java, SQL
  • Libraries/APIs

    Pandas, ZeroMQ
  • Paradigms

    Functional Programming, ETL, Object-oriented Programming (OOP)
  • Other

    Data Processing, Bayesian Inference & Modeling, Streamsets, Data Visualization, Statistics, Machine Learning, Streaming Data, AWS, Data Analysis
  • Frameworks

    Spark
  • Tools

    Kudu, Kinetica, RabbitMQ, Git, Sublime Text, IntelliJ, Cloudera, Logstash, Tableau
  • Platforms

    Linux, Apache Kafka, AWS EC2, Amazon Web Services (AWS), Oracle
  • Storage

    NoSQL, Apache Hive, PostgreSQL, Microsoft SQL Server, CouchDB, IBM DB2, Elasticsearch

Education

  • Ph.D. in Chemical Physics
    2005 - 2011
    University of Illinois at Urbana-Champaign - Champaign, IL, USA
  • Bachelor of Science degree in Chemistry
    2001 - 2005
    Virginia Tech - Blacksburg, VA, USA

To view more profiles

Join Toptal
Share it with others