David Grayson, Developer in Oakland, CA, United States
David is available for hire
Hire David

David Grayson

Verified Expert  in Engineering

Data Scientist and Machine Learning Developer

Location
Oakland, CA, United States
Toptal Member Since
January 1, 2021

David is an experienced data and ML scientist with a PhD and demonstrated success at large and small companies. He published 12+ papers on computational neuroimaging, designed and built real-time ML apps for QuickBooks, improving product experience for over a million users. He led multiple initiatives at a biotech startup predicting neurological disease using novel computer vision, analytics, and ML methods. David is passionate about helping clients leverage data and AI to maximize their impact.

Portfolio

Logic20/20
Time Series Analysis, Azure DevOps, Data Science, Luigi, Python...
System1 Biosciences
SQL, Deep Learning, Signal Processing, Image Processing, Experimental Research...
Intuit, Inc.
A/B Testing, Git, Python, Pandas, Amazon Web Services (AWS), Docker...

Experience

Availability

Part-time

Preferred Environment

Linux, Slack, Python 3, MacOS

The most amazing...

...ML product I've built is a recommender system for QuickBooks Online users needing help, based on real-time user activity and powered by deep learning.

Work Experience

Lead Data Scientist

2021 - 2021
Logic20/20
  • Led a new data science practice within San Diego Gas & Electric’s (SDG&E) asset management group.
  • Served as the lead data scientist and ML engineer for Pacific Gas & Electric’s (PG&E) AI-assisted inspection team.
  • Trained engineers, analysts, and data scientists in the full data science lifecycle, including project scoping, EDA, data pipelining, code testing, model training/validation, and deployment.
  • Built the client's first ML app at SDG&E, making daily predictions about failures on 200,000 devices in the distribution grid using CNNs, LSTMs, and self-supervised embeddings. Built their first continuous integration pipeline using Azure DevOps.
  • Trained and productionized deep computer vision models at scale to prioritize and assist PG&E’s inspection of millions of drone-captured images.
  • Enabled real-time, automated assistance in the inspection of more than 100 thousand aerial images via four object detection and classification pipelines.
  • Restructured a database containing millions of AI-detected components. Reduced query execution time on the DB by more than 50x.
  • Replaced manual inspection form questions with AI predictions, reducing manual labor for tens of thousands of inspections. Demonstrated accuracy of over 90% across seven classes.
  • Trained and productionized new iterations of a component classification model, adding new classes and improving the precision of existing classes by 3% on average.
  • Deployed existing model pipelines to GPU, resulting in around 5x speed-up in response time and eliminating crashes on Kubernetes pods.
Technologies: Time Series Analysis, Azure DevOps, Data Science, Luigi, Python, Computer Vision, Convolutional Neural Networks (CNN), Machine Learning, Amazon Web Services (AWS), Presentations, Pandas, Scikit-learn, PyTorch, SQL, Deep Learning, Keras, Technical Project Management, Docker, Project Management, Git, Continuous Integration (CI), Image Processing, Python 3, TensorFlow, Artificial Intelligence (AI), Databases, Data Analysis, Object Detection, Neural Networks

Senior Machine Learning Scientist

2019 - 2020
System1 Biosciences
  • Led the video microscopy data pipeline team with biology, robotics, software, and data science members. Deployed a 12-step processing DAG in AWS on 500+ videos (over 10TB). Reduced the failure rate of QC-ed videos by 75% and increased frame rate 10x.
  • Built and productionized CNN-based image segmentation for automated quantification of tissue protein expression. Deployed in AWS on over 1,000 scanned images (more than 1PB).
  • Demonstrated effects of lab protocols on tissue quality, used for patents and investor demos.
  • Created an advanced analytics pipeline to measure and describe neuronal network activity. It was used to demonstrate the significant and distinct effects of three different neuromodulatory drugs and validate new lab protocols.
  • Built an analytics pipeline to assay hierarchical effects of experimental variables. Created novel, statistically rigorous methods for demonstrating disease effects.
  • Served as a technical lead for the neurodegenerative disease program. Planned and executed scientific roadmaps and company and investor presentations while coordinating experimental designs, data pipelines, ML, and analytics.
Technologies: SQL, Deep Learning, Signal Processing, Image Processing, Experimental Research, Experimental Design, Continuous Integration (CI), Docker, Git, Project Management, Data Visualization, Statistics, Presentations, Amazon Web Services (AWS), Machine Learning, Convolutional Neural Networks (CNN), Computer Vision, PyTorch, Scikit-learn, Pandas, NumPy, SciPy, Python, Data Science, Time Series Analysis, Keras, Technical Project Management, Computational Biology, Scientific Computing, Python 3, Artificial Intelligence (AI), Data Analysis, Neural Networks, Research

Senior Data Scientist—Machine Learning

2017 - 2019
Intuit, Inc.
  • Acted as a technical lead for QuickBooks Online's self-help recommendation algorithm, which required a multi-team collaboration. Expanded its use to all customer segments and submitted multiple patents for its back-end ML algorithms.
  • Trained, productionized, and A/B tested the first real-time deep learning models (RNN and LSTM) in QuickBooks. Boosted customer engagement by 55%, reduced customer support call rates by 10%, and reduced direct annual costs by at least $900,000.
  • Transformed data from millions of users and billions of clickstream events via distributed computing such as Spark to create embedded representations of online user activity and improve multiple existing ML services.
  • Trained interns and led exploratory machine learning and NLP research for customer success. Projects included an API service to anonymize customer chat data and a predictive customer support call intent model.
Technologies: A/B Testing, Git, Python, Pandas, Amazon Web Services (AWS), Docker, Technical Project Management, Keras, Deep Learning, Hadoop, PySpark, SQL, GPT, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), SciPy, NumPy, Machine Learning, Data Science, Scikit-learn, Data Visualization, Python 3, TensorFlow, Artificial Intelligence (AI), Data Analysis, Neural Networks

Visiting Scientist

2015 - 2017
Oregon Health & Science University
  • Led two research projects on a six-member data team comprised of graduate students, postdoctoral scientists, and research staff, resulting in three publications and multiple conference presentations.
  • Built multilinear regression models explaining more than 60% variance in the correlational structure of fMRI time-series data, using anatomical and gene expression data as features.
  • Trained students and research staff in structural and functional MRI, signal processing, and data analysis.
Technologies: Scientific Computing, Linux, Experimental Research, 3D Image Processing, Signal Processing, Experimental Design, Factor Analysis, Python, Data Visualization, Statistics, Computer Vision, Graph Theory, Machine Learning, Data Science, Data Analysis, R, Research

Graduate Student Researcher

2012 - 2017
UC Davis Center for Neuroscience
  • Developed data analysis strategies independently. Selected for a two-year Autism Speaks research fellowship award for my work.
  • Produced results that were instrumental in securing a federal grant worth over $1.5 million.
  • Published 12 peer-reviewed studies with over 700 citations, covering advanced statistical and computational techniques for processing multimodal brain MRI data and characterizing typical and atypical brain organization.
Technologies: Signal Processing, 3D Image Processing, Linux, Experimental Design, Experimental Research, Data Visualization, Statistics, Data Science, Git, Time Series Analysis, Data Analysis, R

Computer Vision for Remote Inspection of Aerial Imagery for Utilities

Led a computer vision team of three data scientists and two engineers at the utility company PG&E. We provided AI-powered assistance for remote manual inspections of drone-captured images of the electrical grid. We also trained and deployed object detection and classification models at scale, using PyTorch models deployed with Seldon and Kubernetes in AWS. Our model pipelines ran on hundreds of thousands of images. We succeeded with multiple use cases, including 1) using model predictions to prioritize high-risk structures, 2) integrating model predictions into the inspection UI to reduce inspector error, and 3) replacing manual form questions with AI predictions to reduce manual labor. Some of the challenges we overcame were 1) optimizing model performance, 2) measuring model performance on production data, 3) sharing code across multiple model deployments, 4) deploying models to GPU, and 5) efficiently storing and analyzing tens of millions of predictions.

Predictive Maintenance for the Distribution Grid

As the lead data scientist, my project goal was to establish a robust data science practice within the asset management group of San Diego Gas and Electric. This involved teaching by example how to manage the data science lifecycle using a specific use case to scope, build, and validate the team’s first ML app. I implemented many technical best practices such as modular coding, packaging, testing, workflow management, CI/CD, and non-technical best practices such as Agile methodologies.

The specific use case was to predict devices in the electrical grid that are nearing failure. This involved joining many disparate data stores with information on more than 200,000 transformers, including metadata pertaining to GIS, customer outage, and service records; time series of weather variables; and time series of electrical loads. The end-to-end pipeline involved cleaning, filtering, and joining data and training custom artificial neural nets (CNNs, LSTMs, autoencoders) using metadata and time series of weather and load data. The pipeline ran as a Python app using Luigi to manage the workflow, with CI/CD configured in ADO. We demonstrated accuracy that outperformed existing baselines and established previously unknown mechanistic insights.

Disease Classification from Neuronal Network Activity at System1 Biosciences

Served as technical lead and scrum master on a team analyzing videos of high-resolution neural tissue microscopy, with members from biology, robotics, software, and data science. The goal was to build, validate, and enable CI/CD for the pipeline, and use it to measure the effects of drug perturbations, lab protocols, and genetic modifications on the activity of artificial neural tissue.

The key challenge was representing extremely high spatiotemporal resolution data via low-D, biologically interpretable metrics. We built a 12-module semi-automated pipeline (a DAG), including supervised and unsupervised CV methods constrained by biological priors, to clean and standardize the data, including auto-triggered QC that seamlessly integrated with pre- and post-processing.

Deployed as a streaming app in AWS on over 10 TB of data, it reduced QC-ed videos' failure rate by 75% and enabled us to increase the temporal resolution 10x.

For analytics, I designed two novel ML-based methods to deconfound experimental variables. I employed these pipelines to achieve the following critical endpoints for investors—demonstrating distinct effects of three neuromodulator drugs and demonstrating significant accuracy in predicting disease.

QuickBooks Online In-product Help Recommender

Acted as a lead data scientist for QuickBooks Online's self-help recommendation app, which required a multi-team collaboration. The goal was to surface the most relevant help articles to customers and enable them to resolve their problems from within the product. My role was to build and integrate the ML engine.

For data exploration, extraction, and feature engineering, I liaised with data science and data engineering teams to understand the multiple sources of relevant data. I wrote efficient PySpark code to ingest and transform high volumes of clickstream (billions of rows), customer profile data, and help article databases.

For model training, I employed a novel deep learning approach consisting of shared layers, LSTMs, and merging temporal sequences with static features.

To productionize the model, I led a team consisting of other DS contributors as well as front-end and back-end developers, and members of performance testing and A/B testing teams. Together we integrated the model with the existing click data streams, built I/O specs, ensured adequate stability and response latency, and measured significant improvements in customer engagement (55% higher clickthrough on articles) and support metrics (10% lower call rates).

Languages

Python, SQL, Python 3, R

Paradigms

Data Science, Continuous Integration (CI), Azure DevOps

Other

Computer Vision, Machine Learning, Presentations, Deep Learning, Experimental Design, Experimental Research, Artificial Intelligence (AI), Data Analysis, Research, Natural Language Processing (NLP), Technical Project Management, Statistics, Data Visualization, Mathematics, Probability Theory, Signal Processing, 3D Image Processing, A/B Testing, Scientific Computing, Image Processing, Neural Networks, GPT, Generative Pre-trained Transformers (GPT), Convolutional Neural Networks (CNN), Graph Theory, Network Science, Cognitive Science, Computational Biology, Factor Analysis, Time Series Analysis, Graphics Processing Unit (GPU), CI/CD Pipelines, Object Detection, Variational Autoencoders, Diffusion Models

Libraries/APIs

SciPy, NumPy, Pandas, Scikit-learn, PyTorch, PySpark, Keras, TensorFlow, Luigi, CatBoost

Tools

Git, PyCharm, Slack

Platforms

Linux, MacOS, Jupyter Notebook, Amazon Web Services (AWS), Docker, Kubernetes

Frameworks

Hadoop, Alembic

Industry Expertise

Project Management

Storage

Databases

2012 - 2017

Doctoral Degree in Neuroscience (Computational Neuroimaging)

University of California, Davis - Davis, California

2008 - 2012

Bachelor's Degree in Computational Neuroscience

Cornell University - Ithaca, NY

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring