David Grayson
Verified Expert in Engineering
Data Scientist and Machine Learning Developer
Oakland, CA, United States
Toptal member since January 1, 2021
David is an experienced data and ML scientist with a PhD and demonstrated success at large and small companies. He published 12+ papers on computational neuroimaging, designed and built real-time ML apps for QuickBooks, improving product experience for over a million users. He led multiple initiatives at a biotech startup predicting neurological disease using novel computer vision, analytics, and ML methods. David is passionate about helping clients leverage data and AI to maximize their impact.
Portfolio
Experience
Availability
Preferred Environment
Linux, Slack, Python 3, MacOS
The most amazing...
...ML product I've built is a recommender system for QuickBooks Online users needing help, based on real-time user activity and powered by deep learning.
Work Experience
Lead Data Scientist
Logic20/20
- Led a new data science practice within San Diego Gas & Electric’s (SDG&E) asset management group.
- Served as the lead data scientist and ML engineer for Pacific Gas & Electric’s (PG&E) AI-assisted inspection team.
- Trained engineers, analysts, and data scientists in the full data science lifecycle, including project scoping, EDA, data pipelining, code testing, model training/validation, and deployment.
- Built the client's first ML app at SDG&E, making daily predictions about failures on 200,000 devices in the distribution grid using CNNs, LSTMs, and self-supervised embeddings. Built their first continuous integration pipeline using Azure DevOps.
- Trained and productionized deep computer vision models at scale to prioritize and assist PG&E’s inspection of millions of drone-captured images.
- Enabled real-time, automated assistance in the inspection of more than 100 thousand aerial images via four object detection and classification pipelines.
- Restructured a database containing millions of AI-detected components. Reduced query execution time on the DB by more than 50x.
- Replaced manual inspection form questions with AI predictions, reducing manual labor for tens of thousands of inspections. Demonstrated accuracy of over 90% across seven classes.
- Trained and productionized new iterations of a component classification model, adding new classes and improving the precision of existing classes by 3% on average.
- Deployed existing model pipelines to GPU, resulting in around 5x speed-up in response time and eliminating crashes on Kubernetes pods.
Senior Machine Learning Scientist
System1 Biosciences
- Led the video microscopy data pipeline team with biology, robotics, software, and data science members. Deployed a 12-step processing DAG in AWS on 500+ videos (over 10TB). Reduced the failure rate of QC-ed videos by 75% and increased frame rate 10x.
- Built and productionized CNN-based image segmentation for automated quantification of tissue protein expression. Deployed in AWS on over 1,000 scanned images (more than 1PB).
- Demonstrated effects of lab protocols on tissue quality, used for patents and investor demos.
- Created an advanced analytics pipeline to measure and describe neuronal network activity. It was used to demonstrate the significant and distinct effects of three different neuromodulatory drugs and validate new lab protocols.
- Built an analytics pipeline to assay hierarchical effects of experimental variables. Created novel, statistically rigorous methods for demonstrating disease effects.
- Served as a technical lead for the neurodegenerative disease program. Planned and executed scientific roadmaps and company and investor presentations while coordinating experimental designs, data pipelines, ML, and analytics.
Senior Data Scientist—Machine Learning
Intuit, Inc.
- Acted as a technical lead for QuickBooks Online's self-help recommendation algorithm, which required a multi-team collaboration. Expanded its use to all customer segments and submitted multiple patents for its back-end ML algorithms.
- Trained, productionized, and A/B tested the first real-time deep learning models (RNN and LSTM) in QuickBooks. Boosted customer engagement by 55%, reduced customer support call rates by 10%, and reduced direct annual costs by at least $900,000.
- Transformed data from millions of users and billions of clickstream events via distributed computing such as Spark to create embedded representations of online user activity and improve multiple existing ML services.
- Trained interns and led exploratory machine learning and NLP research for customer success. Projects included an API service to anonymize customer chat data and a predictive customer support call intent model.
Visiting Scientist
Oregon Health & Science University
- Led two research projects on a six-member data team comprised of graduate students, postdoctoral scientists, and research staff, resulting in three publications and multiple conference presentations.
- Built multilinear regression models explaining more than 60% variance in the correlational structure of fMRI time-series data, using anatomical and gene expression data as features.
- Trained students and research staff in structural and functional MRI, signal processing, and data analysis.
Graduate Student Researcher
UC Davis Center for Neuroscience
- Developed data analysis strategies independently. Selected for a two-year Autism Speaks research fellowship award for my work.
- Produced results that were instrumental in securing a federal grant worth over $1.5 million.
- Published 12 peer-reviewed studies with over 700 citations, covering advanced statistical and computational techniques for processing multimodal brain MRI data and characterizing typical and atypical brain organization.
Experience
Computer Vision for Remote Inspection of Aerial Imagery for Utilities
Predictive Maintenance for the Distribution Grid
The specific use case was to predict devices in the electrical grid that are nearing failure. This involved joining many disparate data stores with information on more than 200,000 transformers, including metadata pertaining to GIS, customer outage, and service records; time series of weather variables; and time series of electrical loads. The end-to-end pipeline involved cleaning, filtering, and joining data and training custom artificial neural nets (CNNs, LSTMs, autoencoders) using metadata and time series of weather and load data. The pipeline ran as a Python app using Luigi to manage the workflow, with CI/CD configured in ADO. We demonstrated accuracy that outperformed existing baselines and established previously unknown mechanistic insights.
Disease Classification from Neuronal Network Activity at System1 Biosciences
The key challenge was representing extremely high spatiotemporal resolution data via low-D, biologically interpretable metrics. We built a 12-module semi-automated pipeline (a DAG), including supervised and unsupervised CV methods constrained by biological priors, to clean and standardize the data, including auto-triggered QC that seamlessly integrated with pre- and post-processing.
Deployed as a streaming app in AWS on over 10 TB of data, it reduced QC-ed videos' failure rate by 75% and enabled us to increase the temporal resolution 10x.
For analytics, I designed two novel ML-based methods to deconfound experimental variables. I employed these pipelines to achieve the following critical endpoints for investors—demonstrating distinct effects of three neuromodulator drugs and demonstrating significant accuracy in predicting disease.
QuickBooks Online In-product Help Recommender
For data exploration, extraction, and feature engineering, I liaised with data science and data engineering teams to understand the multiple sources of relevant data. I wrote efficient PySpark code to ingest and transform high volumes of clickstream (billions of rows), customer profile data, and help article databases.
For model training, I employed a novel deep learning approach consisting of shared layers, LSTMs, and merging temporal sequences with static features.
To productionize the model, I led a team consisting of other DS contributors as well as front-end and back-end developers, and members of performance testing and A/B testing teams. Together we integrated the model with the existing click data streams, built I/O specs, ensured adequate stability and response latency, and measured significant improvements in customer engagement (55% higher clickthrough on articles) and support metrics (10% lower call rates).
Education
Doctoral Degree in Neuroscience (Computational Neuroimaging)
University of California, Davis - Davis, California
Bachelor's Degree in Computational Neuroscience
Cornell University - Ithaca, NY
Skills
Libraries/APIs
SciPy, NumPy, Pandas, Scikit-learn, PyTorch, PySpark, Keras, TensorFlow, Luigi, CatBoost
Tools
Git, PyCharm, Slack
Languages
Python, SQL, Python 3, R
Platforms
Linux, MacOS, Jupyter Notebook, Amazon Web Services (AWS), Docker, Kubernetes
Frameworks
Hadoop, Alembic
Paradigms
Continuous Integration (CI), Azure DevOps
Industry Expertise
Project Management
Storage
Databases
Other
Computer Vision, Machine Learning, Presentations, Deep Learning, Experimental Design, Experimental Research, Data Science, Artificial Intelligence (AI), Data Analysis, Research, Natural Language Processing (NLP), Technical Project Management, Statistics, Data Visualization, Mathematics, Probability Theory, Signal Processing, 3D Image Processing, A/B Testing, Scientific Computing, Image Processing, Neural Networks, Generative Pre-trained Transformers (GPT), Convolutional Neural Networks (CNNs), Graph Theory, Network Science, Cognitive Science, Computational Biology, Factor Analysis, Time Series Analysis, Graphics Processing Unit (GPU), CI/CD Pipelines, Object Detection, Variational Autoencoders, Diffusion Models
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring