Allen is available for hire

Allen Gary Grimm

Verified Expert in Engineering

Agile Data Science Developer

Location

Portland, United States

Toptal Member Since

November 5, 2014

Fascinated by the intersection of abstraction and reality, Allen found his calling in data science. Formally trained in machine learning plus a breadth of experience in applying ML as prototypes up through production, his specialty is in finding and implementing tractable solutions to complex data modeling problems: e.g., user behavior prediction, recommender systems, NLP, spam filters, deduplication, or feature engineering.

Agile Data Science A/B Testing Python Linux Scikit-learn REST APIs Django SQL PySpark Test-driven Development (TDD)Agile Software Development Git GitHub MySQL Redshift Haystack Wolfram Octave Cython Flask MLflow

Portfolio

Grimm Science

Amazon Web Services (AWS), DigitalOcean, Google Cloud, Agile, Machine Learning...

CVS

Python, PySpark, Databricks, Apache Airflow, MLflow, Decision Trees...

Doing, Inc.

Doc2Vec, Graph Theory, Neural Networks, Tf-idf, SQLAlchemy, Python...

Experience

Data Science - 9 years Python - 7 years Agile Data Science - 5 years PySpark - 4 years Django - 4 years Machine Learning Operations (MLOps) - 4 years Test-driven Development (TDD) - 3 years Uplift Modeling - 2 years

Availability

Part-time

Preferred Environment

Git, Python, Linux

The most amazing...

...thing I've coded is an evolutionary algorithm to grow complex networks representing massively parallel processors to research the potential of new wire types.

Work Experience

Founder, Engineer, and Data Scientist

2016 - PRESENT

Grimm Science

Built a high-quality spam filter for websites based on logistic regression and several iterations of feature engineering centered entirely on linguistic features.
Developed image and video categorization using AWS Rekognition to extract keywords from media.
Rewrote a DynamoDB-backed CloudSearch implementation wrapped in Lambda. Started from a failed proof of concept.
Compiled and configured constraint satisfaction code (SCIP) for a client's use case. Wrapped software calls in a Docker container and deployed it as a service using AWS Batch.
Built a custom recommender system—implicit user-item collaborative filter customized to return relevant people based on a product. Enclosed it in a Django project using DRF to serve as API powering a contractor-made web interface.
Created OCR on Egyptian Hieratic using convolutional neural networks.
Performed several web scraping with a simple GNU Wget and a bot that dynamically navigated the site.
Debugged, updated, and cleaned inherited Looker Integration, now a part of Google Cloud, to enable both internal and external analytics dashboards.
Implemented keyword extraction from resumes and job postings using TF-IDF and word2vec.

Technologies: Amazon Web Services (AWS), DigitalOcean, Google Cloud, Agile, Machine Learning, Python, Amazon S3 (AWS S3), AWS Lambda, Amazon EC2, REST APIs, PySpark, Multi-Armed Bandit, A/B Testing, Uplift Modeling, SQL

Senior Data Scientist and Data Engineer

2019 - 2021

CVS

Optimized the PySpark-based uplift model from a runtime of eight hours down to five minutes on a benchmark dataset with millions of rows and hundreds of variables.
Packaged the Uplift model into a properly versioned pip-installable package shared across the team.
Added MLflow interface to uplift model to fit within the rest of the team's model pipeline.
Updated the tree-based uplift model's decision functions to the cutting edge, increasing average model performance by 50 BPS across the team.
Helped refactor the experimentation pipeline to better use Airflow and PySpark by bringing poorly scaling pipelines back within SLA requirements.

Technologies: Python, PySpark, Databricks, Apache Airflow, MLflow, Decision Trees, Gradient Boosted Trees, Azure, Uplift Modeling, A/B Testing, Multi-Armed Bandit, SQL

Data Scientist and Web Developer

2016 - 2017

Doing, Inc.

Led the processes of scoping and selecting possible machine learning uses, prototyping chosen initiatives, and productizing final models.
Contributed to the development of the project Canonicalization. The core of Doing’s data is scraped event postings from several major event publishers. Through this, we frequently encountered duplicate locations across sources and duplicate events across and within sources. A distance-based test done theoretically comparing every event to every other event (but optimized enough to be computationally feasible; almost fast) or every location to every other location let us find events and locations that were so similar they were likely the same. This project was built from scratch up through productization.
Helped build a tag extraction project. To help users quickly understand events, it is useful to have a short list of potent tags attached to each event. This project was prototyped using an aggregation of Doc2vec and Tf-idf. It was validated through systematically generating surveys via Google Docs to let the team give feedback on the quality of tags generated.
Helped build a categorization project. Similar to tags, categories are useful to help us better understand our events and to help users better navigate available events. This was also prototyped using Doc2vec comparing each event to a whitelist of available categories (which came from picking the most popular categories listed by our data sources). This one reached the stage of prototype.
Contributed to the development of the project DoingRank. Given a complete lack of user data (the startup’s app is still unreleased) but significant event data, none of the supervised recommender algorithms fit. So the first version (that only barely reached the stage of prototype) had two components. The first, to encode an abstract notion of event quality, was a math-ized version of the collective intuition of properties expected in good event postings (a title that matches the description, consistent event postings, etc). The second part is user-specific and maps RSVPs and other direct-app interactions through tags/categories to form a high-level notion of preference.

Technologies: Doc2Vec, Graph Theory, Neural Networks, Tf-idf, SQLAlchemy, Python, Amazon S3 (AWS S3)

Senior Data Scientist

2014 - 2016

Veelo

Conceived, prototyped, and productized data science initiatives. Researched models and wrote the valuable ones into the app.
Created a relevance score model applied to content based on how users consume and react to content. It was a mathematical equivalent to a neural network, though the training was mainly done by interviewing domain experts due to little available data.
Developed a model that generates tags attached to content based on who consumes what content in which context. For example, if many salespeople consume a document and nobody else touches it, the content is probably for salespeople.
Documented and identified holes in current client-facing reporting infrastructure. Built new reports into the app as appropriate. My contribution mainly focused on the back end, but occasionally required front-end work too.
Upgraded the current search engine to include spellcheck, faceting on our current tag infrastructure, and autocomplete.

Technologies: Angular, JavaScript, HTML, Django REST Framework, SQLAlchemy, Solr, Haystack, Git, Django, Python, REST APIs, SQL, Elasticsearch

Data Scientist

2014 - 2014

Cloudability (via Grimm Science)

Surveyed time series prediction methods.
Conducted a case study on time series prediction applied to server usage in R.
Wrote product-quality implementation of the chosen time series model (holt winters) from scratch in Python.
Calibrated forecasting intervals (expected accuracy on predictions) in terms of performance, and trained and tested sets of data.
Documented model implementation and testing procedures to enable the client's engineering team to build the model into their dashboard.

Technologies: Holt-Winters, R, Python

Senior Data Scientist

2014 - 2014

Sovolve (via Grimm Science)

Modeled user activity and interactions to optimize the user experience by filtering content to what is likely to be the most interesting and useful.
Helped build out back-end data infrastructure to improve app performance and prepare for scalability.
Conducted A/B studies to help with product decisions.
Clustered user behavior into distinct and comprehensible segments.
Conducted and internally published the app's virality to report product success and direct product decisions.

Technologies: Mixpanel, Neo4j, PostgreSQL, Python, Linux, REST APIs

Data Scientist

2012 - 2014

PlayHaven

Modeled and predicted user behavior in mobile games. Core projects included churn prediction and user path prediction.
Managed relations between data science and engineering to catalyze productization of initiatives.
Conducted ad hoc advanced analytics to assist in product decisions and to seed ideas for future data modeling.
Rebuilt system logs: Solved for errors in observed device identifiers and marked invalid log entries as such. More precisely, the task was to write an iterative mapreduce algorithm to solve for all connected components in a several-billion node network using Hadoop Streaming and Python.
Recruited, trained, and managed small teams of interns to assist with projects.

Technologies: Hadoop, R, GitHub, Python, Linux, Amazon S3 (AWS S3), Amazon EC2, REST APIs

Data Miner, Software Engineer, and Data Engineer

2011 - 2012

Nike Sport Research Lab

Demoed data mining.
Defined roles for new full-time data miners in a lab.
Created a database architecture to centralize the lab's data collection and analysis.
Worked with researchers to import their personal research data into a consistent format.
Liaised with lab researchers and the Wolfram team to build the centralized database.

Technologies: Wolfram, MySQL, Python, C++, REST APIs

Research Assistant

2010 - 2011

Portland State University - Teuscher Lab

Built an evolutionary algorithm in C++ using the library ParadisEO to evolve complex networks.
Wrote a network evaluation utility to simulate traffic and calculate other metrics on networks representing massively parallel processors with non-traditional interconnections.
Built out and documented the experimentation process to enable fellow researchers within and outside of the university to use my framework.
Conducted experiments relating the properties of links to the types of networks it would optimally be used in.
Wrote a thesis on creation of a framework and the results of initial experiments.

Technologies: Network Analysis, Simulations, Evolutionary Algorithms, ParadisEO, C++, Linux

Experience

PDX Data

http://www.meetup.com/Portland-Data-Science-Group/

I founded a local data meetup that grew into what is now a well organized cluster of data meetups for the Portland area. Within that cluster, I also am the lead of the most well attended and one of the two most active data meetups

The website for the cluster is http://pdxdata.org/

The website for the meetup I'm most involved with/lead is http://www.meetup.com/Portland-Data-Science-Group/

Churn Precition with Graphical Models

My flagship project at the job that turned me from a data miner to a data scientist. Slides from this and other presentations that I've done can be found here: https://github.com/TheGrimmScientist/SlidesFromTalks.

Trials and Tribulations of a Data Scientist

My blog on data science. I plan to grow it into a stand-alone resource for data science education including everything from business to theory and execution.

An Exploration of Heterogeneous Networks On Chip

http://pdxscholar.library.pdx.edu/cgi/viewcontent.cgi?article=1184&context=open_access_etds

My thesis, which explored the relation between the properties of links and the properties of optimally built networks.

Citation and other metadata are available here: http://archives.pdx.edu/ds/psu/7239

Discrete Multivariate Modeling Simulator

https://github.com/TheGrimmScientist/DMM_Sim

The most recent sample of code that I own and can share. It is to become an open source version of Occam3 (http://dmm.sysc.pdx.edu/weboccam.cgi?action=search), the modeling technique I used for Churn Prediction.

Publication

Python Best Practices and Tips by Toptal Developers

https://www.toptal.com/python/tips-and-practices

Skillset

Languages

Python, SQL, Wolfram, HTML, C++, C, R, Octave, JavaScript

Libraries/APIs

Scikit-learn, SQLAlchemy, Django ORM, Matplotlib, PySpark, REST APIs, Pandas

Tools

IPython Notebook, Apache Solr, Haystack, Git, GitHub, Solr, Doc2Vec, Vagrant, Occam3, MATLAB, Boto 3, Apache Airflow

Paradigms

Data Science, Test-driven Development (TDD), Agile Software Development, Agile

Platforms

Linux, AWS Lambda, Amazon EC2, Mixpanel, DigitalOcean, MacOS, Windows, AWS Elastic Beanstalk, Amazon Web Services (AWS), Databricks, Azure

Other

Agile Data Science, Decision Trees, Random Forests, Neural Networks, Cython, Uplift Modeling, Machine Learning Operations (MLOps), A/B Testing, Multi-Armed Bandit, Simulated Annealing, Graphical Models, Evolutionary Algorithms, Markov Model, ParadisEO, Simulations, Network Analysis, Holt-Winters, Tf-idf, Graph Theory, Machine Learning, SVMs, Regression, Lambda Functions, Mixed-integer Linear Programming, MLflow, Gradient Boosted Trees

Frameworks

Django, Django REST Framework, Angular, Apache Spark, Hadoop, Flask, AngularJS

Storage

PostgreSQL, MySQL, Amazon DynamoDB, NoSQL, Redshift, Column-oriented DBMS, Neo4j, HDFS, Amazon S3 (AWS S3), Elasticsearch, Google Cloud

Education

2009 - 2011

Master of Science Degree in Electrical Engineering

Portland State University - Portland, Oregon

2005 - 2009

Bachelor of Science Degree in Electrical Engineering

Gannon University - Erie, Pennsylvania

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring