Senior Data Scientist2017 - PRESENTLSQ
Technologies: SQL, Python, Docker, ECS, Amazon EC2 (Amazon Elastic Compute Cloud), Amazon S3 (AWS S3), Microsoft SQL Server, MySQL, PostgreSQL, Redshift, Matplotlib, NumPy, Pandas, Scikit-learn, XGBoost, Amazon Web Services (AWS), Spark
- Brought in as the first data scientist hire of a leading US invoice finance (factoring) firm processing $3.2 billion in receivables in 2018.
- Frequently challenged conventional wisdom with data-driven insights through detailed exploratory analysis. Saved the firm approximately $1 million by identifying material financial weaknesses in a key market initiative through analysis of client attrition, acquisition costs, and loss rate data. My recommendations led to the cancellation of the program and catalyzed significant process changes in marketing, underwriting, and account management.
- Questioned the prevailing management view that post-recession financial risk is primarily driven by debtors. Determined that deleterious client behavior is a principal risk factor, and identified client CEO personal credit score as a leading predictor of client default. Advocated systematic tracking and evaluation of the client’s ability to pay before extending funds in excess of invoice collateral.
- Identified target industries with the most attractive economics, market segments where LSQ could increase prices without affecting client attrition, and incentives to increase client longevity and lifetime value.
- Reduced risk, and streamlined operations with machine learning models, and advanced feature engineering. Led an initiative to optimize outbound communication and improve data tracking to reduce invoice delinquency and increase collection rates. Enhanced data-driven decisions by building an invoice risk model to predict non-payment and inform funding choices.
- Identified anomalous client behavior patterns signaling increased risk. Calculated debtor-centric-days to pay standard deviations above the norm to detect non-payment risk many weeks earlier than the legacy process, particularly when extreme early-payers start to deviate from past behavior although not yet delinquent.
- Built a framework to fully automate machine learning and model training. Applied evolutionary algorithms to optimize model tuning parameters (i.e., number of trees and learning rate), as well as model input selection. Leveraged inexpensive elastic compute power on AWS to train tens of thousands of candidate models.
- Transformed the company’s data infrastructure. Vastly improved data quality through relentless data cleaning initiatives. Cached frequently used metrics and model features in historical daily snapshot tables, dramatically reducing time to prototype new data projects. Enabled the department to quickly scale headcount by making data more intuitive and accessible, shortening onboarding time for new employees by six months.
- Automated Extract Transform and Load (ETL) processes on AWS with Apache Airflow. Migrated compute and data-intensive tasks to large, elastically-sized EMR Spark clusters on AWS using the EC2 Spot Market for a three to five times speedup at a fraction of the cost of dedicated hardware.
- Led multi-phase department-wide training to enforce fundamental software best practices, including source control (GitHub), unit testing, and containerization (Docker).
Principal Data Engineer2012 - 2017Capital One
Technologies: Java, R, SQL, Python, Amazon EC2 (Amazon Elastic Compute Cloud), Amazon S3 (AWS S3), Apache Kafka, Storm, Hadoop, Amazon Web Services (AWS), Spark
- Built a targeted online acquisition platform using AWS, H2O, and Spark to grow the customer base and increase revenue. Presented Building Real-time Targeting Capabilities on AWS at the H2O Open Tour 2016 in New York. Discussed H2O and Apache Spark-based machine learning techniques for improving customer acquisition rates on Capital One's website. Explained how to build models at scale on AWS and how to conduct automated daily model refits and model deployments to reactive production systems.
- Built a fully automated cloud infrastructure for a large-scale credit risk model hosted in Amazon's Elastic Compute Cloud (EC2) service using Cloud Formation and Chef. Utilized AWS services and technologies such as Simple Storage Service (S3), Relational Database Service (RDS), Auto Scaling, and Elastic Load Balancing.
- Built and maintained massive-scale machine learning algorithms in production for operational credit risk minimization and marketing on a 240-node Hadoop cluster leveraging half a petabyte of data. Primarily used Python Hadoop Streaming (Map Reduce), Hive, Parquet, and Avro. Tuned performance of Hadoop jobs using YARN and various tools provided by Cloudera's CDH distribution.
- Experimented with various big data technologies including Apache Spark, Apache Storm, and Apache Kafka. Taught numerous legacy internal Capital One teams how to use the latest Hadoop technologies.
- Maintained and productionized more than 50 machine learning predictive models for a diverse range of use cases such as credit risk, marketing, fraud, and customer experience. Worked across multiple lines of business: credit card, auto finance, mortgage, and retail bank.