Lead Data Scientist | Lead Data Engineer2021 - PRESENTPrivate Company (Confidential)
Technologies: Apache Airflow, PySpark, Elasticsearch, Kibana, AWS, Docker, Data Versioning, PyTorch, TorchServe, XLM-R, BERT, Regular Expressions, Deep Learning, Python, Machine Learning, Flask-RESTful, Web Scraping, Neo4j, Pandas, Hyperparameter Optimization, Data Structures, Machine Learning Operations (MLOps), Natural Language Processing (NLP), Data Science
- Led a team of four and managed emerging talents with difficult product schedules and requirements.
- Developed a mixed cloud (local and AWS) big data platform that processed six million documents per day.
- Replaced an industrial 65-language NLP and geoparsing/geolocation solution with in-house models to save $700,000. Improved 20% recall (sentiment) by 20% and accuracy (NER) by 23%, averaged across languages.
- Managed data vendor relations and created an engineering solution to save $600,000.
- Developed three products specifically focused on increasing customer AI trust.
Senior Data Scientist and Back-end Developer2019 - 2021G42
Technologies: PyTorch, Elasticsearch, Kibana, Logstash, Open Distro, TorchServe, Regular Expressions, Data Structures, Docker, Singularity, BERT, Data Versioning, GraphQL, Apache Airflow, Entity Linking, Natural Language Processing (NLP), Python, Machine Learning, Flask-RESTful, Web Scraping, Neo4j, Pandas, Hyperparameter Optimization, Machine Learning Operations (MLOps), Data Science
- Developed a machine translation solution for Arabic and English that achieved performance competitive with Google at ±3 BLEU Points, depending on the domain.
- Led data engineering and cleaning, resulting in 78% of the BLEU improvement from the initial model. Implemented SotA cleaning techniques (back-translation and Bicleaner).
- Created a fast parallel data cleaning pipeline for 24 billion sentences with Apache Airflow.
- Developed a news information extraction solution using a dynamic knowledge graph to interconnect streaming news data with 40,000 articles per day. Published research papers at CIKM, ACL, and SIGIR 2020.
- Implemented and improved SotA entity linking models to be compatible with BERT.
- Developed an OCR machine learning solution to deal with very noisy, diverse domain, mixed language (AR, EN) data. Set an adaptive localized binarization filter, which improved bounding box recognition and decreased the character error rate by 50%.
- Served as the lead back-end developer for an oil reservoir image classification app. Created a scalable back end for the computer vision application using PyTorch and TorchServe, Elasticsearch, and Kibana.
- Pioneered GraphQL adoption, resulting in 50% faster front-end loading times.
Data Scientist2018 - 2019Saal.ai
Technologies: Scikit-learn, Docker, Flask-RESTful, Web Scraping, Regular Expressions, Pandas, PyTorch, Python, Machine Learning, Neo4j, Hyperparameter Optimization, Data Structures, Machine Learning Operations (MLOps), Natural Language Processing (NLP), Data Science
- Developed a 7-day forecast model for oil prices: cleaned and feature engineered diverse, frequency varying time-series data.
- Built LSTM, HMM, and random forest regression models and used Bayesian Tree Parzen Estimators for hyperparameter optimization.
- Developed NLP/AI functions and deployed them as a Dockerized REST microservice. Created a multithreaded article extractor using quantitative linguistic features and stance detection using deep learning with BERT-based semantic distance metrics.
- Led research in graph attention networks for text classification. Implemented violence detection using deep learning and lexical feature engineering.
- Led early-stage discussions with FAB (a bank) on recommender systems and an investor relationship that resulted in a partnership with Capital Health (a hospital). I did this while serving as a product owner and data scientist.
- Pioneered a new product in equestrian data science and led vendor data acquisition for an uncommon domain.
Electrical Engineer and Data Analyst2014 - 2017Collins Aerospace
Technologies: Machine Learning, FPGA, DSP, Signal Analysis, Python, Scikit-learn, Docker, Pandas, Natural Language Processing (NLP), Data Science
- Implemented corporate analytics: used clustering to discover critical programs 50% faster than previous methods and sold senior leadership on methods to improve seven programs, each with over 20% waste.
- Developed on-demand metrics with a real-time, dynamic charts system derived from automated FPGA builds, which saved three person-days of lab time on its first day.
- Contributed to multiple Waveforms and projects: added five new features (SALT); verified multiple FPGAs to DO-254 level A requirements (AWACS); conducted design, verification, and hardware testing (MOSMOD); and received a Lean award for performance.