
Derek Thomas
Verified Expert in Engineering
Data Scientist and AI Developer
Abu Dhabi, United Arab Emirates
Toptal member since September 22, 2021
Derek is a lead data scientist and engineer and a passionate leader with 8+ years of experience. His industry experience includes oil and gas, health, finance, and social media, and he has saved $1.3 million and generated $4 million since 2020. Derek has broad domain experience, three publications in respected conferences, and master's and bachelor's degrees in electrical and computer engineering. He loves working with teams to find solutions that best fit clients' needs.
Portfolio
Experience
- Python - 10 years
- Machine Learning - 8 years
- Generative Pre-trained Transformers (GPT) - 6 years
- PyTorch - 6 years
- Deep Learning - 6 years
- Natural Language Processing (NLP) - 6 years
- Docker - 6 years
- Elasticsearch - 4 years
Availability
Preferred Environment
PyCharm, MacOS, Vim Text Editor, Amazon Web Services (AWS), Docker
The most amazing...
...thing I've developed is an Arabic/English translation engine using open data that outperforms Google!
Work Experience
Lead Data Scientist | Lead Data Engineer
Private Company (Confidential)
- Led a team of four and managed emerging talents with difficult product schedules and requirements.
- Developed a mixed cloud (local and AWS) big data platform that processed six million documents per day.
- Replaced an industrial 65-language NLP and geoparsing/geolocation solution with in-house models to save $700,000. Improved 20% recall (sentiment) by 20% and accuracy (NER) by 23%, averaged across languages.
- Managed data vendor relations and created an engineering solution to save $600,000.
- Developed three products specifically focused on increasing customer AI trust.
Senior Data Scientist and Back-end Developer
G42
- Developed a machine translation solution for Arabic and English that achieved performance competitive with Google at ±3 BLEU Points, depending on the domain.
- Led data engineering and cleaning, resulting in 78% of the BLEU improvement from the initial model. Implemented SotA cleaning techniques (back-translation and Bicleaner).
- Created a fast parallel data cleaning pipeline for 24 billion sentences with Apache Airflow.
- Developed a news information extraction solution using a dynamic knowledge graph to interconnect streaming news data with 40,000 articles per day. Published research papers at CIKM, ACL, and SIGIR 2020.
- Implemented and improved SotA entity linking models to be compatible with BERT.
- Developed an OCR machine learning solution to deal with very noisy, diverse domain, mixed language (AR, EN) data. Set an adaptive localized binarization filter, which improved bounding box recognition and decreased the character error rate by 50%.
- Served as the lead back-end developer for an oil reservoir image classification app. Created a scalable back end for the computer vision application using PyTorch and TorchServe, Elasticsearch, and Kibana.
- Pioneered GraphQL adoption, resulting in 50% faster front-end loading times.
Data Scientist
Saal.ai
- Developed a 7-day forecast model for oil prices: cleaned and feature engineered diverse, frequency varying time-series data.
- Built LSTM, HMM, and random forest regression models and used Bayesian Tree Parzen Estimators for hyperparameter optimization.
- Developed NLP/AI functions and deployed them as a Dockerized REST microservice. Created a multithreaded article extractor using quantitative linguistic features and stance detection using deep learning with BERT-based semantic distance metrics.
- Led research in graph attention networks for text classification. Implemented violence detection using deep learning and lexical feature engineering.
- Led early-stage discussions with FAB (a bank) on recommender systems and an investor relationship that resulted in a partnership with Capital Health (a hospital). I did this while serving as a product owner and data scientist.
- Pioneered a new product in equestrian data science and led vendor data acquisition for an uncommon domain.
Electrical Engineer and Data Analyst
Collins Aerospace
- Implemented corporate analytics: used clustering to discover critical programs 50% faster than previous methods and sold senior leadership on methods to improve seven programs, each with over 20% waste.
- Developed on-demand metrics with a real-time, dynamic charts system derived from automated FPGA builds, which saved three person-days of lab time on its first day.
- Contributed to multiple Waveforms and projects: added five new features (SALT); verified multiple FPGAs to DO-254 level A requirements (AWACS); conducted design, verification, and hardware testing (MOSMOD); and received a Lean award for performance.
Experience
Arabic ↔ English Translation Engine
Key Activities
• Handled all the data engineering and trained some models.
• Implemented a back-translation data generator and improved BLEU by 78% over the baseline.
• Developed a trie-based cleaning technique for scraped data.
• Used Bicleaner to detect noisy sentence pairs based on mutual translation likelihood.
Social Media Processing
The routing was deployed in Docker containers that subscribed to AWS and worked in batches. They would then call the models. I used TorchServe to deploy the models because it's lightweight and easily configurable, allowing quick optimization for the computational capacity. The results were ingested into Elastic, which worked well because it's flexible and scales well. Kibana was a good choice for visualizations.
Relation Extraction with Self-determined Graph Convolutional Networks
Although their performance is promising, the reliance on linguistic tools results in a non-end-to-end process. In this work, we proposed a novel model, the self-determined graph convolutional network (SGCN), which determines a weighted graph using a self-attention mechanism rather than using any linguistic tool. Then, the self-determined graph is encoded using a GCN.
We tested our model on the TACRED dataset and achieved the state-of-the-art result. Our experiments show that SGCN outperforms the traditional GCN, which uses dependency parsing tools to build the graph.
I was the second author, and this work was accepted into CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management.
Autoencoding Keyword Correlation Graph for Document Clustering
https://aclanthology.org/2020.acl-main.366.pdfThe graph was constructed with topical keywords as nodes and multiple local and global features as edges. A GAE was employed to aggregate the two sets of features by learning a latent representation, which could jointly reconstruct them. Clustering was then performed on the learned representations, using vector dimensions as features for inducing document classes.
Extensive experiments on two datasets showed that the features learned by our approach could achieve better clustering performance than other existing features, including term frequency-inverse document frequency and average embedding.
I was the third author, and this work was accepted into ACL 2020.
Attending to Inter-sentential Features in Neural Text Classification
We investigated graph-level extensions to such models and proposed a novel architecture for combining alternative text features. It used an attention mechanism to dynamically decide how much information to use from a sequence- or graph-level component. We evaluated different architectures on a range of text classification datasets, and graph-level extensions were found to improve performance on most benchmarks. In addition, the attention-based architecture adaptively learned from the data outperforms the generic and fixed-value concatenation ones.
I was the fourth author, and this work was accepted into SIGIR 2020.
Education
Master's Degree in Electrical and Computer Engineering
University of Louisville - Louisville, KY, USA
Bachelor's Degree in Electrical and Computer Engineering
University of Louisville - Louisville, KY, USA
Certifications
Natural Language Processing
Coursera
Neo4j Certified Professional
Neo4j
Deep Learning Specialization
Coursera
Skills
Libraries/APIs
Pandas, PyTorch, Scikit-learn, Flask-RESTful, PySpark
Tools
TorchServe, Named-entity Recognition (NER), PyCharm, Kibana, Apache Airflow, Vim Text Editor, Logstash, Amazon Simple Notification Service (SNS)
Languages
Python, GraphQL
Platforms
Docker, Amazon EC2, Amazon Web Services (AWS), MacOS
Storage
Neo4j, Elasticsearch, Amazon S3 (AWS S3), Graph Databases
Other
Regular Expressions, Natural Language Processing (NLP), XLM-R, Machine Translation, Machine Learning Operations (MLOps), Data Science, Generative Pre-trained Transformers (GPT), DSP, Signal Analysis, FPGA, Web Scraping, Hyperparameter Optimization, Data Structures, Waveforms, Statistics, Adaptive Control Systems, Machine Learning, Open Distro, BERT, Data Versioning, Deep Learning, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-term Memory (LSTM)
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring