Derek Thomas, Developer in Abu Dhabi, United Arab Emirates
Derek is available for hire
Hire Derek

Derek Thomas

Verified Expert  in Engineering

Data Scientist and AI Developer

Location
Abu Dhabi, United Arab Emirates
Toptal Member Since
September 22, 2021

Derek is a lead data scientist and engineer and a passionate leader with 8+ years of experience. His industry experience includes oil and gas, health, finance, and social media, and he has saved $1.3 million and generated $4 million since 2020. Derek has broad domain experience, three publications in respected conferences, and master's and bachelor's degrees in electrical and computer engineering. He loves working with teams to find solutions that best fit clients' needs.

Portfolio

Private Company (Confidential)
Apache Airflow, PySpark, Elasticsearch, Kibana, Amazon Web Services (AWS)...
G42
PyTorch, Elasticsearch, Kibana, Logstash, Open Distro, TorchServe...
Saal.ai
Scikit-learn, Docker, Flask-RESTful, Web Scraping, Regular Expressions, Pandas...

Experience

Availability

Part-time

Preferred Environment

PyCharm, MacOS, Vim Text Editor, Amazon Web Services (AWS), Docker

The most amazing...

...thing I've developed is an Arabic/English translation engine using open data that outperforms Google!

Work Experience

Lead Data Scientist | Lead Data Engineer

2021 - PRESENT
Private Company (Confidential)
  • Led a team of four and managed emerging talents with difficult product schedules and requirements.
  • Developed a mixed cloud (local and AWS) big data platform that processed six million documents per day.
  • Replaced an industrial 65-language NLP and geoparsing/geolocation solution with in-house models to save $700,000. Improved 20% recall (sentiment) by 20% and accuracy (NER) by 23%, averaged across languages.
  • Managed data vendor relations and created an engineering solution to save $600,000.
  • Developed three products specifically focused on increasing customer AI trust.
Technologies: Apache Airflow, PySpark, Elasticsearch, Kibana, Amazon Web Services (AWS), Docker, Data Versioning, PyTorch, TorchServe, XLM-R, BERT, Regular Expressions, Deep Learning, Python, Machine Learning, Flask-RESTful, Web Scraping, Neo4j, Pandas, Hyperparameter Optimization, Data Structures, Machine Learning Operations (MLOps), Generative Pre-trained Transformers (GPT), Natural Language Processing (NLP), GPT, Data Science

Senior Data Scientist and Back-end Developer

2019 - 2021
G42
  • Developed a machine translation solution for Arabic and English that achieved performance competitive with Google at ±3 BLEU Points, depending on the domain.
  • Led data engineering and cleaning, resulting in 78% of the BLEU improvement from the initial model. Implemented SotA cleaning techniques (back-translation and Bicleaner).
  • Created a fast parallel data cleaning pipeline for 24 billion sentences with Apache Airflow.
  • Developed a news information extraction solution using a dynamic knowledge graph to interconnect streaming news data with 40,000 articles per day. Published research papers at CIKM, ACL, and SIGIR 2020.
  • Implemented and improved SotA entity linking models to be compatible with BERT.
  • Developed an OCR machine learning solution to deal with very noisy, diverse domain, mixed language (AR, EN) data. Set an adaptive localized binarization filter, which improved bounding box recognition and decreased the character error rate by 50%.
  • Served as the lead back-end developer for an oil reservoir image classification app. Created a scalable back end for the computer vision application using PyTorch and TorchServe, Elasticsearch, and Kibana.
  • Pioneered GraphQL adoption, resulting in 50% faster front-end loading times.
Technologies: PyTorch, Elasticsearch, Kibana, Logstash, Open Distro, TorchServe, Regular Expressions, Data Structures, Docker, BERT, Data Versioning, GraphQL, Apache Airflow, Named-entity Recognition (NER), Natural Language Processing (NLP), GPT, Generative Pre-trained Transformers (GPT), Python, Machine Learning, Flask-RESTful, Web Scraping, Neo4j, Pandas, Hyperparameter Optimization, Machine Learning Operations (MLOps), Data Science

Data Scientist

2018 - 2019
Saal.ai
  • Developed a 7-day forecast model for oil prices: cleaned and feature engineered diverse, frequency varying time-series data.
  • Built LSTM, HMM, and random forest regression models and used Bayesian Tree Parzen Estimators for hyperparameter optimization.
  • Developed NLP/AI functions and deployed them as a Dockerized REST microservice. Created a multithreaded article extractor using quantitative linguistic features and stance detection using deep learning with BERT-based semantic distance metrics.
  • Led research in graph attention networks for text classification. Implemented violence detection using deep learning and lexical feature engineering.
  • Led early-stage discussions with FAB (a bank) on recommender systems and an investor relationship that resulted in a partnership with Capital Health (a hospital). I did this while serving as a product owner and data scientist.
  • Pioneered a new product in equestrian data science and led vendor data acquisition for an uncommon domain.
Technologies: Scikit-learn, Docker, Flask-RESTful, Web Scraping, Regular Expressions, Pandas, PyTorch, Python, Machine Learning, Neo4j, Hyperparameter Optimization, Data Structures, Machine Learning Operations (MLOps), GPT, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), Data Science

Electrical Engineer and Data Analyst

2014 - 2017
Collins Aerospace
  • Implemented corporate analytics: used clustering to discover critical programs 50% faster than previous methods and sold senior leadership on methods to improve seven programs, each with over 20% waste.
  • Developed on-demand metrics with a real-time, dynamic charts system derived from automated FPGA builds, which saved three person-days of lab time on its first day.
  • Contributed to multiple Waveforms and projects: added five new features (SALT); verified multiple FPGAs to DO-254 level A requirements (AWACS); conducted design, verification, and hardware testing (MOSMOD); and received a Lean award for performance.
Technologies: Machine Learning, FPGA, DSP, Signal Analysis, Python, Scikit-learn, Docker, Pandas, GPT, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), Data Science

Arabic ↔ English Translation Engine

Our team of four non-Arabic speakers created an Arabic/English translation engine that was quantitatively competitive with Google and outperformed Google qualitatively. This performance was especially notable because we used only open data. The reproducible pipeline handled 24 billion sentences.

Key Activities
• Handled all the data engineering and trained some models.
• Implemented a back-translation data generator and improved BLEU by 78% over the baseline.
• Developed a trie-based cleaning technique for scraped data.
• Used Bicleaner to detect noisy sentence pairs based on mutual translation likelihood.

Social Media Processing

This application was developed to aggregate and process social media data. The data was collected in AWS but processed on a private cloud. The processing consisted of data routing, running models, and ingestion in Elastic.

The routing was deployed in Docker containers that subscribed to AWS and worked in batches. They would then call the models. I used TorchServe to deploy the models because it's lightweight and easily configurable, allowing quick optimization for the computational capacity. The results were ingested into Elastic, which worked well because it's flexible and scales well. Kibana was a good choice for visualizations.

Relation Extraction with Self-determined Graph Convolutional Networks

Relation extraction is a way to obtain the semantic relationship between entities in text. The state-of-the-art methods use linguistic tools to build a graph for the text in which the entities appear, then a graph convolutional network (GCN) is employed to encode the pre-built graphs.

Although their performance is promising, the reliance on linguistic tools results in a non-end-to-end process. In this work, we proposed a novel model, the self-determined graph convolutional network (SGCN), which determines a weighted graph using a self-attention mechanism rather than using any linguistic tool. Then, the self-determined graph is encoded using a GCN.

We tested our model on the TACRED dataset and achieved the state-of-the-art result. Our experiments show that SGCN outperforms the traditional GCN, which uses dependency parsing tools to build the graph.

I was the second author, and this work was accepted into CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management.

Autoencoding Keyword Correlation Graph for Document Clustering

https://aclanthology.org/2020.acl-main.366.pdf
Document clustering requires a deep understanding of the complex structure of LONGTEXT; in particular, the intra-sentential (local) and inter-sentential features (global). Existing representation learning models do not fully capture these features. To address this, we presented a novel graph-based representation for document clustering that builds a graph autoencoder (GAE) on a keyword correlation graph.

The graph was constructed with topical keywords as nodes and multiple local and global features as edges. A GAE was employed to aggregate the two sets of features by learning a latent representation, which could jointly reconstruct them. Clustering was then performed on the learned representations, using vector dimensions as features for inducing document classes.

Extensive experiments on two datasets showed that the features learned by our approach could achieve better clustering performance than other existing features, including term frequency-inverse document frequency and average embedding.

I was the third author, and this work was accepted into ACL 2020.

Attending to Inter-sentential Features in Neural Text Classification

Text classification requires a deep understanding of the linguistic features in text; in particular, the intra-sentential (local) and inter-sentential features (global). Models that operate on word sequences have been used successfully to capture the local features, yet they are not effective in capturing the global features in LONGTEXT.

We investigated graph-level extensions to such models and proposed a novel architecture for combining alternative text features. It used an attention mechanism to dynamically decide how much information to use from a sequence- or graph-level component. We evaluated different architectures on a range of text classification datasets, and graph-level extensions were found to improve performance on most benchmarks. In addition, the attention-based architecture adaptively learned from the data outperforms the generic and fixed-value concatenation ones.

I was the fourth author, and this work was accepted into SIGIR 2020.

Languages

Python, GraphQL

Libraries/APIs

Pandas, PyTorch, Scikit-learn, Flask-RESTful, PySpark

Tools

Named-entity Recognition (NER), PyCharm, Kibana, Apache Airflow, Vim Text Editor, Logstash, Amazon Simple Notification Service (Amazon SNS)

Paradigms

Data Science

Platforms

Docker, Amazon EC2, Amazon Web Services (AWS), MacOS

Other

Regular Expressions, TorchServe, Natural Language Processing (NLP), XLM-R, Machine Translation, Machine Learning Operations (MLOps), GPT, Generative Pre-trained Transformers (GPT), DSP, Signal Analysis, FPGA, Web Scraping, Hyperparameter Optimization, Data Structures, Waveforms, Statistics, Adaptive Control Systems, Machine Learning, Open Distro, BERT, Data Versioning, Deep Learning, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNNs), Long Short-term Memory (LSTM)

Storage

Neo4j, Elasticsearch, Amazon S3 (AWS S3), Graph Databases

2011 - 2012

Master's Degree in Electrical and Computer Engineering

University of Louisville - Louisville, KY, USA

2007 - 2011

Bachelor's Degree in Electrical and Computer Engineering

University of Louisville - Louisville, KY, USA

SEPTEMBER 2018 - PRESENT

Natural Language Processing

Coursera

AUGUST 2018 - PRESENT

Neo4j Certified Professional

Neo4j

FEBRUARY 2018 - PRESENT

Deep Learning Specialization

Coursera

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring