Derek Thomas, Data Scientist and AI Developer in Abu Dhabi, United Arab Emirates
Derek Thomas

Data Scientist and AI Developer in Abu Dhabi, United Arab Emirates

Member since September 14, 2021
Derek is a lead data scientist and engineer and a passionate leader with 8+ years of experience. His industry experience includes oil and gas, health, finance, and social media, and he has saved $1.3 million and generated $4 million since 2020. Derek has broad domain experience, three publications in respected conferences, and master's and bachelor's degrees in electrical and computer engineering. He loves working with teams to find solutions that best fit clients' needs.
Derek is now available for hire

Portfolio

  • Private Company (Confidential)
    Apache Airflow, PySpark, Elasticsearch, Kibana, AWS, Docker, Data Versioning...
  • G42
    PyTorch, Elasticsearch, Kibana, Logstash, Open Distro, TorchServe...
  • Saal.ai
    Scikit-learn, Docker, Flask-RESTful, Web Scraping, Regular Expressions...

Experience

Location

Abu Dhabi, United Arab Emirates

Availability

Part-time

Preferred Environment

PyCharm, MacOS, Vim Text Editor, AWS, Docker

The most amazing...

...thing I've developed is an Arabic/English translation engine using open data that outperforms Google!

Employment

  • Lead Data Scientist | Lead Data Engineer

    2021 - PRESENT
    Private Company (Confidential)
    • Led a team of four and managed emerging talents with difficult product schedules and requirements.
    • Developed a mixed cloud (local and AWS) big data platform that processed six million documents per day.
    • Replaced an industrial 65-language NLP and geoparsing/geolocation solution with in-house models to save $700,000. Improved 20% recall (sentiment) by 20% and accuracy (NER) by 23%, averaged across languages.
    • Managed data vendor relations and created an engineering solution to save $600,000.
    • Developed three products specifically focused on increasing customer AI trust.
    Technologies: Apache Airflow, PySpark, Elasticsearch, Kibana, AWS, Docker, Data Versioning, PyTorch, TorchServe, XLM-R, BERT, Regular Expressions, Deep Learning, Python, Machine Learning, Flask-RESTful, Web Scraping, Neo4j, Pandas, Hyperparameter Optimization, Data Structures, Machine Learning Operations (MLOps), Natural Language Processing (NLP), Data Science
  • Senior Data Scientist and Back-end Developer

    2019 - 2021
    G42
    • Developed a machine translation solution for Arabic and English that achieved performance competitive with Google at ±3 BLEU Points, depending on the domain.
    • Led data engineering and cleaning, resulting in 78% of the BLEU improvement from the initial model. Implemented SotA cleaning techniques (back-translation and Bicleaner).
    • Created a fast parallel data cleaning pipeline for 24 billion sentences with Apache Airflow.
    • Developed a news information extraction solution using a dynamic knowledge graph to interconnect streaming news data with 40,000 articles per day. Published research papers at CIKM, ACL, and SIGIR 2020.
    • Implemented and improved SotA entity linking models to be compatible with BERT.
    • Developed an OCR machine learning solution to deal with very noisy, diverse domain, mixed language (AR, EN) data. Set an adaptive localized binarization filter, which improved bounding box recognition and decreased the character error rate by 50%.
    • Served as the lead back-end developer for an oil reservoir image classification app. Created a scalable back end for the computer vision application using PyTorch and TorchServe, Elasticsearch, and Kibana.
    • Pioneered GraphQL adoption, resulting in 50% faster front-end loading times.
    Technologies: PyTorch, Elasticsearch, Kibana, Logstash, Open Distro, TorchServe, Regular Expressions, Data Structures, Docker, Singularity, BERT, Data Versioning, GraphQL, Apache Airflow, Entity Linking, Natural Language Processing (NLP), Python, Machine Learning, Flask-RESTful, Web Scraping, Neo4j, Pandas, Hyperparameter Optimization, Machine Learning Operations (MLOps), Data Science
  • Data Scientist

    2018 - 2019
    Saal.ai
    • Developed a 7-day forecast model for oil prices: cleaned and feature engineered diverse, frequency varying time-series data.
    • Built LSTM, HMM, and random forest regression models and used Bayesian Tree Parzen Estimators for hyperparameter optimization.
    • Developed NLP/AI functions and deployed them as a Dockerized REST microservice. Created a multithreaded article extractor using quantitative linguistic features and stance detection using deep learning with BERT-based semantic distance metrics.
    • Led research in graph attention networks for text classification. Implemented violence detection using deep learning and lexical feature engineering.
    • Led early-stage discussions with FAB (a bank) on recommender systems and an investor relationship that resulted in a partnership with Capital Health (a hospital). I did this while serving as a product owner and data scientist.
    • Pioneered a new product in equestrian data science and led vendor data acquisition for an uncommon domain.
    Technologies: Scikit-learn, Docker, Flask-RESTful, Web Scraping, Regular Expressions, Pandas, PyTorch, Python, Machine Learning, Neo4j, Hyperparameter Optimization, Data Structures, Machine Learning Operations (MLOps), Natural Language Processing (NLP), Data Science
  • Electrical Engineer and Data Analyst

    2014 - 2017
    Collins Aerospace
    • Implemented corporate analytics: used clustering to discover critical programs 50% faster than previous methods and sold senior leadership on methods to improve seven programs, each with over 20% waste.
    • Developed on-demand metrics with a real-time, dynamic charts system derived from automated FPGA builds, which saved three person-days of lab time on its first day.
    • Contributed to multiple Waveforms and projects: added five new features (SALT); verified multiple FPGAs to DO-254 level A requirements (AWACS); conducted design, verification, and hardware testing (MOSMOD); and received a Lean award for performance.
    Technologies: Machine Learning, FPGA, DSP, Signal Analysis, Python, Scikit-learn, Docker, Pandas, Natural Language Processing (NLP), Data Science

Experience

  • Arabic ↔ English Translation Engine

    Our team of four non-Arabic speakers created an Arabic/English translation engine that was quantitatively competitive with Google and outperformed Google qualitatively. This performance was especially notable because we used only open data. The reproducible pipeline handled 24 billion sentences.

    Key Activities
    • Handled all the data engineering and trained some models.
    • Implemented a back-translation data generator and improved BLEU by 78% over the baseline.
    • Developed a trie-based cleaning technique for scraped data.
    • Used Bicleaner to detect noisy sentence pairs based on mutual translation likelihood.

  • Social Media Processing

    This application was developed to aggregate and process social media data. The data was collected in AWS but processed on a private cloud. The processing consisted of data routing, running models, and ingestion in Elastic.

    The routing was deployed in Docker containers that subscribed to AWS and worked in batches. They would then call the models. I used TorchServe to deploy the models because it's lightweight and easily configurable, allowing quick optimization for the computational capacity. The results were ingested into Elastic, which worked well because it's flexible and scales well. Kibana was a good choice for visualizations.

  • Relation Extraction with Self-determined Graph Convolutional Networks

    Relation extraction is a way to obtain the semantic relationship between entities in text. The state-of-the-art methods use linguistic tools to build a graph for the text in which the entities appear, then a graph convolutional network (GCN) is employed to encode the pre-built graphs.

    Although their performance is promising, the reliance on linguistic tools results in a non-end-to-end process. In this work, we proposed a novel model, the self-determined graph convolutional network (SGCN), which determines a weighted graph using a self-attention mechanism rather than using any linguistic tool. Then, the self-determined graph is encoded using a GCN.

    We tested our model on the TACRED dataset and achieved the state-of-the-art result. Our experiments show that SGCN outperforms the traditional GCN, which uses dependency parsing tools to build the graph.

    I was the second author, and this work was accepted into CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management.

  • Autoencoding Keyword Correlation Graph for Document Clustering
    https://aclanthology.org/2020.acl-main.366.pdf

    Document clustering requires a deep understanding of the complex structure of LONGTEXT; in particular, the intra-sentential (local) and inter-sentential features (global). Existing representation learning models do not fully capture these features. To address this, we presented a novel graph-based representation for document clustering that builds a graph autoencoder (GAE) on a keyword correlation graph.

    The graph was constructed with topical keywords as nodes and multiple local and global features as edges. A GAE was employed to aggregate the two sets of features by learning a latent representation, which could jointly reconstruct them. Clustering was then performed on the learned representations, using vector dimensions as features for inducing document classes.

    Extensive experiments on two datasets showed that the features learned by our approach could achieve better clustering performance than other existing features, including term frequency-inverse document frequency and average embedding.

    I was the third author, and this work was accepted into ACL 2020.

  • Attending to Inter-sentential Features in Neural Text Classification

    Text classification requires a deep understanding of the linguistic features in text; in particular, the intra-sentential (local) and inter-sentential features (global). Models that operate on word sequences have been used successfully to capture the local features, yet they are not effective in capturing the global features in LONGTEXT.

    We investigated graph-level extensions to such models and proposed a novel architecture for combining alternative text features. It used an attention mechanism to dynamically decide how much information to use from a sequence- or graph-level component. We evaluated different architectures on a range of text classification datasets, and graph-level extensions were found to improve performance on most benchmarks. In addition, the attention-based architecture adaptively learned from the data outperforms the generic and fixed-value concatenation ones.

    I was the fourth author, and this work was accepted into SIGIR 2020.

Skills

  • Languages

    Python, GraphQL
  • Libraries/APIs

    Pandas, PyTorch, Scikit-learn, Flask-RESTful, PySpark
  • Tools

    PyCharm, Kibana, Apache Airflow, Vim Text Editor, Logstash, AWS Push Notification Service (AWS SNS)
  • Paradigms

    Data Science
  • Platforms

    Docker, Amazon EC2, MacOS
  • Other

    Regular Expressions, TorchServe, Entity Linking, Natural Language Processing (NLP), XLM-R, Machine Translation, Machine Learning Operations (MLOps), DSP, Signal Analysis, FPGA, Web Scraping, Hyperparameter Optimization, Data Structures, Waveforms, Statistics, Adaptive Control Systems, Machine Learning, Open Distro, Singularity, BERT, Data Versioning, AWS, Deep Learning, Convolutional Neural Networks, Recurrent Neural Networks, Long Short-term Memory (LSTM)
  • Storage

    Neo4j, Elasticsearch, AWS S3, Graph Databases

Education

  • Master's Degree in Electrical and Computer Engineering
    2011 - 2012
    University of Louisville - Louisville, KY, USA
  • Bachelor's Degree in Electrical and Computer Engineering
    2007 - 2011
    University of Louisville - Louisville, KY, USA

Certifications

  • Natural Language Processing
    SEPTEMBER 2018 - PRESENT
    Coursera
  • Neo4j Certified Professional
    AUGUST 2018 - PRESENT
    Neo4j
  • Deep Learning Specialization
    FEBRUARY 2018 - PRESENT
    Coursera

To view more profiles

Join Toptal
Share it with others