Data Scientist and Developer
Thomas has been working in machine learning and natural language processing for more than ten years. Thomas initially studied physics to a master's level. He then moved into the machine learning field, completing a second master's degree in computer speech, text, and internet technology at the University of Cambridge in 2008. Thomas has worked in a variety of companies in industries including consulting, computer science, recruitment, retail, and security, as well as some research experience.
ExperienceMachine Learning - 12 yearsNatural Language Processing (NLP) - 12 yearsGPT - 12 yearsData Science - 12 yearsGenerative Pre-trained Transformers (GPT) - 12 yearsClassification Algorithms - 12 yearsText Classification - 12 yearsPython - 9 years
MacOS, Linux, Windows, SpaCy, Scikit-learn, TensorFlow, PyCharm, Python
The most amazing...
...thing I've delivered for a client is a 7% increase in customers by adding machine learning to their signup form.
Director | Freelance Consultant Data Scientist
Fast Data Science Ltd.
- Provided consulting in many areas of machine learning to clients across industries, building and deploying machine learning models, focusing on natural language processing.
- Conducted AI due diligence of startups for investors.
- Provided training and upskilling in data science for analytics teams.
- Assisted consultancies with public sector procurement in a variety of countries.
Consultant Data Scientist
National Health Service
- Investigated factors behind junior doctor attrition from the NHS and developed a machine learning model to predict who is going to leave the organization.
- Provided consulting services to management and insights on causes of employee turnover.
- Provided general data strategy consulting to the NHS management.
- Trained a text classification model to predict 75 parameters of complexity from 200-page clinical trial protocol PDFs, allowing the clinical operations team to run financial modeling on more reliable data.
- Analyzed text reports of manufacturing defects and performed unsupervised clustering with LDA, allowing manufacturing division to see key areas of faults.
- Identified molecules in scientific publications linked to molecules discovered by Boehringer Ingelheim, allowing pre-clinical research team to connect with researchers over the world experimenting with the same compounds.
- Designed and trained a regression model using PySpark/Spark MLLib to predict customers' order weights in kilograms before they even place the order.
- Worked on recommendation systems for recommending online shopping purchases.
- Trained a predictive model to predict vehicle turnaround and loading times.
- Used machine learning to predict information about candidates, allowing the company to simplify the registration process and improve registrations by 7%.
- Deployed a recommender system to send job alerts to candidates by email with a 7% conversion rate.
- Trained deep learning models (CNN, RNN, LSTM, Word2Vec, Seq2Seq) to analyze candidates' CVs and job descriptions, using Google GPU instances.
- Deployed machine learning projects through to production on the live site as scalable Docker instances behind a load balancer.
- Worked on new techniques to recommend a job to a candidate based on past behavior (like the recommendations you see when you buy a product on Amazon).
Computer Vision Scientist
- Designed and trained—using a team of five developers and five testers/data annotators—neural network solutions for face recognition that ran on Android, iOS, and Windows.
- Collected training data from sources such as web scraping and arranged annotators to manually clean data.
- Worked on and patented cryptographic measures to protect biometric data (irises, fingerprints, and so on).
- Trained convolutional neural networks on GPU using deep learning software Caffe and was able to classify images such as fingerprints or pharmaceutical pill bottles.
- Designed cryptographic measures to protect biometric data (irises, fingerprints).
- Worked in a team of five computational linguists that were designing methods for monitoring market sentiment on the internet and specializing in focused web crawling.
- Communicated designs for natural-language-processing programs to a team of developers who implemented these into products marketed to corporate clients.
- Worked on the development of a machine learning NLP pipeline.
- Worked in a multilingual team on the architecture of human-like natural language dialog systems for use on mobile, web, and in consumer electronics, becoming the team expert on advanced parsing of user input.
- Made frequent visits to blue-chip companies in Silicon Valley and Asia while presenting technology solutions to potential clients.
- Defined requirements, estimated time scales, and prototyped during project planning.
- Provided consulting services to clients and partners developing their own dialog systems using my company's proprietary software.
Customer Conversion on an Online Form
Since users also uploaded their CV which contains explicitly lots of personal information, as well as implicit information such as the job type or salary that someone was looking for, I was able to train a deep neural network on past signup data over several years, to analyze the CV and fill out some of the fields in the signup form automatically. This allowed a field to be removed, which boosted the conversion rate of the form by 7%, measured by A/B testing.
Vehicle Unloading Times
Fortunately, there was a system in place for recording vehicle ignition events, GPS location, and geofencing to identify the arrival and departure times of delivery vehicles, and past schedules were available to identify the quantity and type of product delivered on each drop, which driver was in charge, and the time of day and type of vehicle used.
Using this trove of logged data I was able to train a simple regression model that would predict the unloading time of any future delivery at the time that the schedule is being generated.
This allowed the client to save money on driver overtime, disruption caused by late deliveries, and fines due to drivers working longer than their legally permitted hours.
Analysis of Clinical Trials
Before the trial is run, the drug developer writes a document called a protocol. This contains vital information about how long the trial will run for, what is the risk to participants, what kind of treatment is being investigated, and so on.
The problem is that each protocol is up to 200 pages long, and the structure can vary.
For one pharmaceutical company, I developed and trained a deep learning tool to predict more than 50 output variables from a clinical trial protocol. This allows pharma companies and regulators to analyze and quantify large numbers of protocols, allowing more accurate cost estimation.
The technique can be extended to other industries where large unstructured or semi-structured documents are the norm.
Finding Molecules and Proteins in Scientific Literature
As an example, the molecule on the right is Aspirin. This is still a trademark of Bayer in some countries. But in a paper, it could appear under acetylsalicylic acid, 2-acetoxybenzenecarboxylic acid,
C9H8O4, or a number of identifiers such as DB00945. There could also be identifiers that refer to other molecules or identifiers that refer to only one version of a molecule.
I have developed several tried and tested techniques to disambiguate these terms. Usually, I need several annotated examples to start with, and we will train a machine learning model to learn from these examples and annotate new publications as they come in.
Python, Python 3, SQL, Java, R, C++, C#
Scikit-learn, NumPy, SpaCy, Natural Language Toolkit (NLTK), SciPy, TensorFlow, PySpark, MLlib
Azure ML Studio, Azure Machine Learning, PyCharm
Azure, Google Cloud Platform (GCP), Docker, MacOS, Unix, Amazon Web Services (AWS), Windows, Linux
Natural Language Understanding (NLU), Natural Language Processing (NLP), Text Classification, Machine Learning, Classification Algorithms, Convolutional Neural Networks, Dialog Systems, Natural Language Generation (NLG), Spanish, German, GPT, Generative Pre-trained Transformers (GPT), Programming, Physics, Clustering, Graphics Processing Unit (GPU), Custom BERT, Computer Vision, Speech Recognition, Speech Synthesis, Sentiment Analysis
Flask, Spark, .NET
Master's Degree in Computer Speech, Text, and Internet Technology
University of Cambridge - Cambridge, UK
Master's Degree in Physics
University of Durham - Durham, UK
Azure Data Science Associate