Thomas Wood, Data Scientist and Developer in London, United Kingdom
Thomas Wood

Data Scientist and Developer in London, United Kingdom

Member since January 9, 2019
Thomas has been working in machine learning and natural language processing for more than ten years. Thomas initially studied physics to a master's level. He then moved into the machine learning field, completing a second master's degree in computer speech, text, and internet technology at the University of Cambridge in 2008. Thomas has worked in a variety of companies in industries including consulting, computer science, recruitment, retail, and security, as well as some research experience.
Thomas is now available for hire

Portfolio

Experience

Location

London, United Kingdom

Availability

Part-time

Preferred Environment

MacOS, Linux, Windows, SpaCy, Sklearn, TensorFlow, PyCharm, Python

The most amazing...

...thing I've delivered for a client is a 7% increase in customers by adding machine learning to their signup form.

Employment

  • Director | Freelance Consultant Data Scientist

    2018 - PRESENT
    Fast Data Science Ltd.
    • Provided consulting in many areas of machine learning to clients across industries, building and deploying machine learning models, focusing on natural language processing.
    • Conducted AI due diligence of startups for investors.
    • Provided training and upskilling in data science for analytics teams.
    • Assisted consultancies with public sector procurement in a variety of countries.
    Technologies: Amazon Web Services (AWS), TensorFlow, AWS, Azure, Python
  • Consultant Data Scientist

    2020 - 2020
    National Health Service
    • Investigated factors behind junior doctor attrition from the NHS and developed a machine learning model to predict who is going to leave the organization.
    • Provided consulting services to management and insights on causes of employee turnover.
    • Provided general data strategy consulting to the NHS management.
    Technologies: Scikit-learn, Azure, Python
  • Data Scientist

    2019 - 2020
    Boehringer Ingelheim
    • Trained a text classification model to predict 75 parameters of complexity from 200-page clinical trial protocol PDFs, allowing the clinical operations team to run financial modeling on more reliable data.
    • Analyzed text reports of manufacturing defects and performed unsupervised clustering with LDA, allowing manufacturing division to see key areas of faults.
    • Identified molecules in scientific publications linked to molecules discovered by Boehringer Ingelheim, allowing pre-clinical research team to connect with researchers over the world experimenting with the same compounds.
    Technologies: TensorFlow, PostgreSQL, Python
  • Data Scientist

    2018 - 2019
    Tesco
    • Designed and trained a regression model using PySpark/Spark MLLib to predict customers' order weights in kilograms before they even place the order.
    • Worked on recommendation systems for recommending online shopping purchases.
    • Trained a predictive model to predict vehicle turnaround and loading times.
    Technologies: MLlib, PySpark, Spark, Python
  • Data Scientist

    2017 - 2018
    CV-Library
    • Used machine learning to predict information about candidates, allowing the company to simplify the registration process and improve registrations by 7%.
    • Deployed a recommender system to send job alerts to candidates by email with a 7% conversion rate.
    • Trained deep learning models (CNN, RNN, LSTM, Word2Vec, Seq2Seq) to analyze candidates' CVs and job descriptions, using Google GPU instances.
    • Deployed machine learning projects through to production on the live site as scalable Docker instances behind a load balancer.
    • Worked on new techniques to recommend a job to a candidate based on past behavior (like the recommendations you see when you buy a product on Amazon).
    Technologies: Google Cloud Platform (GCP), TensorFlow, Python
  • Computer Vision Scientist

    2015 - 2017
    Veridium
    • Designed and trained—using a team of five developers and five testers/data annotators—neural network solutions for face recognition that ran on Android, iOS, and Windows.
    • Collected training data from sources such as web scraping and arranged annotators to manually clean data.
    • Worked on and patented cryptographic measures to protect biometric data (irises, fingerprints, and so on).
    • Trained convolutional neural networks on GPU using deep learning software Caffe and was able to classify images such as fingerprints or pharmaceutical pill bottles.
    • Designed cryptographic measures to protect biometric data (irises, fingerprints).
    Technologies: C++, TensorFlow, Python
  • Solution Architect

    2009 - 2016
    CID GmbH
    • Worked in a team of five computational linguists that were designing methods for monitoring market sentiment on the internet and specializing in focused web crawling.
    • Communicated designs for natural-language-processing programs to a team of developers who implemented these into products marketed to corporate clients.
    • Worked on the development of a machine learning NLP pipeline.
    Technologies: Machine Learning, .NET, C#
  • Knowledge Engineer

    2011 - 2015
    Artificial Solutions
    • Worked in a multilingual team on the architecture of human-like natural language dialog systems for use on mobile, web, and in consumer electronics, becoming the team expert on advanced parsing of user input.
    • Made frequent visits to blue-chip companies in Silicon Valley and Asia while presenting technology solutions to potential clients.
    • Defined requirements, estimated time scales, and prototyped during project planning.
    • Provided consulting services to clients and partners developing their own dialog systems using my company's proprietary software.
    Technologies: Python

Experience

  • Customer Conversion on an Online Form (Development)
    https://fastdatascience.com/customer-conversion/

    I established that the signup form in a job board was causing the company to lose customers. I was able to establish that users were confused by some fields.

    Since users also uploaded their CV which contains explicitly lots of personal information, as well as implicit information such as the job type or salary that someone was looking for, I was able to train a deep neural network on past signup data over several years, to analyze the CV and fill out some of the fields in the signup form automatically. This allowed a field to be removed, which boosted the conversion rate of the form by 7%, measured by A/B testing.

  • Vehicle Unloading Times (Development)
    https://fastdatascience.com/vehicle-unloading-times/

    A client in the retail industry had a fleet of vehicles delivering produce at different times of the day. They used third-party logistics software to plan the delivery schedules, however, an element of the delivery schedules that was hard to plan was the unloading time of the vehicle when it arrived at the store.

    Fortunately, there was a system in place for recording vehicle ignition events, GPS location, and geofencing to identify the arrival and departure times of delivery vehicles, and past schedules were available to identify the quantity and type of product delivered on each drop, which driver was in charge, and the time of day and type of vehicle used.

    Using this trove of logged data I was able to train a simple regression model that would predict the unloading time of any future delivery at the time that the schedule is being generated.

    This allowed the client to save money on driver overtime, disruption caused by late deliveries, and fines due to drivers working longer than their legally permitted hours.

  • Analysis of Clinical Trials (Development)
    https://fastdatascience.com/clinical-trials-analysis/

    When a pharmaceutical company develops a drug, it needs to pass through several phases of trials before regulators can approve it.

    Before the trial is run, the drug developer writes a document called a protocol. This contains vital information about how long the trial will run for, what is the risk to participants, what kind of treatment is being investigated, and so on.

    The problem is that each protocol is up to 200 pages long, and the structure can vary.

    For one pharmaceutical company, I developed and trained a deep learning tool to predict more than 50 output variables from a clinical trial protocol. This allows pharma companies and regulators to analyze and quantify large numbers of protocols, allowing more accurate cost estimation.

    The technique can be extended to other industries where large unstructured or semi-structured documents are the norm.

  • Finding Molecules and Proteins in Scientific Literature (Development)
    https://fastdatascience.com/finding-molecules-and-proteins-in-scientific-literature/

    I have worked on several different projects where a client needed to parse scientific literature and identify occurrences of molecules or proteins.

    As an example, the molecule on the right is Aspirin. This is still a trademark of Bayer in some countries. But in a paper, it could appear under acetylsalicylic acid, 2-acetoxybenzenecarboxylic acid,

    C9H8O4, or a number of identifiers such as DB00945. There could also be identifiers that refer to other molecules or identifiers that refer to only one version of a molecule.

    I have developed several tried and tested techniques to disambiguate these terms. Usually, I need several annotated examples to start with, and we will train a machine learning model to learn from these examples and annotate new publications as they come in.

Skills

  • Languages

    Python, Python 3, SQL, Java, R, C++, C#
  • Libraries/APIs

    Scikit-learn, NumPy, SpaCy, NLTK, SciPy, TensorFlow, Sklearn, PySpark, MLlib
  • Tools

    Azure ML Studio, PyCharm
  • Paradigms

    Data Science
  • Platforms

    Azure, Google Cloud Platform (GCP), Docker, MacOS, Unix, Amazon Web Services (AWS), Windows, Linux
  • Other

    Natural Language Understanding (NLU), Natural Language Processing (NLP), Text Classification, Machine Learning, Microsoft Azure Machine Learning (ML), Classification Algorithms, Convolutional Neural Networks, Dialog Systems, Natural Language Generation (NLG), Spanish, German, Programming, Clustering, AWS, Graphics Processing Unit (GPU), Custom BERT, Computer Vision, Speech Recognition, Speech Synthesis, Sentiment Analysis
  • Frameworks

    Flask, Spark, .NET
  • Industry Expertise

    Physics
  • Storage

    PostgreSQL

Education

  • Master's degree in Computer Speech, Text, and Internet Technology
    2007 - 2008
    University of Cambridge - Cambridge, UK
  • Master's degree in Physics
    2003 - 2007
    University of Durham - Durham, UK

Certifications

  • Azure Data Science Associate
    APRIL 2020 - APRIL 2022
    Microsoft

To view more profiles

Join Toptal
Share it with others