Senior Data Scientist2020 - PRESENTContractPod AI
Technologies: Transformers, Data Science, Natural Language Processing (NLP), NLTK, SpaCy
- Worked on information extraction from legal documents.
- Built a feature to understand whether contracts are signed or not based on converting the pages into graphs with nodes being built of words, lines, and signatures.
- Researched methodologies for signature detection and obtained open-source free data to train on.
- Fine-tuned Yolo v3 to detect signatures to an accuracy of 80%.
- Built a dotted line detector to extract lines in documents using OpenCV.
- Developed a signature requirement classifier that used an ensemble of mechanisms such as word density, dotted line presence, neighboring words. The classifier had 90% accuracy on the test set.
- Built a matching algorithm that matched signature requirements to the signatures.
- Created a clause comparison system to understand whether the clauses in contracts match approved clauses.
Senior Data Scientist2020 - PRESENTSprout AI
Technologies: Data Science, Transformers, Natural Language Processing (NLP), SpaCy
- Led a small team of consultants to improve information extraction from claims.
- Performed error analysis to understand current system results and what subsystems needed to be improved.
- Annotated damaged items in insurance claims to build a custom model.
- Trained an NER detector to detect damaged items in claims using Huggingface Transformers to an F1 score of 75%.
MSc. Data Science Tutor2020 - PRESENTUniversity of London
Technologies: Data Science, Spark, Hadoop, MapReduce
- Answered student questions around Hadoop/Spark and cluster processing.
- Organized tutorials for the students to help them with their Hadoop/Spark questions.
- Graded coursework submissions - first and second marker.
Senior Data Scientist2020 - 2020Foreign, Commonwealth & Development Office - UK Government
Technologies: Gensim, SpaCy, NLTK, Microsoft Power BI, Agile Data Science
- Defined and explained a number of experiments that could improve information extraction from news around the world.
- Scraped news from news websites, and cleaned and deduplicated them.
- Built an MVP of an automated topic detection mechanism in the news using LDA and extracted topic names.
- Aggregated processed data into a PowerBI visualization.
Senior Data Scientist2020 - 2020Fortress AI
Technologies: Web Scraping, Scikit-learn, Pandas
- Consulted on the strategic direction to implement machine learning on network devices for home environments.
- Researched information about doing QoS (quality of service) with machine learning and produced a report.
Technical Trainer2020 - 2020OpenClassrooms
Technologies: Linux, Keras, Teamwork, Data Visualization, Pandas, Machine Learning, Jupyter Notebook, Python 3
- Developed a practical introductory course on deep learning.
- Wrote a 3-part course that aimed to introduce students to deep learning with a focus on practicality and simple explanations. The course had the main theme of students working for a pizza company that uses machine learning.
- Focused the first part on the differences between traditional machine learning and deep learning; the second on neurons, how they work, and fully connected networks; and the third part on convolutional neural networks and recurrent neural networks.
- Developed a number of practical examples that the students are encouraged to follow and develop in their own Jupyter Notebooks to gain a better understanding and have a reference tool later on.
Senior Data Scientist2020 - 2020Cabinet Office
Technologies: Linux, Teamwork, Data Visualization, Pandas, Machine Learning, Agile Data Science, Google Docs, Scikit-learn
- Worked on the discovery and alpha phases aimed at understanding user problems and creating MVPs.
- Defined and explained a number of experiments that could improve knowledge management such as faceted search and classifiers for different Tags.
- Participated in a number of user interviews to better understand their ways of working.
- Wrote a number of small-scale experiments to test ideas.
- Built, cleaned, and labeled datasets for the tasks.
- Created a document type classifier that was able to distinguish between documents based on keywords and structure with an Accuracy of 90%. The system used Pika and Spacy in order to extract features and Scikit-learn to build the classifier.
- Created a duplicate document and near-duplicate document detector using MinHash in order to make it easy to avoid duplication and understand related documents.
- Built a 100,000 node knowledge graph using Spacy, DBpedia, Gensim, and Neo4J in order to better understand connections between people and important topics in the documents.
- The project was mentioned in The Times: https://www.thetimes.co.uk/article/ai-trawls-20-000-miles-of-state-papers-j0l9k5gx9.
Data Scientist and Machine Learning Engineer2019 - 2020Ernst & Young
Technologies: Linux, Keras, Teamwork, Data Engineering, Data Visualization, Pandas, Machine Learning, Agile Data Science, Imblearn, Scikit-learn, MLflow, Databricks, PySpark, Python
- Researched public and internal information on ML models for mergers and acquisitions and participated in workshops to generate ideas for potential use cases of ML in the M&A process.
- Did data cleaning to ensure entities existed at different points in time and correct merging of entities from different datasets based on dates.
- Created the first proof of concept models for applications of machine learning for M&A using Pandas and Random Forests in Scikit-Learn.
- Set up the ML architecture to ensure integration with the engineering architecture in Azure and selected Databricks as it would allow for use of Spark for cluster-based data processing, MLFlow for experiment tracking and deployment into Kubernetes.
- Researched and experimented with a number of mechanisms to allow for modeling of imbalanced datasets–weight balancing, blagging (random forests where decision trees use undersampling), undersampling and oversampling, and transfer learning.
- Analyzed multiple data sources and selected complementary data sources such as CapIQ for financial data, Factiva for news, and Oxford Economics for forecasts.
- Managed the machine learning team and had duties such as planning the team's workload, providing guidance on priorities, planning the team structure and size, interviewing, and hiring.
- Participated in user interviews to help shape both how we build the algorithms and the platform on which they would be run. A simple product and model explainability were key takeaways.
- Participated in a number of presentations with the aim of explaining how machine learning works and how it could be used by C-level stakeholders.
- Implemented a number of best practices in the team, such as random seed start, in order to get accurate scores of our models.
Data Scientist and Machine Learning Engineer2017 - 2019Serendipity AI
Technologies: Linux, Teamwork, Data Engineering, Data Visualization, Pandas, Machine Learning, Agile Data Science, SpaCy, Gensim, Scikit-learn, HBase, PySpark, Python
- Helped put in practice a news classifier and created a topic/user based news recommendation system using NLP.
- Used named entity detectors from Spacy, DBpedia, and Jaccard Similarity together with Levehnstein distance to detect and match named entities in news and other text data.
- Developed a new vectorization method for the detected named entities in text and worked on a mechanism that would qualify their expertise to different topics.
- Deployed Spark, Hadoop, and HBase on a cluster of three computers in order to speed up machine learning processing.
- Developed an ML processing pipeline that would allow information to flow to HBase and processed it in parallel using PySpark. Every stage in the pipeline was designed as a microservice which had access to only an input and an output table.
- Implemented a recommendation system using a neural network set up as an autoencoder and cosine similarity from Spotify Annoy.
- Brought to production level an article judging system. The system had a classification service and a training application. I used Celery to train every night and to restart the worker pool of the judging service when new models were available.
- Improved the code quality and reduced repeated code across applications written both in Flask and Cherrypy by creating a shared library. Added a logging system based on Python logging that had handlers for local logging and Rollbar.
- Created a number of APIs using Flask that ran on AWS and connected to Neo4j.
- Set up a testing framework that would allow APIs to be tested before and after deployment using Jenkins, and wrote integration tests for the APIs.
Data Scientist and Machine Learning Engineer2017 - 2017Cappfinity
Technologies: Linux, Teamwork, Pandas, Machine Learning, Tree-Based Pipeline Optimization Tool (TPOT), Flask, TensorFlow, Scikit-learn, Python
- Researched and integrated an automatic machine learning algorithm picker in Python.
- Researched auto-sklearn (bayesian optimization for algorithm selection), TPOT (genetic algorithms for feature processing and algorithm selection), and NEAT (genetic algorithms for neural network evolution).
- Developed the architecture for experimentation and result visualization for machine learning algorithms using services built with C# ASP.net Core and Python-Flask which communicate via REST and RabbitMQ.
- Built the system's presentation layer using Angular 4.
- Wrote a text extraction service from speech using Google Speech to Text API.
- Integrated MongoDB and connected all the services to it so that they can save processing results.
- Integrated all the applications in Docker with their own private network and Docker Compose to allow for continuous integration and faster deployment.
Research Engineer2016 - 2017Oxehealth
Technologies: Teamwork, Data Engineering, Machine Learning, RabbitMQ, ZeroMQ, Python, C++, C
- Led the data engineering team and worked on big data micro-services that would connect cameras installed on-site with Oxehealth’s data warehouse.
- Worked on Oxehealth’s TechCrunch London live demo that connected a room in Oxford with a human being monitored to the stage in London.
- Designed and developed the microservices architecture for video data retrieval from customer sites using ZeroMQ, GRPC, and Boost Program Options and Property Tree for C++.
- Set up a VPN Network to connect customer deployments to a central data repository using pfSense.
- Built a breathing robot that could replicate different breathing patterns.
- Designed and developed an application that allowed for multiple room monitoring using Qt.
Computer Vision and Algorithms Engineer2016 - 2016Meta Vision Systems
Technologies: Linux, Teamwork, Machine Learning, CUDA, C++, C, OpenCV
- Designed the full stack from image capture and processing to point clouds sent over the network using multiple threads and a pipeline architecture in order to measure oil pipes with lasers and cameras.
- Wrote general purpose GPU (GPGPU) code to accelerate image processing algorithms–convolution and point extraction via new kernels or through OpenCV, reducing processing time from 40s to 40ms for some code paths.
- Implemented algorithms such as K-means and ordinary least squares through OpenCV for finding points of interest and then line fitting.
- Designed and set up the network communication channels for transmission of data, commands, and replies using Type Length Value (TLV) messages via Boost ASIO.
- Designed and developed a logging system using Microsoft ETW.
- Set up point cloud library (PCL) for surface reconstruction and for visualization of STL files and point clouds.
- Used Boost Property Tree to implement a configuration file parser that uses JSON files.
- Deployed Jenkins for automatic build verification and to run test cases.
Software Engineer2013 - 2016Qualcomm
Technologies: Linux, Teamwork, C++, C
- Wrote the first Windows driver for Qualcomm's NFC chip.
- Participated in a number of integration activities where I helped set up new platforms with our NFC chip.
- Worked on the launch of a Windows mobile phone that contained the chip I worked on.
- Advised other teams across the globe on Windows driver development.
- Developed a script in PowerShell for improving the team’s efficiency.
- Debugged customer and partner issues and those arising during testing.
- Trained new team members from different disciplines such as software engineering and testing.