Senior Data Scientist
2021 - PRESENTSONY- Experimented with various technologies to build a better user experience as part of the PlayStation team.
- Performed requirement gathering from various stakeholders, then collected data and aggregated data from various data sources using technologies such as Alation, Snowflake, Sagemaker, AWS EMR, and Databricks.
- Worked on player-to-player recommenders based on their game activity,. computer vision-based systems such as shot boundary detectors, blur detectors, salient object detectors, and automatic cropping from game videos.
- Also worked on changing avatar emotions based on people’s faces using GANs, Dockerizing projects that had to be shared/deployed, and large compute Cluster setups.
Technologies: OpenCV, Python 3, Spark, AWS, Deep Learning, PyTorchUniversity of London Tutor
2020 - PRESENTUniversity of London- Provided online tutor activities for the Bachelor's Degree in Computer Science and Master's Degree in Data Science.
- Answered student questions about Financial Data Modelling, Hadoop, Spark, Python, and cluster processing.
- Organized webinars for the students that covered a range of topics and prepared them for their mid-terms and finals.
- Graded coursework and exams for various modules such as Big Data and Software Development.
Technologies: Data Science, Spark, Hadoop, MapReduce, Financial Data modelling, JupyterSenior Data Scientist
2021 - 2022Future Anthem- Aggregated data and did data wrangling using PySpark in Databricks on Azure.
- Set up a recommendation system with 3 subsystems that would recommend games to users.
- Built a user-item recommendation subsystem based on cosine similarity to make recommendations to new users.
- Created a sequence-based recommendation system that could be used to make recommendations to early-stage users.
- Constructed a collaborative filtering system based on implicit feedback using LightFM. The system was trained using the number of plays a user had in a game.
- Built dashboards and performed data analysis to understand how new Future Anthem customers are performing and to help them get better results.
- Delivered part of the work via other engineers from the Disruptive Engineering team who I managed.
Technologies: Spark, Recommendation Systems, Python 3, Delta Lake, Microsoft Power BISenior Data Scientist
2020 - 2022ContractPod AI- Worked on information extraction from legal documents.
- Built an API to understand whether contracts are signed or not based on computer vision and NLP.
- Researched methodologies for signature detection and obtained open-source, free data to train on.
- Fine-tuned Yolo to detect signatures to an accuracy of 80%.
- Built a dotted line detector to extract lines in documents using OpenCV.
- Built a graph that represented the document and all the extractions.
- Built a signature requirement classifier that used an ensemble of mechanisms such as word density, dotted line presence, neighboring words. The classifier had 90% accuracy on the test set.
- Built a matching algorithm that matched signature requirements to the signatures. The API was deployed on CUDA-enabled Docker containers.
- Built and conducted interviews to expand the team and offered support and mentorship to the team.
- Built a contract clause comparison API to understand whether clauses in contracts match pre-approved clauses for multiple languages. Used a pre-trained BERT transformer that was fine-tuned with in-house data and deployed with Docker Containers.
Technologies: Transformers, Data Science, Natural Language Processing (NLP), NLTK, SpaCy, Flask, Hugging Face, JupyterSenior Data Scientist
2020 - 2020Sprout AI- Led a small team of consultants to improve information extraction from claims.
- Performed error analysis to understand current system results and what subsystems needed to be improved.
- Annotated damaged items in insurance claims to build a custom model.
- Trained an NER detector to detect damaged items in claims using Huggingface Transformers to an F1 score of 75%.
Technologies: Data Science, Natural Language Processing (NLP), SpaCy, JupyterSenior Data Scientist
2020 - 2020Foreign, Commonwealth & Development Office - UK Government- Defined and explained a number of experiments that could improve information extraction from news worldwide.
- Scraped news from news websites and cleaned and deduplicated them.
- Built an MVP of an automated topic detection mechanism in the news using LDA and extracted topic names.
- Aggregated processed data into a PowerBI visualization.
Technologies: Gensim, SpaCy, NLTK, Microsoft Power BI, Agile Data Science, JupyterSenior Data Scientist
2020 - 2020Fortress AI- Consulted on the strategic direction to implement machine learning on network devices for home environments.
- Researched information around adblocking with machine learning and scraped ads and built an MVP of an ad-blocking mechanism using machine learning on JavaScript using TfIdf and logistic regression.
- Researched information about doing QoS (quality of service) with machine learning and produced a report.
Technologies: Web Scraping, Scikit-learn, Pandas, JupyterTechnical Trainer
2020 - 2020OpenClassrooms- Developed a practical introductory course on deep learning.
- Wrote a 3-part course that aimed to introduce students to deep learning, focusing on practicality and simple explanations. The course had the main theme of students working for a pizza company that uses machine learning.
- Focused the first part on the differences between traditional machine learning and deep learning; the second on neurons, how they work, and fully connected networks; and the third part on convolutional neural networks and recurrent neural networks.
- Developed a number of practical examples that the students are encouraged to follow and develop in their Jupyter Notebooks to better understand and have a reference tool later on.
Technologies: Linux, Keras, Teamwork, Data Visualization, Pandas, Machine Learning, Jupyter Notebook, Python 3, JupyterSenior Data Scientist
2020 - 2020Cabinet Office- Worked on the discovery and alpha phases aimed at understanding user problems and creating MVPs.
- Defined and explained a number of experiments that could improve knowledge management, such as faceted search and classifiers for different Tags.
- Participated in a number of user interviews to better understand their working methods.
- Wrote a number of small-scale experiments to test ideas.
- Built, cleaned, and labeled datasets for the tasks.
- Created a document type classifier that was able to distinguish between documents based on keywords and structure with an Accuracy of 90%. The system used Pika and Spacy in order to extract features and Scikit-learn to build the classifier.
- Created a duplicate document and near-duplicate document detector using MinHash to make it easy to avoid duplication and understand related documents.
- Built a 100,000 Node.js knowledge graph using Spacy, DBpedia, Gensim, and Neo4J to better understand connections between people and important topics in the documents.
- Received a feature for the project in The Times: https://www.thetimes.co.uk/article/ai-trawls-20-000-miles-of-state-papers-j0l9k5gx9.
Technologies: Linux, Teamwork, Data Visualization, Pandas, Machine Learning, Agile Data Science, Google Docs, Scikit-learn, JupyterData Scientist | Machine Learning Engineer
2019 - 2020Ernst & Young- Researched public and internal information on ML models for mergers and acquisitions and participated in workshops to generate ideas for potential use cases of ML in the M&A process.
- Did data cleaning to ensure entities existed at different points in time and correct merging of entities from different datasets based on dates.
- Created the first proof of concept models for applications of machine learning for M&A using Pandas and Random Forests in Scikit-Learn.
- Set up the ML architecture to ensure integration with the engineering architecture in Azure and selected Databricks. It allows the use of Spark for cluster-based data processing and MLFlow for experiment tracking and deployment into Kubernetes.
- Researched and experimented with a number of mechanisms to allow for modeling of imbalanced datasets–weight balancing, blagging (random forests where decision trees use undersampling), undersampling and oversampling, and transfer learning.
- Analyzed multiple data sources and selected complementary data sources such as CapIQ for financial data, Factiva for news, and Oxford Economics for forecasts.
- Managed the machine learning team and had duties such as planning the team's workload, providing guidance on priorities, planning the team structure and size, interviewing, and hiring.
- Participated in user interviews to help shape how we built the algorithms and the platform on which they would be run. A simple product and model explainability were key takeaways.
- Participated in a number of presentations with the aim of explaining how machine learning works and how it could be used by C-level stakeholders.
- Implemented a number of best practices in the team, such as random seed start, in order to get accurate scores of our models.
Technologies: Linux, Keras, Teamwork, Data Engineering, Data Visualization, Pandas, Machine Learning, Agile Data Science, Imblearn, Scikit-learn, MLflow, Databricks, PySpark, Python, Jupyter, Data ScrapingData Scientist and Machine Learning Engineer
2017 - 2019Serendipity AI- Helped put in practice a news classifier and created a topic/user-based news recommendation system using NLP.
- Used named entity detectors from Spacy, DBpedia, and Jaccard Similarity together with Levehnstein distance to detect and match named entities in news and other text data.
- Developed a new vectorization method for the detected named entities in text and worked on a mechanism to qualify their expertise to different topics.
- Deployed Spark, Hadoop, and HBase on a cluster of three computers to speed up the machine learning processing.
- Developed an ML processing pipeline that would allow information to flow to HBase and processed it in parallel using PySpark. Every stage in the pipeline was designed as a microservice with access to only an input and an output table.
- Implemented a recommendation system using a neural network set up as an autoencoder and cosine similarity from Spotify Annoy.
- Brought to production level an article judging system. The system had a classification service and a training application. I used Celery to train every night and restart the judging service's worker pool when new models were available.
- Improved the code quality and reduced repeated code across applications written in Flask and Cherrypy by creating a shared library. Added a logging system based on Python logging that had handlers for local logging and Rollbar.
- Created a number of APIs using Flask that ran on AWS and connected to Neo4j.
- Set up a testing framework that would allow APIs to be tested before and after deployment using Jenkins and wrote integration tests for the APIs.
Technologies: Linux, Teamwork, Data Engineering, Data Visualization, Pandas, Machine Learning, Agile Data Science, SpaCy, Gensim, Scikit-learn, HBase, PySpark, Python, Jupyter, Data ScrapingData Scientist and Machine Learning Engineer
2017 - 2017Cappfinity- Researched and integrated an automatic machine learning algorithm picker in Python.
- Researched Auto-Sklearn (bayesian optimization for algorithm selection), TPOT (genetic algorithms for feature processing and algorithm selection), and NEAT (genetic algorithms for neural network evolution).
- Developed the architecture for experimentation and result visualization for machine learning algorithms using services built with C# ASP.NET Core and Python-Flask, which communicate via REST and RabbitMQ.
- Built the system's presentation layer using Angular 4.
- Wrote a text extraction service from speech using Google Speech to Text API.
- Integrated MongoDB and connected all the services to it so that they can save processing results.
- Integrated all the applications in Docker with their own private network and Docker Compose to allow for continuous integration and faster deployment.
Technologies: Linux, Teamwork, Pandas, Machine Learning, Tree-Based Pipeline Optimization Tool (TPOT), Flask, TensorFlow, Scikit-learn, PythonResearch Engineer
2016 - 2017Oxehealth- Led the data engineering team and worked on big data microservices that would connect cameras installed on-site with Oxehealth’s data warehouse.
- Worked on Oxehealth’s TechCrunch London live demo that connected a room in Oxford with a human being monitored to the stage in London.
- Designed and developed the microservices architecture for video data retrieval from customer sites using ZeroMQ, GRPC, and Boost Program Options and Property Tree for C++.
- Set up a VPN Network to connect customer deployments to a central data repository using pfSense.
- Built a breathing robot that could replicate different breathing patterns.
- Designed and developed an application that allowed for multiple room monitoring using Qt.
Technologies: Teamwork, Data Engineering, Machine Learning, RabbitMQ, ZeroMQ, Python, C++, CComputer Vision and Algorithms Engineer
2016 - 2016Meta Vision Systems- Designed the full stack from image capture and processing to point clouds sent over the network using multiple threads and a pipeline architecture to measure oil pipes with lasers and cameras.
- Wrote general-purpose GPU (GPGPU) code to accelerate image processing algorithms–convolution and point extraction via new kernels or through OpenCV, reducing processing time from the 40s to 40ms for some code paths.
- Implemented K-means and ordinary least squares algorithms through OpenCV for finding points of interest and then line fitting.
- Designed and set up the network communication channels to transmit data, commands, and replies using Type Length Value (TLV) messages via Boost ASIO.
- Designed and developed a logging system using Microsoft ETW.
- Set up point cloud library (PCL) for surface reconstruction and visualization of STL files and point clouds.
- Used Boost Property Tree to implement a configuration file parser that uses JSON files.
- Deployed Jenkins for automatic build verification and to run test cases.
Technologies: Linux, Teamwork, Machine Learning, CUDA, C++, C, OpenCVSoftware Engineer
2013 - 2016Qualcomm- Wrote the first Windows driver for Qualcomm's NFC chip.
- Participated in a number of integration activities where I helped set up new platforms with our NFC chip.
- Worked on the launch of a Windows mobile phone that contained the chip I worked on.
- Advised other teams across the globe on Windows driver development.
- Developed a script in PowerShell for improving the team’s efficiency.
- Debugged customer and partner issues and those arising during testing.
- Trained new team members from different disciplines such as software engineering and testing.
Technologies: Linux, Teamwork, C++, C