Sergey is available for hire

Sergey Antopolskiy

Verified Expert in Engineering

Data Scientist and Developer

Location

Reggio Emilia, Province of Reggio Emilia, Italy

Toptal Member Since

February 2, 2021

Sergey is an expert data scientist and machine learning engineer. He has solved data analytics, visualization, and modeling problems and developed architecture and pipelines for data-centered workflows. He automated manual production steps in drug manufacturing using ML models, which led to a throughput increase of 200%. Sergey has a scientific background with extensive industry experience and understands complex problems in-depth to create the most suitable approach to generate business value.

Portfolio

ClinicalMind, LLC

Python, Google API, Generative Pre-trained Transformers (GPT), GPT...

IBM Process Mining (formerly myInvenio)

Data Reporting, Data Science, Pandas, Java 8, You Only Look Once (YOLO), Conda...

CAMLIN Group

Gantt Chart, Project Management, Data Preprocessing, Data Pipelines...

Experience

Python 3 - 8 years Jupyter - 7 years Feature Engineering - 6 years Data Visualization - 6 years Machine Learning - 6 years Biometrics - 5 years Gradient Boosting - 3 years Azure - 2 years

Availability

Part-time

Preferred Environment

Conda, Azure, Docker, Bash Script, Jupyter, PyCharm, MacOS, Linux, Python

The most amazing...

...project I've developed is a distributed cloud platform for on-demand training of ML models to find the root causes of abnormalities in business processes.

Work Experience

Data Science Consultant

2021 - 2022

ClinicalMind, LLC

Developed data collection pipelines for various sources, such as PubMed, Twitter, medical forums, Open Payments database, claims database, Google Books, and more. These data were merged and deduplicated using specialized ML algorithms.
Built an algorithm for surfacing and analyzing lexicon differences across audiences on social media, traditional media, and scientific publications.
Constructed Bespoke Metrics for finding Digital Opinion Leaders in social media, scientific publications, books, and trends. An algorithm for name disambiguation allowed for finding Opinion Leaders prominent in all areas or a specific area.

Technologies: Python, Google API, Generative Pre-trained Transformers (GPT), Natural Language Processing (NLP), GPT, Data Engineering, Key Performance Metrics, APIs, Data Collection, Data Pipelines, Linear Algebra, Twitter API, PubMed & Mendeley APIs, ELT

Senior Data Scientist

2019 - 2021

IBM Process Mining (formerly myInvenio)

Designed and implemented a cloud-based platform for the on-demand training and deployment of ML models for process mining and business process analysis, enabling several novel ML algorithms and massively shortened TTM for further ML-based projects.
Codesigned and implemented a novel AI/XAI algorithm for extracting the root causes of business process anomalies; this bleeding-edge algorithm led to advertising our product as AI-enabled, multiple sales to large customers, and partnering with IBM.
Automated manual production steps in drug manufacturing using ML models, leading to a throughput increase of 200% while maintaining labor costs. Proposed cost-effective improvements estimated to reduce manual involvement by 50%.
Designed and prototyped an NLP-based ML pipeline, which allowed unsupervised identification of business threads from screen capture and click-and-key log of PC user activity; implemented a POC of PC application to obtain the necessary data.
Created a data quality control pipeline for business process logs and designed a wizard-based UX to guide users in fixing issues with their data; it reduced the load on the helpdesk, as many requests were related to data issues unknown to clients.
Implemented a simple yet powerful engine for business rule mining by extending the Java library for decision trees; designed UI/UX for presenting its results to the users; the sales department often cited the new feature as a major selling point.
Created numerous automated CD/CI pipelines, development, and testing tools for the internal use of the data science team, which streamlined workflows and reduced manual labor related to testing by about three times, as estimated by coworkers.

Technologies: Data Reporting, Data Science, Pandas, Java 8, You Only Look Once (YOLO), Conda, DevOps, Azure DevOps, Business Rules, Decision Trees, Logistic Regression, Tokenization, Topic Modeling, WinAPI, Tesseract, OCR, Machine Learning Operations (MLOps), Git, Azure Tables, Azure Table Storage, SQL, Project Discovery, Process Mining, Business Process Analysis, Classification Algorithms, Azure Blob Storage API, Bash Script, Azure Kubernetes Service (AKS), Azure Functions, Explainable Artificial Intelligence (XAI), LSTM Networks, Gradient Boosting, Gradient Boosted Trees, Azure Machine Learning, Azure Blobs, Azure, Python, Computer Vision, Kubernetes, GPT, Generative Pre-trained Transformers (GPT), Natural Language Processing (NLP), Random Forests, Artificial Intelligence (AI), REST APIs, RESTful Development, MySQL, PostgreSQL, Root Cause Analysis, Anomaly Detection, MLflow, Apache Airflow, XGBoost, ETL, Amazon Web Services (AWS)

Senior Research Scientist/Senior Data Scientist

2017 - 2019

CAMLIN Group

Discovered and fixed results-invalidating the bug in the previously used data analysis pipeline; without my involvement, wrong results would have been published in a major publication and invalidated the filed patent.
Designed and implemented a data preprocessing pipeline for the multidimensional biometric data and deep-learning model for real-time prediction of user intentions from biometric data.
Ported a Tensorflow-based ML model on edge device Jetson TX2, which allowed model training, deployment, and real-time prediction in a portable battery-powered form.
Co-led a multistage project, planning and coordinating work between several parties, including a scientistic research lab, industrial R&D, and engineering team, and communicated with the stakeholders.
Participated in patenting the discoveries, including an ML model, as an end-to-end approach for Brain-Computer Interface architecture (https://patents.google.com/patent/WO2020211958A1).
Created and taught an extensive 3-week course on Applied Data Science for interns and junior employees and conducted internal training.

Technologies: Gantt Chart, Project Management, Data Preprocessing, Data Pipelines, Principal Component Analysis (PCA), Unsupervised Learning, Classification Algorithms, Logistic Regression, Convolutional Neural Networks (CNN), Deep Learning, TensorFlow, Data Quality Analysis, Complex Data Analysis, Time Series, Time Series Analysis, Accelerometers, Experimental Design, Biometrics, HDF, MATLAB, Python 3, Keras, Artificial Intelligence (AI), Anomaly Detection

Experience

Cloud Platform for Data-agnostic On-demand ML Training, Deployment, and Serving

The need was to create a data-agnostic cloud-based platform for ML experiments and production lifecycles, decoupled from the main business analytics software.

I designed and implemented an Azure-based distributed platform, which included (1) Serverless Azure functions as the platform API and workflow orchestrator, (2) Azure Blob Storage as the datalake, (3) MSSQL DB (later moved to Azure Tables for convenience) for storing states and intermediate results, (4) Azure ML Compute Clusters for running ML algorithms and producing artifacts, (5) Azure Kubernetes Cluster for deploying the models and serving the predictions, (6) Git repository of algorithms which can be run on-demand, (7) CI/CD pipelines.

When a user creates a project and uploads the dataset to the main software, the platform accesses the data and runs a series of ML experiments, producing models, predictions and explanations. The predictions and explanations are submitted through the REST API back to the main software, where they are displayed to the user in various scenarios, providing them with detailed insight into their dataset and allowing better decision making. Some of the models are automatically deployed as endpoints to provide real-time predictions.

Increased Throughput of a Pharmacological Production Line Via Improved Process Model

Several stages of the drug manufacturing at the client's production plant had a lead time of 1.5 hours per batch and needed constant manual intervention, which prevented the desired scaling of the production. The client and I identified the root cause as a lack of a precise model of the amount of chemicals needed to add to each batch to achieve desired product properties.

Using historic process data obtained from the client, I created a precise model, which allowed me to combine several production steps without the direct involvement of the personnel. I achieved this by extracting necessary variables from the time-series signals of the production line sensors and combining them in polynomial regression. This reduced the lead time of the bottleneck manufacturing step to slightly less than 30 minutes, leading to a corresponding increase in the throughput (+200% as estimated by the client) while also reducing the load on the technical staff. I packaged and shipped the model, meeting specific client technical requirements.

While working on the project, I proposed several cost-effective improvements to the production line, which are estimated to reduce manual involvement by 200% while increasing the production's precision.

Data-agnostic Business Rules Mining (BRM) Algorithm Using Extended Decision Trees

The goal:
- Extract business rules describing the conditions under which a process goes from activity A to one of the possible next activities (B, C, etc.)
- Estimate the consistency of these rules.
- Present this in a user-friendly form.
- Take <5 seconds on 1 million business cases.
- Integrate easily with the main Java software.

I decided to use Java 8 for that project. I extended the publicly available basic Decision Tree library with many necessary functions, such as pruning, metric estimation, tracking groups, working with missing data, and more. With that and feature engineering/augmentation pipelines, the algorithm obtains classification models for each transition and translates them to text rules, such as "A to B: when X > 10, or X < 1 and Y > 100". These are easily interpreted by business users. Metrics are presented in a user-friendly way, allowing to judge the consistency of the identified rules. I designed UX/UI for displaying and exploring the insights and adjusting them to the specific users' datasets (e.g., users can make rules more complex and precise, if they want).

BRM became one of the core features of the software anda key sales point. It became a basis for process simulations, another core feature.

Brain-computer Interface for Neural Menu Navigator

https://arxiv.org/pdf/2004.11978.pdf

We created and tested a brain-computer interface prototype based on the real-time analysis of multidimensional bioelectrical signals obtained from the scalp of a car driver (EEG), showing the selected items in the infotainment menu in a completely hands-free way.

I designed and implemented an EEG data preprocessing pipeline and ML model (based on the convolutional neural network architecture), which was trained on the driver's data and in real-time predicted which infotainment function they wanted to select (navigation, music, etc.). The ML model was ported to the battery-powered portable edge device NVIDIA Jetson TX2, allowing it to work independently inside a car. To increase the project business value, we also collected rich motion data using a set of accelerometer sensors to create future models predicting steering actions.

I co-led this project; in particular, I coordinated the activity between neuroscientists, engineers, and our research partners from Toyota Motor Europe, designed the prototype tests and data collection.

This work resulted in several papers (for a detailed account, see the project URL) and a patent I co-authored (https://patents.google.com/patent/WO2020211958A1).

Time-frequency Signatures in Brain Activity Related to Car Control During Driving

I analyzed the electroencephalographic (EEG) dataset to extract patterns related to the driving actions; braking, acceleration, and steering. The data consisted of an EEG, accelerometer, and driving simulator data, all of which were multidimensional time series.

I was invited to the project at a late stage; however, while analyzing the previous work, I found a serious bug in the data analysis, which invalidated the results about to be published. Consequently, I was asked to join the project full-time to improve the analysis, which was eventually published as a scientific paper and partially patented. (Patents.google.com/patent/WO2019025000A1).

In particular, my work involved synchronizing the data streams from different devices, extracting and filtering event triggers, performing PCA, and factorizing EEG signals on independent components (ICA), with subsequent time-frequency statistical analysis.

Skills

Languages

Python 3, Python, Bash Script, SQL, Java 8

Libraries/APIs

Azure Blob Storage API, Pandas, REST APIs, TensorFlow, Accelerometers, cuDDN, WinAPI, Keras, Google API, Twitter API, PubMed & Mendeley APIs, XGBoost

Tools

Jupyter, Azure Machine Learning, Git, PyCharm, MATLAB, Azure Kubernetes Service (AKS), You Only Look Once (YOLO), LabVIEW, Apache Airflow

Paradigms

Data Science, Azure DevOps, DevOps, Test-driven Development (TDD), Synthetic Data Generation, RESTful Development, UX Design, UI Design, Anomaly Detection, Key Performance Metrics, ETL

Platforms

Azure Functions, Docker, Azure, Linux, MacOS, Kubernetes, Amazon Web Services (AWS)

Storage

Azure Blobs, Azure Table Storage, Azure Tables, Data Pipelines, MySQL, PostgreSQL

Other

Data Visualization, Data Analytics, Machine Learning, Biometrics, Principal Component Analysis (PCA), Feature Engineering, Biomedical Skills, Experimental Design, Complex Data Analysis, Data Quality Analysis, Logistic Regression, Classification Algorithms, Data Preprocessing, Gradient Boosted Trees, Gradient Boosting, Process Mining, Machine Learning Operations (MLOps), Decision Trees, Neuroscience, Data Preparation, Health IT, Experimental Research, Scientific Data Analysis, Polynomial Regression, Linear Regression, Artificial Intelligence (AI), Conda, Unsupervised Learning, Clustering, Convolutional Neural Networks (CNN), Time Series Analysis, Non-negative Matrix Factorization (NMF), HDF, Time Series, Deep Learning, Gantt Chart, Explainable Artificial Intelligence (XAI), Business Process Analysis, Project Discovery, OCR, Tesseract, Business Rules, APIs, EEG, EEG Libraries for Python, Computational Biology, Computational Statistics, Statistics, Statistical Modeling, Simulations, Data Reporting, Sensor Data, Client Reporting, Random Forests, Root Cause Analysis, Architecture, AI Design, Digital Signal Processing, LSTM Networks, Topic Modeling, Tokenization, Computer Vision, Natural Language Processing (NLP), MLflow, Data Engineering, Data Collection, Linear Algebra, ELT, GPT, Generative Pre-trained Transformers (GPT)

Industry Expertise

Project Management

Education

2011 - 2016

Ph.D. in Systems Neuroscience

International School for Advanced Studies - Trieste, Italy

2014 - 2014

Coursework (Exchange Student) in Computational Neuroscience

Frankfurt Institute for Advanced Studies - Frankfurt, Germany

2006 - 2011

Master's Degree in Physiology

Lomonosov Moscow State University - Moscow, Russia

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring