Abdul is available for hire

Abdul Rafey Tahir

Verified Expert in Engineering

Research Engineer and Developer

Location

Lahore, Punjab, Pakistan

Toptal Member Since

July 24, 2022

Abdul Rafey is a data scientist with five years of industry experience. He has worked on challenging problems and performed data preprocessing, analysis, and modeling on big data in eCommerce, healthcare, finance, insurance, and safety and compliance domains. He is proficient in Python and relevant data science libraries like Pandas, NumPy, scikit-learn, PySpark, TensorFlow, Plotly, Seaborn, etc., AutoML frameworks like H2O.ai and recommendation system frameworks like RecBole and LightFM.

Machine Learning Algorithms Data Analytics Data Cleaning Data Scientist Agile Data Science Predictive Analytics Analytics Exploratory Data Analysis Artificial Intelligence (AI)Data Scraping Web Scraping Statistics Time Series Data Modeling LLaMA 2 Signal Processing

Portfolio

QPharma, Inc.

Amazon Web Services (AWS), Big Data, Scala, Python, Graph Theory...

Motive

Python, Big Data, Machine Learning, SQL, PySpark, Ruby, Snowflake, Redash...

Ponte Energy Partners GmbH

Amazon SageMaker, AWS CloudFormation, Machine Learning Operations (MLOps)...

Experience

Data Science - 5 years Machine Learning - 5 years Python - 5 years Git - 5 years SQL - 3 years Data Analysis - 3 years Amazon Web Services (AWS) - 3 years PySpark - 2 years

Availability

Part-time

Preferred Environment

Data Science, Machine Learning, Big Data, Python, Pandas, Scikit-learn, Amazon Web Services (AWS), Deep Learning, Data Analysis, Forecasting, NumPy

The most amazing...

...thing I've built is a real-time collision detection system using sensor data for Motive, Inc., which is a multi-billion dollar startup in Silicon Valley.

Work Experience

Senior Data Scientist

2022 - PRESENT

QPharma, Inc.

Developed a big data analytics pipeline to identify healthcare professional leaders at local and national levels based on referrals and prescriptions data. These are provided to pharma clients for maximum market penetration of new and existing brands.
Developed social media scrapers for Twitter and YouTube to gauge the social media influence of HCPs. This data is preprocessed and fed to a new analytics pipeline that identifies key opinion leaders in specific areas of medicine for pharma clients.
Took charge of the conversion of the existing codebase from Scala to PySpark for better integration with the existing Python modules and for faster code execution in many functionality blocks as compared to Scala.

Technologies: Amazon Web Services (AWS), Big Data, Scala, Python, Graph Theory, Artificial Intelligence (AI), PySpark, Machine Learning Automation, Amazon Machine Learning, Machine Learning Operations (MLOps), SQL, Git, Large Data Sets, Data Scientist, Data Gathering, Data Scraping, Web Scraping, AWS CloudFormation, Identity & Access Management (IAM), AWS CodeBuild, Amazon EC2, Amazon S3 (AWS S3), Sentiment Analysis, Agile Data Science, MySQL, NumPy, Pandas, ChatGPT, Large Language Models (LLMs), DevOps, Docker, CI/CD Pipelines, Data Versioning, ETL Tools, CSV, APIs, Data Modeling, Predictive Analytics, Data Matching, Amazon, Analytics, Optimization, Exploratory Data Analysis, Selenium, Chatbots

Data Scientist

2021 - PRESENT

Motive

Developed an unsafe driving detection algorithm for Motive, a US-based multibillion-dollar startup. It detects unsafe acceleration, brake, and corner events generated by customer fleets' truck drivers using sensor data used for driver coaching.
Trained a real-time crash detection ML model using huge volumes of sensor data for Motive's safety product. The system saves event and video data, notifies authorities in minutes and helps save lives, exonerate drivers, and reduce insurance liability.
Built a smoothing algorithm in collaboration with the embedded team at Motive to improve the quality of raw sensor data from the electronic logging device in the customers' vehicles, improving the system's precision in catching hard events by 40%.

Technologies: Python, Big Data, Machine Learning, SQL, PySpark, Ruby, Snowflake, Redash, Amazon Web Services (AWS), Microsoft Excel, Spark SQL, Data Queries, Tableau, Data Visualization, Data Science, Statistics, Predictive Modeling, Predictive Learning, ETL, Signal Processing, Artificial Intelligence (AI), Data Reporting, Data Analytics, Data Engineering, Cloud, Jupyter Notebook, XGBoost, Data Cleaning, AI Design, Automation, Task Automation, AWS Fargate, Graphs, Classification, Data Pipelines, Jupyter, Amazon SageMaker, Machine Learning Automation, Machine Learning Operations (MLOps), Git, Docker, Large Data Sets, Unstructured Data Analysis, Data Scientist, Data Gathering, Amazon SageMaker Pipelines, AWS CodePipeline, Computer Vision, Amazon EC2, Amazon S3 (AWS S3), APIs, PostgreSQL, Neural Networks, Agile Data Science, MySQL, NumPy, Pandas, Generative Artificial Intelligence (GenAI), DevOps, CI/CD Pipelines, Data Versioning, CSV, FastAPI, Containerization, PyTorch, Data Modeling, Predictive Analytics, Trend Analysis, Amazon, Logistic Regression, A/B Testing, Analytics, Optimization, Exploratory Data Analysis, AI Model Training, H2O AutoML

ML Engineer

2023 - 2023

Ponte Energy Partners GmbH

Developed an AWS Sagemaker pipeline to support training, processing, batch transformation, and inference functionality for ML models that predict price variations on the company's renewable energy trading platform.
Restructured a large portion of the codebase, set up debug configs for local model execution, and optimized CI/CD scripts and several functionalities for efficient data loading and processing, including the use of Manifest files and property files.
Utilized a bunch of new tools like Typer for efficient parsing of CLI args and contextlib for building the dependency wheel as a background process while executing the pipeline.

Technologies: Amazon SageMaker, AWS CloudFormation, Machine Learning Operations (MLOps), Machine Learning, Identity & Access Management (IAM), AWS CodeBuild, AWS CodePipeline, Amazon SageMaker Pipelines, Data Versioning, ETL Tools, CSV, Containerization, Predictive Analytics, Amazon, High-frequency Trading (HFT), Optimization, AI Model Training

Data Scientist

2023 - 2023

Neyl Skalli

Developed a web scrapper for Transfermarkt.com to scrape data for soccer players. Successfully built and deployed it on AWS Glue to scrape data for 2,000+ teams (more than 60,000 players).
Utilized scraped data to train an unsupervised machine learning model, specifically K-Medoid clustering, enabling effective grouping of players based on their statistics, rankings, and valuation.
Played a key role in integrating the trained model into the client's platform, allowing users to receive the top 5 most similar players based on their search queries.

Technologies: Python, Web Scraping, Data Scraping, Unstructured Data Analysis, Data Scientist, Data Gathering, Amazon EC2, Amazon S3 (AWS S3), NumPy, Pandas, CSV, Analytics, AI Model Training, Selenium

Data Scientist

2020 - 2021

CUNA Mutual Group

Developed a machine learning model to predict which insurance advisors would not be able to sell a product in the following 12 months based on three years of historical data or sales. Trained and deployed the model on the Azure cloud.
Built a model to forecast which credit unions the company did business with would be able to survive in the next two years after COVID-19 hit based on historical data going as far back as 1990.
Developed an algorithm following a weighted average metric model to score the performance of insurance advisors based on their performance in the last four quarters to identify top, medium, and low-performing advisors.

Technologies: Python, Data Analysis, Machine Learning, Time Series Analysis, Microsoft Excel, Data Queries, Statistical Modeling, Azure, Data Science, Statistics, Predictive Modeling, Predictive Learning, ETL, Data Reporting, Data Analytics, SQL Server 2016, Statistical Analysis, R, Google Cloud Platform (GCP), Cloud, Risk Analysis, Jupyter Notebook, XGBoost, Agent-based Modeling, Data Cleaning, AI Design, API Integration, Automation, Task Automation, Graphs, Classification, Text Classification, Financial Modeling, Jupyter, Git, Unstructured Data Analysis, Data Scientist, Data Scraping, Web Scraping, PostgreSQL, Recurrent Neural Networks (RNNs), Neural Networks, Sentiment Analysis, Agile Data Science, MySQL, NumPy, Pandas, Stock Trading, CSV, Excel 365, Time Series, ARIMA, ARIMA Models, ARIMAX Models, Data Modeling, Predictive Analytics, Regression Modeling, Marketing Mix Modeling, Finance, Trend Analysis, Bayesian Statistics, Actuarial, Logistic Regression, A/B Testing, Analytics, Optimization, Exploratory Data Analysis, AI Model Training

Data Scientist

2019 - 2020

Foot Locker

Built a machine learning model to predict customer tier change (upgrade and downgrade) in the company's loyalty program in the next quarter based on data from the last three quarters to offer rewards as part of the customer retention policy.
Performed RFM (recency, frequency, and monetary value) analysis for Foot Locker customers to segment more frequent and high-spending customers from others. The purpose was to lay the groundwork for a personalized recommendation system.
Was part of the Churn prediction project at Foot Locker. Like the loyalty program, the company wanted to determine which customers would churn. Based on data from the previous three quarters, the criteria were set to no spending in one quarter.

Technologies: Databricks, Python, PySpark, Machine Learning, Data Analysis, Data Visualization, Forecasting, Data Science, Statistics, Predictive Modeling, Predictive Learning, ETL, Data Analytics, SQL Server 2016, Cloud, Jupyter Notebook, Amazon SageMaker, Data Cleaning, API Integration, Automation, Graphs, Classification, Text Classification, Financial Modeling, Jupyter, Git, Data Scientist, eCommerce, Sentiment Analysis, Agile Data Science, NumPy, Pandas, CSV, Time Series, ARIMA, Retail, ARIMA Models, ARIMAX Models, Data Modeling, Predictive Analytics, Regression Modeling, Marketing Mix Modeling, Trend Analysis, Digital Marketing, Bayesian Statistics, Logistic Regression, A/B Testing, Exploratory Data Analysis, AI Model Training

Research Associate

2018 - 2018

National University of Computer and Emerging Sciences

Involved in the full-year project that researched and developed road anomaly detection, i.e., potholes, manholes, speed breakers, cat-eyes, and rumble strips. Smartphone sensors were used for data collection through hours of drives across the city.
Used the model trained in research to build a crowd-sourced application to map road anomalies across the cities for users to avoid using routes with high anomalies. The model was retrained over time to improve the prediction of road anomalies.
Published a research paper in the 2018 IEEE Intelligent Vehicle Symposium (IV) Conference titled "Intelligent Crowd Sourced Road Anomaly Detection System."

Technologies: Artificial Intelligence (AI), Research, Machine Learning, Data Analysis, Computer Vision, TensorFlow, Sensor Fusion, Data Science, Predictive Modeling, Predictive Learning, Jupyter Notebook, Data Cleaning, Graphs, Classification, Jupyter, OCR, CSV, PyTorch, R, AI Model Training

Experience

Forecast Customer Loyalty Status for Foot Locker USA Loyalty Program

Foot Locker USA has an extensive loyalty program that rewards customers based on their spending. They have three customer classes based on heuristics:
• X1: VIP customers
• X2: average customers (with reasonable spending)
• X3: low-spending customers

The project involved building a machine learning model to predict which customers would and wouldn't change their category in the next quarter based on data from the past eight quarters.

Using pandas, I cleaned and prepared the dataset from 2019 Q1 to 2020 Q4 for feature engineering. The target variable class_change came from 2021 Q1 data. It was set to 1 for customers whose classes had changed during that time (either upgraded or downgraded) and 0 for those who hadn't. The test set target variable came from 2021 Q2. After generating quarterly features from data—including the number of website visits, orders placed, items checked out, items viewed, the amount spent, and other session data and shopping history—a random forest classifier was trained using scikit-learn. The model did fairly well on the test set, with a recall of 0.62 and a precision of 0.87. It was then deployed on Azure.

YouTube Comment Classification for Content Creators

https://github.com/abdulrafeytahir/Youtube-Comment-Classification

The idea was to create an application for YouTube content creators to filter out requests for new content from their viewers in the comment section. This would help them create content that the audience demanded to watch.

There project involved:

• Data collection. I developed a scraper using Python and Selenium, crawled many channels, and collected the top 100 comments on every video. The clients had a team of annotators who annotated about 100,000 comments, of which there were roughly 10,000 requested comments.

• Model training. The dataset was quite imbalanced, so I downsampled negative class data points to 30,000. I then applied basic text preprocessing techniques using Pandas and NLTK, such as special character removal, lowercase alphabets, text tokenization, and stemming. Also, I generated simple frequency-based features using TF-IDF vectorization in scikit-learn and trained a support vector machine (SVM) model. It achieved 83% recall and 91% precision on the validation set.

• Inference and optimization. I used my scraper to scrape a few more websites and ran an inference for the scraped comments. Since the goal was high precision, we got the events classified as requests, annotated them again, and retrained the model.

LLM-based Meeting Minutes Generation

My team and I developed an LLM-based meeting minutes generation model, which was trained on 1,000 meeting videos scraped from YouTube. The first part involved audio/video-to-text transcription, for which we used Meta's Whisper model. Once the transcript was generated, we used Llama 2 to create a summary of the transcript and then used the summary to generate meeting minutes in the form of bullet points and a flow chart.

We used GCP for training and a Tesla A100 for running training jobs. For inference, we used P100 GPU, which worked pretty well for the scope of our application. We also built a Flask app that is containerized and deployed on GCP. The models are deployed as API endpoints and consumed by the app.

Recommendation System and Churn Reporting for an eCommerce System

Used "eCommerce Behaviour Data from Multi-category Store" with 285 million users. Performed RFMV analysis on the data to segregate more frequent and high-spending customers from others. I then used the LightFM library to train a deep learning recommendation model based on purchased data. Trained an SRGNN deep learning model from the RecBole library on session data. Combined the predictive capability of the two models for a better recommendation. In addition to the recommendation system, the focus was on churn reporting, especially for more frequent and high-spending customers. The goal was to retain these customers by suggesting personalized marketing strategies.

Skills

Languages

Python, SQL, R, Snowflake, Java, C++, Ruby, C, Scala

Frameworks

Selenium

Libraries/APIs

Pandas, NumPy, Scikit-learn, PySpark, Matplotlib, XGBoost, LSTM, PyTorch, Keras, TensorFlow

Tools

Git, Microsoft Excel, Spark SQL, Amazon SageMaker, Jupyter, H2O AutoML, AWS CloudFormation, Redash, Tableau, AWS Fargate, AWS CodeBuild

Paradigms

Data Science, ETL, Automation, Agent-based Modeling, DevOps, Business Intelligence (BI)

Platforms

Amazon Web Services (AWS), Jupyter Notebook, Amazon EC2, Amazon, Google Cloud Platform (GCP), Docker, Azure, Databricks

Storage

SQL Server 2016, Amazon S3 (AWS S3), PostgreSQL, MySQL, Data Pipelines, Databases

Other

Machine Learning, Data Analysis, Algorithms, Big Data, Natural Language Processing (NLP), Artificial Intelligence (AI), Data Scraping, Time Series Analysis, Data Queries, Web Scraping, Data Visualization, Computer Vision, Statistics, Predictive Modeling, Predictive Learning, Data Reporting, Data Analytics, Data Engineering, Statistical Analysis, GPT, Data Cleaning, AI Design, API Integration, Task Automation, Graphs, Classification, Financial Modeling, Machine Learning Operations (MLOps), Large Data Sets, Unstructured Data Analysis, Data Scientist, Data Gathering, APIs, Recurrent Neural Networks (RNNs), Neural Networks, eCommerce, Sentiment Analysis, Agile Data Science, Stock Trading, Large Language Models (LLMs), Data Versioning, CSV, Time Series, ARIMA, Retail, ARIMA Models, ARIMAX Models, Data Modeling, Predictive Analytics, Regression Modeling, Marketing Mix Modeling, Finance, Trend Analysis, Data Matching, OpenAI GPT-4 API, Digital Marketing, Bayesian Statistics, Actuarial, Logistic Regression, A/B Testing, Analytics, Optimization, Exploratory Data Analysis, AI Model Training, Chatbots, Statistical Modeling, Cloud, Risk Analysis, Generative Pre-trained Transformers (GPT), Text Classification, Machine Learning Automation, Amazon Machine Learning, Amazon SageMaker Pipelines, OCR, ChatGPT, Generative Artificial Intelligence (GenAI), CI/CD Pipelines, BERT, ETL Tools, Excel 365, FastAPI, Containerization, Data Structures, Deep Learning, Forecasting, Churn Analysis, Recommendation Systems, Sensor Fusion, Signal Processing, Research, Language Models, Graph Theory, Identity & Access Management (IAM), AWS CodePipeline, Hugging Face, Llama 2, Text Summarization

Industry Expertise

High-frequency Trading (HFT)

Education

2020 - 2022

Master's Degree in Data Science

National University of Computer and Emerging Sciences - Lahore, Pakistan

2014 - 2018

Bachelor's Degree in Computer Science

National University of Computer and Emerging Sciences - Lahore, Pakistan

Certifications

MARCH 2019 - PRESENT

Neural Networks and Deep Learning

Coursera

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring