
Abdul Rafey Tahir
Verified Expert in Engineering
Research Engineer and Developer
Lahore, Punjab, Pakistan
Toptal member since July 24, 2022
Abdul Rafey is a data scientist with 5+ years of industry experience. He has worked on challenging AI, big data, and LLM use cases in eCommerce, healthcare, finance, insurance, and safety and compliance domains. He is proficient in Python and relevant DS libraries like Pandas, NumPy, scikit-learn, big data frameworks like Apache Spark and Hadoop, and LLM integration and fine-tuning with OpenAI, LangChain, and Hugging Face.
Portfolio
Experience
- SQL - 5 years
- Machine Learning - 5 years
- Python - 5 years
- Data Science - 5 years
- Amazon Web Services (AWS) - 3 years
- PySpark - 2 years
- Kubeflow - 2 years
Availability
Preferred Environment
Data Science, Machine Learning, Big Data, Python, Amazon Web Services (AWS), Deep Learning, Large Language Models (LLMs), LangChain, OpenAI, Kubeflow
The most amazing...
...thing I've built is a real-time collision detection system using sensor data for Motive, Inc., a multi-billion dollar startup in Silicon Valley.
Work Experience
Data/AI Expert
Mazars USA LLP - Main
- Developed a data pipeline in Azure DevOps to ingest and categorize financial documents using a large language model and move them to relevant SharePoint folders to organize workflow for internal users like auditors and tax consultants.
- Implemented a RAG-based approach to categorize documents. I used the OpenAI text-embedding-3-small model to generate embeddings for a large dataset of financial documents. A vector store with embeddings and document categories was built in Cosmos DB.
- Built a FastAPI to process every new file uploaded to SharePoint by the users. The API would generate embeddings for the file, retrieve the top 7 results from the vector store, and use popular voting to return the document category.
Senior Data Scientist
QPharma, Inc.
- Developed an LLM chatbot using a LangChain SQL agent with dynamic few-shot prompting to answer user queries from a database of top healthcare professionals in their respective fields. It was used by medical and sales reps of pharma clients.
- Built a big data analytics pipeline to identify healthcare professional leaders at local and national levels based on referrals and prescription data. These are provided to pharma clients for maximum market penetration of new and existing brands.
- Architected social media scrapers for Twitter and YouTube to gauge the social media influence of HCPs. This data is preprocessed and fed to a new analytics pipeline that identifies key opinion leaders in specific areas of medicine for pharma clients.
ML Engineer
Ponte Energy Partners GmbH
- Developed an AWS Sagemaker pipeline to support training, processing, batch transformation, and inference functionality for ML models that predict price variations on the company's renewable energy trading platform.
- Restructured a large portion of the codebase, set up debug configs for local model execution, and optimized CI/CD scripts and several functionalities for efficient data loading and processing, including the use of Manifest files and property files.
- Utilized a bunch of new tools like Typer for efficient parsing of CLI args and contextlib for building the dependency wheel as a background process while executing the pipeline.
Data Scientist
Neyl Skalli
- Developed a web scrapper for Transfermarkt.com to scrape data for soccer players. Successfully built and deployed it on AWS Glue to scrape data for 2,000+ teams (more than 60,000 players).
- Utilized scraped data to train an unsupervised machine learning model, specifically K-Medoid clustering, enabling effective grouping of players based on their statistics, rankings, and valuation.
- Played a key role in integrating the trained model into the client's platform, allowing users to receive the top 5 most similar players based on their search queries.
Data Scientist
Motive
- Developed an unsafe driving detection algorithm for Motive, a US-based multibillion-dollar startup. It detects unsafe acceleration, brake, and corner events generated by customer fleets' truck drivers using sensor data used for driver coaching.
- Trained a real-time crash detection ML model with huge volumes of sensor data for Motive's safety product. The system saves event and video data, notifies authorities in minutes, and helps save lives, exonerate drivers, and reduce insurance liability.
- Built a smoothing algorithm in collaboration with the embedded team at Motive to improve the quality of raw sensor data from the electronic logging device in the customers' vehicles, improving the system's precision in catching hard events by 40%.
Data Scientist
Foot Locker
- Built a machine learning model to predict customer tier change (upgrade and downgrade) in the company's loyalty program in the next quarter based on data from the last three quarters to offer rewards as part of the customer retention policy.
- Performed RFM (recency, frequency, and monetary value) analysis for Foot Locker customers to segment more frequent and high-spending customers from others. The purpose was to lay the groundwork for a personalized recommendation system.
- Was part of the Churn prediction project at Foot Locker. Like the loyalty program, the company wanted to determine which customers would churn. Based on data from the previous three quarters, the criteria were set to no spending in one quarter.
Data Scientist
CUNA Mutual Group
- Developed a machine learning model to predict which insurance advisors would not be able to sell a product in the following 12 months based on three years of historical data or sales. Trained and deployed the model on the Azure cloud.
- Built a model to forecast which credit unions the company did business with would be able to survive in the next two years after COVID-19 hit based on historical data going as far back as 1990.
- Developed an algorithm following a weighted average metric model to score the performance of insurance advisors based on their performance in the last four quarters to identify top, medium, and low-performing advisors.
Research Associate
National University of Computer and Emerging Sciences
- Involved in the full-year project that researched and developed road anomaly detection, i.e., potholes, manholes, speed breakers, cat-eyes, and rumble strips. Smartphone sensors were used for data collection through hours of drives across the city.
- Used the model trained in research to build a crowd-sourced application to map road anomalies across the cities for users to avoid using routes with high anomalies. The model was retrained over time to improve the prediction of road anomalies.
- Published a research paper in the 2018 IEEE Intelligent Vehicle Symposium (IV) Conference titled "Intelligent Crowd Sourced Road Anomaly Detection System."
Experience
Forecast Customer Loyalty Status for Foot Locker USA Loyalty Program
• X1: VIP customers
• X2: average customers (with reasonable spending)
• X3: low-spending customers
The project involved building a machine learning model to predict which customers would and wouldn't change their category in the next quarter based on data from the past eight quarters.
Using pandas, I cleaned and prepared the dataset from 2019 Q1 to 2020 Q4 for feature engineering. The target variable class_change came from 2021 Q1 data. It was set to 1 for customers whose classes had changed during that time (either upgraded or downgraded) and 0 for those who hadn't. The test set target variable came from 2021 Q2. After generating quarterly features from data—including the number of website visits, orders placed, items checked out, items viewed, the amount spent, and other session data and shopping history—a random forest classifier was trained using scikit-learn. The model did fairly well on the test set, with a recall of 0.62 and a precision of 0.87. It was then deployed on Azure.
YouTube Comment Classification for Content Creators
https://github.com/abdulrafeytahir/Youtube-Comment-ClassificationThere project involved:
• Data collection. I developed a scraper using Python and Selenium, crawled many channels, and collected the top 100 comments on every video. The clients had a team of annotators who annotated about 100,000 comments, of which there were roughly 10,000 requested comments.
• Model training. The dataset was quite imbalanced, so I downsampled negative class data points to 30,000. I then applied basic text preprocessing techniques using Pandas and NLTK, such as special character removal, lowercase alphabets, text tokenization, and stemming. Also, I generated simple frequency-based features using TF-IDF vectorization in scikit-learn and trained a support vector machine (SVM) model. It achieved 83% recall and 91% precision on the validation set.
• Inference and optimization. I used my scraper to scrape a few more websites and ran an inference for the scraped comments. Since the goal was high precision, we got the events classified as requests, annotated them again, and retrained the model.
LLM-based Meeting Minutes Generation
We used GCP for training and a Tesla A100 for running training jobs. For inference, we used P100 GPU, which worked pretty well for the scope of our application. We also built a Flask app that is containerized and deployed on GCP. The models are deployed as API endpoints and consumed by the app.
Recommendation System and Churn Reporting for an eCommerce System
MedBot - An LLM-based Medical Chatbot
Education
Master's Degree in Data Science
National University of Computer and Emerging Sciences - Lahore, Pakistan
Bachelor's Degree in Computer Science
National University of Computer and Emerging Sciences - Lahore, Pakistan
Certifications
Neural Networks and Deep Learning
Coursera
Skills
Libraries/APIs
Pandas, NumPy, Scikit-learn, PySpark, Matplotlib, TensorFlow, Python API, XGBoost, LSTM, PyTorch, Node.js, Keras
Tools
Git, Microsoft Excel, Spark SQL, Amazon SageMaker, Jupyter, ARIMA, ARIMAX, H2O AutoML, AWS CloudFormation, ChatGPT, Redash, Tableau, AWS Fargate, AWS CodeBuild, Azure ML Studio
Languages
Python, SQL, R, Snowflake, Java, C++, Ruby, C, Scala
Frameworks
Selenium, LlamaIndex, Streamlit
Paradigms
ETL, Automation, Agent-based Modeling, DevOps, Business Intelligence (BI)
Platforms
Amazon Web Services (AWS), Databricks, Jupyter Notebook, Amazon EC2, Amazon, AWS Lambda, Kubeflow, Google Cloud Platform (GCP), Docker, Azure
Storage
SQL Server 2016, Amazon S3 (AWS S3), PostgreSQL, MySQL, MongoDB, Data Pipelines, Databases
Industry Expertise
Retail & Wholesale, High-frequency Trading (HFT)
Other
Data Science, Machine Learning, Data Analysis, Algorithms, Big Data, Natural Language Processing (NLP), Artificial Intelligence (AI), Data Scraping, Time Series Analysis, Data Queries, Web Scraping, Data Visualization, Computer Vision, Statistics, Predictive Modeling, Predictive Learning, Data Reporting, Data Analytics, Data Engineering, Statistical Analysis, Data Cleaning, AI Design, API Integration, Task Automation, Graphs, Classification, Financial Modeling, Machine Learning Operations (MLOps), Large Data Sets, Unstructured Data Analysis, Data Scientist, Data Gathering, APIs, Recurrent Neural Networks (RNNs), Neural Networks, eCommerce, Sentiment Analysis, Agile Data Science, Stock Trading, Large Language Models (LLMs), Data Versioning, CSV, Time Series, Data Modeling, Predictive Analytics, Regression Modeling, Marketing Mix Modeling, Finance, Trend Analysis, Data Matching, OpenAI GPT-4 API, Digital Marketing, Bayesian Statistics, Actuarial, Logistic Regression, A/B Testing, Analytics, Optimization, Exploratory Data Analysis, AI Model Training, Chatbots, OpenAI GPT-3 API, Bots, Stock Market, Trading Bots, LangChain, Supervised Machine Learning, Model Tuning, Serverless, Generative Pre-trained Transformer 3 (GPT-3), Generative Pre-trained Transformer 4 (GPT-4), Statistical Modeling, Cloud, Risk Analysis, Generative Pre-trained Transformers (GPT), Text Classification, Machine Learning Automation, Amazon Machine Learning, Amazon SageMaker Pipelines, Optical Character Recognition (OCR), Generative Artificial Intelligence (GenAI), CI/CD Pipelines, BERT, ETL Tools, Excel 365, FastAPI, Containerization, PDF Scraping, Amazon Bedrock, Data Structures, Deep Learning, Forecasting, Churn Analysis, Recommendation Systems, Sensor Fusion, Signal Processing, Research, Language Models, Graph Theory, Identity & Access Management (IAM), AWS CodePipeline, Hugging Face, Llama 2, Text Summarization, OpenAI, Retrieval-augmented Generation (RAG)
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring