
Abdul Rafey Tahir
Verified Expert in Engineering
Research Engineer and Developer
Lahore, Punjab, Pakistan
Toptal member since July 24, 2022
Abdul Rafey is an AI Engineer with 6+ years of industry experience, focused on architecting scalable ML, GenAI and LLM solutions at Enterprise level. With a strong AI RnD background, he has hands-on experience with Transformer models, state-of-the-art LLMs like GPT, Llama, and Claude, and GenAI frameworks like Langchain and LlamaIndex. He is an expert AI solutions architect who can rapidly build prototypes and then scale them up to production level systems on cloud platforms like AWS or Azure.
Portfolio
Experience
- Machine Learning - 6 years
- Data Science - 6 years
- Python - 6 years
- SQL - 6 years
- Time Series Forecasting - 4 years
- Amazon Web Services (AWS) - 3 years
- LangChain - 2 years
- PySpark - 2 years
Availability
Preferred Environment
Data Science, Machine Learning, Big Data, Python, Amazon Web Services (AWS), Deep Learning, Large Language Models (LLMs), LangChain, OpenAI, Retrieval-augmented Generation (RAG)
The most amazing...
...thing I've built is a real-time cloud based collision detection ML system using vision + sensor data for Motive, a multi-billion dollar Silicon Valley startup.
Work Experience
Data Scientist/ML Engineer
Zoetis - Data and Digital Solutions
- Developed an AI pipeline using Pytesseract and Spark to process large volumes of OCR data, generating insights into purchase patterns of pet owners for Zoetis versus competitor products and identifying target customers to maximize revenue gain.
- Developed and deployed a customer defection ML model on AzureAI using historic purchase data to help identify customers that would defect in the future. This helped the marketing team direct their efforts to retain those customers.
- Conducted multiple ad-hoc analyses using Pandas and Spark on various marketing and sales campaigns to determine their success and generated insights into the revenue gain and customer retention as a result of those campaigns.
Data/AI Expert
Mazars USA LLP - Main
- Designed and developed a GenAI pipeline in AzureAI to ingest and categorize financial documents using OpenAI LLMs and and move them to relevant SharePoint folders to organize workflow for internal users like auditors and tax consultants.
- Implemented a RAG-based approach to categorize documents. I used the OpenAI text-embedding-3-small model to generate embeddings for a large dataset of financial documents. A vector store with embeddings and document categories was built in Cosmos DB.
- Built a FastAPI to process every new file uploaded to SharePoint by the users. The API would generate embeddings for the file, retrieve the top 7 results from the vector store, and use popular voting to return the document category.
Senior Data Scientist
QPharma, Inc.
- Architected an GenAI Chatbot on AWS Sagemaker using LangChain SQL agent to answer user queries from a database of top healthcare professionals in their respective fields. It was used by medical and sales reps of pharma clients for marketing purposes.
- Built a big data analytics pipeline using AWS EMR to identify healthcare leaders at local and national levels based on referrals and prescription data. These are provided to pharma clients for maximum market penetration of their drug brands.
- Developed social media scrapers for Twitter and YouTube to gauge the social media influence of HCPs. This data is preprocessed and fed to a new analytics pipeline that identifies key opinion leaders in specific areas of medicine for pharma clients.
ML Engineer
Ponte Energy Partners GmbH
- Developed an AWS Sagemaker pipeline to support training, processing, batch transformation, and inference functionality for ML models that predict price variations on the company's renewable energy trading platform.
- Restructured a large portion of the codebase, set up debug configs for local model execution, and optimized CI/CD scripts and several functionalities for efficient data loading and processing, including the use of Manifest files and property files.
- Utilized a bunch of new tools like Typer for efficient parsing of CLI args and contextlib for building the dependency wheel as a background process while executing the pipeline.
Data Scientist / ML Engineer
Neyl Skalli
- Developed a web scrapper for Transfermarkt.com to scrape data for soccer players. Successfully built and deployed it on AWS Glue to scrape data for 2,000+ teams (more than 60,000 players).
- Utilized scraped data to train an unsupervised machine learning model, specifically K-Medoid clustering, enabling effective grouping of players based on their statistics, rankings, and valuation.
- Played a key role in integrating the trained model into the client's platform, allowing users to receive the top 5 most similar players based on their search queries.
Data Scientist / ML Engineer
Motive
- Developed an unsafe driving detection algorithm for Motive, a US-based multibillion-dollar startup. It detects unsafe acceleration, brake, and corner events generated by customer fleets' truck drivers using sensor data used for driver coaching.
- Trained a real-time crash detection ML model with huge volumes of sensor data for Motive's safety product. The system saves event and video data, notifies authorities in minutes, and helps save lives, exonerate drivers, and reduce insurance liability.
- Built a smoothing algorithm in collaboration with the embedded team at Motive to improve the quality of raw sensor data from the electronic logging device in the customers' vehicles, improving the system's precision in catching hard events by 40%.
Data Scientist / ML Engineer
Foot Locker
- Built a machine learning model to predict customer tier change (upgrade and downgrade) in the company's loyalty program in the next quarter based on data from the last three quarters to offer rewards as part of the customer retention policy.
- Performed RFM (recency, frequency, and monetary value) analysis for Foot Locker customers to segment more frequent and high-spending customers from others. The purpose was to lay the groundwork for a personalized recommendation system.
- Was part of the Churn prediction project at Foot Locker. Like the loyalty program, the company wanted to determine which customers would churn. Based on data from the previous three quarters, the criteria were set to no spending in one quarter.
Data Scientist / ML Engineer
CUNA Mutual Group
- Developed a machine learning model to predict which insurance advisors would not be able to sell a product in the following 12 months based on three years of historical data or sales. Trained and deployed the model on the Azure cloud.
- Built a model to forecast which credit unions the company did business with would be able to survive in the next two years after COVID-19 hit based on historical data going as far back as 1990.
- Developed an algorithm following a weighted average metric model to score the performance of insurance advisors based on their performance in the last four quarters to identify top, medium, and low-performing advisors.
Research Associate
National University of Computer and Emerging Sciences
- Involved in the full-year project that researched and developed road anomaly detection, i.e., potholes, manholes, speed breakers, cat-eyes, and rumble strips. Smartphone sensors were used for data collection through hours of drives across the city.
- Used the model trained in research to build a crowd-sourced application to map road anomalies across the cities for users to avoid using routes with high anomalies. The model was retrained over time to improve the prediction of road anomalies.
- Published a research paper in the 2018 IEEE Intelligent Vehicle Symposium (IV) Conference titled "Intelligent Crowd Sourced Road Anomaly Detection System."
Experience
Forecast Customer Loyalty Status for Foot Locker USA Loyalty Program
• X1: VIP customers
• X2: average customers (with reasonable spending)
• X3: low-spending customers
Using pandas, I cleaned and prepared the dataset from 2019 Q1 to 2020 Q4 for feature engineering. The target variable class_change came from 2021 Q1 data. It was set to 1 for customers whose classes had changed during that time (either upgraded or downgraded) and 0 for those who hadn't. The test set target variable came from 2021 Q2. After generating quarterly features from data—including the number of website visits, orders placed, items checked out, items viewed, the amount spent, and other session data and shopping history—a random forest classifier was trained using scikit-learn. The model did fairly well on the test set, with a recall of 0.62 and a precision of 0.87. It was then deployed on Azure.
YouTube Comment Classification for Content Creators
https://github.com/abdulrafeytahir/Youtube-Comment-ClassificationThere project involved:
• Data collection. I developed a scraper using Python and Selenium, crawled many channels, and collected the top 100 comments on every video. The clients had a team of annotators who annotated about 100,000 comments, of which there were roughly 10,000 requested comments.
• Model training. The dataset was quite imbalanced, so I downsampled negative class data points to 30,000. I then applied basic text preprocessing techniques using Pandas and NLTK, such as special character removal, lowercase alphabets, text tokenization, and stemming. Also, I generated simple frequency-based features using TF-IDF vectorization in scikit-learn and trained a support vector machine (SVM) model. It achieved 83% recall and 91% precision on the validation set.
• Inference and optimization. I used my scraper to scrape a few more websites and ran an inference for the scraped comments. Since the goal was high precision, we got the events classified as requests, annotated them again, and retrained the model.
LLM-based Meeting Minutes Generation
We used GCP for training and a Tesla A100 for running training jobs. For inference, we used P100 GPU, which worked pretty well for the scope of our application. We also built a Flask app that is containerized and deployed on GCP. The models are deployed as API endpoints and consumed by the app.
Recommendation System and Churn Reporting for an eCommerce System
MedBot - An LLM-based Medical Chatbot
Education
Master's Degree in Data Science
National University of Computer and Emerging Sciences - Lahore, Pakistan
Bachelor's Degree in Computer Science
National University of Computer and Emerging Sciences - Lahore, Pakistan
Certifications
Neural Networks and Deep Learning
Coursera
Skills
Libraries/APIs
Pandas, NumPy, Scikit-learn, PySpark, Matplotlib, TensorFlow, Python API, XGBoost, LSTM, PyTorch, Node.js, Keras
Tools
Git, Microsoft Excel, Spark SQL, Amazon SageMaker, Jupyter, ARIMA, ARIMAX, H2O AutoML, You Only Look Once (YOLO), Terminal, AWS CloudFormation, ChatGPT, Cloud Scheduler, Redash, Tableau, AWS Fargate, AWS CodeBuild, Azure ML Studio, Azure Machine Learning
Languages
Python, SQL, R, Bash Script, Bash, Snowflake, Java, C++, Ruby, C, Scala
Frameworks
Selenium, Spark, LlamaIndex, Streamlit
Paradigms
ETL, Automation, HIPAA Compliance, Business Intelligence (BI), Agent-based Modeling, DevOps
Platforms
Amazon Web Services (AWS), Azure, Databricks, Jupyter Notebook, Amazon EC2, Amazon, AWS Lambda, Kubeflow, Google Cloud Platform (GCP), Docker
Storage
Databases, SQL Server 2016, Amazon S3 (AWS S3), PostgreSQL, MySQL, MongoDB, Data Pipelines
Industry Expertise
Retail & Wholesale, High-frequency Trading (HFT)
Other
Data Science, Machine Learning, Data Analysis, Algorithms, Big Data, Natural Language Processing (NLP), Artificial Intelligence (AI), Data Scraping, Time Series Analysis, Data Queries, Web Scraping, Data Visualization, Forecasting, Computer Vision, Statistics, Predictive Modeling, Predictive Learning, Data Reporting, Data Analytics, Data Engineering, Statistical Analysis, Data Cleaning, AI Design, API Integration, Task Automation, Graphs, Classification, Financial Modeling, Machine Learning Operations (MLOps), Large Data Sets, Unstructured Data Analysis, Data Scientist, Data Gathering, APIs, Recurrent Neural Networks (RNNs), Neural Networks, eCommerce, Sentiment Analysis, Agile Data Science, Stock Trading, Large Language Models (LLMs), Data Versioning, CSV, Time Series, Data Modeling, Predictive Analytics, Regression Modeling, Marketing Mix Modeling, Finance, Trend Analysis, Data Matching, OpenAI GPT-4 API, Digital Marketing, Bayesian Statistics, Actuarial, Logistic Regression, A/B Testing, Analytics, Optimization, Exploratory Data Analysis, AI Model Training, Chatbots, OpenAI GPT-3 API, Bots, Stock Market, Trading Bots, LangChain, OpenAI, Retrieval-augmented Generation (RAG), Supervised Machine Learning, Model Tuning, Serverless, Generative Pre-trained Transformer 3 (GPT-3), Generative Pre-trained Transformer 4 (GPT-4), AI Agents, Bayesian Machine Learning, Electronic Health Records (EHR), Time Series Forecasting, Bayesian Inference & Modeling, SSH, AWS SSH Keys, Data, Data-centric AI, AI Programming, Document Parsing, Mathematical Statistics, Marketing Analytics, Statistical Data Analysis, Website Data Scraping, Scraping, Google Colaboratory (Colab), Prompt Engineering, Statistical Modeling, Cloud, Risk Analysis, Generative Pre-trained Transformers (GPT), Text Classification, Machine Learning Automation, Amazon Machine Learning, Amazon SageMaker Pipelines, Optical Character Recognition (OCR), Generative Artificial Intelligence (GenAI), CI/CD Pipelines, BERT, ETL Tools, Excel 365, FastAPI, Containerization, PDF Scraping, Amazon Bedrock, Xarray, Equities, Equity Trading, Advertising Technology (Adtech), Google Cloud Functions, Data Structures, Deep Learning, Churn Analysis, Recommendation Systems, Sensor Fusion, Signal Processing, Research, Language Models, Graph Theory, Identity & Access Management (IAM), AWS CodePipeline, Hugging Face, Llama 2, Text Summarization, Azure Databricks, Azure Data Factory (ADF), Reinforcement Learning
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring