Abdul Rafey Tahir
Verified Expert in Engineering
Research Engineer and Developer
Abdul Rafey is a data scientist with five years of industry experience. He has worked on challenging problems and performed data preprocessing, analysis, and modeling on big data in eCommerce, healthcare, finance, insurance, and safety and compliance domains. He is proficient in Python and relevant data science libraries like Pandas, NumPy, scikit-learn, PySpark, TensorFlow, Plotly, Seaborn, etc., AutoML frameworks like H2O.ai and recommendation system frameworks like RecBole and LightFM.
Data Science, Machine Learning, Big Data, Python, Pandas, Scikit-learn, Amazon Web Services (AWS), Deep Learning, Data Analysis, Forecasting, NumPy
The most amazing...
...thing I've built is a real-time collision detection system using sensor data for Motive, Inc. which is a multi-billion dollar startup in Silicon Valley.
Senior Data Scientist
- Helped develop an analytics pipeline to identify healthcare professional (HCP) leaders at local and national levels based on referrals and prescriptions data. These are provided to pharma clients for maximum market penetration of new and existing brands.
- Developed social media scrapers for Twitter and YouTube to gauge the social media influence of HCPs. This data is preprocessed and fed to a new analytics pipeline that identifies key opinion leaders in specific areas of medicine for pharma clients.
- Took charge of the conversion of the existing codebase from Scala to PySpark for better integration with the existing Python modules and for faster code execution in many functionality blocks as compared to Scala.
- Developed an unsafe driving detection algorithm for Motive, a US-based multibillion-dollar startup. It detects unsafe acceleration, brake, and corner events generated by customer fleets' truck drivers using sensor data used for driver coaching.
- Trained a real-time crash detection machine learning model using sensor data for Motive's safety product. The system saves event and video data, notifies authorities in minutes, and helps save lives, exonerate drivers, and reduce insurance liability.
- Built a smoothing algorithm in collaboration with the embedded team at Motive to improve the quality of raw sensor data from the electronic logging device in the customers' vehicles, improving the system's precision in catching hard events by 40%.
- Developed a web scrapper for Transfermarkt.com to scrape data for soccer players. Successfully built and deployed it on AWS Glue to scrape data for 2,000+ teams (more than 60,000 players).
- Utilized scraped data to train an unsupervised machine learning model, specifically K-Medoid clustering, enabling effective grouping of players based on their statistics, rankings, and valuation.
- Played a key role in integrating the trained model into the client's platform, allowing users to receive the top 5 most similar players based on their search queries.
CUNA Mutual Group
- Developed a machine learning model to predict which insurance advisors would not be able to sell a product in the following 12 months based on three years of historical data or sales. Trained and deployed the model on the Azure cloud.
- Built a model to forecast which credit unions the company did business with would be able to survive in the next two years after COVID-19 hit based on historical data going as far back as 1990.
- Developed an algorithm following a weighted average metric model to score the performance of insurance advisors based on their performance in the last four quarters to identify top, medium, and low-performing advisors.
- Built a machine learning model to predict customer tier change (upgrade and downgrade) in the company's loyalty program in the next quarter based on data from the last three quarters to offer rewards as part of the customer retention policy.
- Performed RFM (recency, frequency, and monetary value) analysis for Foot Locker customers to segment more frequent and high-spending customers from others. The purpose was to lay the groundwork for a personalized recommendation system.
- Was part of the Churn prediction project at Foot Locker. Like the loyalty program, the company wanted to determine which customers would churn. Based on data from the previous three quarters, the criteria were set to no spending in one quarter.
National University of Computer and Emerging Sciences
- Involved in the full-year project that researched and developed road anomaly detection, i.e., potholes, manholes, speed breakers, cat-eyes, and rumble strips. Smartphone sensors were used for data collection through hours of drives across the city.
- Used the model trained in research to build a crowd-sourced application to map road anomalies across the cities for users to avoid using routes with high anomalies. The model was retrained over time to improve the prediction of road anomalies.
- Published a research paper in the 2018 IEEE Intelligent Vehicle Symposium (IV) Conference titled "Intelligent Crowd Sourced Road Anomaly Detection System."
Forecast Customer Loyalty Status for Foot Locker USA Loyalty Program
• X1: VIP customers
• X2: average customers (with reasonable spending)
• X3: low-spending customers
The project involved building a machine learning model to predict which customers would and wouldn't change their category in the next quarter based on data from the past eight quarters.
Using pandas, I cleaned and prepared the dataset from 2019 Q1 to 2020 Q4 for feature engineering. The target variable class_change came from 2021 Q1 data. It was set to 1 for customers whose classes had changed during that time (either upgraded or downgraded) and 0 for those who hadn't. The test set target variable came from 2021 Q2. After generating quarterly features from data—including the number of website visits, orders placed, items checked out, items viewed, the amount spent, and other session data and shopping history—a random forest classifier was trained using scikit-learn. The model did fairly well on the test set, with a recall of 0.62 and a precision of 0.87. It was then deployed on Azure.
YouTube Comment Classification for Content Creatorshttps://github.com/abdulrafeytahir/Youtube-Comment-Classification
There project involved:
• Data collection. I developed a scraper using Python and Selenium, crawled many channels, and collected the top 100 comments on every video. The clients had a team of annotators who annotated about 100,000 comments, of which there were roughly 10,000 requested comments.
• Model training. The dataset was quite imbalanced, so I downsampled negative class data points to 30,000. I then applied basic text preprocessing techniques using Pandas and NLTK, such as special character removal, lowercase alphabets, text tokenization, and stemming. Also, I generated simple frequency-based features using TF-IDF vectorization in scikit-learn and trained a support vector machine (SVM) model. It achieved 83% recall and 91% precision on the validation set.
• Inference and optimization. I used my scraper to scrape a few more websites and ran an inference for the scraped comments. Since the goal was high precision, we got the events classified as requests, annotated them again, and retrained the model.
Recommendation System and Churn Reporting for an eCommerce System
Python, SQL, R, Snowflake, C++, Ruby, C, Scala
Pandas, NumPy, Scikit-learn, PySpark, Matplotlib, XGBoost, Keras, TensorFlow
Git, Microsoft Excel, Spark SQL, Amazon SageMaker, Jupyter, AWS CloudFormation, Redash, Tableau, AWS Fargate, AWS CodeBuild
Data Science, ETL, Automation, Agent-based Modeling, Business Intelligence (BI)
Amazon Web Services (AWS), Jupyter Notebook, Amazon EC2, Google Cloud Platform (GCP), Azure, Databricks, Docker
SQL Server 2016, Amazon S3 (AWS S3), PostgreSQL, MySQL, Data Pipelines, Databases
Machine Learning, Data Analysis, Algorithms, Natural Language Processing (NLP), Artificial Intelligence (AI), Data Scraping, Time Series Analysis, Data Queries, Web Scraping, Data Visualization, Computer Vision, Statistics, Predictive Modeling, Predictive Learning, Data Reporting, Data Analytics, Data Engineering, Statistical Analysis, Data Cleaning, AI Design, API Integration, Task Automation, Graphs, Classification, Financial Modeling, Machine Learning Operations (MLOps), Large Data Sets, Unstructured Data Analysis, Data Scientist, Data Gathering, Recurrent Neural Networks (RNN), Neural Networks, eCommerce, Sentiment Analysis, Agile Data Science, Statistical Modeling, Cloud, Risk Analysis, GPT, Generative Pre-trained Transformers (GPT), Text Classification, Machine Learning Automation, Amazon Machine Learning, Amazon SageMaker Pipelines, OCR, APIs, Big Data, Data Structures, Deep Learning, Forecasting, Churn Analysis, Recommendation Systems, Sensor Fusion, Signal Processing, Research, Language Models, Graph Theory, Identity & Access Management (IAM), AWS CodePipeline
Master's Degree in Data Science
National University of Computer and Emerging Sciences - Lahore, Pakistan
Bachelor's Degree in Computer Science
National University of Computer and Emerging Sciences - Lahore, Pakistan
Neural Networks and Deep Learning