Peter Papai, Developer in Bangkok, Thailand
Peter is available for hire
Hire Peter

Peter Papai

Verified Expert  in Engineering

Bio

With a PhD in physics, Peter is a developer working in the field of data science. He has five years of full-time experience working on big data projects at a large internet company. Peter has formulated business goals and designed, prototyped, productized, and A/B-tested machine learning algorithms in several areas. His insights gleaned from data have helped stakeholders make impactful business decisions.

Portfolio

DiagnoseEarly
Python, Docker, Scientific Computing, GC Mass Spectrometry, Statistics...
Tokopedia
Python, Google Cloud Platform (GCP), Machine Learning, BigQuery, A/B Testing...
Agoda
PyTorch, Scikit-learn, SQL, Apache Hive, Apache Spark, Scala, Python...

Experience

  • Python - 10 years
  • Machine Learning - 10 years
  • Statistics - 10 years
  • Spark - 4 years
  • Deep Learning - 4 years
  • Scala - 4 years
  • A/B Testing - 3 years
  • PyTorch - 2 years

Availability

Part-time

Preferred Environment

Git, IntelliJ IDEA, PyCharm, Visual Studio Code (VS Code), Google Cloud Platform (GCP), Spark, Hugging Face, PyTorch

The most amazing...

...thing I've done is to formulate a mathematical framework to price hotel rooms, which has become an essential driver of business for my employer.

Work Experience

Data Scientist

2022 - 2024
DiagnoseEarly
  • Developed the scientific components of the software that identified toxins in the mass spectrum of human breath.
  • Dockerized the solution to make it available as a microservice.
  • Researched the scientific literature on mass spectroscopy, made recommendations, and conducted feasibility studies to assist the early-stage startup in defining its product roadmap.
  • Wrote algorithms to process heart rate data from wearable devices as a component of a health care app.
Technologies: Python, Docker, Scientific Computing, GC Mass Spectrometry, Statistics, Data Science, Artificial Intelligence (AI), Data Analysis, Data Cleansing, Visual Studio Code (VS Code), Python 3, Git, Mathematics, NumPy, Pandas, Clustering Algorithms, K-means Clustering, Data Scientist, Data Visualization, Data Processing, Decision Trees

Lead Data Scientist

2021 - 2021
Tokopedia
  • Improved the ranking algorithm on the search page, which increased the number of orders by around 1% overall according to A/B tests.
  • Enhanced the A/B testing framework, rectifying the harm done by many erroneous past experiments.
  • Refined the relevance of keyword targeting ads and increased the ad revenue by 2%.
  • Collaborated with the tech team to deploy models in production on GCP.
  • Built ETL pipelines to create features using BigQuery and Dataflow.
Technologies: Python, Google Cloud Platform (GCP), Machine Learning, BigQuery, A/B Testing, Data Science, Scikit-learn, Artificial Intelligence (AI), ETL, Supervised Machine Learning, Data Analysis, Visual Studio Code (VS Code), Advertising, Python 3, Recommendation Systems, Git, Mathematics, Clustering, NumPy, Pandas, Clustering Algorithms, K-means Clustering, Data Scientist, Google BigQuery, Data Pipelines, Data Warehousing, Data Processing, Decision Trees, Flask, Scripting

Lead Data Scientist

2019 - 2020
Agoda
  • Developed a system to monitor thousands of time series for anomaly detection.
  • Improved fraud detection using machine learning and new A/B testing strategies.
  • Provided mentoring for less experienced data scientists.
  • Communicated with stakeholders, worked on roadmaps, and defined KPIs for projects.
  • Helped the tech team to deploy deep learning models in production.
Technologies: PyTorch, Scikit-learn, SQL, Apache Hive, Apache Spark, Scala, Python, Machine Learning, Data Science, A/B Testing, Time Series Analysis, Anomaly Detection, Artificial Intelligence (AI), ETL, Supervised Machine Learning, Data Analysis, Python 3, IntelliJ IDEA, Time Series, Git, Spark, Mathematics, DBSCAN, Clustering, NumPy, Pandas, Clustering Algorithms, K-means Clustering, Data Scientist, Data Warehousing, Data Visualization, Data Processing, Decision Trees, Flask, Scripting

Senior Data Scientist

2015 - 2019
Agoda (Booking Holdings)
  • Served as a core member of back-end teams following the agile methodology, including Scrum, Jira, and Git pull requests, among others.
  • Turned business goals into math objectives and implemented algorithms to optimize them for pricing.
  • Implemented content and collaborative filtering-based algorithms for ranking.
  • Applied time series prediction techniques for demand forecasting.
  • Cooperated with tech teams to put into production models written in Scala or Python using a variety of frameworks and tools.
  • Built ETL pipelines for feature engineering, mainly using Spark.
Technologies: SQL, Apache Hive, HDFS, Apache Spark, Scala, Python, A/B Testing, Data Science, Machine Learning, Deep Learning, Revenue Optimization, Operations Research, Artificial Intelligence (AI), ETL, Supervised Machine Learning, Data Analysis, Data Cleansing, Python 3, Recommendation Systems, IntelliJ IDEA, PyCharm, Git, Spark, Mathematics, DBSCAN, Clustering, NumPy, Clustering Algorithms, K-means Clustering, Data Scientist, Data Pipelines, Data Warehousing, Data Visualization, Data Processing, Decision Trees, Scripting

Senior Data Analyst

2014 - 2014
IO Technologies
  • Prototyped a model for clickthrough rate prediction, satisfying the architectural constraints of the company.
  • Used the cloud-based stack (AWS) of the company to produce the model.
  • Provided training about machine learning and data science for coworkers.
Technologies: Amazon Web Services (AWS), Scikit-learn, Python, A/B Testing, Artificial Intelligence (AI), Supervised Machine Learning, Data Analysis, Visual Studio Code (VS Code), Python 3, PyCharm, Git, Mathematics, Clustering, Data Processing, Decision Trees

Quantitative Researcher

2013 - 2013
WorldQuant
  • Researched the scientific literature for ideas for automated trading.
  • Implemented predictive models using different data sources, such as historical stock returns, news articles, etc.
  • Tested predictive algorithms offline, using historical data.
Technologies: Python, C++, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), Data Cleansing, Mathematics, Algorithmic Trading

Postdoctoral Researcher​

2011 - 2012
The Abdus Salam International Centre for Theoretical Physics (ICTP)
  • Published research papers in the field of cosmology.
  • Helped create teaching materials for undergraduates.
  • Analyzed galaxy survey data to constrain theoretical models of the universe.
Technologies: Mathematica, SciPy, NumPy, Python, Physics, Mathematics

Experience

Cost Reduction in Fraud Detection

Fraud detection systems often work in a human-in-the-loop fashion, where an ML model suggests suspicious transactions for review by a human expert. My goal was to minimize the number of reviews without increasing the number of uncaught fraudulent transactions. I built a model to predict which transaction would be likely to get rejected by a human expert. I was responsible for creating features and training the model using Scala and Spark ML in a batch job that ran daily. I deployed the model online using the MLeap framework. The next iteration of the model was implemented in Python and exported in PMML format for online deployment. I was responsible for A/B testing the model as well. Because it entailed comparing costs, variance reduction techniques had to be used to measure the impact of the new algorithm. The new algorithm reduced the workload of the human experts by one-third.

Dynamic Pricing of Hotel Rooms

Came up with the idea to apply counterfactual techniques to find the optimal room prices. By extending the decision tree of Spark ML, I implemented a greedy algorithm that maximizes margin. I designed and evaluated A/B tests to validate the approach. Later, I implemented iterative improvements of the original idea in Python. The company attributed a 20% increase in profit to the success of this project.

Click-through Rate Prediction for Online Advertising

Had to work around a legacy system when designing an algorithm for predicting the click-through rate for banner ads. Using an off-the-shelf algorithm was out of the question. We also faced a cold-start problem because the company was fairly new, and the data was very sparse. I implemented a tree augmented naïve Bayesian model with hierarchical CTR smoothing. It was the 1st model we deployed, utilizing our system architecture and data collection practices to the fullest.

Personalized Ranking

Implemented a factorization machine-based model that used context features to estimate the conversion rate for user-hotel pairs. I extended Scala and Spark ML to train the model. Cross-validation happened on several metrics from information retrieval, e.g., NDCG or binary hit rate. This model was part of a model stack that generated a 1 – 2% increase in margin for the company.

Inventory Matching

For matching hotels from suppliers' inventory to hotels in our own inventory, I used techniques from NLP (TF-IDF, character CNN) to create an embedding for hotel records. The hotels were compared in this embedding space. The model was deployed as a daily batch job. It increased the percentage of matching hotels from around 80% to around 90%. Removing duplicates from our search page significantly improved the user experience based on our internal UX research.

Anomaly Detection – Time Series

Led the work on the ML component of an internal tool to monitor the health of our IT systems. The modeling involved forecasting quantiles for hundreds of time series in a fully autonomous way. The models were written in Python using scikit-learn, NumPy, SciPy, and PyTorch. The deployment happened via Docker containers. The loss due to technical errors decreased by 90% in the quarter following the deployment.

Education

2006 - 2011

PhD in Physics

University of Hawaii at Manoa - Honolulu, Hawaii, USA

2001 - 2006

Master of Science Degree in Physics

ELTE - Budapest, Hungary

Certifications

NOVEMBER 2024 - PRESENT

Data Engineering. AI Data Engineering

Amazon Web Services | via Coursera

JULY 2023 - PRESENT

Generative AI with Large Language Models

Coursera

DECEMBER 2020 - PRESENT

Natural Language Processing Specialization

Coursera

NOVEMBER 2020 - PRESENT

Generative Adversarial Networks (GANs) Specialization

Coursera

JULY 2020 - PRESENT

AI for Medicine Specialization

Coursera

APRIL 2020 - PRESENT

Self-Driving Cars Specialization

Coursera

Skills

Libraries/APIs

Spark ML, NumPy, Pandas, PyTorch, Scikit-learn, SciPy, Hugging Face Transformers

Tools

Git, PyCharm, IntelliJ IDEA, Mathematica, BigQuery

Languages

Python, Python 3, Scala, SQL, C++, PMML

Paradigms

ETL, Anomaly Detection

Frameworks

Spark, Apache Spark, Flask

Platforms

Visual Studio Code (VS Code), Amazon Web Services (AWS), Google Cloud Platform (GCP), Docker, Jupyter Notebook

Storage

Apache Hive, HDFS, NoSQL, Data Pipelines

Other

Machine Learning, Mathematics, Data Science, Physics, Artificial Intelligence (AI), Supervised Machine Learning, Data Analysis, Data Cleansing, Data Scientist, Statistics, Deep Learning, Time Series, Natural Language Processing (NLP), Big Data, Recommendation Systems, DBSCAN, Clustering, Clustering Algorithms, K-means Clustering, Google BigQuery, Data Processing, Decision Trees, Scripting, A/B Testing, Machine Vision, Time Series Analysis, Click-through Rates (CTR), Naive Bayes, Advertising, Revenue Optimization, Operations Research, Generative Pre-trained Transformers (GPT), Hugging Face, Risk Modeling, Optimization, Scientific Computing, GC Mass Spectrometry, Computer Vision, Robotics, Large Language Models (LLMs), Transformer Models, LoRa, AI Agents, LangChain, Algorithmic Trading, Data Engineering, Vector Databases, Streaming Data, Data Warehousing, Data Visualization

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring