Matteo is available for hire

Matteo Pallini

Verified Expert in Engineering

Data Scientist and Software Developer

Location

London, United Kingdom

Toptal Member Since

September 10, 2020

Matteo is a data scientist, machine learning engineer, and software developer with a BSc and MSc in economics and statistics. He has been a data analyst, modeler, and developer, mainly in small tech companies. Matteo's capabilities include Python, SQL, Git, Bash, MongoDB, and Docker; he has built regression analyses and tree-based models; and he has used NLP and scraping techniques.

Data Analysis Python Pandas Web Scraping Git MySQL Scrapy Machine Learning Time Series Ubuntu PostgreSQL MongoDB Docker Time Series Analysis PyCharm

Portfolio

Migacore Technologies

Travel, Data Analysis, Machine Learning, Time Series, Data Science...

Iwoca

Loans & Lending, Business Loans, Marketing, Data Analysis, Time Series...

Experience

Python - 4 years Pandas - 4 years Data Science - 3 years Machine Learning - 2 years Scikit-learn - 2 years Web Scraping - 1 year

Availability

Part-time

Preferred Environment

Jupyter Notebook, PyCharm, Ubuntu

The most amazing...

...thing I've built was a scraping pipeline that extracted the number of attendees from ~200,000 events websites, using a combination of regex and NLP techniques.

Work Experience

Machine Learning Engineer

2019 - PRESENT

Migacore Technologies

Determined that the process used to flag events relevant for travel demand was time-consuming and possibly biased. Transitioned to an XGBoost model trained on manually labeled events, reducing the time to add features to the pipeline by 70%.
Extracted event characteristics from the relevant websites, using a combination of XPath, regex, and NLP. The features built (e.g., attendee numbers and presence of sponsored airline offers) had accuracy rates ranging from 80 to 95%.
Scraped websites for events likely to generate uplifts in flight demand.

Technologies: Travel, Data Analysis, Machine Learning, Time Series, Data Science, Jupyter Notebook, PyCharm, Time Series Analysis, Gradient Boosted Trees, Scrapy, Ubuntu, Python, Matplotlib, Pandas, SpaCy, Docker, Scikit-learn, MongoDB

Data Scientist/Software Engineer

2015 - 2019

Iwoca

Improved the accuracy of credit scorecards through the creation and inclusion of a logistic regression model.
Automated credit checks and credit application rejections, reducing the frequency of numerous manual interventions by 15 to 40%.
Created the MySQL marketing database and integrated it with internal and external platforms. The database, storing approximately 3.5 million leads, allowed iwocato to optimize its marketing channels and grow the main one by more than 125%.
Built tools that allowed the strategy team to monitor and forecast financial metrics and loss statistics. Acquired and applied extensive knowledge of Pandas and Matplotlib during this initiative.

Technologies: Loans & Lending, Business Loans, Marketing, Data Analysis, Time Series, Data Science, Jupyter Notebook, PyCharm, Python, Scikit-learn, Matplotlib, Pandas, Regression Modeling, Git, PostgreSQL, MySQL

Experience

Scraping Visitor Numbers from Big Events Websites

The aim of the project was to scrape the number of attendees from approximately 200,000 events websites. These events ranged from major sports events, such as F1 races and football matches, to trade expos, conferences, and music festivals. The diversity of the website audiences made it fairly complex to design an approach that was general enough to work for all websites while generating an acceptable number of false positives.

The final pipeline started off by extracting the website text using Scrapy. From the text, through the use of regex, it was possible to extract the paragraphs that contained a number and referred to visitors. Then, from this set, only the cases for which the number referred to the event were kept. It was possible to do so through a combination of SpaCy named-entity recognition (NER) and some NLTK utilities.

Eventually, this process allowed us to extract visitor numbers from websites with a false positive rate below 20%. This was a fairly small percentage, considering the broad variety of websites and the fact that it was achieved in approximately three weeks of work.

Skills

Languages

Python

Libraries/APIs

Pandas, Matplotlib, Scikit-learn, SpaCy

Other

Data Analysis, Web Scraping, Loans & Lending, Statistics, Econometrics, Regression Modeling, Time Series Analysis, Bayesian Statistics, Gradient Boosted Trees, Time Series, Machine Learning, Business Loans, Travel

Frameworks

Scrapy

Tools

Git, PyCharm

Paradigms

Data Science

Platforms

Jupyter Notebook, Ubuntu, Docker

Storage

MySQL, PostgreSQL, MongoDB

Industry Expertise

Marketing

Education

2012 - 2015

Master of Science Degree in Economics and Statistics

Bocconi University - Milan, Italy

2009 - 2012

Bachelor of Science Degree in Economics and Statistics

Bocconi University - Milan

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring