Julián is available for hire

Julián Peller

Verified Expert in Engineering

Machine Learning Engineer and Developer

Location

Buenos Aires, Argentina

Toptal Member Since

June 27, 2017

Julián is an autonomous, curious, and self-driven data scientist with a solid theoretical background—an MSc in computer science—and more than 14 years in the software industry. During this time, he has worked in different roles, projects, and types of companies. Julián is also a Python expert and enthusiast and is specializing in—and very passionate about—deep learning.

Recommendation Systems Linux Git Python Scikit-learn Pandas Machine Learning Data Modeling Bash SQL MySQL NumPy Visual Studio Code (VS Code)Amazon EC2 Clustering Fast.ai Samsung SpaCy Data Analysis Gensim MLflow

Portfolio

Selective Wealth Management

Python, Pandas, Tableau, Plotly, Azure, Azure SQL, Azure Cache, Flask...

PSMP (via Toptal)

SpaCy, Selenium, Python, SQL, MongoDB, Amazon Web Services (AWS), Amazon EC2...

Etermax

Amazon Web Services (AWS), Redshift, Apache Airflow, Spark, MLflow, Fast.ai...

Experience

Linux - 9 years SQL - 8 years Python - 8 years Pandas - 4 years Machine Learning - 4 years Scikit-learn - 4 years Generative Pre-trained Transformers (GPT) - 1 year Deep Learning - 1 year

Availability

Part-time

Preferred Environment

Visual Studio Code (VS Code), Jupyter Notebook, Git, Linux

The most amazing...

...recommender system I’ve built is a Prod2vec based item-item model scaled up with k-means and approximate neighbors for the real estate domain.

Work Experience

Data Scientist

2020 - 2022

Selective Wealth Management

Implemented financial models and various forecasts for five industries covering up to 40,000 companies worldwide, working together with a CFA, using S&P and Bloomberg data sources.
Productionized the models and forecasts into a multithreaded and scalable system.
Orchestrated a cost-effective production environment in Azure.
Wrote an optimization module based on SciPy for identifying shares and options opportunities.

Technologies: Python, Pandas, Tableau, Plotly, Azure, Azure SQL, Azure Cache, Flask, Data Science, SciPy, Bloomberg API, SQL, Git, NumPy, Linux, Software Design, Requirements Analysis, Redis, Bash, Windows, Jupyter Notebook, Data Preparation, Data Modeling, Matplotlib, Automated Testing, Requests, Data Analysis, Visual Studio Code (VS Code)

Data Scientist

2020 - 2020

PSMP (via Toptal)

Developed a web scrapper based on Selenium that gathers various data points for a given company from Google, LinkedIn, and generic websites.
Integrated the scrapper module into a running website.
Performed web admin housekeeping for the existing website; renewed SSL certificates, created a functional dev environment, corrected cron processes, and more.

Technologies: SpaCy, Selenium, Python, SQL, MongoDB, Amazon Web Services (AWS), Amazon EC2, NumPy, Linux, Software Design, Requirements Analysis, Pandas, Bash, Git, Natural Language Toolkit (NLTK), Jupyter Notebook, Data Preparation, Data Modeling, Matplotlib, Requests, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), GPT, Data Analysis

Data Scientist

2019 - 2020

Etermax

Implemented an image moderation system with convolutional networks using fast.ai.
Constructed a win-rate prediction model for content personalization. Given a user and a question, it predicts whether they will answer correctly and with what probability.
Created an API with MLflow for the win-rate predictor, stress tested it with Locust, and optimized it, reducing the latency from 400 to 20 milliseconds.
Built and took to production an LTV model for one of the newest games, reaching the same error rate as existing and well-established LTV models for historical games, using partial and incomplete user data.
Made a general LTV model for any newly released game, which assumes a logarithmic curve for each user's cumulative revenue and projects it.
Validated the previous logarithmic model over games with existing data, comparing its MAPE against trainable models with known good performance. Optimized the model, which fits a linear regression per user, to run fast with multiprocessing.
Implemented a rule-based categorization system for questions with high coverage for more than 100 classes with spaCy.
Applied a rule-based bidding optimization system for UA campaigns for Facebook ads.
Developed, deployed, and monitored all the previously listed models on AWS. Functioned as the go-to technical role in the data science team with regard to the infrastructure.

Technologies: Amazon Web Services (AWS), Redshift, Apache Airflow, Spark, MLflow, Fast.ai, PyTorch, Python, Pandas, Machine Learning, Data Science, Amazon EC2, NumPy, Recommendation Systems, Linux, Deep Learning, Software Design, Requirements Analysis, Neural Networks, SQL, Bash, Git, PySpark, SpaCy, Agile Software Development, Natural Language Toolkit (NLTK), Jupyter Notebook, Data Preparation, Data Modeling, Matplotlib, Spark ML, Automated Testing, Continuous Integration (CI), Requests, Clustering, Generative Pre-trained Transformers (GPT), GPT, Natural Language Processing (NLP), Data Analysis, Gensim, Computer Vision

Data Scientist

2019 - 2020

Celerative

Analyzed, corrected, and led the progress of a recommendation systems project for the fitness industry.
Homogenized models, set baselines, and created a general offline evaluation environment, discarding trivial LightFM models with bad performance.
Conducted stress tests with Locust and identified important problems with the existing memory-based item-item collaborative filtering model.
Wrote and evaluated various recommender models (Spark's ALS, Implicit, and Prod2Vec).
Coached a trainee data scientist to write a user-user collaborative filtering model.
Found a better model in terms of both offline metrics and computation and memory usage compared with the IICF, which was considered until that moment.
Deployed the recommender models in Google App Engine (GAE).
Led the deployment and the A/B test of one of the models (ALS Matrix Factorization with the library Implicit) that achieved an improvement of 200% against the CTR of the existing baseline model.

Technologies: Flask, Apache Airflow, Google Cloud Platform (GCP), Spark, Python, Pandas, Machine Learning, Data Science, Amazon EC2, NumPy, Recommendation Systems, Linux, Software Design, Requirements Analysis, Google Cloud, MySQL, SQL, Bash, Git, Amazon Web Services (AWS), Google App Engine, Jupyter Notebook, Data Preparation, Data Modeling, Matplotlib, Data Analysis, Gensim, Google Compute Engine (GCE), SciPy

Data Science Teacher

2019 - 2020

Digital House

Lectured and created materials for a five-month-long data science course for university students.
Taught Python fundamentals for data science: NumPy, Pandas and visualization libraries, SQL, data cleaning and preprocessing, machine learning fundamentals, supervised and unsupervised machine learning algorithms, and text mining.
Taught descriptive and inferential statistics, APIs, and web scraping.
Designed exams and coordinated a full-time auxiliary teacher as well as a handful of other sporadic teachers of the house.

Technologies: Flask, Requests, Spark, Python, Pandas, Machine Learning, Data Science, NumPy, Jupyter Notebook, Data Preparation, Data Modeling, Matplotlib, GPT, Generative Pre-trained Transformers (GPT), Natural Language Processing (NLP), Data Analysis, Clustering, Plotly

Data Scientist

2017 - 2019

Navent

Implemented a Prod2Vec item-item recommendation system from scratch. Scaled it up with Spark's k-means, approximate nearest neighbors, and multiprocessing.
Scaled up the principal recommendation system to handle ten times its original data using Spark's ALS, Annoy, Cassandra, and GCP.
Wrote reports with Jupyter notebooks and Google Data Studio.
Segmented users via clustering techniques.
Prepared for the ECI 2018's data competition (Dc.uba.ar/events/eci/2018/charlas-y-eventos/even-acad/competencia).
Participated in a duplication detection system project using pHash, MongoDB, Cassandra, and GCP.
Improved and scaled the recommender events tracking system using Pub/Sub, Dataflow, and BigQuery.

Technologies: MySQL, MongoDB, Cassandra, Apache Kafka, Spring Boot, BigQuery, Spark, Python, Pandas, Machine Learning, Data Science, NumPy, Recommendation Systems, Linux, Software Design, Requirements Analysis, Google Cloud, SQL, Bash, Git, PySpark, Java, Google Cloud Platform (GCP), Jupyter Notebook, Data Preparation, Data Modeling, Matplotlib, Eclipse IDE, Spark ML, Terraform, Automated Testing, Clustering, Data Analysis, Gensim, Google Compute Engine (GCE)

Software Engineer

2014 - 2017

NVIDIA

Led a Kernel-level integration project between Mellanox and an important Chinese company on a tight schedule—rapidly incorporating new knowledge and working together with people in four different time zones.
Analyzed the requirements, researched, designed, and implemented the infrastructures for continuous static and coverage analysis for the company's main software product (Mellanox operating system).
Designed, configured, and implemented a scalable continuous testing system for long-term testing of released products over different switches.
Automated the whole development cycle for five different products with a graphic queue-based tool. For example: beautify code, check for Coverity defects, compile different architectures, install an operating system on switches, and run unit and integration tests.
Wrote the bindings on a CLI for a low-level C library using iPython, SWIG, and the inspection module.
Worked on the redesign and the implementation of the current CI infrastructure.
Migrated various separated projects into one using Git submodules.
Presented solutions to audiences of different sizes (five, 20, 80), from different countries, and within and outside the company.
Worked in teams of different sizes with people from all around the world (Israel, US, Russia, Ukraine, China, and India). Performed at the Israeli offices until February 2016, when I relocated to Buenos Aires.

Technologies: Jenkins, Bash, Python, Linux, Software Design, Requirements Analysis, MySQL, SQL, Git, Agile Software Development, DevOps, GNUMake, Automated Testing, Continuous Integration (CI)

Software Engineering Intern

2014 - 2014

NVIDIA

Spent a three-month-long internship at the Israeli offices of the company.
Worked on simple DevOps tasks—writing tools and utilities with Python.
Coded various utilities for the main C developers.

Technologies: Python, Linux, Software Design, Requirements Analysis, SQL, Bash, Git, Agile Software Development, DevOps

Software Developer

2012 - 2014

Planisys

Designed and developed the cornerstone system of the company from scratch: a high-traffic email marketing system, similar to MailChimp, with a strong data analysis module and performance, availability, and concurrence requirements.
Used Ansible to deploy, maintain, and support the previously described system for more than 40 clients, including Grupo Clarin—the most important mass media group in Argentina.
Wrote a pay-per-view system for Samsung Smart TVs for classic music-streaming using LAMP, FFmpeg, and the PayPal API.
Migrated a client's huge database from PostgreSQL to MySQL.
Moved a customer's email client code from SVN to Git.

Technologies: Redis, Ansible, JavaScript, CSS, HTML, SQL, Linux, Software Design, Requirements Analysis, MySQL, Bash, Git, DevOps

Teacher Assistant (Programming Paradigms)

2011 - 2012

University of Buenos Aires, Faculty of Exact and Natural Sciences

Performed the teacher assistant's role in an advanced university course of the Master in Computer Science degree.
Gave classes on object-oriented programming with Smalltalk and on data types in functional programming with Haskell.
Wrote and corrected exams. Assisted students with their tasks and assignments.

Technologies: C, C++, Haskell, Prolog, Smalltalk

Software Developer

2008 - 2012

Freelance and a Small Startup

Worked for a small local startup, where I created and deployed custom modules for SugarCRM to different customers.
Designed, implemented, deployed, and maintained CRMs and other customer-specific CRUD software solutions for small clients.
Developed a web-based survey application that allowed the creation, management, and analysis of the results of customer surveys.

Technologies: JavaScript, CSS, HTML, MySQL

Experience

Kaggle Notebook Grandmaster Ranked 6th out of 200,000

http://www.kaggle.com/julian3833/code

I am a Kaggle Notebook Grandmaster, the first and only from my country. I have published various popular notebooks in Kaggle—with more than 3600 upvotes and 5400 forks—reaching the rank of 7th of 200,000 on that scoreboard.

My main contributions are:

• PyTorch transformers models for the token classification NLP competition "Feeback Prize—Evaluation Student Writing": Sentence Classifier baseline, ShortFormers w/Chunks. Other notebooks for this competition: Roberta Intra-task pre-training, Topic Modeling with LDA.

• A straightforward and high-scoring Naive Bayes model that opened the path to simple linear models which surprisingly ended up performing similarly to transformers in the NLP competition "Jigsaw Rate Severity of Toxic Comments."

• A series of notebooks covering various preprocessing steps and explaining different public models in the NLP ML QA competition "Chai—Hindi and Tamil QA."

• PyTorch models for the instance segmentation competition "Sartorius—Cell Instance Segmentation.": Masked R-CNN, U-net, Classifier, and Masked R-CNN.

• PyTorch models for the object detection competition "TensorFlow—Help Protect the Great Barrier Reef": FasterRCNN, DETR, and an intelligent cross-validation strategy proposal.

Prod2Vec Item-Item Recommender System

Prod2Vec is an innovative application of Word2vec to the recommender systems domain. Word2vec uses a large corpus of language sentences to learn low dimensional words embedding. The key insight is that: similar words have similar words around them.

Prod2Vec uses the same concept, but we want embeddings for products instead of word embeddings. Instead of sentences (sequences of words), we use different sequences of products like user navigation or purchase histories to learn them.

With the low dimensional embeddings, we can calculate similarities and neighborhoods among products, which is the core of an item-item recommender model.

In Navent, I implemented this recommender system from scratch using Gensim. A few tweaks were necessary once the POC successfully reached a reasonable computation time. First, K-means was used to cluster the items in groups to avoid calculating the similarities among all the products. Second, Annoy—an approximate neighborhood library by Spotify—which trades some accuracy for performance, replaced the usual nearest neighbors module.

The model obtained a 17.5% increase in CTR against the previous productive model during the A/B test and is currently in production.

Image Moderator with Fast.ai

In Trivial Crack, users can create content themselves. As it is called, the factory is very engaging, but on the other hand, it opens the door to low quality or even nocive content, so there is a manual moderation process to filter this content out.

My onboarding project at Etermax was an image moderation tool; a binary classifier built using Fast.ai. The tool allowed us to provide the images to the moderator ordered by the likelihood of approval and, finally, to fully automatize this tedious task. Moreover, the model increased the throughput of the content-creation pipeline and provided us with a more extensive set of high-quality questions with images (known to be more engaging).

Soon the project became productive, and I became enthusiastic about Fast.ai. After this, I took the rest of the courses and started a personal side project with ULMFit.

The productive model uses a simple ResNet34 and handles duplicated images with contradictory labels using p-hashes. It reaches 90% accuracy with a simple training trick that runs a few epochs over shrunk versions of the final-sized images, a kind of intermediate pre-training before the final fine-tuning.

Stack Overflow Top 10% Answerer in Python and Machine Learning, Top 20% in Pandas and scikit-learn

https://stackoverflow.com/users/3254400/dataista

I answered various Python and data science questions in Stack Overflow. Currently, I am in the top 10% answerers of Python and ML in the platform and the top 20% answerers regarding Pandas and scikit-learn.

GPT-2 Large Experiments

https://www.kaggle.com/julian3833/gpt-2-large-774m-w-pytorch-not-that-impressive

GPT2 is a deep learning language model released in 2019 and 2020 by OpenAI. It obtained a lot of attention because, in the original release, they showed a handful of human-level language generation examples of an impressive length and complexity. Furthermore, they did not publish the pre-trained weights worried about the harm potential of such an advanced language model. They used this milestone to propose a broad discussion about the dangers of advanced NLP.

My experiment consisted of running the GPT-2 with the largest available version by that time (774M weights) over the blog post examples' conditional inputs to check if the actual output was as good as it was shown. I used the most advanced known sampling mechanisms for language generation: Top-k and Top-p sampling, with different configurations. Sadly, no results were as good as the ones in the blog post and, although results were good and impressive, they were not that impressive. There are some challenging theoretical problems involved in NLG with Deep Networks, as the author of Top-p sampling shows in his paper. For these experiments, I used Hugging Face's library transformer.

Win-rate Predictor Using MLflow

Recently at Etermax, we have developed a win-rate predictor. Given a user and a question, the win-rate predictor model should assess whether the user will or won't answer the question correctly and with what probability.

We used MLflow for hyper-parameter tuning and publishing and exposing the productive model trained from Airflow. The model is stored by MLflow in S3 and picked up by a Flask-Gunicorn API. Initially, we used the API service provided by MLflow, but since a request to Dynamo was needed before accessing the model itself, we had an extra HTTP request, which made everything too slow. On the other hand, it was not easy to integrate with New Relic, so we branched out a customized version of MLflow API.

We started with a latency of 400 ms and a requirement to answer in less than 80 ms to keep it as an online model. Using Locust.io as a stress test framework and New Relic to disaggregate the times, we could reach 20 ms of response time under everyday stress compromising minimal accuracy.

Education

2006 - 2016

Master of Science Degree in Computer Science

University of Buenos Aires - Buenos Aires, Argentina

Certifications

JANUARY 2022 - PRESENT

Deep Neural Networks with PyTorch by IBM

Coursera

JANUARY 2022 - PRESENT

NLP: 5-course Specialization by Deeplearning.ai

Coursera

OCTOBER 2018 - PRESENT

Deep Learning: 5-course Specialization by Deeplearning.ai

Coursera

JULY 2018 - PRESENT

Recommender Systems: 5-course Specialization by University of Minnesota

Coursera

JUNE 2018 - PRESENT

Applied Data Science with Python: 5-course Specialization by University of Michigan

Coursera

MARCH 2018 - PRESENT

Google Cloud Platform Big Data and Machine Learning Fundamentals

Coursera

MAY 2017 - PRESENT

Machine Learning by Stanford University

Coursera

Skills

Libraries/APIs

Scikit-learn, Pandas, NumPy, Matplotlib, Fast.ai, PyTorch, SpaCy, Requests, SciPy, Bloomberg API, Spark ML, PySpark, Apache Lucene, Stanford NLP, Natural Language Toolkit (NLTK)

Tools

Git, Google Compute Engine (GCE), IntelliJ IDEA, Apache Airflow, Tableau, Plotly, Terraform, BigQuery, Gensim, Eclipse IDE, GNUMake, LaTeX, Ansible, Jenkins

Languages

Python, Bash, SQL, R, CSS, JavaScript, HTML, Java

Platforms

Jupyter Notebook, Linux, Amazon EC2, Visual Studio Code (VS Code), Google App Engine, Amazon Web Services (AWS), Google Cloud Platform (GCP), Azure, Apache Kafka, Windows

Frameworks

Flask, Spring Boot, Selenium, Spark

Paradigms

Data Science, Requirements Analysis, REST, Continuous Integration (CI), Automated Testing, Agile Software Development, DevOps

Storage

MySQL, Redshift, Azure SQL, Azure Cache, Google Cloud, Cassandra, Redis, MongoDB

Other

Recommendation Systems, Data Preparation, Data Modeling, Neural Networks, Machine Learning, Software Design, Deep Learning, MLflow, Clustering, Data Analysis, Google Data Studio, Pub/Sub, Natural Language Processing (NLP), Natural Language Understanding (NLU), Computer Vision, GPT, Generative Pre-trained Transformers (GPT)

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring