Leandro is available for hire

Leandro Roser

Verified Expert in Engineering

Data Scientist and Machine Learning Developer

Buenos Aires, Argentina

Toptal member since September 15, 2021

Expertise

Data Mining Data Visualization Data Science Data Engineering Machine Learning Big Data Architecture Artificial Intelligence Deep Learning Azure Linux R Python ETL Docker

Bio

Leandro is a machine learning and MLOps engineer with over 14 years of experience in the data field. He has a solid background in ML infrastructure development, machine learning, and statistics. He is proficient in using tools such as Python, Docker, Terraform, CI/CD frameworks, and several AWS services. Leandro excels at transforming ideas into robust, end-to-end products. He is open to on-site job relocation in the European Union.

Portfolio

Prisma Medios de Pago

Amazon, Terraform, Amazon SageMaker, Docker, Python, Serverless...

Le Wagon

Python, Scikit-learn, Pandas, TensorFlow, Docker, Data Visualization

Toptal Client

Data Science, Bioinformatics, Genomics, Machine Learning

Experience

Linux - 13 years
Python - 7 years
Machine Learning Operations (MLOps) - 6 years
Machine Learning - 6 years
Docker - 6 years
APIs - 5 years
Amazon Web Services (AWS) - 5 years
Azure - 3 years

Preferred Environment

Linux, Windows, Visual Studio, Slack, Python, Azure, Amazon Web Services (AWS), Git

The most amazing...

...project I've developed is a conversational recommendation system based on RAG, GPT-3, and Mixture of Experts—at the earliest adoption point of this technology.

Work Experience

MLOps Engineer

2024 - PRESENT

Prisma Medios de Pago

Built a serverless MLOps platform for automatic deployment of ML models.
Reduced time-to-production from months to a few weeks. Streamlined the process of getting models into an automated setup that generates business value with minimal friction.
Interacted with data scientists, DevOps engineers, and other platform members to achieve the best solutions.
Designed the architecture from scratch for different training/inference scenarios (e.g., batch/at-demand API-based inference) using best practices and validated the solution with company solution architects.
Used AWS tools such as Step Functions, SageMaker, DynamoDB and S3. Built custom Terraform modules to deploy the solutions into different environments. Developed CI/CD code to automate the deployment process using GitLab in development/production accounts.
Generated custom libraries for the infrastructure. Created Python packages and modules from the data science notebooks and edited the original code.
Wrote unit tests and security scanning steps and performed automatic static code checks and other quality control steps during CI.
Collaborated with DevOps engineers, and defined IAM roles with minimal permissions for different aspects of the infrastructure. Analyzed CloudTrail logs. Worked on other aspects needed for optimal operations such as cross-account access.
Worked on a transparent logging system, for example, properly capturing Athena or SageMaker failures and logging into CloudWatch.

Technologies: Amazon, Terraform, Amazon SageMaker, Docker, Python, Serverless, Data Architecture, Amazon S3 (AWS S3), AWS Lambda, Amazon DynamoDB, GitLab CI/CD, AWS DevOps, AWS Step Functions

Data Science Instructor (Bootcamp)

2022 - PRESENT

Le Wagon

Lectured a hands-on course on data science from entry to more advanced topics, such as ML engineering and MLOps tasks.
Generated custom content for my classes with a focus on practical industry problems.
Guided students through good software engineering practices for data science workflows.

Technologies: Python, Scikit-learn, Pandas, TensorFlow, Docker, Data Visualization

Data Scientist

2024 - 2024

Toptal Client

Developed GWAS analyses. Generated multiple statistical and machine learning methods for the pharmacogenomics use case.
Developed code that was able to match the modeling results in the literature.
Generated posterior analyses integrating the GWAS data with different open source datasets, in order to get connections of the genetic information with external data sources.

Technologies: Data Science, Bioinformatics, Genomics, Machine Learning

MLOps Engineer | Cloud Architect

2023 - 2023

Grego-AI

Developed a state-of-the-art conversational recommendation engine based on a semantic search using LangChain, Python, PostgreSQL, Amazon S3, and GPT-3.5/4.
Generated a back end with FastAPI, Amazon S3, PostgreSQL, and Docker.
Developed the 1st architecture for the platform in AWS, configuring VPC, public and private subnets, EC2 instances, autoscaling groups, load balancers, S3 buckets, and RDS. Configured security groups and NACLs.
Developed CI/CD pipelines for the developed infrastructure. Automated 90% of the deployment process using this solution.
Deployed a containerized solution, both for the front end and the back end. Connected the front end, back end, and the rest of the services, such as Amazon S3 and RDS.
Deployed the front end, configuring a reverse proxy (NGINX).
Automated the 1st version of the architecture using Terraform.
Developed a RAG pipeline to parse natural language queries using the customized infrastructure. Improved the pipeline using a Mixture of Experts.
Developed a lambda function to provide a customized notification system.

Technologies: Boto 3, Amazon Web Services (AWS), APIs, Machine Learning Operations (MLOps), Machine Learning, Docker, Data Architecture, Python, Bash, Linux, SQLAlchemy, Git, CI/CD Pipelines, NGINX, Amazon EC2, Load Balancers, Autoscaling Groups, Amazon S3 (AWS S3), Amazon RDS, Amazon Virtual Private Cloud (VPC), Amazon Elastic Container Registry (ECR), Terraform, Chatbots, Language Models, Large Language Models (LLMs), OpenAI GPT-3 API, AWS DevOps

Data Scientist (via Toptal)

2022 - 2023

Stanford University - Main

Developed solutions for the Stanford cluster to interact with Databricks via ODBC. Created a package to set up configuration and installation in the cluster. Developed bash scripts and created an R subpackage.
Created CI pipelines that generated artifacts that matched the cluster architecture.
Developed workflows for RNA-Seq and ATAC-Seq analyses.
Developed a solution in Rust to transform genomics datasets. Helped with different parts of a workflow in a population genomics dataset. Generated different scripts to run the analysis in the Slurm cluster.

Technologies: Data Science, R, Bioinformatics, Genomics, RStudio Shiny, Bash, Rust, Slurm Workload Manager, GitHub Actions, Git, Azure Databricks, Python, Spark, Linux

Graph Data Scientist

2022 - 2022

Toptal Client

Developed information extraction pipelines from documents using libraries like Spacy, regular expressions, and other NLP algorithms.
Tested several methods for unsupervised document mining before the final approach.
Integrated the extracted information into a graph database. Generated proper schemas for the data.
Added customized GSQL queries into the solution and adapted a pathfinding algorithm to the schema.
Developed a dockerized solution to automate the whole process.
Added CI to the process to automate building, testing, and pull requests.

Technologies: GraphDB, Linux, Natural Language Processing (NLP), RDBMS, Automated Testing, Finance, TigerGraph, Docker, CI/CD Pipelines, Language Models, Data Visualization

Senior Machine Learning Engineer | eCommerce | FT

2022 - 2022

PROFASEE INC

Developed an MLOps detailed plan based on AWS SageMaker, including all the architecture components and interaction with other components of the base infrastructure.
Developed a customized Shiny application for the visualization of metrics of the ML modeling results and monitoring metrics using interactive charts. The app was containerized via Docker and deployed as an internal service via ECS.
Developed automated Markdown reports based on selections from the app.
Generated an API using FastAPI for querying data for the app. Integrated the app with the API.
Developed production-level pipelines for generating the needed app inputs from outputs of the ML models and connected input and outputs with AWS S3. Integrated the pipelines with the API.
Developed base code for interacting with external data sources via FastAPI and PostgreSQL.
Generated PostgreSQL schemas using SQLAlchemy and adapters for interaction between Pydantic and SQLAlchemy models for the app layer and the external data sources interaction layer.

Technologies: Python, Machine Learning Operations (MLOps), Machine Learning, Data Engineering, R, Pandas, APIs, REST APIs, Amazon Web Services (AWS), ETL, Data Pipelines, Data Visualization

Data Science Engineer

2022 - 2022

BCG

Developed a back end using FastAPI, PostgreSQL, and Ray to interact with customer data and show the information in a dashboard. Containerized the full application using Docker Compose.
Translated data engineering R code written with data.table to Python (Pandas), generating a package that was able to check at different steps the output consistency.
Generated different configuration elements for an on-premise data stack, such as Makefiles, unit tests, and an input data checker package.
Wrote a scikit-learn machine learning pipeline for data imputation, encoding, and outlier detection steps.

Technologies: APIs, Python, R, PostgreSQL, REST, Pandas, Docker, Docker Compose, Pipelines, REST APIs, ETL, Data Visualization

Data Engineer | Data Scientist

2021 - 2022

Toptal Client

Developed three Neo4j graph databases from scratch. Defined nodes, edges, and attributes. Performed data preparation with Pandas.
Automated the generation of the databases and provided docker containers to create the databases from the raw data and to contain in a single point all the needed infrastructure. Docker containers included base infrastructure, Neo4j, databases, and a UI.
Exposed ports in the Docker application to perform queries and visualize the knowledge graphs using an interactive front end. Added CI using GitHub actions to ensure that the application was built without errors.
Performed unsupervised analyses such as node2vec to understand the knowledge graph structure. Generated interactive charts with Bokeh to explore the results. Provided alternative visualizations of the knowledge graphs using JavaScript libraries.
Tested multiple methods such as the Louvain algorithm and analysis of clusters for the node2vec results using DBSCAN.

Technologies: Python, Neo4j, Docker, Continuous Integration (CI), Pandas, Unsupervised Learning, Knowledge Graphs, ETL, Julia

Data Science Instructor

2020 - 2022

Digital House

Lectured a hands-on course on data science from entry to more advanced topics.
Generated custom content for the classes to improve the understanding of specific topics.
Finished lecturing two courses of approximately 40 students successfully.

Technologies: Python, Scikit-learn, Natural Language Toolkit (NLTK), Random Forests, SQL, APIs, Optimization, Applied Mathematics, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), Statistics, Pipelines, Unsupervised Learning, Supervised Learning, Gradient Boosting, Bootstrap, Ensemble Methods, Statistical Data Analysis, REST APIs, BigQuery, Data Visualization

Data Scientist

2021 - 2021

DataArt

Performed analyses on time-distributed metrics collected from multiple portions of a mobile application and external data sources.
Generated the data infrastructure for the mobile application based on components such as data lakes, BigQuery, Databricks, Spark, and Azure Synapse Analytics.
Provided custom metrics using the collected information stored in data lakes and Azure Cosmos DB.

Technologies: Python, Azure, Databricks, SQL, Google BigQuery, PySpark, Azure Data Factory (ADF), Azure Synapse, Azure Data Lake, Data Science, Scikit-learn, Pandas, Dask, Git, Artificial Intelligence (AI), Agile Software Development, Spark, TensorFlow, Keras, Machine Learning, Azure Cosmos DB, Data Engineering, Data Mining, Deep Learning, Time Series Analysis, Azure DevOps, Forecasting, ETL, Data Pipelines, BigQuery, Apache Spark, Deep Neural Networks (DNNs), Data Visualization

Data Scientist | Machine Learning Engineer

2020 - 2021

Self-employed

Developed a graph database using Neo4j (Cypher) and generated subsequent analyses and node embeddings using Node2Vec. Performed entity extraction from documents to populate the database.
Collaborated in a machine learning project to predict the best opportunities for a food business. The analyses were performed using geographically and time-distributed features.
Developed a dashboard using Flask and components such as MongoDB.
Performed sentiment analysis on Twitter data using 1D CNNs.

Technologies: Python, R, Spark, Spark ML, Neo4j, TensorFlow, Dask, Data Science, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), Natural Language Toolkit (NLTK), Scikit-learn, Pandas, SQL, SpaCy, Spatial Analysis, Git, Artificial Intelligence (AI), Agile Software Development, Gensim, Hugging Face, Keras, Machine Learning, MongoDB, Flask, Data Mining, Image Processing, Deep Learning, Neural Networks, H20, Amazon SageMaker, Amazon EC2, PostgreSQL, REST APIs, Amazon Web Services (AWS), Data Pipelines, Language Models, Data Visualization

Data Scientist

2020 - 2020

Intellignos

Developed a recommender system based on collaborative filtering using Spark and Spark ML that recommends products to millions of users.
Developed a PySpark pipeline as part of the recommender system solution.
Generated an end-to-end data engineering pipeline for the recommender system using Spark.
Worked on statistical analyses and data modeling using elastic net regression to find the best online investment opportunities for a project.
Performed machine learning engineering tasks, such as the generation of packages. model orchestration, and model tracking with MLflow.

Technologies: Python, R, Azure, Databricks, Spark, Spark ML, Azure Data Factory (ADF), Azure Data Lake, Data Science, Large-scale Distributed Systems, Scikit-learn, Pandas, SQL, Spark NLP, Git, Artificial Intelligence (AI), Agile Software Development, Generative Pre-trained Transformers (GPT), Natural Language Processing (NLP), Docker, Machine Learning, Data Engineering, Data Mining, Statistics, Statistical Data Analysis, Deep Learning, Big Data, Recommendation Systems, ETL, Large Data Sets, BigQuery, Apache Spark, Machine Learning Operations (MLOps), Language Models

Data Scientist | Machine Learning Engineer

2019 - 2020

Softtek

Developed an end-to-end machine learning project for anomaly detection and time series forecasting in near real-time data from IoT devices, using autoencoders and Bayesian modeling.
Developed data engineering pipelines for the analysis of data from PLCs.
Generated a workflow for entity extraction from documents.
Performed unsupervised classification of documents using Doc2Vec.
Developed a machine learning project for employee turnover prediction using an unbalanced dataset and XGboost.
Performed machine learning engineering tasks, such as generation of packages, model orchestration, and model tracking with MLflow.

Technologies: Python, R, Azure, Databricks, TensorFlow, Bayesian Inference & Modeling, Azure Data Factory (ADF), Data Science, Generative Pre-trained Transformers (GPT), Natural Language Processing (NLP), Spark NLP, Natural Language Toolkit (NLTK), Scikit-learn, Pandas, SQL, Bayesian Statistics, PyMC, Dask, Git, Artificial Intelligence (AI), Agile Software Development, Spark, Docker, Gensim, Keras, Machine Learning, Data Engineering, Data Mining, Statistics, Statistical Data Analysis, Image Processing, Deep Learning, Neural Networks, H20, Internet of Things (IoT), Time Series Analysis, Forecasting, Big Data, ETL, Large Data Sets, Azure Machine Learning, Data Pipelines, Apache Spark, Machine Learning Operations (MLOps), Language Models, Deep Neural Networks (DNNs), Data Visualization

Postdoctoral Researcher

2018 - 2019

Washington State University

Developed software for precision medicine (whole genome sequencing and transcriptomics).
Generated Python and R packages and pipelines for a terabyte-scale dataset that was processed in parallel on an HPC cluster with Slurm.
Collaborated in research papers and participated in conferences.

Technologies: R, Python, Bash, Data Science, Large-scale Distributed Systems, Scikit-learn, Pandas, SQL, C++, Git, Docker, Machine Learning, Data Mining, Slurm Workload Manager, High-performance Computing (HPC), Big Data, Bioinformatics, Genomics, Biology, Computational Biology, Molecular Biology, Large Data Sets, Data Pipelines, Data Visualization

Postdoctoral Researcher

2016 - 2018

IIB-INTECH UNSAM

Developed software in R and Python for precision medicine (epigenomics and transcriptomics).
Generated interfaces using R Shiny to provide no-code approaches for the packages in order to make the software accessible to a broad number of users.
Collaborated in research papers and participated in conferences.

Technologies: R, Python, Bash, Data Science, Scikit-learn, Pandas, Git, Data Mining, Bioinformatics, Genomics, Biology, Computational Biology, Molecular Biology, Data Visualization

Experience

A Machine Learning Application for Time-series Forecasting and Anomaly Detection

For the forecasting portion of the project, a Bayesian model was developed to account for the uncertainty of the forecasting estimates. In the case of anomaly detection, the model was based on auto-encoders.

A Recommender System Using Collaborative Filtering

The goal of the project was to generate the best product recommendations for customers of a retailer company. For scalability, the model was developed using PySpark, Spark ML, and the ALS algorithm.

Prediction of the Number of Days It Will Take for a SKU to be Out of Stock

https://github.com/leandroroser/meli_data_challenge_2021

The goal of this project was to predict how long it would take for the inventory of a certain item to be sold completely. Possible values range from 1 to 30. The data was pre-processed with PySpark and modeled with XGBoost.

Prettyparser

https://pypi.org/project/prettyparser/

Prettyparser is a Python library for parsing PDF/TXT and Python objects with text (str, list) using regular expressions. In the case of PDF files, the package reads the content using pdfplumber. It then performs a series of data manipulations to generate higher quality output, removing the boilerplate code needed to read/process/write the content of multiple files with multiple pages. A custom processing function using pdfplumber that takes a page and returns a processed text is also allowed. Additional data processing steps can be added via custom regular expressions that are compiled for improved speed.

Tensorflow Speech Recognition Challenge

https://www.kaggle.com/leangab/tensorflow-speech-recognition-challenge

In this Kaggle notebook, I developed a model to classify short audio clips using Librosa and TensorFlow. This example shows the use of batch processing, MFCCs, and a conv2d architecture. The model reached 90% of accuracy.

NLP Analysis of the E. A. Poe's Corpus of Short Stories

https://www.kaggle.com/leangab/poe-short-stories-corpus-analysis

In this Kaggle notebook, I performed an analysis using the whole short stories corpus of E. A. Poe. It shows the implementation of different libraries such as NLTK, Spacy, and Gensim and methods such as word2vec and the latent Dirichlet allocation (LDA). This is an example of how cosine similarity works under the hood with a simple model (word2vec) and shows the implementation before the adoption of LLMs.

FastqCleaner

https://github.com/leandroroser/FastqCleaner

A Shiny web app for pre-processing of transcriptomics data. The app includes C++ code for optimization of bottleneck portions of the code and customization of the behavior and appearance of the app via JavaScript and CSS.

Publication: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2961-8

EcoGenetic | An R Package for Landscape Genetics

https://github.com/cran/EcoGenetics

An R package for spatial analysis of genetic and phenotypic data. It includes features such as extensive unit testing, best practices using OOP in R (S4 classes), and integration with the R ecosystem.

Citation: https://mran.microsoft.com/snapshot/2018-08-31/web/packages/EcoGenetics/citation.html

ChunkR

https://github.com/leandroroser/chunkR

This package allows reading large text tables in chunks in R, using a fast C++ back end. Text files can be imported as data frames (with automatic column type detection option) or matrices. The program is designed to be simple and user-friendly.

At the time this library was developed, there weren't many resources in R to get around the medium-sized local data problem like R data frames were fully allocated in memory, as Pandas does in Python. The recent Vaex Python library is a good example of using this approach (but with memory maps and lazy loading). The C++ code interfaced with R is available in the src subfolder of the repository.

Publication

Python vs. R: Syntactic Sugar Magic

https://www.toptal.com/developers/python/python-vs-r-syntactic-sugar-magic

Education

2010 - 2015

PhD in Biological Science

University of Buenos Aires - Buenos Aires, Argentina

2003 - 2010

Combined Bachelor's and Master's Degree in Biological Science

University of Buenos Aires - Buenos Aires, Argentina

Certifications

FEBRUARY 2024 - FEBRUARY 2027

AWS Certified Solutions Architect – Associate

Amazon Web Services

OCTOBER 2023 - OCTOBER 2026

AWS Certified Developer – Associate

Amazon Web Services Training and Certification

DECEMBER 2021 - PRESENT

Julia Programming 2021

Udemy

DECEMBER 2021 - PRESENT

MLOps Fundamentals: CI/CD/CT Pipelines of ML with Azure Demo

Udemy

FEBRUARY 2021 - PRESENT

Google Cloud: Insights from Data with BigQuery

Coursera

NOVEMBER 2020 - PRESENT

Getting Started with Google Kubernetes Engine

Coursera

OCTOBER 2020 - PRESENT

Deep Neural Networks with Pytorch

Coursera

Skills

Libraries/APIs

Scikit-learn, Pandas, PyMC, Keras, XGBoost, REST APIs, PySpark, TensorFlow, Spark ML, SpaCy, Natural Language Toolkit (NLTK), Spark NLP, Dask, SQLAlchemy, PyTorch

Tools

Amazon Elastic Container Registry (ECR), GitLab CI/CD, Gensim, Git, Amazon SageMaker, GIS, Azure Machine Learning, BigQuery, NGINX, Amazon Virtual Private Cloud (VPC), Terraform, AWS Step Functions, Docker Compose, Boto 3

Languages

Python, R, Bash, Regex, SQL, C++, C++11, JavaScript, CSS, Julia, Rust

Frameworks

Spark, RStudio Shiny, Flask, Bootstrap, Apache Spark

Paradigms

Agile Software Development, High-performance Computing (HPC), Continuous Integration (CI), ETL, Azure DevOps, REST, Automated Testing

Platforms

Linux, Databricks, Docker, AWS Lambda, Azure, Azure Synapse, H20, Amazon EC2, Amazon Web Services (AWS), Kubernetes, TigerGraph, AWS IoT, Amazon

Storage

Neo4j, Data Pipelines, PostgreSQL, Azure Cosmos DB, MongoDB, Google Cloud, Amazon S3 (AWS S3), Amazon DynamoDB, RDBMS, Elasticsearch

Industry Expertise

Bioinformatics

Other

Statistics, Machine Learning, Bayesian Inference & Modeling, Spatial Analysis, Azure Data Factory (ADF), Azure Data Lake, Artificial Intelligence (AI), Time Series Analysis, Data Science, Data Engineering, Random Forests, Optimization, Applied Mathematics, Pipelines, Unsupervised Learning, Supervised Learning, Gradient Boosting, Data Mining, Statistical Data Analysis, Deep Learning, Slurm Workload Manager, Forecasting, Big Data, Genomics, Biology, Computational Biology, Molecular Biology, Large Data Sets, OpenAI GPT-3 API, CI/CD Pipelines, Data Visualization, Documentation, GitHub Actions, Azure Databricks, Large-scale Distributed Systems, Bayesian Statistics, Hugging Face, Google BigQuery, Natural Language Processing (NLP), Internet of Things (IoT), APIs, Ensemble Methods, Image Processing, Neural Networks, Machine Learning Operations (MLOps), Recommendation Systems, Geospatial Data, Generative Pre-trained Transformers (GPT), Amazon RDS, Language Models, Large Language Models (LLMs), Deep Neural Networks (DNNs), Serverless, AWS DevOps, Knowledge Graphs, GraphDB, Finance, Data Architecture, Load Balancers, Autoscaling Groups, Chatbots

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring