Leandro Roser, Developer in Buenos Aires, Argentina
Leandro is available for hire
Hire Leandro

Leandro Roser

Verified Expert  in Engineering

Data Scientist and Machine Learning Developer

Location
Buenos Aires, Argentina
Toptal Member Since
September 15, 2021

Leandro is a machine learning and MLOps engineer and generalist data expert with a strong background in engineering and applied machine learning. He develops using tools such as Terraform, Docker, Spark, APIs, and Python stacks. He has expertise working with multiple AWS services and designing infrastructure for ML projects. Leandro can materialize ideas into solid, end-to-end products.

Portfolio

Prisma Medios de Pago
Amazon, Terraform, Amazon SageMaker, Docker, Python, Serverless...
Le Wagon
Python, Scikit-learn, Pandas, TensorFlow, Docker, Data Visualization
Grego-AI
Boto 3, Amazon Web Services (AWS), APIs, Machine Learning Operations (MLOps)...

Experience

Availability

Part-time

Preferred Environment

Linux, Windows, Visual Studio, Slack, Python, Azure, Amazon Web Services (AWS), Git

The most amazing...

...project I've developed is a conversational recommendation system based on RAG, GPT-3, and Mixture of Experts—at the earliest adoption point of this technology.

Work Experience

MLOps Engineer

2024 - PRESENT
Prisma Medios de Pago
  • Developed SageMaker Python pipelines for different machine learning projects.
  • Designed architectures for different training/inference scenarios and developed functional code with the relevant AWS services for each architecture. Developed code using Terraform for deployment of the solutions into different environments.
  • Interacted with data scientists, DevOps, and other platform members to achieve the best solutions.
  • Developed CI/CD code to automate the deployment process.
Technologies: Amazon, Terraform, Amazon SageMaker, Docker, Python, Serverless, Data Architecture, Amazon S3 (AWS S3), AWS Lambda, Amazon DynamoDB, GitLab CI/CD

Data Science Instructor (Bootcamp)

2022 - PRESENT
Le Wagon
  • Lectured a hands-on course on data science from entry to more advanced topics, such as ML engineering and MLOps tasks.
  • Generated custom content for my classes with a focus on practical industry problems.
  • Guided students through good software engineering practices for data science workflows.
Technologies: Python, Scikit-learn, Pandas, TensorFlow, Docker, Data Visualization

MLOps Engineer | Cloud Architect

2023 - 2023
Grego-AI
  • Developed a state-of-the-art conversational recommendation engine based on a semantic search using LangChain, Python, PostgreSQL, Amazon S3, and GPT-3.5/4.
  • Generated a back end with FastAPI, Amazon S3, PostgreSQL, and Docker.
  • Developed the 1st architecture for the platform in AWS, configuring VPC, public and private subnets, EC2 instances, autoscaling groups, load balancers, S3 buckets, and RDS. Configured security groups and NACLs.
  • Developed CI/CD pipelines for the developed infrastructure. Automated 90% of the deployment process using this solution.
  • Deployed a containerized solution, both for the front end and the back end. Connected the front end, back end, and the rest of the services, such as Amazon S3 and RDS.
  • Deployed the front end, configuring a reverse proxy (NGINX).
  • Automated the 1st version of the architecture using Terraform.
  • Developed a RAG pipeline to parse natural language queries using the customized infrastructure. Improved the pipeline using a Mixture of Experts.
  • Developed a lambda function to provide a customized notification system.
Technologies: Boto 3, Amazon Web Services (AWS), APIs, Machine Learning Operations (MLOps), Machine Learning, Docker, Data Architecture, Python, Bash, Linux, SQLAlchemy, Git, CI/CD Pipelines, NGINX, Amazon EC2, Load Balancers, Autoscaling Groups, Amazon S3 (AWS S3), Amazon RDS, Amazon Virtual Private Cloud (VPC), Amazon Elastic Container Registry (ECR), Terraform, Chatbots, Language Models, Large Language Models (LLMs), OpenAI GPT-3 API

Data Scientist (via Toptal)

2022 - 2023
Stanford University - Main
  • Developed solutions for the Stanford cluster to interact with Databricks via ODBC. Created a package to set up configuration and installation in the cluster. Developed bash scripts and created an R subpackage.
  • Created CI pipelines that generated artifacts that matched the cluster architecture.
  • Developed workflows for RNA-Seq and ATAC-Seq analyses.
  • Developed a solution in Rust to transform genomics datasets. Helped with different parts of a workflow in a population genomics dataset. Generated different scripts to run the analysis in the Slurm cluster.
Technologies: Data Science, R, Bioinformatics, Genomics, RStudio Shiny, Bash, Rust, Slurm Workload Manager, GitHub Actions, Git, Azure Databricks, Python, Spark, Linux

Graph Data Scientist

2022 - 2022
Toptal Client
  • Developed information extraction pipelines from documents using libraries like Spacy, regular expressions, and other NLP algorithms.
  • Tested several methods for unsupervised document mining before the final approach.
  • Integrated the extracted information into a graph database. Generated proper schemas for the data.
  • Added customized GSQL queries into the solution and adapted a pathfinding algorithm to the schema.
  • Developed a dockerized solution to automate the whole process.
  • Added CI to the process to automate building, testing, and pull requests.
Technologies: GraphDB, Linux, Natural Language Processing (NLP), RDBMS, Automated Testing, Finance, TigerGraph, Docker, CI/CD Pipelines, Language Models, Data Visualization

Senior Machine Learning Engineer | eCommerce | FT

2022 - 2022
PROFASEE INC
  • Developed an MLOps detailed plan based on AWS SageMaker, including all the architecture components and interaction with other components of the base infrastructure.
  • Developed a customized Shiny application for the visualization of metrics of the ML modeling results and monitoring metrics using interactive charts. The app was containerized via Docker and deployed as an internal service via ECS.
  • Developed automated Markdown reports based on selections from the app.
  • Generated an API using FastAPI for querying data for the app. Integrated the app with the API.
  • Developed production-level pipelines for generating the needed app inputs from outputs of the ML models and connected input and outputs with AWS S3. Integrated the pipelines with the API.
  • Developed base code for interacting with external data sources via FastAPI and PostgreSQL.
  • Generated PostgreSQL schemas using SQLAlchemy and adapters for interaction between Pydantic and SQLAlchemy models for the app layer and the external data sources interaction layer.
Technologies: Python, Machine Learning Operations (MLOps), Machine Learning, Data Engineering, R, Pandas, APIs, REST APIs, Amazon Web Services (AWS), ETL, Data Pipelines, Data Visualization

Data Science Engineer

2022 - 2022
BCG
  • Developed a back end using FastAPI, PostgreSQL, and Ray to interact with customer data and show the information in a dashboard. Containerized the full application using Docker Compose.
  • Translated data engineering R code written with data.table to Python (Pandas), generating a package that was able to check at different steps the output consistency.
  • Generated different configuration elements for an on-premise data stack, such as Makefiles, unit tests, and an input data checker package.
  • Wrote a scikit-learn machine learning pipeline for data imputation, encoding, and outlier detection steps.
Technologies: APIs, Python, R, PostgreSQL, REST, Pandas, Docker, Docker Compose, Pipelines, REST APIs, ETL, Data Visualization

Data Engineer | Data Scientist

2021 - 2022
Toptal Client
  • Developed three Neo4j graph databases from scratch. Defined nodes, edges, and attributes. Performed data preparation with Pandas.
  • Automated the generation of the databases and provided docker containers to create the databases from the raw data and to contain in a single point all the needed infrastructure. Docker containers included base infrastructure, Neo4j, databases, and a UI.
  • Exposed ports in the Docker application to perform queries and visualize the knowledge graphs using an interactive front end. Added CI using GitHub actions to ensure that the application was built without errors.
  • Performed unsupervised analyses such as node2vec to understand the knowledge graph structure. Generated interactive charts with Bokeh to explore the results. Provided alternative visualizations of the knowledge graphs using JavaScript libraries.
  • Tested multiple methods such as the Louvain algorithm and analysis of clusters for the node2vec results using DBSCAN.
Technologies: Python, Neo4j, Docker, Continuous Integration (CI), Pandas, Unsupervised Learning, Knowledge Graphs, ETL, Julia

Data Science Instructor

2020 - 2022
Digital House
  • Lectured a hands-on course on data science from entry to more advanced topics.
  • Generated custom content for the classes to improve the understanding of specific topics.
  • Finished lecturing two courses of approximately 40 students successfully.
Technologies: Python, Scikit-learn, Natural Language Toolkit (NLTK), Random Forests, SQL, APIs, Optimization, Applied Mathematics, Natural Language Processing (NLP), GPT, Generative Pre-trained Transformers (GPT), Statistics, Pipelines, Unsupervised Learning, Supervised Learning, Gradient Boosting, Bootstrap, Ensemble Methods, Statistical Data Analysis, REST APIs, BigQuery, Data Visualization

Data Scientist

2021 - 2021
DataArt
  • Performed analyses on time-distributed metrics collected from multiple portions of a mobile application and external data sources.
  • Generated the data infrastructure for the mobile application based on components such as data lakes, BigQuery, Databricks, Spark, and Azure Synapse Analytics.
  • Provided custom metrics using the collected information stored in data lakes and Azure Cosmos DB.
Technologies: Python, Azure, Databricks, SQL, Google BigQuery, PySpark, Azure Data Factory, Azure Synapse, Azure Data Lake, Data Science, Scikit-learn, Pandas, Dask, Git, Artificial Intelligence (AI), Agile Software Development, Spark, TensorFlow, Keras, Machine Learning, Azure Cosmos DB, Data Engineering, Data Mining, Deep Learning, Time Series Analysis, Azure DevOps, Forecasting, ETL, Data Pipelines, BigQuery, Apache Spark, Deep Neural Networks, Data Visualization

Data Scientist | Machine Learning Engineer

2020 - 2021
Self-employed
  • Developed a graph database using Neo4j (Cypher) and generated subsequent analyses and node embeddings using Node2Vec. Performed entity extraction from documents to populate the database.
  • Collaborated in a machine learning project to predict the best opportunities for a food business. The analyses were performed using geographically and time-distributed features.
  • Developed a dashboard using Flask and components such as MongoDB.
  • Performed sentiment analysis on Twitter data using 1D CNNs.
Technologies: Python, R, Spark, Spark ML, Neo4j, TensorFlow, Dask, Data Science, Natural Language Processing (NLP), GPT, Generative Pre-trained Transformers (GPT), Natural Language Toolkit (NLTK), Scikit-learn, Pandas, SQL, SpaCy, Spatial Analysis, Git, Artificial Intelligence (AI), Agile Software Development, Gensim, Hugging Face, Keras, Machine Learning, MongoDB, Flask, Data Mining, Image Processing, Deep Learning, Neural Networks, H20, Amazon SageMaker, Amazon EC2, PostgreSQL, REST APIs, Amazon Web Services (AWS), Data Pipelines, Language Models, Data Visualization

Data Scientist

2020 - 2020
Intellignos
  • Developed a recommender system based on collaborative filtering using Spark and Spark ML that recommends products to millions of users.
  • Developed a PySpark pipeline as part of the recommender system solution.
  • Generated an end-to-end data engineering pipeline for the recommender system using Spark.
  • Worked on statistical analyses and data modeling using elastic net regression to find the best online investment opportunities for a project.
  • Performed machine learning engineering tasks, such as the generation of packages. model orchestration, and model tracking with MLflow.
Technologies: Python, R, Azure, Databricks, Spark, Spark ML, Azure Data Factory, Azure Data Lake, Data Science, Large Scale Distributed Systems, Scikit-learn, Pandas, SQL, Spark NLP, Git, Artificial Intelligence (AI), Agile Software Development, Generative Pre-trained Transformers (GPT), GPT, Natural Language Processing (NLP), Docker, Machine Learning, Data Engineering, Data Mining, Statistics, Statistical Data Analysis, Deep Learning, Big Data, Recommendation Systems, ETL, Large Data Sets, BigQuery, Apache Spark, Machine Learning Operations (MLOps), Language Models

Data Scientist | Machine Learning Engineer

2019 - 2020
Softtek
  • Developed an end-to-end machine learning project for anomaly detection and time series forecasting in near real-time data from IoT devices, using autoencoders and Bayesian modeling.
  • Developed data engineering pipelines for the analysis of data from PLCs.
  • Generated a workflow for entity extraction from documents.
  • Performed unsupervised classification of documents using Doc2Vec.
  • Developed a machine learning project for employee turnover prediction using an unbalanced dataset and XGboost.
  • Performed machine learning engineering tasks, such as generation of packages, model orchestration, and model tracking with MLflow.
Technologies: Python, R, Azure, Databricks, TensorFlow, Bayesian Inference & Modeling, Azure Data Factory, Data Science, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), GPT, Spark NLP, Natural Language Toolkit (NLTK), Scikit-learn, Pandas, SQL, Bayesian Statistics, PyMC3, Dask, Git, Artificial Intelligence (AI), Agile Software Development, Spark, Docker, Gensim, Keras, Machine Learning, Data Engineering, Data Mining, Statistics, Statistical Data Analysis, Image Processing, Deep Learning, Neural Networks, H20, Internet of Things (IoT), Time Series Analysis, Forecasting, Big Data, ETL, Large Data Sets, Azure Machine Learning, Data Pipelines, Apache Spark, Machine Learning Operations (MLOps), Language Models, Deep Neural Networks, Data Visualization

Postdoctoral Researcher

2018 - 2019
Washington State University
  • Developed software for precision medicine (whole genome sequencing and transcriptomics).
  • Generated Python and R packages and pipelines for a terabyte-scale dataset that was processed in parallel on an HPC cluster with Slurm.
  • Collaborated in research papers and participated in conferences.
Technologies: R, Python, Bash, Data Science, Large Scale Distributed Systems, Scikit-learn, Pandas, SQL, C++, Git, Docker, Machine Learning, Data Mining, Slurm Workload Manager, High-performance Computing, Big Data, Bioinformatics, Genomics, Biology, Computational Biology, Molecular Biology, Large Data Sets, Data Pipelines, Data Visualization

Postdoctoral Researcher

2016 - 2018
IIB-INTECH UNSAM
  • Developed software in R and Python for precision medicine (epigenomics and transcriptomics).
  • Generated interfaces using R Shiny to provide no-code approaches for the packages in order to make the software accessible to a broad number of users.
  • Collaborated in research papers and participated in conferences.
Technologies: R, Python, Bash, Data Science, Scikit-learn, Pandas, Git, Data Mining, Bioinformatics, Genomics, Biology, Computational Biology, Molecular Biology, Data Visualization

A Machine Learning Application for Time-series Forecasting and Anomaly Detection

For the forecasting portion of the project, a Bayesian model was developed to account for the uncertainty of the forecasting estimates. In the case of anomaly detection, the model was based on auto-encoders.

A Recommender System Using Collaborative Filtering

The goal of the project was to generate the best product recommendations for customers of a retailer company. For scalability, the model was developed using PySpark, Spark ML, and the ALS algorithm.

Prediction of the Number of Days It Will Take for a SKU to be Out of Stock

https://github.com/leandroroser/meli_data_challenge_2021
The goal of this project was to predict how long it would take for the inventory of a certain item to be sold completely. Possible values range from 1 to 30. The data was pre-processed with PySpark and modeled with XGBoost.

Prettyparser

https://pypi.org/project/prettyparser/
Prettyparser is a Python library for parsing PDF/TXT and Python objects with text (str, list) using regular expressions. In the case of PDF files, the package reads the content using pdfplumber. It then performs a series of data manipulations to generate higher quality output, removing the boilerplate code needed to read/process/write the content of multiple files with multiple pages. A custom processing function using pdfplumber that takes a page and returns a processed text is also allowed. Additional data processing steps can be added via custom regular expressions that are compiled for improved speed.

Tensorflow Speech Recognition Challenge

https://www.kaggle.com/leangab/tensorflow-speech-recognition-challenge
In this Kaggle notebook, I developed a model to classify short audio clips using Librosa and TensorFlow. This example shows the use of batch processing, MFCCs, and a conv2d architecture. The model reached 90% of accuracy.

NLP Analysis of the E. A. Poe's Corpus of Short Stories

https://www.kaggle.com/leangab/poe-short-stories-corpus-analysis
In this Kaggle notebook, I performed an analysis using the whole short stories corpus of E. A. Poe. It shows the implementation of different libraries such as NLTK, Spacy, and Gensim and methods such as word2vec and the latent Dirichlet allocation (LDA). This is an example of how cosine similarity works under the hood with a simple model (word2vec) and shows the implementation before the adoption of LLMs.

FastqCleaner

https://github.com/leandroroser/FastqCleaner
A Shiny web app for pre-processing of transcriptomics data. The app includes C++ code for optimization of bottleneck portions of the code and customization of the behavior and appearance of the app via JavaScript and CSS.

Publication: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2961-8

EcoGenetic | An R Package for Landscape Genetics

https://github.com/cran/EcoGenetics
An R package for spatial analysis of genetic and phenotypic data. It includes features such as extensive unit testing, best practices using OOP in R (S4 classes), and integration with the R ecosystem.

Citation: https://mran.microsoft.com/snapshot/2018-08-31/web/packages/EcoGenetics/citation.html

ChunkR

https://github.com/leandroroser/chunkR
This package allows reading large text tables in chunks in R, using a fast C++ back end. Text files can be imported as data frames (with automatic column type detection option) or matrices. The program is designed to be simple and user-friendly.

At the time this library was developed, there weren't many resources in R to get around the medium-sized local data problem like R data frames were fully allocated in memory, as Pandas does in Python. The recent Vaex Python library is a good example of using this approach (but with memory maps and lazy loading). The C++ code interfaced with R is available in the src subfolder of the repository.
2010 - 2015

PhD in Biological Science

University of Buenos Aires - Buenos Aires, Argentina

2003 - 2010

Combined Bachelor's and Master's Degree in Biological Science

University of Buenos Aires - Buenos Aires, Argentina

FEBRUARY 2024 - FEBRUARY 2027

AWS Certified Solutions Architect – Associate

Amazon Web Services

OCTOBER 2023 - OCTOBER 2026

AWS Certified Developer – Associate

Amazon Web Services Training and Certification

DECEMBER 2021 - PRESENT

Julia Programming 2021

Udemy

DECEMBER 2021 - PRESENT

MLOps Fundamentals: CI/CD/CT Pipelines of ML with Azure Demo

Udemy

FEBRUARY 2021 - PRESENT

Google Cloud: Insights from Data with BigQuery

Coursera

NOVEMBER 2020 - PRESENT

Getting Started with Google Kubernetes Engine

Coursera

OCTOBER 2020 - PRESENT

Deep Neural Networks with Pytorch

Coursera

Libraries/APIs

Scikit-learn, Pandas, Keras, XGBoost, REST APIs, PySpark, TensorFlow, Spark ML, SpaCy, Natural Language Toolkit (NLTK), Dask, PyTorch, SQLAlchemy

Tools

Gensim, Git, Amazon SageMaker, GIS, Azure Machine Learning, BigQuery, Docker Compose, Boto 3, NGINX, Amazon Virtual Private Cloud (VPC), Amazon Elastic Container Registry (ECR), Terraform, GitLab CI/CD

Frameworks

Spark, Flask, Bootstrap, Apache Spark, RStudio Shiny

Paradigms

Agile Software Development, Data Science, High-performance Computing, Continuous Integration (CI), ETL, Azure DevOps, REST, Automated Testing

Languages

Python, R, Bash, Regex, SQL, C++, C++11, JavaScript, CSS, Julia, Rust

Platforms

Linux, Databricks, Docker, Azure, Azure Synapse, H20, Amazon EC2, Amazon Web Services (AWS), Kubernetes, AWS IoT, Amazon, AWS Lambda

Industry Expertise

Bioinformatics

Storage

Neo4j, Data Pipelines, PostgreSQL, Azure Cosmos DB, MongoDB, Google Cloud, RDBMS, Elasticsearch, Amazon S3 (AWS S3), Amazon DynamoDB

Other

Statistics, Machine Learning, PyMC3, Bayesian Inference & Modeling, Spatial Analysis, Azure Data Factory, Azure Data Lake, Artificial Intelligence (AI), Time Series Analysis, Data Engineering, Random Forests, Optimization, Applied Mathematics, Pipelines, Unsupervised Learning, Supervised Learning, Gradient Boosting, Data Mining, Statistical Data Analysis, Deep Learning, Slurm Workload Manager, Forecasting, Big Data, Genomics, Biology, Computational Biology, Molecular Biology, Large Data Sets, OpenAI GPT-3 API, Data Visualization, Large Scale Distributed Systems, Bayesian Statistics, Spark NLP, Hugging Face, Google BigQuery, Natural Language Processing (NLP), Internet of Things (IoT), APIs, Ensemble Methods, Image Processing, Neural Networks, Machine Learning Operations (MLOps), Recommendation Systems, Geospatial Data, Generative Pre-trained Transformers (GPT), Language Models, Large Language Models (LLMs), Deep Neural Networks, Documentation, Knowledge Graphs, GraphDB, Finance, TigerGraph, GPT, Data Architecture, CI/CD Pipelines, Load Balancers, Autoscaling Groups, Amazon RDS, Chatbots, Serverless, GitHub Actions, Azure Databricks

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring