Leandro Roser
Verified Expert in Engineering
Data Scientist and Machine Learning Developer
Buenos Aires, Argentina
Toptal member since September 15, 2021
Leandro is a machine learning and MLOps engineer with over 14 years of experience in the data field. He has a solid background in ML infrastructure development, machine learning, and statistics. He is proficient in using tools such as Python, Docker, Terraform, CI/CD frameworks, and several AWS services. Leandro excels at transforming ideas into robust, end-to-end products.
Portfolio
Experience
Availability
Preferred Environment
Linux, Windows, Visual Studio, Slack, Python, Azure, Amazon Web Services (AWS), Git
The most amazing...
...project I've developed is a conversational recommendation system based on RAG, GPT-3, and Mixture of Experts—at the earliest adoption point of this technology.
Work Experience
MLOps Engineer
Prisma Medios de Pago
- Developed serverless MLOps frameworks for automatic deployment of ML models. Used tools such as Step Functions, SAM, DynamoDB, S3, etc.
- Designed the architecture from scratch for different training/inference scenarios (e.g., batch/at-demand API-based inference) using best practices.
- Interacted with data scientists, DevOps, and other platform members to achieve the best solutions.
- Developed CI/CD code to automate the deployment process using GitLab.
- Interacted with data scientists, DevOps, and other engineers to develop solutions.
- Developed end-to-end code with the relevant AWS services for each architecture. Used Terraform to deploy the solutions into different environments.
- Developed code using SageMaker pipelines and other approaches.
Data Science Instructor (Bootcamp)
Le Wagon
- Lectured a hands-on course on data science from entry to more advanced topics, such as ML engineering and MLOps tasks.
- Generated custom content for my classes with a focus on practical industry problems.
- Guided students through good software engineering practices for data science workflows.
Data Scientist
Toptal Client
- Developed GWAS analyses. Generated multiple statistical and machine learning methods for the pharmacogenomics use case.
- Developed code that was able to match the modeling results in the literature.
- Generated posterior analyses integrating the GWAS data with different open source datasets, in order to get connections of the genetic information with external data sources.
MLOps Engineer | Cloud Architect
Grego-AI
- Developed a state-of-the-art conversational recommendation engine based on a semantic search using LangChain, Python, PostgreSQL, Amazon S3, and GPT-3.5/4.
- Generated a back end with FastAPI, Amazon S3, PostgreSQL, and Docker.
- Developed the 1st architecture for the platform in AWS, configuring VPC, public and private subnets, EC2 instances, autoscaling groups, load balancers, S3 buckets, and RDS. Configured security groups and NACLs.
- Developed CI/CD pipelines for the developed infrastructure. Automated 90% of the deployment process using this solution.
- Deployed a containerized solution, both for the front end and the back end. Connected the front end, back end, and the rest of the services, such as Amazon S3 and RDS.
- Deployed the front end, configuring a reverse proxy (NGINX).
- Automated the 1st version of the architecture using Terraform.
- Developed a RAG pipeline to parse natural language queries using the customized infrastructure. Improved the pipeline using a Mixture of Experts.
- Developed a lambda function to provide a customized notification system.
Data Scientist (via Toptal)
Stanford University - Main
- Developed solutions for the Stanford cluster to interact with Databricks via ODBC. Created a package to set up configuration and installation in the cluster. Developed bash scripts and created an R subpackage.
- Created CI pipelines that generated artifacts that matched the cluster architecture.
- Developed workflows for RNA-Seq and ATAC-Seq analyses.
- Developed a solution in Rust to transform genomics datasets. Helped with different parts of a workflow in a population genomics dataset. Generated different scripts to run the analysis in the Slurm cluster.
Graph Data Scientist
Toptal Client
- Developed information extraction pipelines from documents using libraries like Spacy, regular expressions, and other NLP algorithms.
- Tested several methods for unsupervised document mining before the final approach.
- Integrated the extracted information into a graph database. Generated proper schemas for the data.
- Added customized GSQL queries into the solution and adapted a pathfinding algorithm to the schema.
- Developed a dockerized solution to automate the whole process.
- Added CI to the process to automate building, testing, and pull requests.
Senior Machine Learning Engineer | eCommerce | FT
PROFASEE INC
- Developed an MLOps detailed plan based on AWS SageMaker, including all the architecture components and interaction with other components of the base infrastructure.
- Developed a customized Shiny application for the visualization of metrics of the ML modeling results and monitoring metrics using interactive charts. The app was containerized via Docker and deployed as an internal service via ECS.
- Developed automated Markdown reports based on selections from the app.
- Generated an API using FastAPI for querying data for the app. Integrated the app with the API.
- Developed production-level pipelines for generating the needed app inputs from outputs of the ML models and connected input and outputs with AWS S3. Integrated the pipelines with the API.
- Developed base code for interacting with external data sources via FastAPI and PostgreSQL.
- Generated PostgreSQL schemas using SQLAlchemy and adapters for interaction between Pydantic and SQLAlchemy models for the app layer and the external data sources interaction layer.
Data Science Engineer
BCG
- Developed a back end using FastAPI, PostgreSQL, and Ray to interact with customer data and show the information in a dashboard. Containerized the full application using Docker Compose.
- Translated data engineering R code written with data.table to Python (Pandas), generating a package that was able to check at different steps the output consistency.
- Generated different configuration elements for an on-premise data stack, such as Makefiles, unit tests, and an input data checker package.
- Wrote a scikit-learn machine learning pipeline for data imputation, encoding, and outlier detection steps.
Data Engineer | Data Scientist
Toptal Client
- Developed three Neo4j graph databases from scratch. Defined nodes, edges, and attributes. Performed data preparation with Pandas.
- Automated the generation of the databases and provided docker containers to create the databases from the raw data and to contain in a single point all the needed infrastructure. Docker containers included base infrastructure, Neo4j, databases, and a UI.
- Exposed ports in the Docker application to perform queries and visualize the knowledge graphs using an interactive front end. Added CI using GitHub actions to ensure that the application was built without errors.
- Performed unsupervised analyses such as node2vec to understand the knowledge graph structure. Generated interactive charts with Bokeh to explore the results. Provided alternative visualizations of the knowledge graphs using JavaScript libraries.
- Tested multiple methods such as the Louvain algorithm and analysis of clusters for the node2vec results using DBSCAN.
Data Science Instructor
Digital House
- Lectured a hands-on course on data science from entry to more advanced topics.
- Generated custom content for the classes to improve the understanding of specific topics.
- Finished lecturing two courses of approximately 40 students successfully.
Data Scientist
DataArt
- Performed analyses on time-distributed metrics collected from multiple portions of a mobile application and external data sources.
- Generated the data infrastructure for the mobile application based on components such as data lakes, BigQuery, Databricks, Spark, and Azure Synapse Analytics.
- Provided custom metrics using the collected information stored in data lakes and Azure Cosmos DB.
Data Scientist | Machine Learning Engineer
Self-employed
- Developed a graph database using Neo4j (Cypher) and generated subsequent analyses and node embeddings using Node2Vec. Performed entity extraction from documents to populate the database.
- Collaborated in a machine learning project to predict the best opportunities for a food business. The analyses were performed using geographically and time-distributed features.
- Developed a dashboard using Flask and components such as MongoDB.
- Performed sentiment analysis on Twitter data using 1D CNNs.
Data Scientist
Intellignos
- Developed a recommender system based on collaborative filtering using Spark and Spark ML that recommends products to millions of users.
- Developed a PySpark pipeline as part of the recommender system solution.
- Generated an end-to-end data engineering pipeline for the recommender system using Spark.
- Worked on statistical analyses and data modeling using elastic net regression to find the best online investment opportunities for a project.
- Performed machine learning engineering tasks, such as the generation of packages. model orchestration, and model tracking with MLflow.
Data Scientist | Machine Learning Engineer
Softtek
- Developed an end-to-end machine learning project for anomaly detection and time series forecasting in near real-time data from IoT devices, using autoencoders and Bayesian modeling.
- Developed data engineering pipelines for the analysis of data from PLCs.
- Generated a workflow for entity extraction from documents.
- Performed unsupervised classification of documents using Doc2Vec.
- Developed a machine learning project for employee turnover prediction using an unbalanced dataset and XGboost.
- Performed machine learning engineering tasks, such as generation of packages, model orchestration, and model tracking with MLflow.
Postdoctoral Researcher
Washington State University
- Developed software for precision medicine (whole genome sequencing and transcriptomics).
- Generated Python and R packages and pipelines for a terabyte-scale dataset that was processed in parallel on an HPC cluster with Slurm.
- Collaborated in research papers and participated in conferences.
Postdoctoral Researcher
IIB-INTECH UNSAM
- Developed software in R and Python for precision medicine (epigenomics and transcriptomics).
- Generated interfaces using R Shiny to provide no-code approaches for the packages in order to make the software accessible to a broad number of users.
- Collaborated in research papers and participated in conferences.
Experience
A Machine Learning Application for Time-series Forecasting and Anomaly Detection
A Recommender System Using Collaborative Filtering
Prediction of the Number of Days It Will Take for a SKU to be Out of Stock
https://github.com/leandroroser/meli_data_challenge_2021Prettyparser
https://pypi.org/project/prettyparser/Tensorflow Speech Recognition Challenge
https://www.kaggle.com/leangab/tensorflow-speech-recognition-challengeNLP Analysis of the E. A. Poe's Corpus of Short Stories
https://www.kaggle.com/leangab/poe-short-stories-corpus-analysisFastqCleaner
https://github.com/leandroroser/FastqCleanerPublication: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2961-8
EcoGenetic | An R Package for Landscape Genetics
https://github.com/cran/EcoGeneticsCitation: https://mran.microsoft.com/snapshot/2018-08-31/web/packages/EcoGenetics/citation.html
ChunkR
https://github.com/leandroroser/chunkRAt the time this library was developed, there weren't many resources in R to get around the medium-sized local data problem like R data frames were fully allocated in memory, as Pandas does in Python. The recent Vaex Python library is a good example of using this approach (but with memory maps and lazy loading). The C++ code interfaced with R is available in the src subfolder of the repository.
Education
PhD in Biological Science
University of Buenos Aires - Buenos Aires, Argentina
Combined Bachelor's and Master's Degree in Biological Science
University of Buenos Aires - Buenos Aires, Argentina
Certifications
AWS Certified Solutions Architect – Associate
Amazon Web Services
AWS Certified Developer – Associate
Amazon Web Services Training and Certification
Julia Programming 2021
Udemy
MLOps Fundamentals: CI/CD/CT Pipelines of ML with Azure Demo
Udemy
Google Cloud: Insights from Data with BigQuery
Coursera
Getting Started with Google Kubernetes Engine
Coursera
Deep Neural Networks with Pytorch
Coursera
Skills
Libraries/APIs
Scikit-learn, Pandas, Keras, XGBoost, REST APIs, PySpark, TensorFlow, Spark ML, SpaCy, Natural Language Toolkit (NLTK), Spark NLP, Dask, SQLAlchemy, PyTorch
Tools
Amazon Elastic Container Registry (ECR), GitLab CI/CD, Gensim, Git, Amazon SageMaker, GIS, Azure Machine Learning, BigQuery, NGINX, Amazon Virtual Private Cloud (VPC), Terraform, AWS Step Functions, Docker Compose, Boto 3
Languages
Python, R, Bash, Regex, SQL, C++, C++11, JavaScript, CSS, Julia, Rust
Frameworks
Spark, RStudio Shiny, Flask, Bootstrap, Apache Spark
Paradigms
Agile Software Development, High-performance Computing (HPC), Continuous Integration (CI), ETL, Azure DevOps, REST, Automated Testing
Platforms
Linux, Databricks, Docker, AWS Lambda, Azure, Azure Synapse, H20, Amazon EC2, Amazon Web Services (AWS), Kubernetes, AWS IoT, Amazon
Storage
Neo4j, Data Pipelines, PostgreSQL, Azure Cosmos DB, MongoDB, Google Cloud, Amazon S3 (AWS S3), Amazon DynamoDB, RDBMS, Elasticsearch
Industry Expertise
Bioinformatics
Other
Statistics, Machine Learning, PyMC3, Bayesian Inference & Modeling, Spatial Analysis, Azure Data Factory, Azure Data Lake, Artificial Intelligence (AI), Time Series Analysis, Data Science, Data Engineering, Random Forests, Optimization, Applied Mathematics, Pipelines, Unsupervised Learning, Supervised Learning, Gradient Boosting, Data Mining, Statistical Data Analysis, Deep Learning, Slurm Workload Manager, Forecasting, Big Data, Genomics, Biology, Computational Biology, Molecular Biology, Large Data Sets, OpenAI GPT-3 API, CI/CD Pipelines, Data Visualization, Documentation, GitHub Actions, Azure Databricks, Large Scale Distributed Systems, Bayesian Statistics, Hugging Face, Google BigQuery, Natural Language Processing (NLP), Internet of Things (IoT), APIs, Ensemble Methods, Image Processing, Neural Networks, Machine Learning Operations (MLOps), Recommendation Systems, Geospatial Data, Generative Pre-trained Transformers (GPT), Amazon RDS, Language Models, Large Language Models (LLMs), Deep Neural Networks, Serverless, Knowledge Graphs, GraphDB, Finance, TigerGraph, Data Architecture, Load Balancers, Autoscaling Groups, Chatbots
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring