Juan is available for hire

Juan Manuel Berros

Verified Expert in Engineering

Data Scientist and Developer

Location

Buenos Aires, Argentina

Toptal Member Since

June 16, 2022

Juan is a bioinformatics PhD with years of combined experience in data analysis, data science, and back-end software engineering. He excels both in the statistical analysis of any dataset and in the implementation of data workflows with solid software engineering practices.

Portfolio

Takeup

Python, Data Visualization, Data Analysis, SQL, Streamlit, Snowflake...

SFR Analytics

Snowflake, Polars, Jupyter, SQL, GeoPandas, Streamlit, Data Analysis...

Grata Inc

Python, Docker, Django, Django ORM, Pandas, Data Engineering, Data Analysis...

Experience

Data Visualization - 7 years Web Development - 7 years Data Analysis - 7 years Life Science - 5 years Pandas - 5 years Machine Learning - 5 years Data Engineering - 5 years Data Science - 4 years

Availability

Part-time

Preferred Environment

Linux, Jupyter, Vim Text Editor, Tmux

The most amazing...

...thing I've developed is a complex pipeline of data analysis and an associated web application used daily to generate reports of embryo genetic anomalies.

Work Experience

Software Developer

2023 - PRESENT

Takeup

Developed a suite of internal apps in Streamlit and Snowflake for daily reviews of the model output before pushing rates to the hotels. Maintained and expanded these apps as rate reviewers requested new features.
Heavily refactored the pipelines of rate quality and propagation and added profuse unit testing in sensitive parts of the code dealing with rates. Improved the granularity of the logging for more flexible monitoring of each process in Prefect.
Created the code to support multiple triage mechanisms to be run on the model proposed rates and detect suspicious rates of different sorts.
Involved in the next project that consists of the complete automation of onboarding new clients (new hotels) to the system, which is done today with a bundle of scripts and SQL queries.

Technologies: Python, Data Visualization, Data Analysis, SQL, Streamlit, Snowflake, Hospitality, Pytest, Data Build Tool (dbt), Amazon Elastic Container Service (Amazon ECS), Docker, Prefect

Data Scientist

2023 - 2023

SFR Analytics

Built a pipeline to automatically generate PDF reports to share demographic insights from a sample.
Built a Streamlit app to interactively display consumer insights related to custom demographic groups.
Integrated a Snowflake database with internal Streamlit apps.

Technologies: Snowflake, Polars, Jupyter, SQL, GeoPandas, Streamlit, Data Analysis, Data Visualization

Data Analyst | Back-end Software Engineer

2022 - 2023

Grata Inc

Developed a Django back-end module to score and clean up a 10 million company names database. The quality of the names increased from 6 to 9.8 after this intervention, as measured by a team of data raters.
Built a Django back-end module to score geocoding results from Geocode Earth and Mapbox. The scoring was integrated into the ingestion pipeline to decide when Geocode Earth results were satisfactory, thus saving money from expensive Mapbox API calls.
Created a cohesive OOP solution for these problems—a Refiner class and a utils module that expanded progressively. Unit and integration testing for all features was religiously provided.
Reviewed PRs from peers with a strong emphasis on code style, readability, maintainability, system design, and testing.

Technologies: Python, Docker, Django, Django ORM, Pandas, Data Engineering, Data Analysis, Docker Compose, AWS CLI, Amazon S3 (AWS S3), Kubernetes, Pull Requests, GitHub, Unicode, Object-oriented Programming (OOP), Unit Testing, Jupyter, Amazon Web Services (AWS), Data Reporting, Data Analytics, Geocoder, Geocoding, Mapbox, Mapbox API, Code Review, Data Visualization, APIs, Jupyter Notebook, Data Preparation, Data Cleansing

Data Scientist | Data Engineer

2018 - 2022

Biocódices

Built an ETL workflow of genetic data in Python Luigi (similar to Airflow). It performed data collection, quality assurance, various filters, and the generation of reports for 2,500+ embryos.
Completed a thesis on the statistical properties of genetic scores of disease propensities in adults and chromosome anomalies in embryos.
Participated in experimental design, performed simulations of thousands of genomes, and iterated over hundreds of parametrizations in an on-premises HPC cluster to achieve the thesis goals.
Performed various one-time analyses of genetic data of different origins like embryo mitochondrial DNA distribution, COVID-19 genetic variants in Argentina, the performance of varying DNA sequencers and genomic panels, and sample contamination.

Technologies: Pandas, NumPy, Matplotlib, Seaborn, SciPy, Jupyter, Linux, PLINK, MySQL, UNIX Utilities, Ubuntu Linux, Redis, NGINX, Monit, Data Visualization, Ansible, Git, GitHub, Machine Learning, Data Analysis, Genomics, Bioinformatics, GATK, Data Engineering, Data Science, Python, Life Science, Python 3, HTML5, Exploratory Data Analysis, Tmux, Vim Text Editor, Sass, Web MVC, Flask, Continuous Integration (CI), PostgreSQL, Docker, SQL, Software Engineering, Jupyter Notebook, Applied Mathematics, Quantitative Analysis, Web Scraping, Data Analytics, Data Reporting, Code Review, Statistical Analysis, Data Pipelines, APIs, Healthcare, Data Scraping, Data Scientist, Data Preparation, Data Cleansing

Genomics Workshop Teacher

2016 - 2019

Faculty of Exact and Natural Sciences | University of Buenos Aires

Wrote detailed workshop guides for university students to learn numerous Linux utilities—pipes, AWK, sed, uniq, sort, column, less, for loops, while loops, and GNU parallel—bioinformatic tools and concepts, and basic shell scripting.
Rewrote the workshop guides in Markdown and migrated them to a GitHub repo for more developer-friendly maintenance. Before my intervention, the workshop was a loose bundle of PDFs and scripts mailed between professors.
Created Ubuntu VMs with every needed guide and software installed to be distributed on the lab's computers.

Technologies: Linux, Biology, Genomics, Bioinformatics, Exploratory Data Analysis, Tidyverse, Python 3, Principal Component Analysis (PCA), Statistics, Life Science, R, Python, Jupyter Notebook

Full-stack Ruby on Rails Developer

2016 - 2018

Biocódices

Built an internal laboratory information management system (LIMS) software in Ruby on Rails to inventory lab samples, display daily stats, and generate PDF reports to communicate genetic results to patients.
Maintained several workflows of genetic data processing, some more straightforward, like changes between data formats, and some more complex, like genetic data quality control, filter, and discovery of disease-related mutations.
Scraped a whole site of health-related content, timing off with random waits for the requests and caching them locally, and designed a parsing class based on Beautiful Soup.

Technologies: Ruby on Rails 4, jQuery, HTML, CSS, JavaScript, Web Development, Dashboards, Web Scraping, Workflow, Data Pipelines, MySQL

Full-stack Ruby on Rails Developer

2012 - 2015

Pemasys S.R.L.

Built a Ruby on Rails dashboard to display daily profits of Google Ads/Google AdSense campaigns in 10 countries.
Maintained a Ruby on Rails-based ads aggregator of thousands of cars, apartments, and job ads.
Contributed to the landing page and the payments workflow of the HR portal.

Technologies: jQuery, jQuery UI, JavaScript, HTML, CSS3, Ruby on Rails (RoR), Sass, Linux, UNIX Utilities, MySQL, Ubuntu Linux, NGINX, Web Development, Full-stack Development, Git, GitHub, Tmux, Vim Text Editor, Web MVC, Continuous Integration (CI), SQL, Software Engineering

Experience

Thesis on Genetic Disease Scores

https://jmberros.me/pages/thesis-showcase/

I analyzed the distribution of several genetic disease scores for my Ph.D. This project has tested my knowledge of statistics acquired in college and my coding capabilities from years of expertise in the private sector. It also required heavy use of Python's data science stack to achieve the showcased results.

Complex Pipeline of Genetic Data Processing

https://jmberros.me/pages/paip/

While at Biocodices (medical genetics laboratory), I built a complex pipeline of genetic data processing based on the Python’s workflow framework Luigi (a framework similar to Airflow, developed at Spotify).

The workflow started with the raw output of various DNA Sequencers (i.e. machines that read the sequence of DNA samples) and went through numerous standard bioinformatic quality assurance and filtering steps until a subset of medically relevant genetic mutations was kept for reporting.

Cargo Data Interactive Analysis

https://jmberros.me/pages/data-modeling-projects/

For this project, the client had a data challenge to quickly analyze a dataset of sites where cargo accumulates over time and is picked up according to different signals—a mixture of automatic and scheduled pickups.

Web Application for Embryo Genetic Disorder Reporting

https://jmberros.me/pages/lab-app/

I developed from scratch an internal web application based on Ruby on Rails for a medical genomics laboratory and maintained it for over five years.

The app serves as an online inventory of patients, clinics, doctors, and DNA testing results and implements the CRUD operations through a user-friendly interface. It also automates the data entry of forms and allows bulk editions of different types of data. It generates daily stats of the lab's results that can be interactively filtered and customized. It also automates the generation and interactive edition of PDF reports to communicate the genetic testing results to clinics.

The maintenance of this app required strict adherence to good coding style and software architecture practices and the development of an exhaustive testing suite.

DNA Analysis of Coronavirus Strains

https://jmberros.me/pages/sars-cov-2/

I worked at a laboratory that was part of the Proyecto-PAIS, the country-wide effort in Argentina to sequence and analyze SARS-Cov-2 DNA samples. I explored the performance of the DNA sequencing and the mutations found in the Coronavirus genome samples.

Cleanup of a Database of 10 Million Company Names

https://jmberros.me/pages/grata-names/

Cleaned up a database of millions of company names by addressing multiple parsing issues. To decide whether a company name needed revision, I developed a quantitative score that captured the intuition of what a "correct" company name was in that context, leveraging on available company data like social media handles, the domain, and available names from different sources.

Scoring of Geocoding Results

https://jmberros.me/pages/grata-locations/

I performed extensive data analysis and implemented an OOP solution to score geolocation API results and decide whether the result was trustworthy or not. The evaluation was productionized and performs thousands of daily evaluations.

Education

2018 - 2022

Ph.D. in Bioinformatics

University of Buenos Aires - Buenos Aires, Argentina

2017 - 2019

MSc-equivalent Specialization in Statistics

University of Buenos Aires - Buenos Aires, Argentina

2011 - 2016

Bachelor's Degree in Biology

University CAECE - Buenos Aires, Argentina

Certifications

APRIL 2023 - PRESENT

Natural Language Processing Specialization

Coursera

AUGUST 2022 - PRESENT

Deep Learning Specialization

Coursera

Skills

Libraries/APIs

Pandas, NumPy, Matplotlib, SciPy, Scikit-learn, Tidyverse, jQuery, Luigi, NetworkX, jQuery UI, Django ORM, Mapbox API

Tools

Seaborn, Jupyter, Git, GitHub, Tmux, Vim Text Editor, Pytest, NGINX, Monit, Ansible, Docker Compose, AWS CLI, Geocoder, Geocoding, Amazon Elastic Container Service (Amazon ECS)

Frameworks

Ruby on Rails (RoR), Flask, Django, Ruby on Rails 4, Streamlit

Languages

HTML, Python, Ruby, Python 3, HTML5, CSS3, SQL, R, Sass, JavaScript, CSS, Snowflake

Paradigms

Data Science, Continuous Integration (CI), Object-oriented Programming (OOP), Unit Testing

Platforms

Jupyter Notebook, Linux, Ubuntu Linux, Docker, Kubernetes, Amazon Web Services (AWS), Mapbox

Industry Expertise

Bioinformatics, Healthcare

Storage

MySQL, MariaDB, Data Pipelines, Redis, PostgreSQL, Amazon S3 (AWS S3)

Other

Machine Learning, Data Visualization, Data Analysis, Biology, Genomics, PLINK, Data Engineering, Principal Component Analysis (PCA), Exploratory Data Analysis, Statistical Analysis, Web Development, Life Science, Data Reporting, Code Review, Regression Modeling, Data Scientist, Data Preparation, Data Cleansing, Data Modeling, Statistics, Probability Theory, UNIX Utilities, GATK, Logistic Regression, Linear Regression, Supervised Learning, Statistical Significance, Statistical Modeling, Full-stack Development, Web Dashboards, Web MVC, K-nearest Neighbors (KNN), Software Engineering, Mathematics, Applied Mathematics, Quantitative Analysis, Web Scraping, Data Analytics, APIs, Data Scraping, Generalized Linear Model (GLM), Clustering, Multivariate Testing, Multivariate Statistical Modeling, Graphs, Directed Acrylic Graphs (DAG), Graph Theory, Decision Trees, Random Forests, Ensemble Methods, Computational Biology, Statistical Methods, Statistical Learning, Supervised Machine Learning, Molecular Biology, Predictive Modeling, Pull Requests, Unicode, Neural Networks, Deep Neural Networks, Deep Learning, Pelias, Natural Language Processing (NLP), LSTM Networks, Long Short-term Memory (LSTM), FastAPI, GPT, Generative Pre-trained Transformers (GPT), Recurrent Neural Networks (RNNs), Gated Recurrent Unit (GRU), Sequence Models, Dashboards, Workflow, Data Quality, Interactive UI, Polars, GeoPandas, Hospitality, Data Build Tool (dbt), Prefect

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring