Juan Manuel Berros
Verified Expert in Engineering
Data Scientist and Developer
Buenos Aires, Argentina
Toptal member since June 16, 2022
Juan is a bioinformatics PhD with years of combined experience in data analysis, data science, and back-end software engineering. He excels both in the statistical analysis of any dataset and in the implementation of data workflows with solid software engineering practices.
Portfolio
Experience
Availability
Preferred Environment
Linux, Jupyter, Vim Text Editor, Tmux
The most amazing...
...thing I've developed is a complex pipeline of data analysis and an associated web application used daily to generate reports of embryo genetic anomalies.
Work Experience
Software Developer
Takeup
- Developed a suite of internal apps in Streamlit and Snowflake for daily reviews of the model output before pushing rates to the hotels. Maintained and expanded these apps as rate reviewers requested new features.
- Heavily refactored the pipelines of rate quality and propagation and added profuse unit testing in sensitive parts of the code dealing with rates. Improved the granularity of the logging for more flexible monitoring of each process in Prefect.
- Created the code to support multiple triage mechanisms to be run on the model proposed rates and detect suspicious rates of different sorts.
- Involved in the next project that consists of the complete automation of onboarding new clients (new hotels) to the system, which is done today with a bundle of scripts and SQL queries.
Data Scientist
SFR Analytics
- Built a pipeline to automatically generate PDF reports to share demographic insights from a sample.
- Built a Streamlit app to interactively display consumer insights related to custom demographic groups.
- Integrated a Snowflake database with internal Streamlit apps.
Data Analyst | Back-end Software Engineer
Grata Inc
- Developed a Django back-end module to score and clean up a 10 million company names database. The quality of the names increased from 6 to 9.8 after this intervention, as measured by a team of data raters.
- Built a Django back-end module to score geocoding results from Geocode Earth and Mapbox. The scoring was integrated into the ingestion pipeline to decide when Geocode Earth results were satisfactory, thus saving money from expensive Mapbox API calls.
- Created a cohesive OOP solution for these problems—a Refiner class and a utils module that expanded progressively. Unit and integration testing for all features was religiously provided.
- Reviewed PRs from peers with a strong emphasis on code style, readability, maintainability, system design, and testing.
Data Scientist | Data Engineer
Biocódices
- Built an ETL workflow of genetic data in Python Luigi (similar to Airflow). It performed data collection, quality assurance, various filters, and the generation of reports for 2,500+ embryos.
- Completed a thesis on the statistical properties of genetic scores of disease propensities in adults and chromosome anomalies in embryos.
- Participated in experimental design, performed simulations of thousands of genomes, and iterated over hundreds of parametrizations in an on-premises HPC cluster to achieve the thesis goals.
- Performed various one-time analyses of genetic data of different origins like embryo mitochondrial DNA distribution, COVID-19 genetic variants in Argentina, the performance of varying DNA sequencers and genomic panels, and sample contamination.
Genomics Workshop Teacher
Faculty of Exact and Natural Sciences | University of Buenos Aires
- Wrote detailed workshop guides for university students to learn numerous Linux utilities—pipes, AWK, sed, uniq, sort, column, less, for loops, while loops, and GNU parallel—bioinformatic tools and concepts, and basic shell scripting.
- Rewrote the workshop guides in Markdown and migrated them to a GitHub repo for more developer-friendly maintenance. Before my intervention, the workshop was a loose bundle of PDFs and scripts mailed between professors.
- Created Ubuntu VMs with every needed guide and software installed to be distributed on the lab's computers.
Full-stack Ruby on Rails Developer
Biocódices
- Built an internal laboratory information management system (LIMS) software in Ruby on Rails to inventory lab samples, display daily stats, and generate PDF reports to communicate genetic results to patients.
- Maintained several workflows of genetic data processing, some more straightforward, like changes between data formats, and some more complex, like genetic data quality control, filter, and discovery of disease-related mutations.
- Scraped a whole site of health-related content, timing off with random waits for the requests and caching them locally, and designed a parsing class based on Beautiful Soup.
Full-stack Ruby on Rails Developer
Pemasys S.R.L.
- Built a Ruby on Rails dashboard to display daily profits of Google Ads/Google AdSense campaigns in 10 countries.
- Maintained a Ruby on Rails-based ads aggregator of thousands of cars, apartments, and job ads.
- Contributed to the landing page and the payments workflow of the HR portal.
Experience
Thesis on Genetic Disease Scores
https://jmberros.me/pages/thesis-showcase/Complex Pipeline of Genetic Data Processing
https://jmberros.me/pages/paip/The workflow started with the raw output of various DNA Sequencers (i.e. machines that read the sequence of DNA samples) and went through numerous standard bioinformatic quality assurance and filtering steps until a subset of medically relevant genetic mutations was kept for reporting.
Cargo Data Interactive Analysis
https://jmberros.me/pages/data-modeling-projects/Web Application for Embryo Genetic Disorder Reporting
https://jmberros.me/pages/lab-app/The app serves as an online inventory of patients, clinics, doctors, and DNA testing results and implements the CRUD operations through a user-friendly interface. It also automates the data entry of forms and allows bulk editions of different types of data. It generates daily stats of the lab's results that can be interactively filtered and customized. It also automates the generation and interactive edition of PDF reports to communicate the genetic testing results to clinics.
The maintenance of this app required strict adherence to good coding style and software architecture practices and the development of an exhaustive testing suite.
DNA Analysis of Coronavirus Strains
https://jmberros.me/pages/sars-cov-2/Cleanup of a Database of 10 Million Company Names
https://jmberros.me/pages/grata-names/Scoring of Geocoding Results
https://jmberros.me/pages/grata-locations/Education
Ph.D. in Bioinformatics
University of Buenos Aires - Buenos Aires, Argentina
MSc-equivalent Specialization in Statistics
University of Buenos Aires - Buenos Aires, Argentina
Bachelor's Degree in Biology
University CAECE - Buenos Aires, Argentina
Certifications
Natural Language Processing Specialization
Coursera
Deep Learning Specialization
Coursera
Skills
Libraries/APIs
Pandas, NumPy, Matplotlib, SciPy, Scikit-learn, Tidyverse, jQuery, Luigi, NetworkX, jQuery UI, Django ORM, Mapbox API
Tools
Seaborn, Jupyter, Git, GitHub, Tmux, Vim Text Editor, Pytest, NGINX, Monit, Ansible, Docker Compose, AWS CLI, Geocoder, Geocoding, Amazon Elastic Container Service (ECS), Prefect
Languages
HTML, Python, Ruby, Python 3, HTML5, CSS3, SQL, R, Sass, JavaScript, CSS, Snowflake
Frameworks
Ruby on Rails (RoR), Flask, Django, Ruby on Rails 4, Streamlit
Platforms
Jupyter Notebook, Linux, Ubuntu Linux, Docker, Kubernetes, Amazon Web Services (AWS), Mapbox
Industry Expertise
Bioinformatics, Healthcare
Paradigms
Continuous Integration (CI), Object-oriented Programming (OOP), Unit Testing
Storage
MySQL, MariaDB, Data Pipelines, Redis, PostgreSQL, Amazon S3 (AWS S3)
Other
Machine Learning, Data Visualization, Data Science, Data Analysis, Biology, Genomics, PLINK, Data Engineering, Principal Component Analysis (PCA), Exploratory Data Analysis, Statistical Analysis, Web Development, Life Science, Data Reporting, Code Review, Regression Modeling, Data Scientist, Data Preparation, Data Cleansing, Data Modeling, Statistics, Probability Theory, UNIX Utilities, GATK, Logistic Regression, Linear Regression, Supervised Learning, Statistical Significance, Statistical Modeling, Full-stack Development, Web Dashboards, Web MVC, K-nearest Neighbors (KNN), Software Engineering, Mathematics, Applied Mathematics, Quantitative Analysis, Web Scraping, Data Analytics, APIs, Data Scraping, Generalized Linear Model (GLM), Clustering, Multivariate Testing, Multivariate Statistical Modeling, Graphs, Directed Acrylic Graphs (DAG), Graph Theory, Decision Trees, Random Forests, Ensemble Methods, Computational Biology, Statistical Methods, Statistical Learning, Supervised Machine Learning, Molecular Biology, Predictive Modeling, Pull Requests, Unicode, Neural Networks, Deep Neural Networks (DNNs), Deep Learning, Pelias, Natural Language Processing (NLP), LSTM Networks, Long Short-term Memory (LSTM), FastAPI, Generative Pre-trained Transformers (GPT), Recurrent Neural Networks (RNNs), Gated Recurrent Unit (GRU), Sequence Models, Dashboards, Workflow, Data Quality, Interactive UI, Polars, GeoPandas, Hospitality, Data Build Tool (dbt), Cargo
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring