Daniel is currently unavailable

Daniel Pérez Rubio

Verified Expert in Engineering

Data Scientist and Developer

Guadalajara, Spain

Toptal member since November 29, 2021

Expertise

Machine Learning Data Analysis NLP Data Visualization Python NumPy Visual Studio Development Spark AWS EMR PyTorch Neural Network Chatbot Development

Bio

Daniel is an experienced data scientist with a master's in signal theory (telecommunications). He accounts for eight years of professional experience: from impactful seed-phase startups like Ketekelo (CTO, two years) to global companies like BASF (senior data scientist, two years). Daniel strives on challenges, so he's decided to become a freelance data scientist to help Toptal clients achieve excellence with developing their machine learning, deep learning, NLP, and big data solutions.

Portfolio

Non-disclosable NLP startup from MIT (toptal engagement)

Python, Jupyter Notebook, Topic Modeling...

Daimler

Python, Databricks, PySpark, Spark SQL, Scikit-learn, Pandas, NumPy, SciPy...

BASF

Python, Pandas, Scikit-learn, NumPy, SciPy, Docker, MongoDB, SpaCy...

Experience

Machine Learning - 7 years
APIs - 7 years
Scraping - 7 years
Docker - 5 years
Generative Pre-trained Transformers (GPT) - 5 years
Natural Language Processing (NLP) - 5 years
Python - 5 years
Deep Learning - 3 years

Preferred Environment

Windows, Windows Subsystem for Linux (WSL), Visual Studio Code (VS Code), Docker

The most amazing...

...product I've developed was an internal service desk ticket prioritization model, which helped reduce escalations to 60% within the same workforce.

Work Experience

Data Scientist

2021 - 2022

Non-disclosable NLP startup from MIT (toptal engagement)

Developed a productive pipeline based on model explainability with Shapely values for the analysis of complex dependencies between language and cultural trends in a company.
Designed a robust replicability setup ensuring AutoML capabilities featuring multiple overfitting, dimensionality, and signal/noise control processes like SMOTE, hyperparameter tuning, SHAP-based feature selection, cross-validation, and seed control.
Refactored and optimized two existing big data pipelines, improving stability and reducing resource allocation with a cost reduction of 75% for one.
Implemented and optimized a topic modeling pipeline based on a large language model (BERT), which helped validate their custom topic modeling approach.
Implemented a flexible fine-tuning process for large language model architectures like BERT and GPT2 and used it to train several topic classification models, which were used to refine their custom topic modeling pipeline.
Implemented a clause parsing and classification pipeline based on large language models for a clause sentiment classification tool.
Designed and proved a semi-supervised learning concept for the iterative refinement of large language models based on auto-labeling techniques.
Implemented a productive runtime predictive resource allocation concept to avoid GPU and system memory issues. It's a process based on resource usage logging and a polynomial interpolation pipeline which helped reduce most memory allocation errors.
Conducted several viability analyses for different functional features devised by the client in nine months, following with a subsequent implementation upon the client's decision and the completion of all open points in their product roadmap.
Kept daily contact with the CTO and CEO, providing all necessary insights and low-level details for them to be able to steer product development, always coming forward with proposals and my expert opinion but prioritizing their will.

Technologies: Python, Jupyter Notebook, Topic Modeling, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), Amazon Web Services (AWS), Amazon S3 (AWS S3), Amazon Elastic MapReduce (EMR), Amazon SageMaker, Shapely, Spark SQL, PySpark, SciPy, SpaCy, Scikit-learn, PyTorch, BERT, Docker, Language Models, Deep Learning, StatsModels, Matplotlib, Pandas, NumPy, Clustering, Unsupervised Learning, ETL, HyperOpt, Data Analysis, Dashboards, Data Visualization, Product Leadership, Python 3, Predictive Modeling, Data Engineering

Senior Data Scientist

2021 - 2021

Daimler

Developed three big data after-sales time series forecasting products: the timing of tire replacement, the timing of brake disc replacement, and the timing of brake pad replacement.
Kickstarted creating an experimentation library to allow multiple data scientists to run experiments over the same product, so the results from those experiments could be compared, replicated, and easily communicated to business partners.
Fostered the improvement of the branching model and CI/CD pipelines to eliminate human error and operations overhead and unlock the possibility of developing software packages instead of scripts in notebooks.
Created two new data sources for the team's data lake: worldwide elevation with 30 meters resolution (Aster 30) and regional name localization, including common countries, cities, provinces, and names written in over ten languages.
Collaborated in the organization of the 2021 Daimler Innovation Days, a 2-day event focused on creating fresh product designs and getting familiar with the most modern technologies.

Technologies: Python, Databricks, PySpark, Spark SQL, Scikit-learn, Pandas, NumPy, SciPy, Matplotlib, MLflow, Plotly, Azure Data Lake, Azure Data Factory (ADF), Azure DevOps, GitHub, Jira, Seaborn, GIS, Spark ML, Docker, Data Science, SQL, Predictive Maintenance, Data Visualization, Data Analysis, Python 3, Predictive Modeling, Data Engineering

Senior Data Scientist

2019 - 2021

BASF

Developed two successful NLP products: a fuzzy-logic expert system for customer name matching and a topic modeling dashboard for patent search engine monitoring.
Made three generalistic products: a recommendation system for health of inventory management, a threat-level classifier for domain name trademark fraud detection, and an escalation probability forecaster for a service desk's ticket prioritization.
Performed a topic and sentiment analysis report of Spain's 2020-2021 employee survey for HR to help them process thousands of valuable free-text feedback fields.
Performed multiple workshops for machine learning, Git, open-source software, and remote Docker environments.
Supported company culture by fostering and co-organizing local and global initiatives: 10% innovation time, cross-squad collaboration initiatives, and custom training plans.
Led, together with my colleagues, the introduction of a modern Python workflow in a global company, seamlessly using best code practices, CI/CD pipelines, containerization, and remote environments.
Supported the hiring process by conducting multiple technical interviews.
Assumed the shared role of product owner during more than half of the squad's lifetime.
Worked successfully and efficiently under the Scrum and Kanban Agile frameworks, delivering five successful products in two years.

Technologies: Python, Pandas, Scikit-learn, NumPy, SciPy, Docker, MongoDB, SpaCy, Natural Language Toolkit (NLTK), Django, FastAPI, Apache Airflow, Databricks, PyTorch, Helm, Kubernetes, Multiprocessing, R, PySpark, Spark SQL, Microsoft SQL Server, SAP HANA SQLScript, Beautiful Soup, lxml, Plotly, Matplotlib, TextRank, GitLab CI/CD, Seaborn, Data Science, SQL, ETL, Data Analysis, Data Visualization, Python 3, Predictive Modeling, Data Engineering

Senior Data Scientist

2018 - 2019

Rebold

Implemented and maintained a daily CD pipeline for model training, optimization, and deployment for an ad-buying agent.
Developed a whole email campaign audience-enriched analytics solution.
Took ownership of three big data daily running products: machine learning ad-buying agent training, email campaign audience-enriched analytics, and cookie-based audience classification.
Supported business intelligence (BI) colleagues, implementing custom Python scripts and SQL queries to improve their processes and help them work more efficiently.
Developed, with the assistance of a freelance DevOps engineer, a Python tool for creating and running Ansible templates based on playbooks.
Kickstarted the development of a citizen development web platform based on Flask.

Technologies: PySpark, Spark SQL, PostgreSQL, Apache Airflow, Amazon Web Services (AWS), Ansible, Git, Python, Scikit-learn, NumPy, Pandas, HyperOpt, Amazon Elastic MapReduce (EMR), Amazon S3 (AWS S3), Spark ML, Flask, Apache Superset, Google Data Studio, Continuous Delivery (CD), ETL, Data Science, SQL, Data Analysis, Dashboards, Data Visualization, Python 3, Predictive Modeling, Data Engineering

Data Scientist

2016 - 2018

Human Forecast

Worked autonomously as the only technical profile in the company.
Developed several PoC solutions with a value proposition based on machine learning, most of which can be found on my GitHub profile.
Sold and developed four final products: a topic discovery engine for market research, a real-time social brand image observatory, an Edge AI handrail use advisor, and a chatbot-based smart contract solution for international commerce tracking.
Established the presales strategies together with the CEO.
Performed several product presentations to big companies such as Airbus, Navantia, Vall d'Hebron Hospital, and Cemex Ventures.
Worked in diverse fields like topic modeling, human pose recognition, hyperspectral imaging, Edge AI, sentiment analysis, chatbots, smart contracts, data mining, dashboarding, and APIs.

Technologies: Python, Machine Learning, Google Cloud, Scikit-learn, Natural Language Toolkit (NLTK), OpenCV, Node.js, Git, TensorFlow, Pandas, NumPy, Raspberry Pi, Arduino, Solidity, Flask, Matplotlib, Bokeh, Plotly, D3.js, Asyncio, Supervisord, Apache HTTP Server, NGINX, Web3.js, SciPy, Beautiful Soup, Docker, Express.js, Tableau, GIS, C, Data Science, SQL, Dashboards, Data Analysis, Data Visualization, Product Leadership, Predictive Modeling

CTO

2014 - 2015

Ketekelo

Worked as technical lead and full-stack developer, setting the development roadmap and executing it, together with an intern student.
Implemented several custom WordPress/WooCommerce components, API integrations, and a scraping tool.
Pitched at multiple events. Awarded as the best pitch by Madrid's local government, and attracted the interest of investors like Kike Sarasola and Fundación José Manuel Entrecanales.
Gained acceleration programs from IE Business School, Lanzadera, and Madrid's local government.

Technologies: PHP, JavaScript, HTML, Linux Administration, WooCommerce, jQuery, Bootstrap, MySQL, Scraping, APIs, Ajax, SQL, Amazon Web Services (AWS), Product Leadership

Experience

Topic Discovery Engine for Market Research

https://github.com/danielperezr88/TOM

A web service I created composed of integration between Google Custom Search API, a web scraper, and a topic modelling pipeline, all managed and consumed through an interactive front end. The service featured the creation of new search terms, D3.js visualizations for the different topics identified per search term and day, as well as navigation on different aggregation levels (topic importance per day, Ngram importance per topic, article weight per topic, etc.). User management with varying levels of permission and access to the articles were scrapped.

It allows the user to define fine-grained searches for fields of interest, keep track of the different topics found per field and their relevance with time, and find out quickly if a new topic of interest appears in that field.

Logistics Dapp: Smart Contract Chatbot app for Freight Transport Tracking

https://github.com/danielperezr88/logistics-dapp

An app I developed that is running over legacy Coinbase's Toshi (currently replaced by Wallet), designed to support and log all transactions involved in international freight transport. The interface for the app is fully conversational and features multi-party, role-based permissions, and a step-by-step follow-up of the whole process.

The app is in the active MVP phase. It's been tested and proved useful, but currently is not supported because of changes in Coinbase's Dapp platform and discontinuity of relations with the sponsor.

Handrail Advisor: On-site Human Pose Tracking Camera for Improved Worker Security

https://github.com/danielperezr88/idoonet-rpi-mvncs

I worked on an Edge AI device loaded with a standalone human pose tracking software (a modified fork from a FOSS pose estimation project) and composed of a small, low consumption computing unit, a camera attached, and optionally a warning lightbulb, a screen, and a sound system.

Once placed on a point with good visibility of a handrail-guarded area and configured with labels of the handrail positions and associated areas of use; it will track the correct use of the handrail by all workers on the area and show real-time feedback to those in a preferred way (sound, video, and lightbulb feedback).

Publication

Ask an NLP Engineer: From GPT Models to the Ethics of AI

https://www.toptal.com/developers/artificial-intelligence/ask-an-nlp-engineer

Education

2020 - 2020

Postgraduate Course in Artificial Intelligence

Stanford University - Stanford, CA

2006 - 2014

Bachelor’s Degree and Master's Degree in Telecommunications

Universidad de Alcalá - Alcalá de Henares, Madrid, Spain

Certifications

OCTOBER 2022 - PRESENT

Stanford Reinforcement Learning

Stanford University | Online

DECEMBER 2020 - PRESENT

Stanford Natural Language Processing with Deep Learning

Stanford University | Online

AUGUST 2020 - PRESENT

AI for Trading Nanodegree

Udacity

MAY 2015 - PRESENT

Startup Acceleration and Consolidation

IE Business School

DECEMBER 2011 - PRESENT

Introduction to Artificial Intelligence

Sebastian Thrun and Peter Norvig

Skills

Libraries/APIs

Scikit-learn, Natural Language Toolkit (NLTK), Pandas, NumPy, Beautiful Soup, PySpark, PyTorch, Shapely, Matplotlib, Spark ML, jQuery, OpenCV, Node.js, TensorFlow, D3.js, Asyncio, Web3.js, SciPy, SpaCy

Tools

Spark SQL, Amazon Elastic MapReduce (EMR), Git, Supervisord, Apache HTTP Server, Apache Airflow, Named-entity Recognition (NER), GitLab CI/CD, GitHub, Jira, Seaborn, MATLAB, Amazon SageMaker, Plotly, NGINX, Tableau, GIS, Ansible, Helm, StatsModels

Languages

Python, Python 3, C, C++, SQL, PHP, JavaScript, HTML, Solidity, R, Java

Platforms

Visual Studio Code (VS Code), Jupyter Notebook, Windows, Docker, Unix, Raspberry Pi, Arduino, Amazon Web Services (AWS), Databricks, WooCommerce, Kubernetes

Frameworks

Flask, Django, Bootstrap, Express.js, Jinja

Paradigms

Continuous Delivery (CD), ETL, Azure DevOps, Agile, Dynamic Programming

Storage

MySQL, PostgreSQL, Amazon S3 (AWS S3), MongoDB, Google Cloud, Microsoft SQL Server, SAP HANA SQLScript

Other

Statistics, Natural Language Processing (NLP), Machine Learning, K-nearest Neighbors (KNN), TextRank, Data Science, Data Analysis, Data Visualization, Predictive Modeling, Generative Pre-trained Transformers (GPT), Windows Subsystem for Linux (WSL), Numerical Methods, Programming, Embedded Systems, Optimization, Computer Vision, Signal Processing, Deep Learning, Linux Administration, Scraping, APIs, HyperOpt, Apache Superset, Google Data Studio, Support Vector Machines (SVM), Neural Networks, K-means Clustering, Bayesian Statistics, Information Retrieval, Transformers, Word Embedding, Linguistic Tagging, FastAPI, Multiprocessing, lxml, MLflow, Time Series Analysis, Recurrent Neural Networks (RNNs), Sentiment Analysis, Business Planning, Chatbots, Topic Modeling, BERT, Dashboards, Product Leadership, Data Engineering, Electronics, Telematics, Evolutionary Computation, Ajax, Bokeh, Robotics, Motion Planning, Language Models, Azure Data Lake, Azure Data Factory (ADF), Encoder-Decoder Neural Architecture, Sequence Models, Fundamental Analysis, Quantitative Analysis, Portfolio Optimization, Risk Models, Attribution Modeling, Backtesting Trading Strategies, Negotiation, Tax Accounting, Business Modeling, Business Model Canvas, Partnerships, Google Custom Search, Clustering, Unsupervised Learning, Predictive Maintenance, Reinforcement Learning, Monte Carlo Simulations, Deep Reinforcement Learning, Temporal Difference Learning, Monte Carlo

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring