Paulo Silva, Developer in Belo Horizonte - State of Minas Gerais, Brazil
Paulo is available for hire
Hire Paulo

Paulo Silva

Verified Expert  in Engineering

Statistics Developer

Location
Belo Horizonte - State of Minas Gerais, Brazil
Toptal Member Since
August 29, 2022

Paulo is a data scientist with four years of experience in multiple lines of business. With Python as the main stack, he worked with numerous machine learning algorithms, data analysis, visualization, and hypothesis testing such as A/B, statistical analysis, and even data engineering work. Paulo has an engineering background, and problem-solving comes naturally to him.

Portfolio

Oko Exchange Inc.
Python, Data Science, Machine Learning, Data Visualization, Data Engineering...
Quero Quitar
Databricks, Python, Data Analysis, Data Science, Big Data, NumPy, Streamlit...
CBD Industries, LLC
Data Science, Data Visualization, SQL, Databases, Qlik Sense...

Experience

Availability

Full-time

Preferred Environment

Python, Google Cloud Platform (GCP), Amazon Web Services (AWS), Jupyter Notebook, PyCharm, Visual Studio Code (VS Code)

The most amazing...

...thing I've done is use data science to reduce the number of students who dropped out of college.

Work Experience

Data Scientist

2023 - PRESENT
Oko Exchange Inc.
  • Used OpenAI's Large Language Models (LLMs) APIs (GPT-3.5 Turbo and GPT-4) to parse data from unstructured text into a structured format.
  • Utilized Azure Document Intelligence (previously Azure Form Recognizer) to extract the text from files and LangChain and vector stores to leverage LLMs on large text files.
  • Used AWS Lambda for model serving and Amazon S3 (AWS S3) for storing files.
Technologies: Python, Data Science, Machine Learning, Data Visualization, Data Engineering, Data Analysis, Azure ML Studio, Azure, Artificial Intelligence (AI), OpenAI GPT-4 API, OpenAI GPT-3 API, Generative Pre-trained Transformers (GPT), AWS Lambda, Amazon S3 (AWS S3), Natural Language Processing (NLP), OpenAI, ChatGPT, Large Language Models (LLMs), NumPy, Amazon SageMaker, LangChain, Pinecone, Retrieval-augmented Generation (RAG)

Data Scientist

2023 - 2023
Quero Quitar
  • Created predictive models to direct debt-recovering agencies on whom to approach.
  • Built models to direct the chance to contact the debtor, making the company's approaches more effective.
  • Migrated from Pandas to Databricks to process large heaps of data.
Technologies: Databricks, Python, Data Analysis, Data Science, Big Data, NumPy, Streamlit, Scikit-learn, Spark ML

Data Engineer/Analyst for Qlik Sense

2022 - 2022
CBD Industries, LLC
  • Developed an ETL (Extract-Transform-Load) architecture inside of Qlik Sense.
  • Integrated multiple third-party APIs into Qlik Sense.
  • Utilized AWS services to scale the solution for a big data context.
Technologies: Data Science, Data Visualization, SQL, Databases, Qlik Sense, Amazon Web Services (AWS), AWS Lambda, Amazon S3 (AWS S3), APIs, API Integration, Data Analysis, Data Lakes, Data Warehousing, Statistical Analysis, Cloud, Statistical Data Analysis, Mathematical Analysis, Mathematics, Apache Spark

Data Scientist

2021 - 2022
Limehome
  • Developed a dynamic pricing algorithm for a hotel chain.
  • Performed ad-hoc data analysis to help drive the business forward.
  • Helped data analysts with their research to find inconsistencies, give feedback and provide overall technical support.
Technologies: Python, Google Cloud Platform (GCP), Machine Learning, GitHub, Data Analysis, Analytics, Data Science, Data, Relational Databases, BigQuery, Databases, Data Visualization, Software Development, Algorithms, Git, Jupyter Notebook, PyCharm, Statistics, Pandas, SQL, ETL, ETL Tools, Data Reporting, Data Analytics, Big Data, Linear Regression, Clustering, Dashboards, Predictive Modeling, Predictive Analytics, Amazon Web Services (AWS), TensorFlow, Python 3, Hospitality, Google BigQuery, Statistical Analysis, Cloud, XGBoost, Statistical Data Analysis, Mathematical Analysis, Mathematics, Statistical Methods, Tableau, NumPy, Scikit-learn

Data Scientist

2021 - 2021
Chama
  • Developed a dynamic pricing algorithm for a business to connect buyers and sellers of bottled gas.
  • Helped with experiments to roll out new features in a data-driven way.
  • Collaborated in the analytics chapter of the company to spread the data-driven culture.
Technologies: Python, R, Docker, Machine Learning, Data Analysis, Git, Tableau, Back-end, APIs, API Integration, Analytics, Business Intelligence (BI), Data Science, Data, Database Design, Relational Databases, BigQuery, Databases, Data Visualization, Software Development, Algorithms, GitHub, Jupyter Notebook, Statistics, Pandas, SQL, Pytest, ETL, ETL Tools, Data Engineering, Data Reporting, Data Analytics, Data Mining, Web Scraping, Big Data, Linear Regression, Clustering, Dashboards, Predictive Modeling, Predictive Analytics, TensorFlow, Python 3, Data Pipelines, Postman, REST APIs, Data Integration, Kubernetes, Swagger, Google BigQuery, Data Warehousing, Statistical Analysis, Cloud, XGBoost, Statistical Data Analysis, Mathematical Analysis, Mathematics, Statistical Methods, Azure ML Studio, Apache Airflow, NumPy, Amazon SageMaker, Scikit-learn

Data Scientist

2020 - 2021
Zup
  • Helped the company to identify workers abusing their food spending on trips or with clients.
  • Assisted the company in finding leaders who were not billing the clients correctly, causing money loss.
  • Created a model to help operations know if they had enough computers for the new employees, based on past hiring behavior.
Technologies: Python, Google Cloud Platform (GCP), BigQuery, Google Data Studio, Machine Learning, Data Analysis, Git, Analytics, Business Intelligence (BI), Data Science, Data, Database Design, Relational Databases, Databases, Data Visualization, Software Development, Algorithms, GitHub, Jupyter Notebook, PyCharm, Statistics, Pandas, SQL, ETL, ETL Tools, Data Engineering, Data Reporting, Data Analytics, Web Scraping, Big Data, Linear Regression, Clustering, Dashboards, Predictive Modeling, Predictive Analytics, Python 3, Google BigQuery, Data Warehousing, Statistical Analysis, Cloud, XGBoost, Statistical Data Analysis, Mathematical Analysis, Mathematics, Statistical Methods, Tableau, Artificial Intelligence (AI), NumPy, Scikit-learn

Data Scientist

2019 - 2020
CRM Educacional
  • Developed a lead-scoring model to help privately owned colleges obtain more students.
  • Created a model to identify the risk of students abandoning college and provided insight on the necessary steps to avoid it.
  • Improved the work of the company's data pipeline since it was built for small data, which became unfeasible.
Technologies: Python, Azure DevOps, SQL Server 2016, Azure, Machine Learning, Microsoft Power BI, Back-end, APIs, API Integration, Analytics, Business Intelligence (BI), Data Science, Data, Database Design, Relational Databases, Databases, Data Visualization, Software Development, Algorithms, GitHub, Git, Jupyter Notebook, Statistics, C#, Pandas, SQL, ETL, ETL Tools, Data Engineering, Data Reporting, Data Analytics, Data Mining, Web Scraping, Big Data, Linear Regression, Clustering, Azure Data Factory, Dashboards, Predictive Modeling, Predictive Analytics, Python 3, Data Pipelines, Postman, REST APIs, Data Integration, Swagger, Data Analysis, Data Lakes, Data Warehousing, Statistical Analysis, Cloud, XGBoost, Statistical Data Analysis, Mathematical Analysis, Mathematics, Statistical Methods, Tableau, Apache Spark, Azure ML Studio, NumPy, Amazon SageMaker, Scikit-learn

Data Scientist

2019 - 2019
Maxtrack
  • Developed a model to predict if a car was stolen based on tracker data and previously known user behavior.
  • Improved the company's data pipeline using Spark since the previous one was no longer feasible for the amount of processed data.
  • Analyzed data to determine if some previously developed models were working as expected.
Technologies: Python, MongoDB, Redis, Machine Learning, Data Analysis, Spark, Git, Back-end, APIs, Data Science, Data, Databases, Data Visualization, Software Development, Algorithms, GitHub, Jupyter Notebook, PyCharm, Statistics, Pandas, SQL, ETL, ETL Tools, Data Engineering, Data Reporting, Data Analytics, Data Mining, Web Scraping, Big Data, Linear Regression, Clustering, Dashboards, Predictive Modeling, Predictive Analytics, Amazon Web Services (AWS), Python 3, Data Pipelines, Postman, REST APIs, Data Integration, Data Warehousing, Statistical Analysis, Cloud, XGBoost, Statistical Data Analysis, Mathematical Analysis, Mathematics, Statistical Methods, Apache Spark, NumPy, Scikit-learn

Data Scientist

2019 - 2019
4hoofs
  • Created a model to predict a cow milk yield in a day.
  • Helped the company find new marketing places based on milk producers' public data.
  • Developed an IoT device to monitor the milk quality in a tank.
Technologies: Python, Machine Learning, Data Analysis, MongoDB, MySQL, JavaScript, Back-end, APIs, API Integration, Data Science, Data, Database Design, Relational Databases, Databases, Data Visualization, Software Development, Algorithms, GitHub, Git, Jupyter Notebook, Statistics, C#, Android, React Native, PostgreSQL, Pandas, SQL, Pytest, ETL, ETL Tools, Data Engineering, Data Reporting, Data Analytics, Data Mining, Web Scraping, Linear Regression, Clustering, Dremio, Dashboards, Predictive Modeling, Predictive Analytics, Amazon Web Services (AWS), Python 3, Data Pipelines, Postman, REST APIs, Data Integration, Swagger, Data Lakes, Data Warehousing, Statistical Analysis, Cloud, XGBoost, Statistical Data Analysis, Mathematical Analysis, Mathematics, Statistical Methods, Apache Spark, Artificial Intelligence (AI), Web Development, NumPy, Scikit-learn

College Dropout Prediction

Privately owned colleges in Brazil have a significant problem. Since they are not the top colleges in the country, as the federal universities are, students who join them are usually from lower-income families and tend to drop out frequently.

Students drop out mainly because they face financial hardships, live too far from the campus, can't manage to work and study simultaneously, or even struggle academically and think it's not worth the effort.

Dropping out is a massive problem for the college since the college will miss out on years of revenue from those students. Therefore, it's good for the college to give short-term incentives to hold students in the long run.

With that in mind, I developed a machine learning model to identify the risk and the cause of dropping out. Finally, I provided insight on what incentive the college could offer in trying to hold students.

Car-theft Prediction Using Tracking Data

Some insurance companies demand their clients allow the installation of a tracking device in their car because by tracking the car's location, it is easier to retrieve it. Typically, it takes a while for a theft to be reported, and sometimes it's too late because thieves either remove the tracker or move to a location (usually favelas) where police avoid going to without having a major reason because it's too dangerous for them.

The project I worked on revolved around tracking the users' data, establishing the user's typical behavior using one machine learning model, and then predicting if the car is being stolen using another machine learning model. The objective was to predict these events even before the user reported them to speed up the process of retrieving the car.

For this project, I used Python as the programming language. For the data processing part, we used Apache Spark on the Databricks platform since it was a lot of data, and processing on a single machine was too slow for the requirements (it was time sensitive). The historical data storage was on a MongoDB database, and the API we used to serve the model was Flask.

Dynamic Pricing to Sell Cooking Gas Bottles

In Brazil, there's a peculiar industry for selling cooking gas in cans. This industry has operated primarily analogically for a long time. The client would call the vendors closest to them and ask for a delivery, or the vendor would deliver their services by driving around neighborhoods in a truck.

However, once the gas runs out while a person is cooking, they want to have a new can delivered to their home ASAP since not having it may ruin their meals.

With that in mind, the company's business connected vendors and clients through a mobile app. The issue was that these vendors were not used to fierce competition and were very displeased with us.

To calm the situation, we developed a dynamic pricing algorithm using machine learning to maintain the prices at a sustainable level for the vendors while also being advantageous for the clients.

For this project, I used Python for the programming part, Flask to serve my model, and Docker to containerize the model with the API.
2012 - 2017

Bachelor's Degree in Control and Automation Engineering

Federal University of Minas Gerais (UFMG) - Belo Horizonte, Minas Gerais, Brazil

2015 - 2016

Master's Degree in Control Engineering

Lund University - Lund, Skane, Sweden

FEBRUARY 2021 - PRESENT

Natural Language Processing Nanodegree

Udacity

Libraries/APIs

Pandas, REST APIs, XGBoost, NumPy, Scikit-learn, TensorFlow, Spark ML

Tools

BigQuery, Tableau, ChatGPT, GitHub, PyCharm, Git, Postman, Amazon SageMaker, Microsoft Power BI, Pytest, Qlik Sense, Azure ML Studio, Apache Airflow, AI Prompts

Languages

Python, SQL, Python 3, C, R, JavaScript, C#

Paradigms

Data Science, ETL, Database Design, Azure DevOps, Business Intelligence (BI)

Platforms

Jupyter Notebook, Google Cloud Platform (GCP), Visual Studio Code (VS Code), Amazon Web Services (AWS), Docker, Azure, Android, Kubernetes, AWS Lambda, Databricks

Storage

Data Pipelines, Databases, SQL Server 2016, MySQL, Redis, Relational Databases, Data Integration, MongoDB, PostgreSQL, Amazon S3 (AWS S3), Data Lakes

Frameworks

Flask, Apache Spark, Streamlit, Spark, React Native, Swagger

Other

Machine Learning, Data Analysis, Data Visualization, Software Development, Statistics, Algorithms, API Integration, Analytics, Data, ETL Tools, Data Reporting, Data Analytics, Big Data, Linear Regression, Clustering, Dashboards, Predictive Modeling, Predictive Analytics, Statistical Analysis, Statistical Data Analysis, Mathematical Analysis, Mathematics, Statistical Methods, Artificial Intelligence (AI), OpenAI, Back-end, APIs, Data Engineering, Data Mining, Signal Processing, Hospitality, Google BigQuery, Data Warehousing, Cloud, Web Development, Large Language Models (LLMs), LangChain, Retrieval-augmented Generation (RAG), Industrial IT, Google Data Studio, Natural Language Processing (NLP), Web Scraping, Azure Data Factory, Dremio, Generative Pre-trained Transformers (GPT), OpenAI GPT-4 API, OpenAI GPT-3 API, Pinecone, Control Engineering, Advanced Analytics, Multivariate Analysis (MVA)

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring