Matheus Gaignoux Raiol, Developer in São Paulo - State of São Paulo, Brazil
Matheus is available for hire
Hire Matheus

Matheus Gaignoux Raiol

Verified Expert  in Engineering

Data Engineer and Developer

São Paulo - State of São Paulo, Brazil

Toptal member since November 29, 2022

Bio

Matheus is a data engineer who enjoys data modeling, architecting, and developing data pipelines. He is mainly interested in use cases in the finance and retail industries. Matheus aims to design simple, robust solutions with high functionality and low maintenance levels.

Portfolio

Shopee
Apache Hive, Apache Spark, PySpark, SQL, Data Marts, Data Modeling...
Via
Azure Databricks, Azure Data Factory (ADF), Azure Data Lake, Apache Kafka...
Inmetrics
Python, Spark, Scikit-learn, Databricks, Redshift, Apache Kafka, SQL...

Experience

  • Data Engineering - 4 years
  • Python - 4 years
  • PySpark - 4 years
  • Databricks - 3 years
  • Apache Airflow - 3 years
  • SQL - 3 years
  • Delta Lake - 2 years
  • Azure Data Factory (ADF) - 2 years

Availability

Part-time

Preferred Environment

Databricks, Azure Data Factory (ADF), Apache Airflow, Python, Spark, PySpark, Pandas, Docker, Azure, Data Lakes

The most amazing...

...thing I've implemented is a unified medallion architecture for files with different schemas from several sources.

Work Experience

Senior Data Engineer

2022 - 2022
Shopee
  • Designed and implemented a custom paradigm to feed the fraud analysis team data mart tables.
  • Developed optimized jobs using Spark SQL, preventing large memory use and queue traffic congestion.
  • Worked actively on large datasets using Spark as a processing tool. The pipeline development used software engineering best practices and a standard pattern aiming to reach an easy understanding for future modifications.
Technologies: Apache Hive, Apache Spark, PySpark, SQL, Data Marts, Data Modeling, Data Engineering, ETL, Data, Spark

Senior Data Engineer

2021 - 2022
Via
  • Developed a data pipeline to ingest files from several sources to be transformed into a unified set of tables for chargeback analysis.
  • Designed and implemented a process to extract, transform, load, and analyze purchase orders made in the marketplace web platform. The goal of the analysis step was to identify fraudulent orders.
  • Created pipelines to feed data into tables of a fraud team datamart. The comprehension of fraud concepts and how the transactional business rules were reflected in the available data was required to guarantee clarity to downstream applications.
Technologies: Azure Databricks, Azure Data Factory (ADF), Azure Data Lake, Apache Kafka, Python, SQL, Delta Lake, Data Pipelines, Query Optimization, ETL, ELT, Data, APIs, Data Engineering, Spark

Data Engineer

2020 - 2021
Inmetrics
  • Refactored data pipelines to change the modeling of a data warehouse to a star schema, increasing query performance.
  • Implemented a machine learning workflow to use in a capacity planning platform.
  • Developed a near real-time application to ingest data from a third-party company into a data lake.
Technologies: Python, Spark, Scikit-learn, Databricks, Redshift, Apache Kafka, SQL, Data Marts, Data Pipelines, Amazon Web Services (AWS), ETL, APIs, Scala, Data Engineering, Amazon Athena, AWS Glue, Amazon S3 (AWS S3), AWS Lambda, Amazon Elastic MapReduce (EMR)

Data Engineer

2019 - 2020
EY
  • Created a data warehouse for BI use cases to cover several credit products of a huge Brazilian bank.
  • Developed data pipelines to feed tables inside the data warehouse and optimized queries since the company produced large daily data loads.
  • Generated pipelines for daily updates on feature stores. This process was based on the knowledge of business rules, calculations, and how the ML models were developed and used.
Technologies: Python, Spark, Apache Hive, PostgreSQL, Data Warehousing, SQL, Shell Scripting, Hadoop, Data Engineering

Experience

Fraud Analysis Data Pipeline

In this project, I designed and developed the steps of a refactored fraud analysis data pipeline for purchase orders made in a web marketplace. The primary purpose of this pipeline was to analyze these orders based on key factors and either approve or cancel them. The decision is formulated by superpositioning an ML model score threshold and applying a set of dynamic and flexible rules to each order. After that, all the information is returned to the front-end application, establishing a link between OLAP and OLTP systems. The changes and new features added to this pipeline significantly decreased processing time by 75%—from one hour down to just 15 minutes at most.

Medallion Architecture for Chargeback Analysis

I designed and implemented a data pipeline capable of extracting files from several sources and then processing and sending them to downstream applications handling critical information on chargebacks. The following stages of data storage were created:

• Staging layer (raw files)
• Bronze layer (same information as the previous stage but unifying the file schemas)
• Silver layer (last status of a set of keys)
• Gold layer (combined information)

The medallion architecture provided the needed flexibility to the extent that changes in some business rules could only be applied to the appropriate stage without compromising the rest. Furthermore, this architecture ensured a unified source of truth for any application after the gold layer. Rather than having each consumer process raw files in a potentially inconsistent way, building this ready-to-use refined business enabled a high level of reusability. Lastly, implementing programming best practices guaranteed quick completion of jobs and a lower level of cluster memory and cloud resources.

Data Mart Custom Paradigm

The key idea was to design and implement data workflows following a custom paradigm—each workflow related to only one subject—to replace some processes related to feeding tables in a fraud analysis team data mart. This brought about improvements and increased reliability for all applications relying on the pipelines. Since the data environment was not built on top of a lakehouse architecture, our custom paradigm assumed the existence of two layers for data storage: the raw and business layers. The raw layer should have all the historical loads. Each load is a daily snap of a set of transformations applied to the data concerning the business rules that lead to it. In this layer, it is optional for the data to have uniqueness. On the business layer, the data is deduplicated and has a small volume, considering only the time interval needed for business analysis. Unlike the other layer based on a daily append process, the tables are overwritten every time the workflows run.

Education

2013 - 2017

Bachelor's Degree in Physics

Federal University of Pará - Pará, Brazil

Certifications

OCTOBER 2022 - OCTOBER 2024

Certified Associate Developer for Apache Spark 3.0

Databricks

Skills

Libraries/APIs

PySpark, Pandas, Scikit-learn

Tools

Spark SQL, Apache Airflow, Amazon Athena, AWS Glue, Amazon Elastic MapReduce (EMR)

Languages

Python, SQL, Scala

Frameworks

Spark, Adaptive Query Execution (AQE), Apache Spark, Hadoop

Platforms

Databricks, Docker, Apache Kafka, Azure, Amazon Web Services (AWS), AWS Lambda

Storage

PostgreSQL, MySQL, Redshift, Microsoft SQL Server, Apache Hive, Data Pipelines, HDFS, Data Lakes, Amazon S3 (AWS S3)

Paradigms

ETL

Other

UDFs, DataFrames, Azure Data Factory (ADF), Delta Lake, Data Engineering, Data Wrangling, Azure Databricks, Azure Data Lake, APIs, SFTP, Data Warehousing, Data Marts, Applied Mathematics, Physics, Computational Physics, Data Modeling, Query Optimization, Data, ELT, Shell Scripting

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring