Mahmoud is available for hire

Mahmoud Mehdi

Verified Expert in Engineering

Apache Spark Developer

Location

Paris, France

Toptal Member Since

June 24, 2020

Mahmoud is a senior data engineer who shows a lot of interest in building large-scale data processing systems. His passion for processing a massive amount of data helped him build his data skill rapidly. Mahmoud is a certified Apache Spark developer; he used this framework to help many clients process big data in various fields (music industry, retail, insurance, and fraud detection). He's also a Delta Lake open-source contributor (a project developed on top of Apache Spark by Databricks).

Data Engineering Data Analysis Data Analytics Data Cleansing Data Modeling Big Data Architecture Serverless Data Scraping Web Scraping SQL Spark Scala Java Databases REST APIs

Portfolio

Intermarché

Azure, Azure Databricks, Azure Functions, Azure API Management...

SeLoger

Amazon, AWS Lambda, AWS CloudFormation, Python 3, Apache Spark, Scala...

Believe

Amazon Web Services (AWS), Terraform, Databricks, Scala, Spark, Python...

Experience

SQL - 8 years Scala - 7 years Spark - 7 years Java - 7 years Big Data Architecture - 5 years Spark SQL - 5 years Terraform - 2 years Databricks - 2 years

Availability

Part-time

Preferred Environment

Amazon Web Services (AWS), Java, Scala, Apache Spark, Data Engineering, Data Pipelines

The most amazing...

...things I've implemented were data pipelines for a European retail leader that allows them to retrieve the product's stocks (terabytes of data) in real time.

Work Experience

Senior Data Engineer | Technical Lead

2022 - PRESENT

Intermarché

Developed an Azure function that retrieves tickets from an Azure event hub and calculates the client's discounts in real time.
Developed an Apache Spark job that calculates several clients' KPIs, such as the number of coupons used per client, the purchase frequency of each client, and the number of clients who have bought a discounted product by the campaign.
Worked as the team's technical lead and defined each project's technical architecture, making FinOps studies to estimate the cloud budget.
Developed an API using Azure Functions, Azure API Management, and Delta Lake. All the stores use the API to determine substitutes for each unavailable product when an order is getting prepared.
Developed an API that returns a client's tickets by querying an Azure Cosmos DB table enriched in real time whenever we receive a new ticket in the event hub.

Technologies: Azure, Azure Databricks, Azure Functions, Azure API Management, Azure Blob Storage API, Azure Cosmos DB, Delta Lake, Scala, Spark, Python 3, Pandas, APIs, Azure Event Hubs, Git, Big Data Architecture, ETL, Big Data, Data Engineering, Data Architecture, Data Lakes, Data Analytics, ScalaTest, Databases, Data Analysis, Datasets, Data Cleansing, Serverless, Cloud Patterns, Azure SQL, Swagger, REST APIs, API Integration, Data Integration, Data Lakehouse

AWS Solutions Architect | Senior Data Engineer

2021 - 2022

SeLoger

Audited the Apache Spark jobs and presented some optimizations and changes to adopt in order to enhance the workflows.
Started and suggested the migration from Parquet to Delta Lake.
Implemented an SCD2 pattern framework with Pandas DataFrame (dedicated to the company's data scientists).
Implemented the Amazon Macie solution in order to detect PII data and automated its deployment with AWS CloudFormation.
Implemented some data transformations with AWS Glue DataBrew to automatically handle the group's sensitive data (PII) by applying advanced transformations, such as replacement and encryption.

Technologies: Amazon, AWS Lambda, AWS CloudFormation, Python 3, Apache Spark, Scala, Amazon Web Services (AWS), AWS Glue DataBrew, Amazon Macie, Delta Lake, Git, Big Data Architecture, ETL, Big Data, Data Engineering, Data Architecture, Data Lakes, Data Analytics, ScalaTest, Pandas, Databases, Data Analysis, Datasets, Data Cleansing, Serverless, Cloud Patterns, Swagger, REST APIs, Data Integration, Data Lakehouse

Senior Data Engineer

2020 - 2022

Believe

Developed a big data system in the music industry: it allows users to calculate royalties to pay producers depending on the source (Deezer, Spotify, iTunes, etc.) and the contract made with the company.
Shared my Delta Lake knowledge as a contributor to the open-source project to help my client make ACID transactions on the stored parquet files.
Tuned the Spark jobs that handle a large amount of data.
Orchestrated Apache Spark jobs using Apache Airflow.
Wrote APIs to expose data using AWS Lambda and API Gateway.

Technologies: Amazon Web Services (AWS), Terraform, Databricks, Scala, Spark, Python, Data Engineering, Data Pipelines, Delta Lake, Git, Big Data Architecture, Apache Airflow, ETL, Big Data, Data Architecture, Data Lakes, Data Analytics, ScalaTest, Databases, Data Analysis, Datasets, Data Cleansing, Serverless, Cloud Patterns, Swagger, REST APIs, API Integration, Data Integration, Data Lakehouse, AWS Lambda

Senior Data Engineer

2020 - 2021

Tekmetric

Managed data migrations from different systems to the RDS database.
Wrote Apache Spark jobs (using Scala) that ensured ETL processing on repair shops' data.
Developed a "labor_guide" ETL that allows estimating how much time it will take to replace a specific part for all vehicles. (https://www.tekmetric.com/blog-post/3-0-tekmetric-labor-guide).
Tuned the Spark jobs and ensured they were efficiently running on EMR.
Made some query-intensive data available using AWS DMS (data migration service) by migrating data from RDS to Elasticsearch.

Technologies: Spark, Scala, SQL, ETL, Amazon Web Services (AWS), Amazon Elastic MapReduce (EMR), Python, Data Engineering, Data Pipelines, Git, Big Data Architecture, Big Data, Data Architecture, Data Lakes, Data Analytics, ScalaTest, Databases, Data Analysis, Datasets, Data Cleansing, Serverless, Cloud Patterns, Swagger, API Integration, Data Integration, Data Lakehouse, AWS Lambda

Senior Data Engineer

2019 - 2020

AXA

Developed a big data system that allows fraud detection from the insurance's data.
Designed the data platform on AWS that will handle AXA's data coming from different sources.
Used Spark GraphFrames in order to model data in a way that allows us to detect relations between different claims.
Made the data available for different teams using AWS Glue and Athena.
Wrote Terraform scripts that allowed us to deploy the solution as code on AWS.

Technologies: Amazon Web Services (AWS), Terraform, GIS, Scala, Spark, Python, Data Engineering, Data Pipelines, Git, Big Data Architecture, ETL, Big Data, Data Architecture, Data Lakes, Data Analytics, ScalaTest, Databases, Data Analysis, Datasets, Data Cleansing, Serverless, Cloud Patterns, Swagger, REST APIs, API Integration, Data Integration, Data Lakehouse, AWS Lambda

Data Engineer

2016 - 2019

Carrefour

Developed a daily SalesSpark application that calculates the daily sales generated from different stores and exposes this data using web services.
Developed a daily sales comparator that compares the sales' amounts between the legacy system and the new big data system: it allowed us to detect any anomalies in the data.
Developed an assortment of jobs that processes the products' data and indicates the daily prices for each product per store and region.
Developed a framework that optimizes the writes to the Cassandra database.
Developed real-time applications using Spark Streaming in order to calculate in real-time the generated sales' revenues.
Index the products data using Elasticsearch in order to index data and be able to query it.

Technologies: Scalatra, Apache Kafka, Cassandra, Elasticsearch, Hadoop, Scala, Apache Spark, Python, Data Engineering, Data Pipelines, Git, Big Data Architecture, ETL, Big Data, Data Architecture, Data Lakes, Data Analytics, ScalaTest, Databases, Data Analysis, Datasets, Data Cleansing, Cloud Patterns, REST APIs, API Integration, Data Integration, Data Lakehouse

Data Engineer | Data Scientist

2016 - 2017

Zenika

Created a big data system/application that is able to predict football games using machine learning algorithms.
Developed a web-scraping solution using Node.js in order to collect football data from different websites.
Developed different Apache Spark jobs with Scala in order to process data, apply features, and launch several ML algorithms to train models and predict games' scores.
Developed web services using the Play framework in order to interact with the machine learning models (for example, retrain the models and predict a game).
Coded an AngularJS application in order to interact with the web services I created. We used this application to predict the UEFA Euro 2016 and Copa America games.

Technologies: Amazon Web Services (AWS), Play Framework, Spark ML, Hadoop, Scala, Spark, Git, Big Data Architecture, ETL, Big Data, Data Architecture, Data Lakes, Data Engineering, Data Analytics, Data Scraping, Web Scraping, ScalaTest, Databases, Data Analysis, Datasets, Data Cleansing, Cloud Patterns, REST APIs, API Integration, Data Integration

Experience

Delta Lake Contributor

https://github.com/delta-io/delta/

I had the chance to contribute to the open-source Delta Lake project (maintained by Databricks), which is a layer on top of Apache Spark. This framework offers the possibility to make ACID transactions on stored data.

Implementing a Custom Data Source with Apache Spark for Carrefour's Daily Sales

Sometimes different jobs made the same transformations of specific read data—in my case, it was daily sales—that resulted in a lot of duplicate code and different ways to deal with data.
In order to avoid such issues, I took the initiative to develop a custom Apache Spark Data source that easily reads data and transforms it.
My team members used that library and had their data ready for analysis after calling only one line.

Implementing an Optimized Spark GraphFrames Solution to Detect Frauds at AXA

When I joined the AXA team for fraud detection, the project contained a basic Node.js graph representation that allowed users to detect fraud between the different nodes (representing the claims). Once the data started to grow, we started having performance issues. To solve the issue, I took the initiative to develop a distributed solution based on Apache Spark and its library for graphs that takes advantage of Spark DataFrames and the possibility of parallelizing data processing.

Once the project was completed, we were able to detect fraud rapidly compared to the old solution.

Zenprono: A Spark ML Application That Predicts Football Scores

https://blog.zenika.com/2016/06/10/zenprono-resultats-des-matchs-euro-2016/

2016 was an interesting year for football fans since we had the Euro 2016 and the Copa America in the same year. We decided to create a big data application that predicts different game scores. We extracted data from various sources, processed the data with Apache Spark jobs (the data was massive—we had historical football datasets starting from 1930), and then implemented machine learning algorithms with Spark ML to train our models and predict different games.

We had a good prediction rate: 77% of the predictions were correct.

Skills

Languages

Scala, SQL, Java, Python, Python 3

Frameworks

Spark, Apache Spark, Data Lakehouse, Hadoop, Scalatra, Swagger, Play, Play Framework

Libraries/APIs

PySpark, REST APIs, Spark ML, Amazon API, Azure API Management, Azure Blob Storage API, Pandas

Tools

ScalaTest, Spark SQL, Git, Terraform, Ansible, AWS Glue, Amazon Elastic MapReduce (EMR), BigQuery, Google Cloud Dataproc, Amazon Simple Queue Service (SQS), GIS, AWS CloudFormation, Apache Airflow

Paradigms

ETL, ETL Implementation & Design

Platforms

Databricks, Spark Core, AWS Lambda, Azure, Apache Kafka, Dataiku, Amazon, Amazon Web Services (AWS), Google Cloud Platform (GCP), Azure Functions, Azure Event Hubs

Storage

Database Modeling, Database Architecture, Databases, Data Pipelines, Azure Cosmos DB, Data Lakes, DB, Data Integration, Amazon S3 (AWS S3), Apache Hive, MySQL, Elasticsearch, Azure SQL, Cassandra

Other

Data Engineering, ETL Testing, ETL Tools, Data, Data Analysis, Data Analytics, Data Modeling, Data Architecture, Big Data Architecture, Scraping, Data Scraping, Web Scraping, Datasets, Data Cleansing, Serverless, API Integration, Big Data, ETL Development, Architecture, Google BigQuery, APIs, Data Warehousing, Data Warehouse Design, Cloud Patterns, AWS Glue DataBrew, Amazon Macie, Delta Lake, Azure Databricks

Education

2011 - 2016

Bachelor of Engineering Degree in Computer Science

National Institute of Applied Science and Technology - Tunis, Tunisia

Certifications

OCTOBER 2019 - PRESENT

Databricks Associate Developer (Apache Spark 2.4) with Scala

Databricks

MARCH 2017 - PRESENT

Hadoop Programming

IBM

NOVEMBER 2016 - PRESENT

Scala Programming for Data Science

IBM

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring