Samarth is available for hire

Samarth Jain

Verified Expert in Engineering

Data Engineer and Developer

Abu Dhabi, United Arab Emirates

Toptal member since May 26, 2026

Expertise

Apache Big Data Architecture Data Warehouse Data Engineering Data Science Cloud Architecture Data Analysis Azure Data Migration SQL MySQL

Bio

Samarth is a principal data architect and data engineer with over 11 years of experience building cloud-native data platforms across the banking, telecom, healthcare, airlines, and eCommerce domains. He excels at Azure Databricks, Spark, PySpark, Delta Lake, ADF, Kafka, and real-time pipelines. Samarth delivered 90% performance gains and 30% cost optimization.

Portfolio

Xebia Group

Apache Kafka, Azure Databricks, PySpark, Azure, Spark, Python, SQL, Jira...

Globant

PySpark, Azure, Azure Databricks, Apache Kafka, Azure Event Hubs...

Online Freelance Agency

Hadoop, Spark, Python, Hiver, Yarn, Orc, Unix, Shell Scripting, Jenkins...

Experience

SQL - 12 years
PySpark - 8 years
Python 3 - 8 years
Apache Kafka - 5 years
Azure DevOps - 5 years
Azure Data Factory (ADF) - 5 years
Azure - 5 years
Azure Databricks - 5 years

Preferred Environment

Azure Databricks, Azure, PySpark, Apache Kafka, Azure DevOps, Spark, Python, SQL, Azure Data Factory (ADF), Azure Data Lake Storage, SQL, Python, ETL, Big Data, Data Lakes, Data Warehousing, Pytest, Data Engineering, Quality Assurance (QA), Cloud Architecture, Data Architecture, Databases, Data Analysis, ETL Pipelines, Microsoft Entra ID, Architecture, Data Cleansing, Data Management, Azure Data Lake, ELT, Continuous Deployment, Continuous Integration (CI), DevOps, Data Analytics, Data Cleaning, Data Transformation, Data Validation, Data Warehouse Design, Query Optimization, MySQL, Scripting, Event-driven Architecture

The most amazing...

...achievement has been delivering 90% faster data processing and 30% lower cloud cost at enterprise scale.

Work Experience

Principal Data Consultant

2026 - PRESENT

Xebia Group

Optimized a data model by analyzing more than 30 tables and eliminating over 700 redundant columns, reducing storage footprint by approximately 30% and improving query performance.
Architected and processed real-time airline traveler event data (Travel DNA) from Azure Event Hubs into Azure Databricks using a scalable Databricks Lakehouse and medallion architecture.
Designed event-driven streaming pipelines with exactly-once processing semantics and faulttolerant architecture, ensuring highly reliable downstream analytics.
Implemented Delta Lake optimizations, including partitioning, compaction, schema evolution, and Z-ordering, to improve query performance and reduce compute costs.
Built and standardized incremental data processing frameworks, real-time ingestion pipelines, and streaming ETL solutions for near real-time analytics use cases.
Enabled high-volume streaming analytics using Spark Structured Streaming and Azure Event Hub integration.

Technologies: Apache Kafka, Azure Databricks, PySpark, Azure, Spark, Python, SQL, Jira, Parquet, Azure DevOps, Azure Event Hubs, Azure Data Lake Storage, Unity Catalog, Delta Live Tables (DLT), Delta Lake, Delta Tables, Databricks, Streaming, Medallion Architecture, Data Modeling, Data Quality, Data Governance, Data Security, Data Lineage, Metadata, Python 3, SQL, Python, ETL, Big Data, Data Lakes, Data Warehousing, Pytest, Data Engineering, Quality Assurance (QA), Cloud Architecture, Data Architecture, Databases, Data Analysis, ETL Pipelines, Microsoft Entra ID, Architecture, Data Cleansing, Data Management, Azure Data Lake, ELT, Continuous Deployment, Continuous Integration (CI), DevOps, Data Analytics, Data Cleaning, Data Transformation, Data Validation, Data Warehouse Design, Query Optimization, MySQL, Scripting, Cloud, Apache Spark, Data Pipelines, Event-driven Architecture, Dimensional Modeling

Data Architect

2021 - 2026

Globant

Architected an enterprise-scale Azure Data Platform with ADLS, ADF, and Databricks, migrating from on-prem SQL Server and enabling cloud modernization, resulting in 30% cost reduction and 2x scalability.
Architected batch and real-time streaming pipelines with ADF, Databricks, and Event Hub, processing TB-scale data into Delta Lake.
Reduced pipeline runtime from five hours to 20 minutes, achieving a 90% improvement via Spark optimization, partition tuning, and PySpark performance engineering.
Developed incremental ingestion frameworks, including watermarking and Delta, making faster, more reliable Power BI refreshes.
Standardized data quality and validation framework across projects, including null, format, duplicates, and numeric checks.
Delivered serverless Azure Functions solutions, achieving 20% infrastructure cost savings.
Oversaw a team of eight engineers, driving scalable architecture, enterprise data governance, and delivery excellence.
Unified heterogeneous data sources into an enterprise-wide analytics and reporting platform supporting customer analytics, business intelligence, and eCommerce reporting use cases.
Built scalable data pipelines enabling real-time customer behavior analysis, sales analytics, and operational reporting.

Technologies: PySpark, Azure, Azure Databricks, Apache Kafka, Azure Event Hubs, Azure Data Factory (ADF), Azure Data Lake Storage, Azure Synapse Analytics, Azure SQL, Azure Functions, Unity Catalog, Delta Live Tables (DLT), Delta Lake, Delta Tables, Databricks, Streaming, Medallion Architecture, Data Modeling, Data Migration, Data Quality, Data Governance, Data Security, Data Lineage, Metadata, Spark, Python, SQL, Jira, Parquet, Azure DevOps, Artificial Intelligence (AI), Python 3, SQL, Python, ETL, Big Data, Data Lakes, Data Warehousing, Pytest, Data Engineering, Quality Assurance (QA), Cloud Architecture, Data Architecture, Databases, Data Analysis, ETL Pipelines, Microsoft Entra ID, Architecture, Data Cleansing, Data Management, Azure Data Lake, Azure Synapse, ELT, Continuous Deployment, Continuous Integration (CI), DevOps, Data Analytics, Data Cleaning, Data Transformation, Data Validation, Data Warehouse Design, Query Optimization, MySQL, Scripting, Cloud, Apache Spark, Data Pipelines, Event-driven Architecture, Dimensional Modeling

Senior Software Engineer

2019 - 2021

Online Freelance Agency

Analyzed daily trading data, including equities and derivatives, for a banking client on risk management and financial risk identification.
Optimized Spark and Hive pipelines, improving query performance by 40% and enhancing SLA.
Automated file transfers via Shell script, reducing manual efforts by 90%, i.e., to approximately five hours weekly.
Automated the file and data arrival report for management, saving over 36 person-hours per month.
Implemented data masking for sensitive PII, ensuring compliance with client policies.
Built CI/CD pipelines with Jenkins, Ansible, and Bitbucket, accelerating deployments by 50%.

Technologies: Hadoop, Spark, Python, Hiver, Yarn, Orc, Unix, Shell Scripting, Jenkins, Ansible, Rundeck, Jira, SQL, Parquet, PySpark, Data Governance, Metadata, Python 3, SQL, Python, ETL, Big Data, Data Lakes, Data Warehousing, Data Engineering, Quality Assurance (QA), Data Architecture, Databases, Data Analysis, ETL Pipelines, Data Cleansing, Data Management, ELT, Continuous Deployment, Continuous Integration (CI), DevOps, Data Analytics, Data Cleaning, Data Transformation, Data Validation, Data Warehouse Design, Query Optimization, MySQL, Scripting, Apache Spark, Data Pipelines

System Analyst

2014 - 2019

Amdocs

Implemented various data transformations on batch data using Spark to process large-scale call detail records (CDR), subscriber usage, and billing reconciliation for a telecom client.
Designed efficient Hive queries to join tables and filter data to optimize queries.
Collaborated with stakeholders to gather requirements and translate them into technical deliverables.
Resolved 50+ critical live defects in SQL production systems, ensuring high availability.
Kept the work of EPC in sync using database dump between multiple ref masters to close all gaps.

Technologies: SQL, Unix, Excel Expert, Hadoop, Spark, Shell Scripting, Oracle, PySpark, SQL, ETL, Big Data, Data Lakes, Data Warehousing, Data Engineering, Quality Assurance (QA), Databases, Data Analysis, ETL Pipelines, ELT, Data Analytics, Data Cleaning, Data Validation, Data Warehouse Design, Query Optimization, MySQL, Scripting, Apache Spark, Data Pipelines

Experience

Real-time Lakehouse Platform for Airline Traveler Events

I designed and implemented a real-time data platform on Azure to process high-volume traveler event streams and enable near real-time analytics. I led the architecture and hands-on development of ingestion, transformation, and serving layers using Databricks Lakehouse and Medallion architecture. I built event-driven pipelines from Azure Event Hub to Delta tables, with exactly-once processing and fault tolerance. I standardized incremental processing and streaming ETL patterns for reliable downstream consumption. I optimized 30+ source tables by removing 700+ redundant columns, reducing storage by approximately 30%, and improving query performance. I implemented Delta Lake optimizations, including partitioning, compaction, schema evolution, and Z-ordering, to reduce compute costs and improve SLA adherence. Throughout the project, I collaborated with cross-functional stakeholders to align data models with analytics and reporting needs.

Education

2010 - 2014

Bachelor's Degree in Computer Engineering

Chameli Devi Group of Institutions (CDGI) - Indore, India

Certifications

DECEMBER 2025 - PRESENT

Microsoft Certified: Azure AI Fundamentals

Microsoft

FEBRUARY 2025 - FEBRUARY 2027

Databricks Certified Data Engineer Associate

Databricks

MARCH 2024 - MARCH 2026

Microsoft Certified: Azure Data Engineer Associate

Microsoft

JULY 2023 - PRESENT

Databricks Certified Associate Developer for Apache Spark 3.0

Databricks

Skills

Libraries/APIs

PySpark

Tools

Pytest

Languages

Python 3, SQL, Python

Frameworks

Delta Live Tables (DLT), Apache Spark, Yarn, Hadoop, Spark Structured Streaming

Paradigms

ETL, Event-driven Architecture, Azure DevOps, Continuous Deployment, Continuous Integration (CI), Dimensional Modeling, DevOps

Platforms

Azure, Apache Kafka, Azure Event Hubs, Azure Data Lake Storage, Databricks, Unix, Azure Synapse Analytics, Azure Functions, Azure Synapse

Storage

Data Lakes, Databases, Microsoft Entra ID, Data Validation, Data Pipelines, Azure SQL

Other

Spark, Python, SQL, Parquet, MySQL, Azure Databricks, Azure Data Factory (ADF), Unity Catalog, Delta Lake, Delta Tables, Streaming, Medallion Architecture, Data Migration, Data Quality, Data Governance, Data Security, Data Lineage, Metadata, Excel Expert, Big Data, Data Warehousing, Data Engineering, Quality Assurance (QA), Cloud Architecture, Data Architecture, Data Analysis, ETL Pipelines, Architecture, Data Cleansing, Data Management, Azure Data Lake, ELT, Data Analytics, Data Cleaning, Data Transformation, Data Warehouse Design, Query Optimization, Scripting, Hiver, Shell Scripting, Jira, Data Modeling, Cloud, Hadoop, Apache Sqoop, Ansible, Rundeck, Oracle, Artificial Intelligence (AI), Streaming ETL, Performance Tuning

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring