
Samarth Jain
Verified Expert in Engineering
Data Engineer and Developer
Abu Dhabi, United Arab Emirates
Toptal member since May 26, 2026
Samarth is a principal data architect and data engineer with over 11 years of experience building cloud-native data platforms across the banking, telecom, healthcare, airlines, and eCommerce domains. He excels at Azure Databricks, Spark, PySpark, Delta Lake, ADF, Kafka, and real-time pipelines. Samarth delivered 90% performance gains and 30% cost optimization.
Portfolio
Experience
- SQL - 12 years
- PySpark - 8 years
- Python 3 - 8 years
- Apache Kafka - 5 years
- Azure DevOps - 5 years
- Azure Data Factory (ADF) - 5 years
- Azure - 5 years
- Azure Databricks - 5 years
Preferred Environment
Azure Databricks, Azure, PySpark, Apache Kafka, Azure DevOps, Spark, Python, SQL, Azure Data Factory (ADF), Azure Data Lake Storage, SQL, Python, ETL, Big Data, Data Lakes, Data Warehousing, Pytest, Data Engineering, Quality Assurance (QA), Cloud Architecture, Data Architecture, Databases, Data Analysis, ETL Pipelines, Microsoft Entra ID, Architecture, Data Cleansing, Data Management, Azure Data Lake, ELT, Continuous Deployment, Continuous Integration (CI), DevOps, Data Analytics, Data Cleaning, Data Transformation, Data Validation, Data Warehouse Design, Query Optimization, MySQL, Scripting, Event-driven Architecture
The most amazing...
...achievement has been delivering 90% faster data processing and 30% lower cloud cost at enterprise scale.
Work Experience
Principal Data Consultant
Xebia Group
- Optimized a data model by analyzing more than 30 tables and eliminating over 700 redundant columns, reducing storage footprint by approximately 30% and improving query performance.
- Architected and processed real-time airline traveler event data (Travel DNA) from Azure Event Hubs into Azure Databricks using a scalable Databricks Lakehouse and medallion architecture.
- Designed event-driven streaming pipelines with exactly-once processing semantics and faulttolerant architecture, ensuring highly reliable downstream analytics.
- Implemented Delta Lake optimizations, including partitioning, compaction, schema evolution, and Z-ordering, to improve query performance and reduce compute costs.
- Built and standardized incremental data processing frameworks, real-time ingestion pipelines, and streaming ETL solutions for near real-time analytics use cases.
- Enabled high-volume streaming analytics using Spark Structured Streaming and Azure Event Hub integration.
Data Architect
Globant
- Architected an enterprise-scale Azure Data Platform with ADLS, ADF, and Databricks, migrating from on-prem SQL Server and enabling cloud modernization, resulting in 30% cost reduction and 2x scalability.
- Architected batch and real-time streaming pipelines with ADF, Databricks, and Event Hub, processing TB-scale data into Delta Lake.
- Reduced pipeline runtime from five hours to 20 minutes, achieving a 90% improvement via Spark optimization, partition tuning, and PySpark performance engineering.
- Developed incremental ingestion frameworks, including watermarking and Delta, making faster, more reliable Power BI refreshes.
- Standardized data quality and validation framework across projects, including null, format, duplicates, and numeric checks.
- Delivered serverless Azure Functions solutions, achieving 20% infrastructure cost savings.
- Oversaw a team of eight engineers, driving scalable architecture, enterprise data governance, and delivery excellence.
- Unified heterogeneous data sources into an enterprise-wide analytics and reporting platform supporting customer analytics, business intelligence, and eCommerce reporting use cases.
- Built scalable data pipelines enabling real-time customer behavior analysis, sales analytics, and operational reporting.
Senior Software Engineer
Online Freelance Agency
- Analyzed daily trading data, including equities and derivatives, for a banking client on risk management and financial risk identification.
- Optimized Spark and Hive pipelines, improving query performance by 40% and enhancing SLA.
- Automated file transfers via Shell script, reducing manual efforts by 90%, i.e., to approximately five hours weekly.
- Automated the file and data arrival report for management, saving over 36 person-hours per month.
- Implemented data masking for sensitive PII, ensuring compliance with client policies.
- Built CI/CD pipelines with Jenkins, Ansible, and Bitbucket, accelerating deployments by 50%.
System Analyst
Amdocs
- Implemented various data transformations on batch data using Spark to process large-scale call detail records (CDR), subscriber usage, and billing reconciliation for a telecom client.
- Designed efficient Hive queries to join tables and filter data to optimize queries.
- Collaborated with stakeholders to gather requirements and translate them into technical deliverables.
- Resolved 50+ critical live defects in SQL production systems, ensuring high availability.
- Kept the work of EPC in sync using database dump between multiple ref masters to close all gaps.
Experience
Real-time Lakehouse Platform for Airline Traveler Events
Education
Bachelor's Degree in Computer Engineering
Chameli Devi Group of Institutions (CDGI) - Indore, India
Certifications
Microsoft Certified: Azure AI Fundamentals
Microsoft
Databricks Certified Data Engineer Associate
Databricks
Microsoft Certified: Azure Data Engineer Associate
Microsoft
Databricks Certified Associate Developer for Apache Spark 3.0
Databricks
Skills
Libraries/APIs
PySpark
Tools
Pytest
Languages
Python 3, SQL, Python
Frameworks
Delta Live Tables (DLT), Apache Spark, Yarn, Hadoop, Spark Structured Streaming
Paradigms
ETL, Event-driven Architecture, Azure DevOps, Continuous Deployment, Continuous Integration (CI), Dimensional Modeling, DevOps
Platforms
Azure, Apache Kafka, Azure Event Hubs, Azure Data Lake Storage, Databricks, Unix, Azure Synapse Analytics, Azure Functions, Azure Synapse
Storage
Data Lakes, Databases, Microsoft Entra ID, Data Validation, Data Pipelines, Azure SQL
Other
Spark, Python, SQL, Parquet, MySQL, Azure Databricks, Azure Data Factory (ADF), Unity Catalog, Delta Lake, Delta Tables, Streaming, Medallion Architecture, Data Migration, Data Quality, Data Governance, Data Security, Data Lineage, Metadata, Excel Expert, Big Data, Data Warehousing, Data Engineering, Quality Assurance (QA), Cloud Architecture, Data Architecture, Data Analysis, ETL Pipelines, Architecture, Data Cleansing, Data Management, Azure Data Lake, ELT, Data Analytics, Data Cleaning, Data Transformation, Data Warehouse Design, Query Optimization, Scripting, Hiver, Shell Scripting, Jira, Data Modeling, Cloud, Hadoop, Apache Sqoop, Ansible, Rundeck, Oracle, Artificial Intelligence (AI), Streaming ETL, Performance Tuning
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring