Naman Jain, Developer in New Delhi, Delhi, India
Naman is available for hire
Hire Naman

Naman Jain

Verified Expert  in Engineering

Data Engineering Architect and Lead Developer

Location
New Delhi, Delhi, India
Toptal Member Since
June 24, 2020

Naman is a highly experienced cloud and data solutions architect with more than six years of experience delivering data engineering services to multiple Fortune 100 clients. He has delivered on multiple Petabyte-scale data migrations and big data infrastructures via Azure Cloud, AWS Cloud, and Snowflake or DBT, creating a step order of efficiency in their use cases in many instances. Naman fundamentally believes in over-communication, establishing trust, and taking ownership of deliverables.

Portfolio

Enterprise Client
Snowflake, Data Build Tool (dbt), Spark, GitLab, Data Migration...
Enterprise Client (via Toptal)
Scala, Spark, Azure, Azure Data Factory, Azure Data Lake, Azure Databricks...
Stealth mode AI startup (Series A $20 Million)
Data Engineering, Apache Hive, Apache Impala, SQL, Apache Spark, Scala, Bash...

Experience

Availability

Full-time

Preferred Environment

Azure Cloud Services, Apache Spark, Scala, IntelliJ IDEA, Git, Linux, Snowflake, Data Build Tool (dbt), Snowpark, Data Migration

The most amazing...

...enterprise-grade, big data ELT platform I delivered in Azure Cloud was a single source of truth Data Lakehouse that enabled a wide diversity of use cases.

Work Experience

Senior Data Analytics Engineer

2021 - 2022
Enterprise Client
  • Architected and delivered the clients' entire PROD logic lift (over 200 SQL workflows) and legacy data migration (over 10 Petabytes) from AWS Redshift to Snowflake.
  • Automated the daily ingestion jobs via Data Building Tool (DBT) and created a self-updating data catalog via DBTCloud.
  • Lifted all SQL logic from Redshift SQL to DBT SQL. Used macros and Jinja, which allowed us to get visibility into very complex SQL logic and visualize it via the catalog.
  • Achieved an 80% reduction in time, cost, and real-time materialization of all our client-facing BI reports due to a cascading trigger in DBT, as opposed to using Redshift, where the sequential tables all got refreshed at once.
  • Linked all our Periscope charts and dashboards to a Git repo which was then indexed in an IDE. This allowed us to make and push large bulk updates, replacing a manual logic updating process on Periscope and significantly increasing our efficiency.
  • Trained the fresh data engineers to manage and extend this entire big data infrastructure.
  • Compared performance, cost, and ease of maintenance of Snowpipe versus Fivetran versus Stitch.
  • Migrated and wrote a new very complex business logic in Scala UDFs in Snowpark. Helped consolidate multiple SQL tables by simplifying their logic in Scala UDFs.
  • Built a Big Data platform that leveraged Snowpark, DBT, and GitLab to achieve standardization of best practices, CI/CD, self-updating documentation DAG, reduction in gold table freshness latency, etc.
  • Migrated over 10 Spark apps to Snowpark and achieved better net runtimes and reduced computing costs for all of them.
Technologies: Snowflake, Data Build Tool (dbt), Spark, GitLab, Data Migration, Data Warehouse Design, Big Data, Data Pipelines, ELT, Big Data Architecture, Data Architecture, Snowpark

Cloud Solutions Architect

2020 - 2021
Enterprise Client (via Toptal)
  • Worked on the orchestration and automation of the workflows via Azure Data Factory.
  • Optimized and partitioned storage in Azure Data Lake Storage (ADLS) Gen2.
  • Implemented complex, strongly-typed Scala Spark workloads in Azure Databricks along with dependency management and Git integration.
  • Implemented real-time low cost and low latency streaming workflows which at their peak were processing more than 2MM raw JSON blobs per second. Integrated Azure Blob Storage, Azure Event Hubs, and Azure Queues via ABS-AQS.
  • Created a multi-layered ELT platform that consisted of raw/bronze (Azure Blob Storage), current and silver (Azure Delta Lake), and mapped/gold (Azure Delta Lake) layers.
  • Balanced the cost of computing by spinning up clusters on demand versus persisting them.
  • Made big data available for efficient and real-time analysis throughout the client via Delta tables, which provided indexed and optimized stores, ACID transaction guarantees, and table level and row-level access controls.
  • Tied all together in end-to-end workflows that were either refreshed with just a few clicks or automated as jobs.
  • Led a team of five consisting of four developers and one solutions architect to productionize big data workflows in Azure Cloud, enabling the client to sunset its legacy applications and experience far more reliable and scalable prod workflows.
  • Enabled a wide diversity of use cases and future-proofed them by relying upon open source and open standards.
Technologies: Scala, Spark, Azure, Azure Data Factory, Azure Data Lake, Azure Databricks, Delta Lake, Data Engineering, ETL, Data Migration, Databricks, Big Data, Data Pipelines, ELT, Big Data Architecture, Azure Cloud Services, Azure Event Hubs, Data Architecture, Azure Data Lake Analytics, Data Lakes

Lead Data Engineer

2019 - 2020
Stealth mode AI startup (Series A $20 Million)
  • Architected and implemented a distributed machine learning platform.
  • Productionized 20+ machine learning models via Spark MLlib.
  • Built products and tools to reduce time to market (TTM) for machine learning projects. Reduced the startup's TTM from the design phase to production by 50%.
  • Productionalized 8 Scala Spark applications to transform the ETL layer to feed into the machine learning models downstream.
  • Used Spark SQL for ETL and Spark Structured Streaming and Spark MLlib for analytics.
  • Led a team of six comprising of three data scientists, two back-end engineers, and one front-end engineer. Delivered a solution that had a back-end layer that talked to the front end via REST API and launched and managed Spark jobs on demand.
Technologies: Data Engineering, Apache Hive, Apache Impala, SQL, Apache Spark, Scala, Bash, Linux, Spark Structured Streaming, Machine Learning, MLlib, Spark, Spark SQL, ETL, Big Data, Data Pipelines, ELT, Big Data Architecture, Data Architecture, Data Lakes

Senior Data Engineer

2018 - 2019
Dow Chemical (Fortune 62)
  • Created five Scala Spark apps for ETL and wrote multiple Bash scripts for the automation of these jobs.
  • Architected and built a Scala Spark app to validate Oracle source tables with their ingested counterparts in HDFS. The user can dynamically choose to conduct either a high-level or data-level validation.
  • Developed the application so that its output would be the exact mismatched columns and rows between source and destination in case of a discrepancy.
  • Reduced the engineer's manual debugging workload by over 99% by lowering it to just running the application and then reading the human-readable output file.
  • Delivered the entire ETL and validation project ahead of schedule and under budget.
  • Used Cloudera Distributed Hadoop (CDH) for HDFS and Hive extensively.
Technologies: Data Engineering, Apache Hive, Apache Impala, SQL, Apache Spark, Scala, Hadoop, Bash, Linux, Oracle Database, Spark SQL, ETL, Big Data, Data Pipelines, ELT, Big Data Architecture, Data Architecture

Senior Data Engineer

2018 - 2019
Boston Scientific (Fortune 319)
  • Designed and implemented a Scala Spark application to build Apache Solr indices from Hive tables. The app was designed for a rollback on any failure and reduced the downtime for downstream consumers from around 3 hours to around 10 seconds.
  • Implemented a Spark Structured Streaming application to ingest data from Kafka streams and upsert them into Kudu tables in a Kerberized cluster.
  • Set up multiple Shell scripts to automate Spark jobs, Apache Sqoop jobs, and Impala commands.
  • Used Cloudera Distributed Hadoop (CDH) and ElasticSearch extensively.
Technologies: Data Engineering, Apache Hive, Apache Impala, SQL, Apache Spark, Scala, Hadoop, Bash, Linux, Kudu, Spark Structured Streaming, Apache Solr, Spark SQL, ETL, Big Data, Data Pipelines, ELT, Big Data Architecture, Data Architecture

Senior Data Engineer

2017 - 2018
General Mills (Fortune 200)
  • Consumed social marketing data from various sources, including Google Analytics API, Oracle Databases, and various streaming sources.
  • Created a Scala Spark application to ingest >100Gb of data as a daily batch job, partition, and store as parquet in HDFS, with corresponding Hive partitions at the query layer. The app replaced a legacy Oracle solution and reduced runtime by 90%.
  • Set up Spark SQL and Spark Structured Streaming for ETL.
  • Used Cloudera Distribution Hadoop (CDH) extensively.
Technologies: Data Engineering, Apache Hive, Apache Impala, SQL, Apache Spark, Scala, Hadoop, Spark Structured Streaming, Spark SQL, ETL, Big Data, Data Pipelines, ELT, Big Data Architecture, Data Architecture

Software Engineer

2015 - 2016
MetLife Insurance (Fortune 44)
  • Acted as the product manager for a motorcycle insurance web app. The app grew into becoming the primary landing site for motorcycle insurance leads.
  • Built master for deployment until production. Deployed all builds and was primary on the stability of the build.
  • Led Scrum development for client teams of 30+ developers, testers, and analysts.
  • Architected and supported the solution within the client organization.
Technologies: Model View Controller (MVC), Agile

Optimizing Capital Allocation for Mortgage Market Loans

https://github.com/Namanj/Mortgage-Market-Tri-Analysis
This project was developed as a 2-week capstone project for Galvanize's data science program.

I worked with data from Shubham Housing Finance, a firm that has given out more than USD $150 million as mortgage loans over the past 5 years.

My goal was to use data science to help the firm optimize its usage of capital, both in its loan allocation process and in its expansion.

I decided to break this broad goal down into 3 individual more specific goals:
- Build a classifier that predicts the probability that a customer will default on their loan
- Recommend new office locations which maximize growth potential
- Forecast upcoming amount of business over the next quarter

Languages

Scala, SQL, Snowflake, Python 3, Bash

Frameworks

Spark, Apache Spark, Play Framework, Spark Structured Streaming, Hadoop, YARN

Libraries/APIs

Spark ML, MLlib, Google APIs

Tools

Git, IntelliJ IDEA, Spark SQL, Apache Impala, Apache Solr, Kudu, Apache Sqoop, Subversion (SVN), GitLab

Paradigms

ETL, ETL Implementation & Design, Functional Programming, Microservices Architecture, Object-oriented Programming (OOP), Agile Software Development, Agile, Model View Controller (MVC)

Platforms

Azure, Azure Event Hubs, Databricks, Linux, Apache Kafka, MacOS, Oracle Database

Storage

Data Lakes, Data Lake Design, Data Pipelines, Azure Cloud Services, Apache Hive, HDFS

Other

Azure Data Factory, Azure Data Lake, Data Engineering, Data Warehousing, Delta Lake, Data Migration, Azure Data Lake Analytics, ETL Development, Big Data, Data Architecture, Big Data Architecture, ELT, Azure Databricks, Data Warehouse Design, Data Build Tool (dbt), Machine Learning, Data Structures, Snowpark

2012 - 2014

Bachelor of Science Degree in Computer Science and Engineering

The Ohio State University - Columbus, Ohio, USA

DECEMBER 2017 - PRESENT

Spark and Hadoop Developer

Cloudera

JANUARY 2016 - PRESENT

Data Science Bootcamp

Galvanize | San Francisco, California, USA

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring