Naman Jain, Data Engineering Architect and Lead Developer in New Delhi, Delhi, India
Naman Jain

Data Engineering Architect and Lead Developer in New Delhi, Delhi, India

Member since June 13, 2019
Naman is a Cloudera certified Spark developer and has over four years of experience delivering data engineering services to multiple Fortune 100 clients, both on-site and remotely. He has taken dozens of clients' Scala Spark applications to the production environment, creating a step order of efficiency in their use-cases in many instances. He fundamentally believes in over-communication and establishing trust.
Naman is now available for hire

Portfolio

Experience

Location

New Delhi, Delhi, India

Availability

Part-time

Preferred Environment

Azure Cloud Services, Apache Spark, Scala, IntelliJ, Subversion (SVN), Git, Bash, Linux, MacOS

The most amazing...

...enterprise-grade big data ELT platform that I delivered in Azure Cloud was a single source of truth data Lakehouse that enabled a wide diversity of use cases. 

Employment

  • Data Solutions Architect

    2020 - 2021
    Enterprise Client via Toptal
    • Worked on orchestration and automation of the workflows via Azure Data Factory.
    • Optimized and partitioned storage in Azure Data Lake Storage (ADLS) Gen2.
    • Implemented complex strongly-typed Scala Spark workloads in Azure Databricks, along with dependency management and Git integration.
    • Implemented real-time low cost and low latency streaming workflows which at their peak were processing >2MM raw JSON blobs per Second. Architected as Azure Blob Storage -> Azure Event Hubs -> Azure Queues via ABS-AQS.
    • Created a multi-layered ELT platform which consisted of raw/bronze (Azure Blob Storage), current and silver (Azure Delta Lake), and mapped/gold (Azure Delta Lake) layers.
    • Balanced the cost of computing by spinning up clusters on-demand vs persisting them.
    • Made big data available for efficient and real-time analysis throughout the client via delta tables, which provided indexed and optimized stores, ACID transaction guarantees, and table level and row-level access controls.
    • Tied all of this together in end-to-end workflows that were either refreshed with just a few clicks or automated as jobs.
    • Led a team of five comprising of four developers, and one solutions architect to productionalize big data workflows in Azure Cloud that enabled the client to sunset its legacy applications and experience far more reliable and scalable Prod workflows.
    • Enabled a wide diversity of use cases and future-proofed them by relying upon open source and open standards.
    Technologies: Scala, Spark, Azure, Azure Data Factory, Azure Data Lake, Azure Databricks, deltalake, Data Engineering, ETL, Data Migration, Databricks
  • Lead Data Engineer

    2019 - 2020
    Stealth mode AI startup (Series A $20 Million)
    • Architected and implemented a distributed machine learning platform.
    • Productionized 20+ machine learning models via Spark MLlib.
    • Built products and tools to reduce time to market (TTM) for machine learning projects. Reduced the startup's TTM from the design phase to production by 50%.
    • Productionalized 8 Scala Spark applications to transform the ETL layer to feed into the machine learning models downstream.
    • Used Spark SQL for ETL and Spark Structured Streaming and Spark MLlib for analytics.
    • Led a team of six comprising of three data scientists, two back-end engineers, and one front-end engineer. Delivered a solution that had a back-end layer that talked to the front end via REST API and launched and managed Spark jobs on demand.
    Technologies: Data Engineering, Apache Hive, Apache Impala, SQL, Apache Spark, Scala, Bash, Linux, Spark Structured Streaming, Machine Learning, MLlib, Spark, Spark SQL, ETL
  • Senior Data Engineer

    2018 - 2019
    Dow Chemical (Fortune 62)
    • Productionalized five Scala Spark apps for ETL. Wrote multiple Bash Scripts for the automation of these jobs.
    • Architected and productionalized a Scala Spark app for validating the Oracle source tables with their ingested counterparts in HDFS. The user could dynamically choose to conduct either a high-level validation or a data level validation. The output of the app in case of a discrepancy was the exact columns and the exact rows that mismatched between source and destination.
    • Reduced the engineer's manual debug workload by over 99%, reducing it to just running the app and then reading the human-readable output file.
    • Delivered the entire ETL and validation project ahead of schedule.
    Technologies: Data Engineering, Apache Hive, Apache Impala, SQL, Apache Spark, Scala, Hadoop, Bash, Linux, Oracle Database, Spark SQL, ETL
  • Senior Data Engineer

    2018 - 2019
    Boston Scientific (Fortune 319)
    • Designed and implemented a Scala Spark application to build Apache Solr indices from Hive tables. The app was designed for a rollback on any failure and reduced the downtime for downstream consumers from ~three hrs to ~ten seconds.
    • Implemented Spark Structured Streaming application to ingest data from Kafka streams and upsert into Kudu tables in a kerberized cluster.
    • Implemented multiple Shell scripts to automate Spark jobs, Apache Sqoop jobs, Impala commands, and more.
    Technologies: Data Engineering, Apache Hive, Apache Impala, SQL, Apache Spark, Scala, Hadoop, Bash, Linux, Kudu, Spark Structured Streaming, Apache Solr, Spark SQL, ETL
  • Senior Data Engineer

    2017 - 2018
    General Mills (Fortune 200)
    • Consumed social marketing data from various sources. Namely Google Analytics API, Oracle Databases, various streaming sources, and more.
    • Productionalized a Scala Spark application to ingest >100Gb of data as a daily batch job, partition, and store as parquet in HDFS, with corresponding Hive partitions at the query layer. App replaced legacy Oracle solution and reduced runtime by 90%.
    • Used Spark SQL and Spark Structured Streaming for ETL.
    Technologies: Data Engineering, Apache Hive, Apache Impala, SQL, Apache Spark, Scala, Hadoop, Spark Structured Streaming, Spark SQL, ETL
  • Software Engineer

    2015 - 2016
    MetLife Insurance (Fortune 44)
    • Acted as the product manager for a motorcycle insurance web app. The app grew into becoming the primary landing site for motorcycle insurance leads.
    • Built master for deployment until production. Deployed all builds and was primary on the stability of the build.
    • Led Scrum development for client teams of 30+ developers, testers, and analysts.
    • Architected and supported the solution within the client organization.
    Technologies: Model View Controller (MVC), Agile

Experience

  • Optimizing Capital Allocation for Mortgage Market Loans
    https://github.com/Namanj/Mortgage-Market-Tri-Analysis

    This project was developed as a 2-week capstone project for Galvanize's data science program.

    I worked with data from Shubham Housing Finance, a firm that has given out more than USD $150 million as mortgage loans over the past 5 years.

    My goal was to use data science to help the firm optimize its usage of capital, both in its loan allocation process and in its expansion.

    I decided to break this broad goal down into 3 individual more specific goals:
    - Build a classifier that predicts the probability that a customer will default on their loan
    - Recommend new office locations which maximize growth potential
    - Forecast upcoming amount of business over the next quarter

Skills

  • Languages

    Scala, SQL, Python 3, Bash
  • Frameworks

    Spark, Apache Spark, Play Framework, Spark Structured Streaming, Hadoop, YARN
  • Libraries/APIs

    Spark ML, MLlib, Google APIs
  • Tools

    Git, IntelliJ, Spark SQL, Apache Impala, Apache Solr, Kudu, Apache Sqoop, Subversion (SVN)
  • Paradigms

    ETL, ETL Implementation & Design, Functional Programming, Microservices Architecture, Object-oriented Programming (OOP), Agile Software Development, Agile, Model View Controller (MVC)
  • Platforms

    Azure, Azure Event Hubs, Databricks, Linux, Apache Kafka, MacOS, Oracle Database
  • Storage

    Data Lakes, Data Lake Design, Data Pipelines, Azure Cloud Services, Apache Hive, HDFS
  • Other

    Azure Data Factory, Azure Data Lake, Data Engineering, Data Warehousing, Delta Lake, Data Migration, Azure Data Lake Analytics, ETL Development, Big Data, Data Architecture, Big Data Architecture, ELT, Data Warehouse Design, Machine Learning, Data Structures, Azure Databricks, deltalake

Education

  • Bachelor of Science Degree in Computer Science and Engineering
    2012 - 2014
    The Ohio State University - Columbus, Ohio, USA

Certifications

  • Spark and Hadoop Developer
    DECEMBER 2017 - PRESENT
    Cloudera
  • Data Science Bootcamp
    JANUARY 2016 - PRESENT
    Galvanize | San Francisco, California, USA

To view more profiles

Join Toptal
Share it with others