Abhishek J, Developer in Mississauga, ON, Canada
Abhishek is available for hire
Hire Abhishek

Abhishek J

Verified Expert  in Engineering

Data Engineer and Developer

Location
Mississauga, ON, Canada
Toptal Member Since
January 10, 2023

Abhishek enjoys solving complex use cases while being creative with his solutions. He has exceptional presentation, written, interpersonal, and analytical skills coupled with driving results through team building. Abhishek adheres to project timelines, meets business expectations, resolves, and proactively works when assigned to projects.

Portfolio

Mosaic
Azure, Azure Blobs, Azure Data Lake, Azure Databricks, Azure Data Factory...
BCG
Azure Cloud Services, Dedicated SQL Pool (formerly SQL DW)...
Amdocs
Amazon S3 (AWS S3), AWS Glue, AWS Lambda, AWS Database Migration Service (DMS)...

Experience

Availability

Part-time

Preferred Environment

Big Data, ETL, Cloudera, Hadoop, Spark, PySpark, SQL, Scala, Azure, Amazon Web Services (AWS)

The most amazing...

...recent development I've worked on is to migrate a client from an on-prem legacy system to Azure cloud from scratch and develop an automated ingestion framework.

Work Experience

Senior Data Engineer

2020 - 2022
Mosaic
  • Migrated ETL pipelines and data from on-prem legacy systems to Azure cloud. Translated mappings to the new data vault fact or dimension tables.
  • Developed a common ingestion framework that automated the manual process of data quality checks or data load to tables and reduced manual runtimes from four hours to process within 30 minutes—over 20 applications use this framework across platforms.
  • Designed a data warehouse for complex business requirements, which needed data integration from multiple source systems.
  • Produced complex translations and used mapping activities in Azure Synapse analytics with respect to IBM DataStage jobs.
  • Worked on Azure Data Share to securely transfer Inbound and outbound data.
  • Created a reusable custom component in-house tool data-platform coordinator to invoke downstream pipelines. This was built using Azure Cosmos DB config files and Azure Functions.
  • Captured real-time data quality logs using Apache Kafka and report generation.
  • Designed and implemented PySpark user-defined functions (UDF) for evaluation, filtering, loading, and storing of data.
  • Upgraded and migrated Azure Data Lake Storage Gen1 to Gen2 resources.
Technologies: Azure, Azure Blobs, Azure Data Lake, Azure Databricks, Azure Data Factory, Azure SQL, Azure SQL Databases, Azure Synapse, Azure Key Vault, Azure DevOps, Azure Cosmos DB, Azure Event Hubs, Big Data, ETL, Hadoop, Spark, PySpark, SQL, Scala, Databases, Cloud Computing, Data Warehousing, ETL Tools, Snowflake, Microsoft Power BI, Data Engineering, Python, ETL Implementation & Design, Data Modeling, Business Intelligence (BI), Shell Scripting, Big Data Architecture, Snowpark, Synapse, Data Lakes, Data Visualization, Data Pipelines, Azure Functions, SQL Server Integration Services (SSIS), Microsoft SQL Server, Databricks, PostgreSQL, Python 3, APIs, Azure SQL Data Warehouse, Dedicated SQL Pool (formerly SQL DW), Apache Kafka, Data Transformation, Relational Databases, Message Queues, Data Manipulation, Lambda Functions, Apache Spark, Data Governance

Data Engineer

2020 - 2021
BCG
  • Built change data capture (CDC) with Snowflake’s Streams and Tasks. Used Snowflake to load data and run it with auto-schema detection and instant data availability.
  • Participated in designing a master data management process and a data catalog.
  • Implemented new Tecno Spark 3 features to optimize an ETL job that runs from six to seven hours to less than an hour.
  • Developed a reusable data platform coordinator to invoke downstream pipelines; it was built with custom components and was meant to be an in-house tool. This was made using Cosmos DB configuration files and Azure functions.
  • Developed complex translations and used mapping activities in Azure Synapse analytics with respect to IBM DataStage jobs.
Technologies: Azure Cloud Services, Azure SQL Data Warehouse, Dedicated SQL Pool (formerly SQL DW), Spark, SQL, Scala, Azure Databricks, Snowflake, Synapse, Data Lakes, Azure Data Lake, Data Build Tool (dbt), Data Pipelines, Azure SQL, Azure Functions, SQL Server Integration Services (SSIS), Microsoft SQL Server, Databricks, Python 3, APIs, Web Services, AWS Glue, AWS Lambda, Apache Kafka, Data Transformation, Amazon EC2, Amazon Athena, Amazon RDS, Relational Databases, Data Manipulation, AWS Step Functions, Lambda Functions, Apache Spark, Amazon Elastic MapReduce (EMR), Data Governance, Cloud Migration

ETL Developer

2019 - 2020
Amdocs
  • Developed data processing solutions using Amazon EMR and wrote Apache Spark transformations scripts resultant output to Amazon Redshift data warehouse, data lake, or Amazon S3 storage.
  • Garnered expertise in ETL/ELT, creating and designing data pipelines without code using AWS Glue Data Catalog and Glue Studio, and made data available for Amazon QuickSight analysis.
  • Formulated PySpark data ingestion code using AWS Glue or Lambda Functions and maintained ETL batch processing pipeline integrated with Amazon SNS or SQS to trigger from Amazon S3 to the target data warehouse.
  • Built file format services for clients to handle multiple file formats like EBCDIC, CSV, FIXED, Apache Parquet, Apache Avro, etc., using Python and Apache Spark.
  • Extracted data from data lakes and enterprise data warehouse (EDW) to relational databases for analyzing and getting more meaningful insights using SQL queries and PySpark.
  • Used Airflow scheduler to schedule jobs and load data into facts or dimensions.
  • Created automation testing framework using PySpark to address data inconsistencies and automated manual checks on data.
Technologies: Amazon S3 (AWS S3), AWS Glue, AWS Lambda, AWS Database Migration Service (DMS), Redshift, Big Data, ETL, Hadoop, Spark, PySpark, SQL, Scala, Databases, Cloud Computing, Data Warehousing, ETL Tools, Tableau, Data Engineering, Python, Amazon Web Services (AWS), Amazon Elastic Container Service (Amazon ECS), ETL Implementation & Design, Data Modeling, Business Intelligence (BI), Shell Scripting, Synapse, Data Lakes, Data Build Tool (dbt), Data Pipelines, Azure SQL, Microsoft SQL Server, Databricks, PostgreSQL, Python 3, APIs, Web Services, Dedicated SQL Pool (formerly SQL DW), Azure SQL Data Warehouse, Apache Kafka, Data Transformation, Amazon EC2, Amazon Athena, Amazon Elastic MapReduce (EMR), Amazon RDS, Relational Databases, Data Manipulation, AWS Step Functions, Apache Spark, Java, Data Governance, Cloud Migration

Big Data Developer

2015 - 2018
Logic Information Systems
  • Migrated data from IBM Netezza EDW and Oracle data warehouse to Hadoop data lake using Sqoop tool.
  • Involved in creating Hive tables, loading the data, and writing Hive queries using HiveQL (HQL).
  • Used Hive to perform ad-hoc query analysis against the data residing in relational database management systems and optimized the query performance using the partition and bucketing concept.
  • Incorporated Aginity Workbench to create views in the Oracle data warehouse to store the aggregated results.
  • Used KNIME, developed ETL workflows to push data from EDW to Hadoop Distributed File System (HDFS), and scheduled jobs using Linux cron.
  • Involved in POC designing Apache Spark model for existing MapReduce model and migrated MapReduce model to PySpark model using Scala.
  • Managed the high-availability cluster environment, checking the status of data nodes, including the paths for the ecosystem components in the .bashrc file.
Technologies: Cloudera, Spark, Big Data, ETL, Hadoop, PySpark, SQL, Databases, Data Warehousing, ETL Tools, Data Engineering, Python, Amazon Web Services (AWS), Amazon Elastic Container Service (Amazon ECS), Oracle, Shell Scripting, MySQL, Data Lakes, Data Pipelines, SQL Server Integration Services (SSIS), Microsoft SQL Server, Web Services, Data Transformation, Relational Databases, Data Manipulation, Apache Spark, Java

Spark Ingestion Framework

Developed a reusable framework designed to ingest the acquired data into the platform.

It included loading data into a raw data platform and data vault (long-term storage) tiers. It validated the data technically and loaded it to the raw tier. This framework's architecture and design philosophy were to have all the decision-making information driven by configurations, which made the ingestion process intelligent enough to derive technical details and metadata as much as possible at runtime.

The framework was designed and written using Azure Databricks or Spark using Azure SQL Database to store and retrieve configurations. The technical implementation utilized the object-oriented concept using PySpark to generate Apache Spark code automatically and dynamically (at runtime). The framework supported Azure Blob, Azure Data Lake Storage Generation 1, and Generation 2 storages. It relied on Azure Key Vault to retrieve credentials and secrets at runtime. Azure Data Factory orchestrated the various activities in the overall data pipeline. The audit and control were also designed to capture the operational statistics and batch control. This feature extended and enhanced the existing batch control system.

Migration from On-premise to Azure Cloud

Migrated and automated existing big data pipelines from Cloudera's on-premise data lake to the Azure cloud.

It included translating the on-premise Oracle and MySQL DB tables to Hadoop using Sqoop and designing an ETL pipeline that copies data from the Hadoop data lake to the Azure lakehouse using the PySpark ingestion framework. An automated dataset comprised of Python code was built to compare the on-premise source's results to the one using Azure.

SYNAPSE ANALYTICS OUTPUT
Dimensional modeling (facts and dimensions) was implemented using Data Vault 2.0. and history data was most effectively maintained in the data vault using Slowly Changing Dimensions (SCD) Type 2 and Delta tables.

Development of On-premise Cloudera Using ETL Tools

Built distributed data solutions using Hadoop and Spark frameworks.

It involved converting Hive and SQL queries into Spark transformations using Spark RDDs, Python, and Scala. I participated in creating Hive tables, loading the data, and writing Hive queries using HiveQL, which will run internally in the map-reduce way. Aginity Workbench was used to create views in the Oracle data warehouse to store the aggregated results. I used Knime and developed ETL workflows to push data from Enterprise Data Warehouse (EDW) to Hadoop Distributed File System (HDFS) and scheduled the jobs using Linux cron.
Data was recovered from third-party systems using data ingestion techniques, Sqoop, and Flume, which was appreciated by the client. I used MapReduce functionality to perform data cleansing activities and Hive to perform ad-hoc query analysis against the data residing in relational database management systems. I optimized the query performance using the partition and bucketing concept.

Languages

SQL, Scala, Python, Snowflake, Python 3, PHP, Java

Frameworks

Spark, Apache Spark, Hadoop

Paradigms

ETL, ETL Implementation & Design, MapReduce, Azure DevOps, Business Intelligence (BI)

Storage

Relational Databases, Azure SQL, Azure SQL Databases, Azure Cosmos DB, Amazon S3 (AWS S3), Data Lakes, Data Pipelines, SQL Server Integration Services (SSIS), Microsoft SQL Server, PostgreSQL, Databases, Azure Blobs, Redshift, MySQL, Azure Cloud Services, SQL Server Management Studio (SSMS), Apache Hive

Other

Big Data, Data Engineering, Data Warehousing, ETL Tools, Azure Data Lake, Azure Databricks, Azure Data Factory, AWS Database Migration Service (DMS), Data Modeling, Big Data Architecture, Data Build Tool (dbt), APIs, Web Services, Data Transformation, Amazon RDS, Data Manipulation, Lambda Functions, Data Governance, Cloud Migration, Cloud Computing, Shell Scripting, Snowpark, Data Visualization, Message Queues

Libraries/APIs

PySpark

Tools

AWS Glue, Amazon Elastic Container Service (Amazon ECS), Synapse, Amazon Athena, Amazon Elastic MapReduce (EMR), Cloudera, Azure Key Vault, Tableau, Microsoft Power BI, Apache Sqoop, Cron, Apache Airflow, Informatica ETL, AWS Step Functions

Platforms

Azure, Azure Synapse, Amazon Web Services (AWS), Databricks, Azure SQL Data Warehouse, Azure Functions, Apache Kafka, Amazon EC2, Dedicated SQL Pool (formerly SQL DW), Azure Event Hubs, AWS Lambda, Oracle, KNIME

2018 - 2019

Postgraduate Certificate in Cloud Computing and Big Data Analytics

Lambton College - Toronto, Canada

2011 - 2015

Bachelor's Degree in Information Technology

Sastra University - Thanjavur, India

DECEMBER 2022 - PRESENT

Databricks Certified Associate Developer for Apache Spark

Databricks

DECEMBER 2020 - DECEMBER 2023

Microsoft Certified: Azure Data Engineer Associate

Microsoft

AUGUST 2019 - AUGUST 2022

Cloudera Certified Associate Spark and Hadoop Developer

Cloudera

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring