Abhishek J
Verified Expert in Engineering
Data Engineer and Developer
Abhishek enjoys solving complex use cases while being creative with his solutions. He has exceptional presentation, written, interpersonal, and analytical skills coupled with driving results through team building. Abhishek adheres to project timelines, meets business expectations, resolves, and proactively works when assigned to projects.
Portfolio
Experience
Availability
Preferred Environment
Big Data, ETL, Cloudera, Hadoop, Spark, PySpark, SQL, Scala, Azure, Amazon Web Services (AWS)
The most amazing...
...recent development I've worked on is to migrate a client from an on-prem legacy system to Azure cloud from scratch and develop an automated ingestion framework.
Work Experience
Senior Data Engineer
Mosaic
- Migrated ETL pipelines and data from on-prem legacy systems to Azure cloud. Translated mappings to the new data vault fact or dimension tables.
- Developed a common ingestion framework that automated the manual process of data quality checks or data load to tables and reduced manual runtimes from four hours to process within 30 minutes—over 20 applications use this framework across platforms.
- Designed a data warehouse for complex business requirements, which needed data integration from multiple source systems.
- Produced complex translations and used mapping activities in Azure Synapse analytics with respect to IBM DataStage jobs.
- Worked on Azure Data Share to securely transfer Inbound and outbound data.
- Created a reusable custom component in-house tool data-platform coordinator to invoke downstream pipelines. This was built using Azure Cosmos DB config files and Azure Functions.
- Captured real-time data quality logs using Apache Kafka and report generation.
- Designed and implemented PySpark user-defined functions (UDF) for evaluation, filtering, loading, and storing of data.
- Upgraded and migrated Azure Data Lake Storage Gen1 to Gen2 resources.
Data Engineer
BCG
- Built change data capture (CDC) with Snowflake’s Streams and Tasks. Used Snowflake to load data and run it with auto-schema detection and instant data availability.
- Participated in designing a master data management process and a data catalog.
- Implemented new Tecno Spark 3 features to optimize an ETL job that runs from six to seven hours to less than an hour.
- Developed a reusable data platform coordinator to invoke downstream pipelines; it was built with custom components and was meant to be an in-house tool. This was made using Cosmos DB configuration files and Azure functions.
- Developed complex translations and used mapping activities in Azure Synapse analytics with respect to IBM DataStage jobs.
ETL Developer
Amdocs
- Developed data processing solutions using Amazon EMR and wrote Apache Spark transformations scripts resultant output to Amazon Redshift data warehouse, data lake, or Amazon S3 storage.
- Garnered expertise in ETL/ELT, creating and designing data pipelines without code using AWS Glue Data Catalog and Glue Studio, and made data available for Amazon QuickSight analysis.
- Formulated PySpark data ingestion code using AWS Glue or Lambda Functions and maintained ETL batch processing pipeline integrated with Amazon SNS or SQS to trigger from Amazon S3 to the target data warehouse.
- Built file format services for clients to handle multiple file formats like EBCDIC, CSV, FIXED, Apache Parquet, Apache Avro, etc., using Python and Apache Spark.
- Extracted data from data lakes and enterprise data warehouse (EDW) to relational databases for analyzing and getting more meaningful insights using SQL queries and PySpark.
- Used Airflow scheduler to schedule jobs and load data into facts or dimensions.
- Created automation testing framework using PySpark to address data inconsistencies and automated manual checks on data.
Big Data Developer
Logic Information Systems
- Migrated data from IBM Netezza EDW and Oracle data warehouse to Hadoop data lake using Sqoop tool.
- Involved in creating Hive tables, loading the data, and writing Hive queries using HiveQL (HQL).
- Used Hive to perform ad-hoc query analysis against the data residing in relational database management systems and optimized the query performance using the partition and bucketing concept.
- Incorporated Aginity Workbench to create views in the Oracle data warehouse to store the aggregated results.
- Used KNIME, developed ETL workflows to push data from EDW to Hadoop Distributed File System (HDFS), and scheduled jobs using Linux cron.
- Involved in POC designing Apache Spark model for existing MapReduce model and migrated MapReduce model to PySpark model using Scala.
- Managed the high-availability cluster environment, checking the status of data nodes, including the paths for the ecosystem components in the .bashrc file.
Experience
Spark Ingestion Framework
It included loading data into a raw data platform and data vault (long-term storage) tiers. It validated the data technically and loaded it to the raw tier. This framework's architecture and design philosophy were to have all the decision-making information driven by configurations, which made the ingestion process intelligent enough to derive technical details and metadata as much as possible at runtime.
The framework was designed and written using Azure Databricks or Spark using Azure SQL Database to store and retrieve configurations. The technical implementation utilized the object-oriented concept using PySpark to generate Apache Spark code automatically and dynamically (at runtime). The framework supported Azure Blob, Azure Data Lake Storage Generation 1, and Generation 2 storages. It relied on Azure Key Vault to retrieve credentials and secrets at runtime. Azure Data Factory orchestrated the various activities in the overall data pipeline. The audit and control were also designed to capture the operational statistics and batch control. This feature extended and enhanced the existing batch control system.
Migration from On-premise to Azure Cloud
It included translating the on-premise Oracle and MySQL DB tables to Hadoop using Sqoop and designing an ETL pipeline that copies data from the Hadoop data lake to the Azure lakehouse using the PySpark ingestion framework. An automated dataset comprised of Python code was built to compare the on-premise source's results to the one using Azure.
SYNAPSE ANALYTICS OUTPUT
Dimensional modeling (facts and dimensions) was implemented using Data Vault 2.0. and history data was most effectively maintained in the data vault using Slowly Changing Dimensions (SCD) Type 2 and Delta tables.
Development of On-premise Cloudera Using ETL Tools
It involved converting Hive and SQL queries into Spark transformations using Spark RDDs, Python, and Scala. I participated in creating Hive tables, loading the data, and writing Hive queries using HiveQL, which will run internally in the map-reduce way. Aginity Workbench was used to create views in the Oracle data warehouse to store the aggregated results. I used Knime and developed ETL workflows to push data from Enterprise Data Warehouse (EDW) to Hadoop Distributed File System (HDFS) and scheduled the jobs using Linux cron.
Data was recovered from third-party systems using data ingestion techniques, Sqoop, and Flume, which was appreciated by the client. I used MapReduce functionality to perform data cleansing activities and Hive to perform ad-hoc query analysis against the data residing in relational database management systems. I optimized the query performance using the partition and bucketing concept.
Skills
Languages
SQL, Scala, Python, Snowflake, Python 3, PHP, Java
Frameworks
Spark, Apache Spark, Hadoop
Paradigms
ETL, ETL Implementation & Design, MapReduce, Azure DevOps, Business Intelligence (BI)
Storage
Relational Databases, Azure SQL, Azure SQL Databases, Azure Cosmos DB, Amazon S3 (AWS S3), Data Lakes, Data Pipelines, SQL Server Integration Services (SSIS), Microsoft SQL Server, PostgreSQL, Databases, Azure Blobs, Redshift, MySQL, Azure Cloud Services, SQL Server Management Studio (SSMS), Apache Hive
Other
Big Data, Data Engineering, Data Warehousing, ETL Tools, Azure Data Lake, Azure Databricks, Azure Data Factory, AWS Database Migration Service (DMS), Data Modeling, Big Data Architecture, Data Build Tool (dbt), APIs, Web Services, Data Transformation, Amazon RDS, Data Manipulation, Lambda Functions, Data Governance, Cloud Migration, Cloud Computing, Shell Scripting, Snowpark, Data Visualization, Message Queues
Libraries/APIs
PySpark
Tools
AWS Glue, Amazon Elastic Container Service (Amazon ECS), Synapse, Amazon Athena, Amazon Elastic MapReduce (EMR), Cloudera, Azure Key Vault, Tableau, Microsoft Power BI, Apache Sqoop, Cron, Apache Airflow, Informatica ETL, AWS Step Functions
Platforms
Azure, Azure Synapse, Amazon Web Services (AWS), Databricks, Azure SQL Data Warehouse, Azure Functions, Apache Kafka, Amazon EC2, Dedicated SQL Pool (formerly SQL DW), Azure Event Hubs, AWS Lambda, Oracle, KNIME
Education
Postgraduate Certificate in Cloud Computing and Big Data Analytics
Lambton College - Toronto, Canada
Bachelor's Degree in Information Technology
Sastra University - Thanjavur, India
Certifications
Databricks Certified Associate Developer for Apache Spark
Databricks
Microsoft Certified: Azure Data Engineer Associate
Microsoft
Cloudera Certified Associate Spark and Hadoop Developer
Cloudera
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring