Vikram is available for hire

Vikram Goyal

Verified Expert in Engineering

Data Engineer and Developer

Location

Vaughan, ON, Canada

Toptal Member Since

October 5, 2020

For eight years, Vikram has specialized in leveraging cloud and big data technologies to solve business problems. He is an expert in implementing solutions such as data lakes, enterprise data marts, and application data masking. As an accomplished and resourceful software professional with over 18 years of experience, Vikram believes it's crucial to understand and analyze all aspects of a problem before choosing a technology or approach to solve it.

Data Migration Data Engineering Data Warehousing Data Warehouse Design Microsoft Excel Excel 2013 SQL Server 2014 Microsoft SQL Server T-SQL (Transact-SQL)SQL Server 2012 SQL Databases ETL Apache Hive Microsoft Parallel Data Warehouse (PDW)Excel Macro Sqoop

Portfolio

Economical Insurance

Google Cloud Platform (GCP), Google Cloud Functions, GSM, Databricks, Scala...

BMO Bank of Montreal

AWS Lambda, Amazon Redshift, Amazon Athena, Amazon Redshift Spectrum, AWS IoT...

Manulife

Slowly Changing Dimensions (SCD), Excel VBA, Shell Scripting, PySpark, Oozie...

Experience

Excel VBA - 10 years Big Data - 7 years SQL Server 2014 - 6 years SQL - 6 years Microsoft Parallel Data Warehouse (PDW) - 5 years Apache Hive - 5 years PySpark - 1 year Spark SQL - 1 year

Availability

Part-time

Preferred Environment

Excel VBA, Python, Apache Hive, Visual Studio, SQL Server 2014, PySpark, Google Cloud Platform (GCP)

The most amazing...

...thing I have developed is a data ingestion framework using PySpark and BigQuery to automatically load SCD1 and SCD2 data with minimum setup effort.

Work Experience

Senior Cloud Data Engineer

2021 - PRESENT

Economical Insurance

Designed and implemented a file ingestion framework to ingest data to Google BigQuery using BigQuery Python APIs and Airflow.
Created solutions such as loading historic data from on-prem Hive to GCP BigQuery using Scala-Spark, Databricks, and BigQuery and loading SAS data from on-prem to GCP BigQuery using PySpark, Databricks, and BigQuery.
Migrated diverse sources to GCP BigQuery, ensuring data consistency and accuracy.
Architected and implemented a solution to create GCP REST APIs using Cloud Run, Apigee, and Python to provide access to underlying BigQuery data to be used by various teams.
Implemented FinOps reports using Tableau to track the cost of using cloud split by different parameters such as cloud service, team, environment, etc.
Analyzed and optimized BigQuery queries to reduce the cost of data storage and data retrieval using various approaches, such as re-partitioning tables and rewriting the queries to use optimal joins and where clauses.
Used Azure Graph APIs to create reusable functions to report a list of AD groups, AD group-to-user mapping, etc.

Technologies: Google Cloud Platform (GCP), Google Cloud Functions, GSM, Databricks, Scala, Tableau, Google BigQuery, Excel VBA, Data Migration, SQL, MySQL, Google Cloud SQL, Google Cloud, PySpark

Senior Cloud Data Engineer

2020 - 2021

BMO Bank of Montreal

Created the solution design blueprint to migrate 145 applications from on-prem (Cloudera) to AWS data lake on S3 and further to load data marts using Scala-Spark on Hive/Redshift.
Built a data ingestion framework using Spark-Scala to load data from S3 to Redshift.
Developed multiple solutions on Athena, such as data encryption/decryption and access control using Lake Formation, etc.
Migrated an on-prem legacy system (Cloudera) with jobs in Pentaho and Hive/Oozie to AWS using technologies such as Redshift, Spark-Scala, Airflow, etc.
Implemented a solution to convert single-segmented/multi-segmented EBCDIC files (coming from mainframes) to ASCII using Scala-Spark.
Optimized SQL code to implement SCD Type-1 and Type-2 loads to Redshift.

Technologies: AWS Lambda, Amazon Redshift, Amazon Athena, Amazon Redshift Spectrum, AWS IoT, Pentaho, Apache Hive, SQL, Excel VBA, MySQL, Oozie, Excel Macros, Excel 2013

Data Engineer

2019 - 2020

Manulife

Implemented a PySpark framework to ingest data from files (delimited, fixed width, and Excel) into Apache Hive tables. As a result, the ingestion process was simplified and resulted in an effort saving of more than 50%.
Facilitated the calculation of assets under management across various dimensions after complex data transformations and calculations using data curation scripts created in HQL, Oozie, and shell scripts.
Built code templates using VBA macros to create metadata files for data ingestion and data curation into SCD1 and SCD2 tables. This helped to reduce code errors and development time by around 30%.

Technologies: Slowly Changing Dimensions (SCD), Excel VBA, Shell Scripting, PySpark, Oozie, Spark SQL, Apache Hive, SQL, Excel Macros, Excel 2013

Technology Architect

2005 - 2019

Infosys

Created a data ingestion framework for loading varied data such as multi-structured VSAM, XML, JSON, zip, and fixed-width files; Microsoft SQL Server; Oracle; etc., to a data lake on HDFS using Apache Hive, Apache Sqoop, SSIS, and Python.
Led a team of four professionals to create two complex data marts using T-SQL on Microsoft PDW, implementing load strategies such as SCD1, SCD2, and fact upsert.
Wrote common data warehouse load strategies to help reduce the development time by nearly 30%.
Created reusable components in PySpark for implementing data load strategies such as SCD1, SCD2, and fact upsert. This led to a development effort saving of 30%.
Implemented a solution to ingest data of a complex XML (both in terms of structure and data volume) to a data lake using Apache Hive. This solution resulted in a cost savings of $300,000 for the client.
Created two frameworks: one using Windows PowerShell to send data extracts from views created on mart tables to external systems and another using Microsoft SQL Server to automatically generate and update stats on all tables for a given database.
Automated the complete data masking process, getting data from the source until saving off the masked data, by building a new framework using shell scripting and Oracle. This helped reduce processing time for creating masked copies by nearly 50%.
Created data comparison tools using Excel macros to compare source and masked data copies to ensure the integrity and completeness of masked data, which helped save around 70% of the validation effort.

Technologies: SQL Server Integration Services (SSIS), Python, Windows PowerShell, T-SQL (Transact-SQL), Data Lakes, Microsoft SQL Server, Excel VBA, Microsoft Parallel Data Warehouse (PDW), SQL Server 2014, PySpark, Apache Sqoop, Apache Hive, Big Data, SQL, Excel 2013, Excel Macros

Experience

Data Curation Framework

A framework to automate the process of creating components in PySpark for implementing data load strategies such as SCD1, SCD2, and fact upsert. The users write the business logic query and the framework does the initial data validation to remove duplicate data, null keys, etc. The framework then runs the business query and, in the end, loads the data to target by implementing data load strategies such as SCD1, SCD2, and fact upsert automatically.

Hadoop Data Ingestion Framework

A data ingestion framework for loading varied data such as multi-structured VSAM, XML, JSON, zip, and fixed-width files; Microsoft SQL Server; Oracle; etc., to a data lake on HDFS using Apache Hive, Apache Sqoop, SSIS, and Python. Users fill in the details of the files to be ingested in Excel, and the framework creates all the code for ingesting the files to HDFS and creating external tables on that data, which the users can then access.

Education

2001 - 2005

Bachelor of Engineering Degree in Electronics and Electrical Communication

Punjab Engineering College - Chandigarh, India

Certifications

JUNE 2018 - PRESENT

Architecting Microsoft Azure Solutions

Microsoft

FEBRUARY 2015 - PRESENT

Administering Microsoft SQL Server 2012/2014 Databases

Microsoft

DECEMBER 2014 - PRESENT

Implementing a Data Warehouse with Microsoft SQL Server 2012/2014

Microsoft

SEPTEMBER 2014 - PRESENT

Querying Microsoft SQL Server 2012/2014

Microsoft

Skills

Libraries/APIs

PySpark

Tools

Microsoft Excel, Excel 2013, Oozie, Spark SQL, Apache Sqoop, Amazon Athena, Amazon Redshift Spectrum, Tableau

Languages

Excel VBA, T-SQL (Transact-SQL), SQL, Python, Scala

Storage

SQL Server 2014, Apache Hive, Microsoft Parallel Data Warehouse (PDW), Microsoft SQL Server, Data Lakes, SQL Server 2012, Databases, MySQL, Google Cloud SQL, Google Cloud, SQL Server Integration Services (SSIS)

Paradigms

ETL

Platforms

Google Cloud Platform (GCP), Databricks, AWS Lambda, AWS IoT, Pentaho

Frameworks

Windows PowerShell

Other

Slowly Changing Dimensions (SCD), Data Engineering, Data Warehousing, Excel Macros, Data Warehouse Design, Google BigQuery, Data Migration, Big Data, Google Cloud Functions, GSM, Shell Scripting, Amazon Redshift

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring