Vikram Goyal, Data Engineer and Developer in Vaughan, ON, Canada
Vikram Goyal

Data Engineer and Developer in Vaughan, ON, Canada

Member since October 5, 2020
Vikram has specialized in leveraging big data and business intelligence technologies to solve business problems for eight years. He is an expert in implementing solutions such as data lakes, enterprise data marts, and application data masking. As an accomplished and resourceful software professional with over 15 years of experience, Vikram believes that it's crucial to understand and analyze all aspects of a problem before choosing a technology or approach to solve it.
Vikram is now available for hire

Portfolio

  • Manulife
    Slowly Changing Dimensions (SCD), Excel VBA, Shell Scripting...
  • Infosys Limited
    SQL Server Integration Services (SSIS), Python, Windows PowerShell, T-SQL...

Experience

Location

Vaughan, ON, Canada

Availability

Part-time

Preferred Environment

Apache Sqoop, Excel VBA, Python, Oozie, Apache Hive, Visual Studio, SQL Server 2014

The most amazing...

...thing I have developed is a data curation framework, using PySpark and Hive to automatically load SCD1 and SCD2 data with minimum setup effort.

Employment

  • Data Engineer

    2019 - PRESENT
    Manulife
    • Implemented a PySpark framework to ingest data from files (delimited, fixed width, and Excel) into Apache Hive tables. As a result, the ingestion process was simplified and resulted in an effort saving of more than 50%.
    • Facilitated the calculation of assets under management across various dimensions after complex data transformations and calculations using data curation scripts created in HQL, Oozie, and shell scripts.
    • Created code templates using VBA macros for creating metadata files for data ingestion and data curation into SCD1 and SCD2 tables. This helped to reduce code errors and development time by around 30%.
    Technologies: Slowly Changing Dimensions (SCD), Excel VBA, Shell Scripting, Hibernate Query Language (HQL), PySpark, Oozie, Spark SQL, Apache Hive
  • Technology Architect

    2005 - 2020
    Infosys Limited
    • Created a data ingestion framework for loading varied data such as multi-structured VSAM, XML, JSON, zip, and fixed-width files; Microsoft SQL Server; Oracle; etc., to a data lake on HDFS using Apache Hive, Apache Sqoop, SSIS, and Python.
    • Led a team of four professionals to create two complex data marts using T-SQL on Microsoft PDW, implementing load strategies such as SCD1, SCD2, and fact upsert.
    • Wrote common data warehouse load strategies to help reduce the development time by nearly 30%.
    • Created reusable components in PySpark for implementing data load strategies such as SCD1, SCD2, and fact upsert. This led to a development effort saving of 30%.
    • Implemented a solution to ingest data of a complex XML (both in terms of structure and data volume) to a data lake using Apache Hive. This solution resulted in a cost savings of $300,000 for the client.
    • Created two frameworks: one using Windows PowerShell to send data extracts from views created on mart tables to external systems and another one using Microsoft SQL Server to create and update stats automatically on all tables for a given database.
    • Automated the complete process of data masking, getting data from the source until saving off the masked data, by building a new framework using shell scripting and Oracle. This helped reduce processing time for creating masked copies by nearly 50%.
    • Created data comparison tools using Excel macros to compare source and masked data copies to ensure the integrity and completeness of masked data, which helped save around 70% of the validation effort.
    Technologies: SQL Server Integration Services (SSIS), Python, Windows PowerShell, T-SQL, Data Lakes, Microsoft SQL Server, Excel VBA, Microsoft Parallel Data Warehouse (PDW), SQL Server 2014, PySpark, Apache Sqoop, Apache Hive, Big Data

Experience

  • Data Curation Framework

    A framework to automate the process of creating components in PySpark for implementing data load strategies such as SCD1, SCD2, and fact upsert. The users write the business logic query and the framework does the initial data validation to remove duplicate data, null keys, etc. The framework then runs the business query and, in the end, loads the data to target by implementing data load strategies such as SCD1, SCD2, and fact upsert automatically.

  • Hadoop Data Ingestion Framework

    A data ingestion framework for loading varied data such as multi-structured VSAM, XML, JSON, zip, and fixed-width files; Microsoft SQL Server; Oracle; etc., to a data lake on HDFS using Apache Hive, Apache Sqoop, SSIS, and Python. Users fill in the details of the files to be ingested in Excel, and the framework creates all the code for ingesting the files to HDFS and creating external tables on that data, which the users can then access.

Skills

  • Languages

    Excel VBA, T-SQL, SQL, Python
  • Tools

    Microsoft Excel, Oozie, Spark SQL, Apache Sqoop
  • Paradigms

    ETL
  • Storage

    SQL Server 2014, Apache Hive, Microsoft Parallel Data Warehouse (PDW), Microsoft SQL Server, Data Lakes, SQL Server 2012, Databases, MySQL, SQL Server Integration Services (SSIS)
  • Other

    Slowly Changing Dimensions (SCD), Data Engineering, Data Warehousing, Excel Macros, Data Warehouse Design, Big Data, Shell Scripting
  • Frameworks

    Windows PowerShell
  • Libraries/APIs

    PySpark

Education

  • Bachelor of Engineering Degree in Electronics and Electrical Communication
    2001 - 2005
    Punjab Engineering College - Chandigarh, India

Certifications

  • Architecting Microsoft Azure Solutions
    JUNE 2018 - PRESENT
    Microsoft
  • Administering Microsoft SQL Server 2012/2014 Databases
    FEBRUARY 2015 - PRESENT
    Microsoft
  • Implementing a Data Warehouse with Microsoft SQL Server 2012/2014
    DECEMBER 2014 - PRESENT
    Microsoft
  • Querying Microsoft SQL Server 2012/2014
    SEPTEMBER 2014 - PRESENT
    Microsoft

To view more profiles

Join Toptal
Share it with others