Verified Expert in Engineering
Data Engineer and Developer
For eight years, Vikram has specialized in leveraging cloud and big data technologies to solve business problems. He is an expert in implementing solutions such as data lakes, enterprise data marts, and application data masking. As an accomplished and resourceful software professional with over 18 years of experience, Vikram believes it's crucial to understand and analyze all aspects of a problem before choosing a technology or approach to solve it.
Excel VBA, Python, Apache Hive, Visual Studio, SQL Server 2014, PySpark, Google Cloud Platform (GCP)
The most amazing...
...thing I have developed is a data ingestion framework using PySpark and BigQuery to automatically load SCD1 and SCD2 data with minimum setup effort.
Senior Cloud Data Engineer
- Designed and implemented a file ingestion framework to ingest data to Google BigQuery using BigQuery Python APIs and Airflow.
- Created solutions such as loading historic data from on-prem Hive to GCP BigQuery using Scala-Spark, Databricks, and BigQuery and loading SAS data from on-prem to GCP BigQuery using PySpark, Databricks, and BigQuery.
- Migrated diverse sources to GCP BigQuery, ensuring data consistency and accuracy.
- Architected and implemented a solution to create GCP REST APIs using Cloud Run, Apigee, and Python to provide access to underlying BigQuery data to be used by various teams.
- Implemented FinOps reports using Tableau to track the cost of using cloud split by different parameters such as cloud service, team, environment, etc.
- Analyzed and optimized BigQuery queries to reduce the cost of data storage and data retrieval using various approaches, such as re-partitioning tables and rewriting the queries to use optimal joins and where clauses.
- Used Azure Graph APIs to create reusable functions to report a list of AD groups, AD group-to-user mapping, etc.
Senior Cloud Data Engineer
BMO Bank of Montreal
- Created the solution design blueprint to migrate 145 applications from on-prem (Cloudera) to AWS data lake on S3 and further to load data marts using Scala-Spark on Hive/Redshift.
- Built a data ingestion framework using Spark-Scala to load data from S3 to Redshift.
- Developed multiple solutions on Athena, such as data encryption/decryption and access control using Lake Formation, etc.
- Migrated an on-prem legacy system (Cloudera) with jobs in Pentaho and Hive/Oozie to AWS using technologies such as Redshift, Spark-Scala, Airflow, etc.
- Implemented a solution to convert single-segmented/multi-segmented EBCDIC files (coming from mainframes) to ASCII using Scala-Spark.
- Optimized SQL code to implement SCD Type-1 and Type-2 loads to Redshift.
- Implemented a PySpark framework to ingest data from files (delimited, fixed width, and Excel) into Apache Hive tables. As a result, the ingestion process was simplified and resulted in an effort saving of more than 50%.
- Facilitated the calculation of assets under management across various dimensions after complex data transformations and calculations using data curation scripts created in HQL, Oozie, and shell scripts.
- Built code templates using VBA macros to create metadata files for data ingestion and data curation into SCD1 and SCD2 tables. This helped to reduce code errors and development time by around 30%.
- Created a data ingestion framework for loading varied data such as multi-structured VSAM, XML, JSON, zip, and fixed-width files; Microsoft SQL Server; Oracle; etc., to a data lake on HDFS using Apache Hive, Apache Sqoop, SSIS, and Python.
- Led a team of four professionals to create two complex data marts using T-SQL on Microsoft PDW, implementing load strategies such as SCD1, SCD2, and fact upsert.
- Wrote common data warehouse load strategies to help reduce the development time by nearly 30%.
- Created reusable components in PySpark for implementing data load strategies such as SCD1, SCD2, and fact upsert. This led to a development effort saving of 30%.
- Implemented a solution to ingest data of a complex XML (both in terms of structure and data volume) to a data lake using Apache Hive. This solution resulted in a cost savings of $300,000 for the client.
- Created two frameworks: one using Windows PowerShell to send data extracts from views created on mart tables to external systems and another using Microsoft SQL Server to automatically generate and update stats on all tables for a given database.
- Automated the complete data masking process, getting data from the source until saving off the masked data, by building a new framework using shell scripting and Oracle. This helped reduce processing time for creating masked copies by nearly 50%.
- Created data comparison tools using Excel macros to compare source and masked data copies to ensure the integrity and completeness of masked data, which helped save around 70% of the validation effort.
Data Curation Framework
Hadoop Data Ingestion Framework
Excel VBA, T-SQL (Transact-SQL), SQL, Python, Scala
Microsoft Excel, Excel 2013, Oozie, Spark SQL, Apache Sqoop, Amazon Athena, Tableau
SQL Server 2014, Apache Hive, Microsoft Parallel Data Warehouse (PDW), Microsoft SQL Server, Data Lakes, SQL Server 2012, Databases, MySQL, Google Cloud SQL, Google Cloud, SQL Server Integration Services (SSIS)
Slowly Changing Dimensions (SCD), Data Engineering, Data Warehousing, Excel Macros, Data Warehouse Design, Google BigQuery, Data Migration, Big Data, Google Cloud Functions, GSM, Shell Scripting, Amazon Redshift, Amazon Redshift Spectrum
Google Cloud Platform (GCP), Databricks, AWS Lambda, AWS IoT, Pentaho
Bachelor of Engineering Degree in Electronics and Electrical Communication
Punjab Engineering College - Chandigarh, India
Architecting Microsoft Azure Solutions
Administering Microsoft SQL Server 2012/2014 Databases
Implementing a Data Warehouse with Microsoft SQL Server 2012/2014
Querying Microsoft SQL Server 2012/2014