Data Engineer
2019 - PRESENTManulife- Implemented a PySpark framework to ingest data from files (delimited, fixed width, and Excel) into Apache Hive tables. As a result, the ingestion process was simplified and resulted in an effort saving of more than 50%.
- Facilitated the calculation of assets under management across various dimensions after complex data transformations and calculations using data curation scripts created in HQL, Oozie, and shell scripts.
- Created code templates using VBA macros for creating metadata files for data ingestion and data curation into SCD1 and SCD2 tables. This helped to reduce code errors and development time by around 30%.
Technologies: Slowly Changing Dimensions (SCD), Excel VBA, Shell Scripting, Hibernate Query Language (HQL), PySpark, Oozie, Spark SQL, Apache HiveTechnology Architect
2005 - 2020Infosys Limited- Created a data ingestion framework for loading varied data such as multi-structured VSAM, XML, JSON, zip, and fixed-width files; Microsoft SQL Server; Oracle; etc., to a data lake on HDFS using Apache Hive, Apache Sqoop, SSIS, and Python.
- Led a team of four professionals to create two complex data marts using T-SQL on Microsoft PDW, implementing load strategies such as SCD1, SCD2, and fact upsert.
- Wrote common data warehouse load strategies to help reduce the development time by nearly 30%.
- Created reusable components in PySpark for implementing data load strategies such as SCD1, SCD2, and fact upsert. This led to a development effort saving of 30%.
- Implemented a solution to ingest data of a complex XML (both in terms of structure and data volume) to a data lake using Apache Hive. This solution resulted in a cost savings of $300,000 for the client.
- Created two frameworks: one using Windows PowerShell to send data extracts from views created on mart tables to external systems and another one using Microsoft SQL Server to create and update stats automatically on all tables for a given database.
- Automated the complete process of data masking, getting data from the source until saving off the masked data, by building a new framework using shell scripting and Oracle. This helped reduce processing time for creating masked copies by nearly 50%.
- Created data comparison tools using Excel macros to compare source and masked data copies to ensure the integrity and completeness of masked data, which helped save around 70% of the validation effort.
Technologies: SQL Server Integration Services (SSIS), Python, Windows PowerShell, T-SQL, Data Lakes, Microsoft SQL Server, Excel VBA, Microsoft Parallel Data Warehouse (PDW), SQL Server 2014, PySpark, Apache Sqoop, Apache Hive, Big Data