Ankit Vohra, Developer in Gurugram, Haryana, India
Ankit is available for hire
Hire Ankit

Ankit Vohra

Verified Expert  in Engineering

Data Engineer and Oracle DB Developer

Location
Gurugram, Haryana, India
Toptal Member Since
September 23, 2020

Ankit, a cloud big data and database developer, excels in AWS, big data, Python, Spark, PL/SQL, and complex SQL. He seamlessly migrated Py2 to Py3, transitioned on-premise Hadoop-PySpark apps to the cloud, wrote Unix shell scripts and Windows batch scripts, and optimized SQL queries. As a certified AWS Architect Associate, GCP Professional Data Engineer, Apache Spark developer, and Oracle OCP, Ankit leverages his skills through freelancing, actively embracing new challenges.

Portfolio

Barefoot Ocean LLC
ETL, SQL, Amazon Web Services (AWS), Data Mining, Matillion, Amazon S3 (AWS S3)...
ExlService
Python, SQL, PySpark, Cloud, Amazon Web Services (AWS)...
IBM
Data Modeling, Batch Scripting, PuTTY, Oracle SQL Data Modeler...

Experience

Availability

Part-time

Preferred Environment

SQL, RDBMS, Unix Shell Scripting, Databricks, Python 3, PostgreSQL, AWS Lambda, Amazon RDS, Amazon S3 (AWS S3), Google Drive API

The most amazing...

...project I've done was automating AWS Lambda Python function to interact with Excel files in Google Drive using Amazon RDS PostgreSQL.

Work Experience

Cloud Data Engineer

2022 - PRESENT
Barefoot Ocean LLC
  • Developed AWS Lambda functions for generating Excel files using Google Drive API and Python, incorporating RapidFuzz to generate the closest matching recommendations in the drop-down.
  • Created Orchestration and Transformation Matillion jobs, efficiently loading large Excel files into Snowflake.
  • Used Python's KoboExtractor library to fetch data from KoboToolbox, loading it into Amazon RDS Postgres DB. Configured needed AWS components, including the role, NAT, VPC, and subnet, as Lambda needed internet access for fetching Kobo data.
  • Developed an AWS Glue ETL job to extract data from a Postgres DB view present in a different AWS account, saving it as a CSV file in S3. Scheduled the loading into a Postgres DB using an Amazon RDS pg_cron extension with the required AWS setup.
  • Configured the VPC, subnet, role, & policy for cross-account access of RDS Postgres DB via AWS Glue; extracted data to AWS S3. Enabled the AWS Lambda function to access API data via the internet by configuring the NAT gateway, VPC subnets, role, etc.
Technologies: ETL, SQL, Amazon Web Services (AWS), Data Mining, Matillion, Amazon S3 (AWS S3), Amazon RDS, AWS Glue, Snowflake, Google Drive API, Python 3, PostgreSQL, PL/pgSQL, AWS Lambda, Google Cloud Platform (GCP), Data Warehousing, Matillion ETL for Redshift, AWS Cloud Architecture, Cloud Architecture

Cloud Big Data Lead Engineer

2021 - 2022
ExlService
  • Migrated Hadoop, HDFS, and PySpark 2.0 applications to PySpark 3, AWS cloud (S3 storage, EMR processing), and Azure cloud (ADLS2, DBFS storage, and Azure Databricks).
  • Worked on AWS EMR and Azure Databricks clusters, configuration, init scripts, running code in Databricks via the GitHub repository, Python wheel, pyc files, and cluster log reference.
  • Utilized PySpark scripts for converting ORC to Parquet format, transforming and comparing data via data frames.
  • Used Azure Data Factory and Azure Synapse to build pipelines to read from Azure SQL database table, running queries and saving data into Azure Blob Storage in the Parquet format.
  • Used Azure Databricks notebooks extensively for testing code snippets, code sharing among peers, passing parameters via widgets, and connecting to Azure SQL database via pyodbc and JDBC.
  • Used the Azure Databricks workflows feature to build the notebook workflow with interdependencies.
Technologies: Python, SQL, PySpark, Cloud, Amazon Web Services (AWS), Amazon Elastic MapReduce (EMR), Azure SQL, Azure SQL Databases, Azure Data Factory, Amazon S3 (AWS S3), Amazon EC2, Hadoop, Apache Hive, Apache Ambari, HDFS, Orc, Parquet, Azure, Shell Scripting, Spark SQL, Spark, Spark Core, Azure Synapse, Databricks

Senior Developer | DBA

2015 - 2022
IBM
  • Purged two huge, partitioned FK-related tables to store only two years of data via parallel and partitioned movement using Oracle Data Pump. Addressed the FK constraint separately via PL/SQL code, using an exception table.
  • Built complex SQL queries per business requirements; for example, counting multiple tables to be displayed as rows, fetching previous and current year sales figures, and comparing sales for single stores. Led a team of three Oracle DBAs.
  • Converted the main driving Unix shell script code for an Oracle PL/SQL-based DWH ETL application to Windows batch script code as part of a Unix server decommission process. Translated control file validations, procedure calls, and logging functions.
  • Developed a shell script to call SQL-PL/SQL code via Cron on a scheduled basis for report generation. Developed functionality to run the same script on an ad hoc basis from SQL Plus via an Oracle Scheduler job.
  • Built automated monitoring and email notification functionality via Unix shell script and Oracle SQL/PL/SQL integration for an ETL DWH application that used to run hundreds of stored processes over six to eight hours on a nightly basis.
  • Developed PL/SQL functionality to audit and block table drops by anonymous users and to audit the locking of DB schemas due to multiple invalid password attempts. Some scheduled processes allowed users to connect to the DB with expired credentials.
  • Created a shell script to generate a report via SQL to count present records in all tables in a schema with a list of any empty tables shown at the top of the report. Used unpivot functionality to list tables in a row-wise manner.
  • Debugged complex PL/SQL code with multiple embedded calls for pointing out inefficient queries via profiling, optimized the same via bulk processing methods, and created appropriate indexes.
  • Supported Amazon RDS PostgreSQL databases as application-level DBA. Used Liquibase to maintain database code changes in Amazon RDS PostgreSQL. Conducted SQL optimization and used utilities like psql, pg_dump, pg_restore, etc., for datasets.
  • Conducted performance tuning for a weekly running query in Oracle DB that sometimes ran long. Analyzed the query via ASH, V$SQL, DBA_HIST_SQLSTAT, and more and determined that the query performed poorly only at one instance due to limited resources.
Technologies: Data Modeling, Batch Scripting, PuTTY, Oracle SQL Data Modeler, Oracle SQL Developer, Oracle Database, Unix Shell Scripting, Oracle Performance Tuning, Oracle PL/SQL, Oracle SQL, Oracle DBA, Database Development, ETL, Oracle Database Tuning, Data Engineering, AIX, Task Automation, UNIX Utilities, Database Optimization, Unix, PL/SQL, Oracle, Stored Procedure, Amazon RDS, Relational Database Services (RDS), Liquibase, PostgreSQL

Developer

2009 - 2015
IBM
  • Fetched data from many sources, such as Oracle, MySQL, Sybase, and SQL Server databases, and Excel, TXT, and CSV files; transformed and loaded it into the staging area; performed complex queries for reporting; and generated files for other systems.
  • Developed an Oracle PL/SQL procedure for a data warehouse application to identify tables that are not in sync with source DB tables and then call appropriate sub-procedures to fix those data mismatches automatically via merging.
  • Built programs in the database (using SQL, PL/SQL, and shell scripting) and scheduled jobs in the database and server to run the programs as required.
  • Created a proof of concept that involved reading files stored in Hadoop and processing them via Spark SQL and reading data from Hadoop files and storing it in MongoDB.
  • Developed a report (using SQL) that customized column headers for easier reference by management. Resolved weekly report outages with various filtering criteria. Used CTE, SQL analytical functions, and joins for the final report.
  • Created a dashboard view via SQL to fetch data on recent electrical outages on a region and device basis. The data was fetched from the database every 10 minutes to update the view in the front-end dashboard.
  • Analyzed complex stored procedures with nested calls to determine poor performing queries and optimized them via bulk operation, SQL restructuring, and index creation. Used profiling to find queries with poor performance.
  • Developed a solution for accessing a Microsoft Access database from an Oracle database. Reference: https://dbheart.com/oracle/connect_msaccess.
  • Built a materialized view to store the reference dataset and then query as per input values. This reduced execution time to 0.5-1.0 second per call.
  • Supported and maintained an ABB electrical NMDMS while working with an energy and utility client. Support included problem troubleshooting, report development, and database development related to the product.
Technologies: MySQL, Oracle PL/SQL, PySpark, Spark SQL, Spark, Hadoop, Oracle, PL/SQL Tuning, Data Engineering, Data Transformation, Task Automation, SQL, RDBMS, UNIX Utilities, Oracle Database Tuning, Performance Tuning, Oracle Performance Tuning, ETL, Unix, PL/SQL, ABB-NMDMS, Spark Core

Developer

2008 - 2009
Seasia Infotech
  • Developed T-SQL stored procedures for migrating COBOL (mainframe) jobs from a legacy system for a US government client in the pension and administration domain.
  • Developed SQL for generating reports via Oracle SQL*Plus with customized headers and data formats.
  • Wrote SQL in an SQL Server and Oracle for different requirements during the project.
Technologies: SQL Plus, T-SQL (Transact-SQL), Oracle PL/SQL, Database Development, Oracle

Developer

2005 - 2008
Tata Consultancy Services
  • Developed functionality for enabling mandate functionality in the core banking software.
  • Automated account statement generation per account-level statement frequency for the core banking system, using Pro*C, PL/SQL, Forms 6i, and Oracle SQL.
  • Implemented the multicity check-issuing facility in the core banking system that previously had been able to issue only base branch checkbooks. Used Pro*C, PL/SQL, Forms 6i, and Oracle SQL.
  • Developed aspects of a banking product using Pro*C, Oracle PL/SQL, SQL, and Oracle Forms 6i per the requirement specifications.
  • Conducted Pro*C learning and hands-on workshops for colleagues.
Technologies: Core Banking Systems, Oracle Forms, Oracle PL/SQL, Oracle, Unix, Unix Shell Scripting, Pro*C, Database Development, RDBMS, Oracle Database, SQL, Oracle SQL, Consumer Banking, C, UNIX Utilities, Solaris, PL/SQL

Fisheries Data Management

I spearheaded a project employing SQL, Python, Amazon RDS (PostgreSQL), AWS Lambda, and ETL (Matillion-Snowflake). The focus was on ingesting substantial Excel files into the Snowflake Data Warehouse via the Matillion ETL tool. Additionally, I utilized AWS Glue to extract data from a PostgreSQL table in a separate AWS account, transferring it into a CSV file in AWS S3. This required meticulous configuration of AWS components, including roles and network settings.
Also, I orchestrated AWS Lambda Python functions to fetch data from an external Kobo API and seamlessly ingest it into the Amazon RDS PostgreSQL database. This involved configuration of components such as roles, policies, and network settings, facilitating Lambda's access to the internet, Amazon RDS, AWS Secrets Manager, and Amazon S3.

Furthermore, I undertook the responsibility of generating Excel files using Python and seamlessly uploading them to Google Drive through the functionality provided by the Google Drive API.

Migration of On-premise Hadoop-HDFS-PySpark2 Application to Cloud

I worked on migrating an on-premise Hadoop-HDFS-PySpark2 application to the AWS cloud, utilizing S3 storage, Parquet format, RDS database, and EMR clusters to run code in PySpark 3.
Also, the same application was made compatible with Azure cloud, utilizing Databricks clusters for processing, dbfs/ADLS2 storage, and PySpark3 code.
The whole code was made Python3 compatible in the process.
Also, the application was made cloud agnostic, i.e., it was ensured that the same code could run successfully in on-premise, AWS, and Azure cloud environments.
The application was run and tested on the cloud and compared with on-premise output results.

Automated On-demand Data Import/Migration from Multiple Database Tables

Automated on-demand data export and import among tables in remote Oracle databases via shell scripting. The script can be called by a user with a session ID argument, using a DBMS scheduler job.

Input Parameters:
SOURCE_DATABASE
TARGET_DATABASE
SOURCE_SCHEMA
TARGET_SCHEMA
SOURCE_TABLE
TARGET_TABLE

The user can load the parameters into a CONTROL_TBL_, where session_id is some randomly generated number; for example, CONTROL_TBL_12345.

The job will accept the same session_id as an argument, read from the corresponding control table present in the target DB, and process the records to load data from the source table into the target table (after dropping if it already existed).

As required, the solution:
i) Imported and refreshed data as per provided input.
ii) Put table metadata (indexes, constraints, triggers, grants, etc.) into a CLOB column (in control table only), which the user could retrieve and run when needed.
iii) Put the error log into the same CLOB column for analysis purposes in case of failure.

Automated Monitoring of Data Warehouse Jobs

An Oracle data warehouse had been refreshed from a source DB via nightly running PL/SQL jobs that were, in turn, triggered via Tivoli Workload Scheduler.

Jobs used to span around six to eight hours and continuous monitoring was required for any failures/errors followed by steps to fix data sync issues after the job run window.

I automated the monitoring process via a Unix shell script that did the following:
1. Connected to the Oracle DB via SQL Plus.
2. Queried job(s) status and picked any failing jobs with an error message.
3. Dumped the information in an intermediate log file.
4. Emailed the file to the intended recipients.

The automated job was scheduled via cron to run in 30-minute intervals and send error emails after each job failure, rather than waiting six to eight hours until the end of the job. In instances of success, only one email was sent at the end run of the job to confirm successful completion.

Purging Two Massive Partitioned Tables with Foreign Key Relationships

A data warehouse database had two huge partitioned tables, 500 and 300 GB in size, with foreign key relationships among them. The client was concerned about query performance and table size. The requirement was to purge the tables and retain only the last two years of data while maintaining referential data integrity in the process and retaining original tables as history tables.

PROCESS
• Let the original huge tables as P (parent) and C (child).
• Let P_NEW and C_NEW as the tables with the last two years of data.

1. Exported data for P & C via parallelism.
2. Renamed constraints and indexes of P & C as the import; will attempt to create them with the same names and will fail.
3. Imported DMP files in the P_NEW table with parallelism and disabled logging.
4. Imported DMP files into C_NEW with parallelism by EXCLUDING REF_CONSTRAINT and disabled logging.
5. Created FK constraint on C_NEW referencing P_NEW in with NOVALIDATE.
6. ENABLE VALIDATE the REF constraint was tricky. I resolved the referential data mismatch by catching missing parent keys in child via an EXCEPTIONS table mechanism and then inserting it into parent via a loop.
7. Renamed P and C to P_HISTORY and C_HISTORY. Renamed P_NEW and C_NEW to P and C.

Generic Data Model for an Amazon S3-based DWH

The client required denormalized data from three different systems to be dumped into Amazon S3 for efficient data retrieval for business intelligence. I built a generic data model to accommodate those entities.

Three similar systems (A, B, C) stored account and finance data for a healthcare client. None of the systems had all the required data, and all three had to be referenced for data analysis. The requirement was to extract the data of relevant common entities from the systems and fit it into a generic, denormalized data model that would store data from all the systems in Amazon S3 for intelligence gathering via Spark clusters.

I first understood the database structure for the systems, then built a generic physical model with entities, attributes, and queries to fetch data from each database. System A had demographic data, B had account data, and C had product-related info in normalized tables.

Here, a generic, denormalized account entity was required to have a single record for a given customer with all the info in a single row (e.g., account, demographic, and product info). Data was pulled via queries through an Informatica ETL process, merged into parquet files, and placed in Amazon S3 for analysis via Spark clusters.

Migrating a DWH Application from Unix Shell Script to Windows Batch Script

As part of migrating a data warehouse application from an AIX (Unix) server to a Windows server, the code in driving shell scripts had to be rewritten into a Windows batch script.

Functionality included the following:
1. Check the presence of a control file for a job.
2. Call the stored procedure corresponding to the control file's content.
3. Check the final status of the executed job and write it to a file.

I converted the Unix shell script code into a Windows batch script as per the functionality and rescheduled the new job call commands on the new Windows server via the Tivoli Workload Scheduler.

Programming for a Customer Communication Project

While working on an energy and utility client's electrical outage management system application—a network manager distribution management system (ABB-NMDMS)—a customer communication project (CCP) came up. I wrote code for the following functionalities in Oracle PL/SQL, SQL, and Unix shell scripting:

1. For each electrical outage, get all impacted customer account IDs from two different data sources and put the non-duplicate IDs in a file every 30 minutes. The file would be consumed by an external application to send SMSs to customers when they have an electrical outage at their home, along with an estimated power restoration time.

2. Generate a separate file for duplicate account IDs and send an email (via uuencode) to intended recipients only when a different account ID is present. If an email for the same ID was already sent and the same ID comes again, then the same email should not be sent.

I built the technical specification document and did the coding for the same.

PostgreSQL Basketball League Management System

https://dbheart.com/oracle/files_to_download/AID_BLMS_AV.pdf
A database application for a basketball league management system (BLMS) as per the requirements given. I developed the database model and schema for entities, such as players, teams, seasons, and contracts, and I created required objects, such as tables, database links, procedures, and functions, to fulfill business requirements.

Link to the full document: https://dbheart.com/oracle/files_to_download/AID_BLMS_AV.pdf

Examples of reporting functionality:
1. Most expensive players
2. Most expensive lineup
3. Injured players
4. Player trading among teams
5. Replication via database links

Additional deliverables:
1. System context diagram
2. Use case model
3. Architecture overview
4. Location model
5. Logical and physical deployment models
6. Entity Relationship diagram (data modeling)

Python: Dictionary from a JSON File and Web Scraping for Information

https://dbheart.com/oracle/files_to_download/Python_w.pdf
Developed Python code to read word meanings from a JSON file, suggest alternatives when words are misspelled, and display the results of the matching word.

I also developed Python code to scrape a website with these options:
i) Check all the pages of a bookstore website to get all unique authors and the top-10 tags from the page.
ii) Check all pages of a bookstore and get books with 5-star ratings only.

Website for Technical Knowledge Sharing

A website with several topics related to Oracle Database and Unix technology. The contents can be viewed easily on a PC, mobile, or tablet via a browser. I built the website with CSS, HTML, and Bootstrap.
2008 - 2010

Post Graduate Diploma in Information Technology

Symbiosis Center for Distance Learning - India

2001 - 2005

Bachelor of Technology Degree in Electrical Engineering with Computers

MM Engineering College Ambala Haryana - Ambala, Haryana, India

NOVEMBER 2023 - PRESENT

AWS Solutions Architect Associate

Amazon Web Services

DECEMBER 2022 - PRESENT

Google Certified Professional Data Engineer

Google

MAY 2021 - PRESENT

AWS Certified Cloud Practitioner

Amazon Web Services

JUNE 2019 - PRESENT

IBM Certified Apache Spark Developer

IBM

AUGUST 2014 - PRESENT

Oracle Database 11g DBA (OCP) (Exams 1Z0-052, 1Z0-053)

Oracle

JULY 2012 - PRESENT

Oracle Database 11g: Program with PL/SQL (Exam 1Z0-144)

Oracle

MAY 2007 - PRESENT

Oracle9i DBA OCA (SQL Exams 1Z0-007, 1Z0-031)

Oracle

Languages

SQL, Stored Procedure, T-SQL (Transact-SQL), Python 3, Python, HTML, Pro*C, C, HTML5, CSS, Orc, Snowflake, PL/pgSQL

Paradigms

Database Development, ETL, Database Design, Automation

Platforms

Oracle Database, Oracle, KornShell, Unix, AIX, Linux, Amazon Web Services (AWS), Spark Core, Databricks, Solaris, Amazon EC2, AWS ALB, Azure, Azure Synapse, Google Cloud Platform (GCP), AWS Lambda

Storage

Oracle SQL Developer, Oracle SQL, Oracle RDBMS, Oracle PL/SQL, PL/SQL, RDBMS, Relational Databases, Databases, Oracle 11g, Database Administration (DBA), PostgreSQL, MySQL, Database Modeling, Oracle DBA, Oracle Database Tuning, Amazon S3 (AWS S3), Apache Hive, HDFS, Database Architecture, Microsoft SQL Server, Azure SQL, Azure SQL Databases, Google Cloud SQL, Auto-scaling Cloud Infrastructure

Other

Oracle Performance Tuning, Batch Scripting, Data Engineering, Data Modeling, Database Optimization, PL/SQL Tuning, Database Schema Design, Shell Scripting, Relational Database Design, Unix Shell Scripting, Data Transformation, Modeling, ABB-NMDMS, Electrical Outage Management System, SQL Plus, Core Banking Systems, Performance Tuning, Task Automation, Partitioning, UNIX Utilities, Amazon RDS, Relational Database Services (RDS), Cloud, Consumer Banking, Data Migration, Azure Data Factory, Parquet, Google BigQuery, AWS Cloud Architecture, Cloud Architecture, Cloud Infrastructure, Cloud Services, Data Mining, Matillion, NAT, Kobo, Data Warehousing

Libraries/APIs

PySpark, Liquibase, Pandas, Google Drive API

Tools

Oracle SQL Data Modeler, PuTTY, Cron, Spark SQL, Oracle Forms, Amazon Athena, Amazon Elastic MapReduce (EMR), Apache Ambari, AWS Glue, Amazon Virtual Private Cloud (VPC), Matillion ETL for Redshift

Frameworks

AWS HA, Hadoop, Bootstrap, Spark

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring