Team Lead, Data Engineering and Analytics
2020 - PRESENT
- Led a team of data engineers and built an enterprise data lake on AWS with Amazon S3, Glue Data Catalog and Glue PySpark jobs, Athena, and Redshift—integrated data from multiple subsidiary companies into a single place for analytics and predictive modeling.
- Managed a team of engineers to build an event streaming system with Apache Kafka with Schema Registry to unify the transactions coming from different business entities and make data centrally available.
- Implemented a dbt set up across two Redshift clusters (one for ETLs and one solely for reporting) and two other teams (data analytics and science) for managing data warehouse transformations and models.
- Made multiple data sources (raw and enriched) available in Elasticsearch for consumption.
- Implemented custom real-time alerts with Elasticsearch and Datadog for technical support and operations teams.
- Supported reporting and data self-service by managing a Tableau server and a Redash instance.
- Collaborated closely with the platform engineering team to keep up with the best practices for automated deployments (Github Actions and Jenkins) and IaC (Terraform).
- Worked with the data science and analytics teams on the best practices for model training and deployment, data modeling, organizing development process, and automation.
Technologies: Redshift, Kafka Streams, Apache Kafka, Python, Data Build Tool (dbt), SQL, Data Engineering, ETL, Snowflake, Data Architecture, Database Architecture, Data Lakes, PySpark, Elasticsearch, Data Modeling, Data Aggregation, Amazon Athena, ETL Tools, Tableau, Apache Airflow, Amazon Web Services (AWS), Spark, Amazon EC2, Docker, Data Warehousing, Architecture, Performance Tuning
Data Engineer, Consultant (Freelance)
2019 - 2020
- Restructured a monolith ML model in PySpark to well-defined data load, processing, training, prediction, and output generation stages.
- Expressed the multiple stages of the model's lifecycle through an Airflow dag; used parallelism, logging, and notification utilities; implemented data quality checks as part of the pipeline.
- Introduced data-processing speed improvements—mainly through adjusting data compression formats for I/O operations, partitioning data, and using PySpark native functions instead of UDFs.
- Gathered requirements, designed, and built a PostgreSQL data warehouse focused on marketing and investment performance adhering to Kimball's classic facts and dimensions principles.
- Built data pipelines to populate the data warehouse with marketing and market analysis data from a variety of sources.
- Supported the head of BI in setting up the Tableau reporting infrastructure.
- Helped to split the reporting requirements and implementation into two buckets—real-time reporting with Elasticsearch and batch reporting that requires pre-processing, joining reference data, and aggregation with BigQuery.
- Refactored and sped up the performance of PySpark ML models predicting returns and cancellations.
- Automated the deployment process of models to EMR on-demand clusters.
- Remapped data sources from Exasol to a data lake built on top of Amazon S3 with Presto.
Technologies: Python, PostgreSQL, SQL, Data Engineering, ETL, Data Architecture, Database Architecture, Spark ML, PySpark, AWS EMR, Elasticsearch, Data Modeling, Data Aggregation, Amazon Athena, ETL Tools, Tableau, Apache Airflow, Amazon Web Services (AWS), Spark, Amazon EC2, Redshift, Docker, Apache Kafka, Data Warehousing, Architecture, Performance Tuning
Senior Data Engineer
2016 - 2018
- Built data pipelines to ingest data from Kafka, relational databases, MongoDB, financial agencies' APIs, marketing platforms, and Salesforce.
- Managed ingested data sources into a centralized data lake on top of Amazon S3 (for UK and US business) and a PostgreSQL data warehouse (for EU business).
- Integrated on-demand AWS EMR cluster with Hive and PySpark into the company's data warehousing, ETL, and reporting activities—to replace the long-running workloads inside PostgreSQL relational database.
- Built data marts and models for automated reporting with PostgreSQL, Redshift, Hive, Amazon S3, and Athena (depending on the geography and stack) for C-level stakeholders and governmental agencies.
Technologies: Amazon Web Services (AWS), Kafka Streams, AWS EMR, Redshift, Python, PostgreSQL, SQL, Data Engineering, ETL, Data Architecture, Database Architecture, PySpark, Data Modeling, Data Aggregation, Amazon Athena, ETL Tools, Tableau, Apache Airflow, Spark, Amazon EC2, Docker, Apache Kafka, Data Warehousing, Architecture, Performance Tuning
Data Engineer and Release Manager
2013 - 2016
- Developed ETL processes using IBM Datastage on top of Oracle database and SQL Server suite.
- Executed, supervised, and communicated the release process with the stakeholders.
- Built and presented multiple prototypes, as a member of the pre-sales squad, with Hadoop, Hive, and Spark.
Technologies: Datastage, Microsoft SQL Server, Oracle, SQL, Data Engineering, ETL, Data Architecture, Database Architecture, Data Modeling, Data Aggregation, ETL Tools, Tableau, Apache Airflow, Amazon Web Services (AWS), Spark, Amazon EC2, Docker, Python, Data Warehousing, Architecture, Performance Tuning