Shahban Riaz, Developer in Melbourne, Australia

Shahban Riaz

Data Engineer and Developer

Location
Melbourne, Australia
Toptal Member Since
August 3, 2022

Shahban is a data engineer who specializes in architecting, designing, and developing data lakes, warehouses, and analytics solutions. For over 14 years in the technology industry, he has guided large organizations in establishing data governance frameworks, implementing batch and real-time data pipelines, and building data quality frameworks. Shahban has experience with test-driven development, CI/CD, and agile project execution.

Shahban is available for hire
Hire Shahban

Portfolio

SEEK
Amazon Web Services (AWS), Amazon S3 (AWS S3), AWS IAM, Redshift Spectrum...
AusNet Services
Solution Design, Distributed Team Management, Azure Data Factory, Azure SQL...
Jemena
PySpark, AWS EMR, AWS Glue, Amazon Athena, Apache Airflow, Amazon DynamoDB...

Location

Melbourne, Australia

Availability

Full-time

Preferred Environment

Azure, Apache Airflow, Azure Synapse, Databricks, Terraform, Apache Kafka, Amazon Web Services (AWS), Redshift, Apache Spark, Agile

The most amazing...

...project I've developed is a framework for configuration-driven data curation, transformation, and quality assurance using Apache Airflow and PySpark.

Work Experience

2021 - 2023

Senior Data Engineer

SEEK
  • Enhanced data ingestion and orchestration frameworks to run jobs in clustered Spark environments using AWS Batch. This reduced the execution time of data pipelines by more than half.
  • Integrated the self-service portal with Airflow and Talend to allow the execution of data processing pipelines across multiple systems with a single click.
  • Developed an automated data tagging solution for an enterprise data lake using Amazon SNS, Amazon SQS, and AWS Lambda functions.
  • Configured AWS Lake Formation to federate data across multiple data lakes. This enables end users to access data in various data lakes from a single location.
  • Built a configuration-driven framework using Apache Airflow and Spark that allows business users to generate customized data objects from simple SQL queries.
  • Created and productionized data quality pipelines using Great Expectations.
  • Developed data pipelines to curate data from Salesforce using REST APIs.
Technologies: Amazon Web Services (AWS), Amazon S3 (AWS S3), AWS IAM, Redshift Spectrum, AWS Batch, PySpark, Apache Airflow, Amazon Athena, AWS CloudFormation, Buildkite, Docker, AWS Glue, AWS Lake Formation, Git, Jira, SQL, Python, ETL, Data Engineering, Containerization, APIs, Spark, PostgreSQL, Data Governance, PL/SQL
2021 - 2021

Data Analytics Technical and Design Lead

AusNet Services
  • Prepared technical design for the full-stack monitoring solution of the corporate data analytics platform, which resulted in a 360-degree monitoring view of the platform. I used Azure Monitor, Kusto Query Language, and a Log Analytics workspace.
  • Drafted architecture and design patterns for data ingestion, transformation, and storage, using Azure Data Factory, PostgreSQL, Databricks, data lake storage, EventHubs, and a data vault.
  • Prepared data models for spatial and weather data sets using the data vault methodology.
  • Led a team of seven DataOps engineers in developing a data analytics and machine learning platform.
  • Reviewed and enhanced end-to-end architecture for a data lake and a data warehousing solution.
  • Oversaw the development and optimization of streaming data pipelines, utilizing Azure ADF, Azure Event Hubs, Apache Spark, Azure Databricks, and Azure SQL.
  • Designed patterns to curate data from various external systems using REST APIs and Apache Spark.
Technologies: Solution Design, Distributed Team Management, Azure Data Factory, Azure SQL, Databricks, PySpark, Azure Data Lake, EventHub, Data Vaults, Git, Jira, SQL, Python, Data Architecture, Azure, ETL, Data Engineering, Containerization, Data Warehouse Design, APIs, Spark, PostgreSQL, Data Governance, PL/SQL
2020 - 2021

Senior Data Engineer and Solution Designer

Jemena
  • Developed a reusable data curation and processing framework using PySpark, AWS EMR, Glue, S3, DynamoDB, and Amazon SQS.
  • Built a configurable pipeline orchestration framework using Python and Apache Airflow.
  • Created continuous deployment pipelines for automated testing and deployment of infrastructure and data pipelines using AWS CodeCommit, CodePipeline, CodeBuild, and Cloud Development Kit (CDK).
  • Drafted end-to-end data architecture for an AWS-based lakehouse solution utilizing native services and open-source Delta Lake.
Technologies: PySpark, AWS EMR, AWS Glue, Amazon Athena, Apache Airflow, Amazon DynamoDB, Amazon Simple Queue Service (SQS), Docker, AWS CodeCommit, AWS Cloud Development, Python, Jira, Data Architecture, Agile, Amazon Web Services (AWS), ETL, Data Engineering, Containerization, Spark, SQL, Data Governance, PL/SQL
2019 - 2020

Senior Data Engineer

nbn
  • Developed and productionized data ingestion, transformation, and modeling frameworks, using Confluent Kafka; Spark in Scala; AWS DynamoDB, Lambda, and ECS; and Amazon EMR, EKS, SNS, and SQS.
  • Built a scalable pipeline scheduling framework using Python and Apache Airflow.
  • Designed and developed data consumption patterns using AWS Glue, Athena, Redshift Spectrum, and Tableau.
  • Created a data tagging solution for ensuring data security and traceability.
  • Enhanced infrastructure deployment pipelines, using Jenkins and Terraform in Apache Kafka, ZooKeeper, and Airflow; AWS EMR, S3, ECS, Glue, and DynamoDB; and Amazon SNS and SQS.
  • Designed and assisted in implementing CI/CD processes to deploy canary releases for data ingestion and processing.
  • Worked on optimizing the performance of existing Kafka-based data pipelines.
Technologies: Apache Spark, Apache Airflow, Amazon DynamoDB, Apache Kafka, Amazon Elastic MapReduce (EMR), Kubernetes, Docker, Amazon Elastic Container Service (Amazon ECS), Amazon Athena, AWS Glue, AWS Simple Notification Service (SNS), Amazon Simple Queue Service (SQS), Amazon CloudWatch, Jenkins, Terraform, Git, Scala, Python, Jira, SQL, Containerization, Amazon Web Services (AWS), PL/SQL
2018 - 2019

Senior Consultant – Big Data

Deloitte
  • Designed hybrid data movement, organization, processing, and notification solutions for on-premise data lakes in Cloudera and on-cloud data lakes using the Google Cloud Platform.
  • Developed data pipelines for batch and stream processing using StreamSets, Apache Kafka, Pub/Sub, DataFlow, BigQuery, Google's machine learning API, and Twilio.
  • Prepared a big data lake and data warehousing architecture using Azure services, including Azure Data Lake Storage Gen2, ADF, Databricks, PolyBase, Cosmos DB, SQL Database, and Azure SQL Data Warehouse.
  • Built data ingestion and processing pipelines using ADF and Spark.
  • Conducted performance tests on end-to-end data pipelines to establish the suitability of PaaS services for production loads.
  • Designed and developed Type 2 SCD data sync pipelines using Spark and Spark SQL.
  • Created data quality assurance and reconciliation frameworks.
Technologies: Apache Spark, Databricks, Azure SQL Data Warehouse (SQL DW), Cloudera, Azure Data Lake, Azure Data Factory, Data Architecture, Solution Design, Git, Jira, Azure, Python, PySpark, SQL, Data Engineering, Containerization, PL/SQL
2007 - 2018

System Implementation Consultant

Techlogix
  • Worked in roles including technical consultant, functional consultant, and team leader for the implementation of PeopleSoft Campus Solutions.
  • Developed operational and statutory reports using Oracle business intelligence tools.
  • Developed tens of integrations between PeopleSoft systems and third-party products.
  • Developed and automated data migration from legacy systems to PeopleSoft systems.
Technologies: PL/SQL, SQL, Oracle, Microsoft SQL Server, PeopleSoft, Business Analysis, Requirements Analysis, Business Process Analysis, Data Migration, REST APIs, APIs, API Integration, PeopleCode, Oracle BI Publisher, Java, Stakeholder Management, IT Project Management

Experience

Enterprise Data Analytics Platform

I built an enterprise data analytics platform for one of the leading telecom companies to curate, process, and store data from multiple corporate systems into an AWS data lake.

I was part of the team assigned to establish architectural patterns to process data from relational and non-relational sources and establish data governance frameworks, including data security, classification, ownership, discoverability, and consumption patterns.

During the later stages of the project, I worked on implementing the platform using AWS IAM, Glue, Athena, Lake Formation, ECS, and EMR, PySpark, Apache Airflow, and containerization technology.

Information Management Lakehouse

I participated as the technical and design lead in a project that involved curating data from IoT devices in the electricity and gas network, geospatial networks, and relational systems. Then we modeled that data using Data Vault and stored it in Azure Data Lake for analysis. The information was stored in data lakes and warehouses using lakehouse architecture to support business intelligence and machine learning uses.

My responsibilities included creating detailed architectural and design documents for the data platform and reviewing design documents prepared by other team members. I also provided technical guidance to the team for building data pipelines and reviewed other engineers' work to ensure that best practices were followed.

Corporate Data Hub

I created a corporate data hub for a leading online employment marketplace that allows to curate data from various enterprise systems into an AWS-based data warehouse. This data warehouse was built using lakehouse architecture.

I developed and enhanced the pipeline scheduling, data curation, and processing framework using Apache Airflow and Spark; PostgreSQL; REST APIs; Delta Lake; AWS S3, Glue, Athena, and Lake Formation; CDC; and Type 2 SCD. I also built data marts for business consumption based on dimensional modeling.

Skills

Languages

SQL, Python, Scala, PeopleCode, Java

Frameworks

Apache Spark, Spark, AWS EMR

Libraries/APIs

PySpark, REST APIs

Tools

Apache Airflow, Amazon Elastic Container Registry (Amazon ECR), Git, Jira, Amazon Athena, AWS Glue, Terraform, Amazon Elastic MapReduce (EMR), Amazon Elastic Container Service (Amazon ECS), Jenkins, AWS Batch, Oracle BI Publisher

Paradigms

ETL, Agile, Testing, Requirements Analysis

Platforms

Docker, Amazon Web Services (AWS), Azure Event Hubs, Apache Kafka, Azure, Databricks, Azure Functions, Oracle

Storage

Data Pipelines, PL/SQL, Redshift, Amazon S3 (AWS S3), Data Lake Design, PostgreSQL, Microsoft SQL Server

Other

Solution Design, Data Architecture, Data Engineering, AWS Cloud Development, Software Engineering, Delta Lake, Deployment, Azure Data Lake, Data Governance, Containerization, IT Project Management, Azure Databricks, Azure Data Factory, Data Modeling, Azure Blob Storage, Azure Synapse Analytics, Network Data Storage, Data Processing, Data Lakehouse, Azure Monitor, Team Leadership, AWS Lake Formation, Data Warehouse Design, APIs, PeopleSoft, Business Analysis, Business Process Analysis, Data Migration, API Integration, Stakeholder Management, AWS Cloud Architecture, Cloud Infrastructure, Cloud Migration

Education

2003 - 2007

Bachelor's Degree in Computer Science

University of Sargodha - Sargodha, Pakistan

Certifications

MARCH 2023 - PRESENT

AWS Certified Solutions Architect – Professional

AWS

JUNE 2022 - PRESENT

Databricks Certified Data Engineer Professional

Databricks

JULY 2020 - JULY 2023

AWS Certified Data Analytics Specialty

AWS

FEBRUARY 2020 - PRESENT

DP-200: Implementing an Azure Data Solution

Microsoft

APRIL 2017 - PRESENT

AgilePM Practitioner

APMG International