Faisal Falah, Developer in Dubai, United Arab Emirates
Faisal is available for hire
Hire Faisal

Faisal Falah

Verified Expert  in Engineering

AWS Data Engineer & Developer

Location
Dubai, United Arab Emirates
Toptal Member Since
May 19, 2021

Faisal is a data engineer with nine years of experience in implementing batch and stream data pipelines in both AWS and GCP using respective cloud services and open source tools. He's a programmer with good command of Go and Python and hands-on experience in various data projects, from ETL development for the global top bank's data migration projects to real-time clickstream processing in AWS to scaling a recommendation engine in PySpark (EMR) for the leading real estate provider in the US.

Portfolio

Leading Supply Chain Provider (Data CoE)
Amazon Web Services (AWS), SQL, Python 3
Nissan Digital
Snowflake, Pandas, Apache NiFi, HBase
Brillio
Amazon Web Services (AWS), Spark, Amazon Elastic MapReduce (EMR)

Experience

Availability

Part-time

Preferred Environment

Unix, Amazon Web Services (AWS), PyCharm, Sublime Text 3, SQL, PySpark, Google Cloud Platform (GCP), RabbitMQ

The most amazing...

...thing I've created is a unified data platform in AWS for one of the leading media clients in India. Its architecture got featured in the official AWS blog.

Work Experience

Cloud Data Engineer

2020 - PRESENT
Leading Supply Chain Provider (Data CoE)
  • Implemented data lakehouse architecture. Data was collected as a batch on-premise via DataSync, stream data via RabbitMQ to Kinesis, CRM data via REST API, and RDS via DMS. The lakehouse was created with services like S3, Glue, and Redshift.
  • Designed the system completely on serverless services in AWS. Heavily used Step Functions, Lambda, and Docker. I also used Athena SQL (insert into statement) for many transformations. It was essential to keep costs to a minimum.
  • Used open-source Python libraries to automate Jupiter notebooks created by data scientists with a seamless CI/CD pipeline.
Technologies: Amazon Web Services (AWS), SQL, Python 3

Senior Data Engineer

2019 - 2020
Nissan Digital
  • Served as a data engineer at Nissan car plants in Japan for the project, collecting and transforming data generated during final car quality tests conducted by a third-party service provider. I understood the architecture and took over the project.
  • Learned to use new technology, Apache Nifi, and implemented new modules quickly. Initially, the transform data was stored in Hbase, and I changed that architecture to make it faster.
  • Implemented a data pipeline on top of Snowflake using Python and SQL for automating data preparation for some models. Used Pandas for final data checks.
Technologies: Snowflake, Pandas, Apache NiFi, HBase

Technical Specialist

2018 - 2019
Brillio
  • Designed a data pipeline for recommendation engine using Spark on AWS for a leading real estate provider in the US. Recommendation for each property is generated using a complex mathematical model which includes nested loops.
  • Gained hands-on experience with a large EMR cluster with a huge data volume of 15 TB. The cluster was r4.16x large, 80-node cluster. Data was populated to Elasticsearch, DynamoDB, and S3.
  • Created a PoC for data collection from IoT devices through Logstash. Data is inserted to S3.
Technologies: Amazon Web Services (AWS), Spark, Amazon Elastic MapReduce (EMR)

Technical Lead

2016 - 2018
Hifx It and Media Services
  • Served as a key member of the data engineering team to create a unified data platform for a leading media house in India.
  • Contributed to major project parts, including clickstream analytics for user churn and conversion prediction, data lake, and warehouse in AWS using S3, Spark, and Redshift.
  • Created a chatbot that acts as a virtual real estate broker using AWS Lex and deployed it to production. Also created a PoC of a news chatbot using Google Dialogflow.
  • Conducted flower carpet evaluation using AI — Google Cloud's Vision API is used to extract features from submitted entries (images) and used multi-class classification (one-vs-rest) model to get the final category.
  • Performed data migration from Redshift to BigQuery. Migrated critical reporting data warehouse from AWS Redshift for better query performance (10-20x) with lesser cost. Used Redshift Unload to S3 and GCS transfer for S3 to GCS.
Technologies: Amazon Web Services (AWS), Go, Python, SQL, Spark

Lead Engineer

2014 - 2016
View26 GmbH
  • Contributed to the development of View26, a SaaS solution on AWS, to collect and integrate data from different software testing tools like HP ALM or Jira. It was developed by a German-based startup. I was in charge of data collection and storage.
  • Created highly concurrent data collection module in Go and storing in MongoDB. Later moved to Postgres for performance and maintainability. To do this, I needed an excellent understanding of SQL and NoSQL.
  • Collaborated with the front-end team, working mainly with D3 and AngularJS. We also used REST APIs, Mux, and web server frameworks in Go.
  • Contributed to product ideation and other business aspects, as this company was a startup.
Technologies: Amazon Web Services (AWS), Go, MongoDB, SQL, PostgreSQL

ETL Developer

2011 - 2014
Accenture
  • Developed the data migration ETL for credit account information sharing (CAIS) for a world-leading bank. Informatica and Unix text editing utilities like AWK and SED were used for data transformation.
  • Performed data integration for the Foreign Account Tax Compliance Act (FATCA). Complex business logic is converted to SQL for data transformation.
  • Created the report design with a preliminary semantic layer using Crystal Reports and WebI, SAP reporting tools.
Technologies: Unix Shell Scripting, Oracle 9g, Informatica ETL

Unified Data Platform (UDP) on AWS

https://aws.amazon.com/solutions/case-studies/malayala-manorama/
Created a centralized data platform with batch and stream pipeline support for a leading media house in India. Major components of the UDP are a data lake on top of S3 (time partitioned in Parquet format) and a data warehouse in Redshift and BigQuery.

Data collected as events from different properties using SDKs (JavaScript, Android, and iOS). AWS Kinesis was used as a message queue. Set up batch ETL using Spark on EMR. Used Python, Go, and Shell scripts for production table data load, Apache Airflow for orchestration, and Athena and Spectrum for ad-hoc queries on top of the data lake.

Data Lakehouse

Data lake design and implementation in lakehouse architecture for a multi-billion supply chain provider. S3 (time partitioned, Parquet format) and Redshift are its main components. AWS services like DataSync, Lambda, and Step Functions are used for a batch pipeline. Used Kinesis Data Firehorse for near-real-time ingestions, AWS Glue for catalog, and Amundsen for metadata discovery with OIDC integration with Keycloack. I also used AWS Step Functions for orchestration and SQL on Athena for ETL. Also, Jupiter notebooks are automated for advanced data transformation and modeling.

Analytics SaaS Solution on AWS | View26.com

https://view26.com/
View26 was a SaaS analytic product on AWS to get a complete view of testing progress from multiple tools like HP ALM and Jira.

As a lead engineer in an early startup, I had hands-on experience in different systems. I mainly designed and implemented a highly concurrent data collection system using Go, MongoDB, Postgres, and Redshift.

Live Clickstream Analytics on Google Cloud

Data is inserted to Pub/Sub topics from App Engine web servers. SDKs send messages from different devices to API endpoint in App Engine webserver. Pub and Sub messages are consumed via dataflow using Apache Beam and inserted into BigQuery. Dataflow Python SDK is also used.

Chatbot on AWS Lex

Created a chatbot that acts as a virtual real estate broker using AWS Lex. AWS Lambda in Python is used to satisfy intents. The existing search API on Solr is utilized for getting relevant results. It was created for a classified website and is still running in production.

ETL to Data Warehouse

Data migration ETL development for projects like credit account information sharing (CAIS) and FATCA for a world-leading bank. Converted complex business logic from mapping documents to SQL. ETL tools like SAP BODS are used along with UNIX text editing tools like SED and AWS with shell scripting. Source data was from Oracle tables. Target was a data warehousing solution from IBM on top of Oracle DB.

Redshift to BigQuery Migration

Originally, AWS Redshift was used for data warehousing, but only around one thousand queries were running in the cluster per month. We compared the cost with GCP BigQuery, and it was way lesser, as costing is based on data volume scanned in BigQuery. Moved data from Redshift to S3, then S3 to GCS, and then to BigQuery. Other than the cost-benefit, complex queries were 10-20 times faster in BigQuery.

Scaling Recommendation Engine in Spark

Developed a scaling recommendation engine in Spark for a leading real estate provider in the US. Original PySpark code ran for a very limited subset of properties, and it's currently on sale in the market. They wanted to scale it and run for all the properties, even off-market. The code was complicated to scale as it contains multiple nested loops through all properties. Still, I've successfully done that using Spark tuning and implementing some minor changes in the code.

Go Module to Generate Dates

https://github.com/kkfaisal/dates
While doing software testing, especially while doing unit testing for ETL projects, it is tricky to generate dates. For example, we might need random weekend days for the last year. I was facing this issue and created a small module in Go for this.

Snowflake Community Blog

https://community.snowflake.com/s/article/PostgreSQL-to-Snowflake-ETL-Steps-to-Migrate-Data
Snowflake has published my article on data migration on their official community blog. This is one among the many of my well-accepted data engineering articles. Many other blogs are among the top Google search results for respective topics.

Languages

SQL, Go, Python, Python 3, Snowflake

Tools

Amazon Athena, AWS Step Functions, PyCharm, Apache Airflow, Sublime Text 3, RabbitMQ, Informatica ETL, Amazon Elastic MapReduce (EMR), Apache NiFi, Amazon Lex, Cloud Dataflow, Apache Beam

Platforms

Amazon Web Services (AWS), Unix, AWS Lambda, Google Cloud Platform (GCP), Docker, Amazon EC2, Oracle

Storage

Redshift, Oracle 9g, MongoDB, PostgreSQL, HBase, Amazon S3 (AWS S3), Microsoft SQL Server

Frameworks

Spark

Other

Software Engineering, Unix Shell Scripting, Chatbots, Google BigQuery, Data Warehousing

Libraries/APIs

PySpark, Pandas

Paradigms

ETL

2007 - 2011

Bachelor's Degree in Computer Engineering

Government Engineering College, Kottayam - Kerala, India

DECEMBER 2018 - PRESENT

Databricks Certified Developer for Apache Spark 2.x for Python

Databricks

MAY 2018 - MAY 2020

Google Cloud Certified Professional - Data Engineer

Google Cloud

MARCH 2017 - MARCH 2019

AWS Certified Big Data - Specialty

Amazon Web Services

DECEMBER 2016 - DECEMBER 2018

AWS Certified Solutions Architect – Associate

Amazon Web Services

SEPTEMBER 2015 - PRESENT

C100DEV: MongoDB Certified Developer Associate Exam

MongoDB University

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring