Aleksei Burenin, Developer in Bangkok, Thailand
Aleksei is available for hire
Hire Aleksei

Aleksei Burenin

Verified Expert  in Engineering

Data Warehouse Architect and Developer

Location
Bangkok, Thailand
Toptal Member Since
November 26, 2020

Aleksei is a highly-qualified data warehouse (DWH) architect with more than a decade of experience in developing full-stack enterprise standalone and big data DWH solutions. He has extensive technical knowledge and experience in DWH implementations that require attention to detail. A proven team leader with exemplary communication skills and a drive to keep up to date with industry innovation, Aleksei thrives under pressure to build robust products and deliver projects on time.

Portfolio

Travel Agency
Bash Script, Vertica, quilliup, Data Warehousing, Data Warehouse Design...
Pico Networks, Inc.
Data Science, ETL, Redshift, Data Engineering, Python...
Leroy Merlin (Russia)
Python, Microsoft BI Stack, Oracle, Microsoft SQL Server, Data Warehouse Design...

Experience

Availability

Part-time

Preferred Environment

Hadoop, Spark, Python, Microsoft SQL Server, PostgreSQL, Vertica

The most amazing...

...ETL framework I've built automated the creation of 1,500+ manually built transformations and made it easy to migrate from proprietary to open-source ETL tools.

Work Experience

Staff BI Data Architect

2019 - PRESENT
Travel Agency
  • Integrated a brand new in-memory cluster-based BI solution for the customer experience department.
  • Tested different ETL approaches (Spark JDBC, a proprietary loader) and implemented the best variant with the required integration, performance, and monitoring capabilities.
  • Designed CI/CD requirements and built the process of deploying ETL scripts into HDFS using Python; implemented version tracking using Git and TeamCity as a deployment tool.
  • Implemented a backup-and-recovery process along with data, schema changes, and ETL flows (including disaster and recovery tests).
  • Defined metrics for monitoring and logging and integrated with company monitoring solutions: InfluxDB, Telegraf, and Grafana.
  • Built a data quality (DQ) solution to test consistency, freshness, and integrity between the new BI API and Hadoop storage.
  • Constructed a pipeline for collecting data from the internal flow automation system to Hadoop storage using Spark and implemented monitoring and DQ.
  • Built an ETL framework for loading files to the Vertica MPP system with scheduling via Airflow. The frameworks could generate DAG based on YAML configuration files. In the configuration, you can control schema and ETL parameters, DQ, and logging.
  • Made a BI uptime tracking solution. Used Apache Airflow to collect usage stats from various storage types/engines, and then DQ checks the results. After calculating the uptime metrics, they're compared to the KPIs and sent back to the users.
Technologies: Bash Script, Vertica, quilliup, Data Warehousing, Data Warehouse Design, ThoughtSpot, Python, Data Quality, Apache Spark, Data Engineering, Big Data, Apache Airflow, ETL Development, Data Modeling, CI/CD Pipelines, Data Analysis, Scala, Apache Impala, ETL Implementation & Design, Databases, Analytics, Business Intelligence (BI), Integration, PySpark

Senior Data Scientist

2022 - 2022
Pico Networks, Inc.
  • Participated in the project aimed to enhance the Content Creator Relationship Management (CRM) system by improving the existing data warehouse (DWH) solution and expanding data sources.
  • Accelerated and optimized critical fact tables in Redshift by implementing materialized views, resulting in a 5-fold increase in query execution speed.
  • Integrated Google Analytics data using BigQuery as a data source, enhancing the depth of analytics available for content creators.
  • Developed advanced analytical reports in Metabase, leveraging the new data source configuration with BigQuery to facilitate future analysis.
  • Designed and deployed an AWS EMR Presto solution to enable multi-data source analysis and reporting. Detailed AWS EMR configuration instructions were documented in Confluence for reference.
  • Implemented secure connections through TLS to safeguard data during transmission and configured Presto as a data source within Metabase, enabling seamless integration of Presto-queryable data into analytical reports.
  • Constructed reports that combined data from BigQuery, Redshift, and PostgreSQL sources, providing a comprehensive view for content creators.
  • Performed a comprehensive risk assessment to identify potential PII data breaches across the company's entire data ecosystem, including Elasticsearch, Redis, S3 storage, MS SQL, and PostgreSQL.
  • Introduced Python probe applications to extract potential PII data, analyze the most vulnerable areas, and visualize the findings within Metabase reports.
  • Recommended best practices to address PII data concerns, including adopting the AWS Glue Data Catalog as a centralized data catalog and implementing data fencing via AWS Lake Formations. These measures aimed to enhance data security and compliance.
Technologies: Data Science, ETL, Redshift, Data Engineering, Python, Amazon Web Services (AWS), Metabase, Data Build Tool (dbt), Big Data, Amazon Elastic MapReduce (EMR), Presto, Integration

Technical Architect

2019 - 2019
Leroy Merlin (Russia)
  • Designed a data model for the ODS, DDS, and DM layers of a data lake.
  • Implemented a monitoring and data quality process of distributed enterprise retail management systems with over 100 retail stores using NiFi. Created a Python application for generating metadata-based NiFi templates.
  • Built a data ingesting pipeline from Microsoft SQL databases to MPP data lake using Debezium, Kafka, and NiFi.
  • Designed NiFi templates for ingesting data from various data sources, including RDMS, NoSQL, and API sources.
Technologies: Python, Microsoft BI Stack, Oracle, Microsoft SQL Server, Data Warehouse Design, Data Warehousing, Apache Kafka, Greenplum, Apache NiFi, ETL Development, Data Analysis, ETL Implementation & Design, Databases, Integration

Senior Data Warehouse Architect

2018 - 2019
PJSC Mobile TeleSystems (Russia)
  • Designed and implemented a security role model for Hadoop, MMP, and RDMS data lake.
  • Reduced the Hive query execution time by implementing LLAP technology. The top 20 most common queries got a 10x speed increase.
  • Designed a logical and physical model for the DDS layer of subscriber event geo entity. Composed all the documentation for ETL and DevOps teams.
  • Designed and implemented a POC for a company website event tracking system using the analytical framework Snowplow. The work included estimating current and 1-3 horizon workload, infrastructure planning, procurement, and designing a data model.
Technologies: Data Engineering, Big Data, Apache Spark, Data Analysis, ETL Implementation & Design, Databases, Integration, Data Migration

Business Intelligence Team Lead

2010 - 2017
JSC Europlan (Russia)
  • Designed the architecture and created from scratch a consistent well-managed enterprise data warehouse (DWH) system which accumulates data from tens of external sources and meets all customer needs;
  • Developed OLAP cubes (15+ databases). Reports and cubes were made for different internal customers from leasing, banking, and insurance divisions, including accounting, finance, risk management, personnel, financial, security, and sales departments.
  • Managed a project that involved migrating part of the existing BI solution to open-source software.
  • Developed a multithread Java application to automate an ETL package creation process using only metadata and templates (to get rid of the need to manually create and maintain thousands of transformation packages).
  • Led a team in the management of a data warehouse (DWH) which became not only a source for enterprise reports but also a source of high-grade verified data for many other intercorporate systems. This allowed us to unload algorithms on other systems.
Technologies: SQL, Data Warehousing, Data Warehouse Design, ETL, PostgreSQL, Microsoft BI Stack, Microsoft SQL Server, ETL Development, Data Analysis, ETL Implementation & Design, Data Engineering, Databases, Analytics, Business Intelligence (BI), Integration, T-SQL (Transact-SQL)

Airflow Architecture Migration Project for Real Estate

An Airflow migration project that encompassed:

• Conducting analysis of the current EC2 standalone solution, evaluating its disadvantages;
• Setting up a new AWS EKS cluster to support the updated architecture;
• Configuring Helm charts to deploy the Airflow app within the Kubernetes environment;
• Implementing FluxCD to embrace the GitOps paradigm for managing Kubernetes infrastructure and application;
• Providing fully hybrid environments:
a) Two cloud-based environments for production and staging.
b) One local environment for DAG development and debugging within an IDE.
c) Docker Compose for automated unit testing and integration tests.
• Developing unit tests for Airflow DAGs, enabling execution both locally and in CI/CD pipelines.
• Introducing a CI/CD pipeline within the GitLab ecosystem to ensure a proper software development lifecycle (SDLC) with mandatory unit tests during merge requests.
• Securing external access through dynamic DNS provisioning for Kubernetes with Cloudflare for enhanced security.
• Configuring a hybrid logging S3 and local storage.
• Implementing database change management using Liquibase.

The final solution had a GitOps approach, scalability, flexible configuration, and proper SDCL support.

Travel Agency Startup | Cloud Data Warehouse Architect

A travel agency startup that required a data analytics solution.

What was done:
• Analyzed data sources for structure and built S2T mapping.
• Selected target data storage as BigQuery; integrated with customer GCP environment.
• Set up data collectors to extract real-time changes from Firebase to BigQuery RAW schema.
• Created an ETL framework to transform data from RAW schema to the ODS layer (GCP Cloud Scheduled queries).
• Added metadata-based transformation for Firebase documents JSON fields to BQ schema.
• Created reporting layer in DWH with recent snapshots and historical data.
• Engineered a cloud set up and BI tool Metabase.
• Migrated manual exports from Retool to five BigQuery-based Metabase dashboards with more than 30 reports.
• Designed data quality policy and processes:
- Freshness checks for each fact table
- Consistency checks for source Firebase collections; targeted raw tables, specifically Python Cloud Functions.
- Calculated uptime KPI metric.
• Enabled alerting for DQ checks via Metabase alert feature for the dev team and business users.
• Documented all solutions in Notion.
• Implemented BigQuery cost monitoring Cloud Logging Sink to BigQuery and alerting via Metabase question subscription.

Agoda | Implementation of a Data Lake Integration with Customer Service Automation Software

I integrated one of the source systems into an enterprise data lake. It was a customer service automation software built on MongoDB and a reporting REST API.

Tasks:
• Interacted with vendor representatives to understand how to extract data from the API.
• Analyzed data available in the REST API.
• Designed transactional (ODS) and data mart (DM) data models for storage API data in Hadoop.
• Collaborated with the dev team responsible for system implementation about multi-datacenter HA deployment to consider future ETL and DQ tools.
• Built an ETL Apache Spark-based solution for incremental and full data loads.
• Integrated ETL app metrics to the enterprise monitoring system based on Cassandra and Grafana.
• Built the ETL for transformation data from the transactional layer to business-specific data marts.
• Created Metabase reports for business owners, which helped analyze the flow's effectiveness built by the flow interaction dev team.
• Established data quality processes to check consistency, accuracy, and freshness for each data pipeline stage.

This project created tremendous value for the analytical department, increased agent interaction performance transparency, and enabled real-time analyses.

Real Estate Company | Data Warehouse Solution

I completed a data warehouse solution for a real estate company.

Project milestones:

1. Made the ETL process using Airflow for loading various sources—Amazon RDS MS SQL sources, S3 Flat files (Excel, CSV), and API sources—into the ingesting layer of the data warehouse.
2. Built the data model and ETL DAGs for data layer storage—preprocessing and detailed data storage (DDS)—on PostgreSQL Amazon RDS.
3. Created DQ processes for controlling freshness and consistency between data sources and ingestion layer tables, ingestion layer and preprocessing, and preprocessing and mart layer. This included the data structure for customizing checks, saving aggregated and detailed results, and Airflow DAGs for running these checks.
4. Created the preprocessing layer's adjustment framework, which allowed the following:
• a user can define datasets for the target table;
• validation of changesets before applying changes (Alembic style but for data);
• reverse changes if required;
• log of changes and tracking for audit purposes.
5. Implemented SCD 4 for DDS.
6. Built the mart layer for views and materialized views for reporting purposes.
7. Prepared guidelines and built the POC for the migration to Snowflake for future extensions.

Agoda | Creation of an ETL Framework for Vertica on Airflow

I created an ETL framework for loading different data sources to the Vertica MPP system. The scheduling was done on Airflow.

The created framework could generate a DAG-based on YAML configuration files:
• Typical pipeline for file storage: grab a file from the remote file storage.
• Push to a staging table in Vertica with minimum transformation.
• Merge or overwrite to a destination table in Vertica.
• Apply business-specific transformation in Vertica using provided SQL (for the reporting layer).

Supporting features:
• Push date linage and statistical information to data metric storage.
• Push data freshness values.
• Data source data validation and alerting it not correct.

An additional type of tasks:
• Run DQ checks to check integrity between data layers and accuracy.
• CI/CD was done via TeamCity.

The created solution became scalable, monitorable, and more robust. CI/CD enabled tracking on source changes and deployment process.
2005 - 2007

Master's Degree in Automated System Faculty, Computer Systems and Networks

Tver State Technical University - Tver, Russia

2001 - 2005

Bachelor’s Degree with Honors in Automated System Faculty, Computer Systems and Networks

Tver State Technical University - Tver, Russia

Libraries/APIs

PySpark, Pandas, REST APIs, Liquibase

Tools

Apache Impala, Apache Airflow, BigQuery, Apache NiFi, Git, Vim Text Editor, Grafana, Amazon EKS, AWS ELB, Amazon Elastic MapReduce (EMR)

Paradigms

ETL, ETL Implementation & Design, Business Intelligence (BI), DevOps, Data Science

Storage

Microsoft SQL Server, Apache Hive, PostgreSQL, JSON, Databases, Vertica, Greenplum, MongoDB, Amazon S3 (AWS S3), Redshift

Languages

SQL, T-SQL (Transact-SQL), Python, Bash Script, Python 3, Scala, Pico

Platforms

Microsoft BI Stack, Google Cloud Platform (GCP), Firebase, Apache Kafka, Oracle, CentOS, Amazon Web Services (AWS)

Frameworks

Apache Spark, Hadoop, Spark, Scrapy, Flux, Presto

Other

Data Warehousing, Data Engineering, Data Quality, ETL Development, Data Modeling, Data Warehouse Design, Metabase, Integration, Big Data, ThoughtSpot, Data Analysis, Reporting, CI/CD Pipelines, Cloud, Analytics, Data Migration, quilliup, Computer Science, Amazon RDS, Google Cloud Functions, Catalog Data Entry Services, Web Scraping, Machine Learning, Cloudflare, Data Build Tool (dbt)

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring