
Aleksei Burenin
Verified Expert in Engineering
Data Warehouse Architect and Developer
Bangkok, Thailand
Toptal member since November 26, 2020
Aleksei is a highly-qualified data warehouse (DWH) architect with more than a decade of experience in developing full-stack enterprise standalone and big data DWH solutions. He has extensive technical knowledge and experience in DWH implementations that require attention to detail. A proven team leader with exemplary communication skills and a drive to keep up to date with industry innovation, Aleksei thrives under pressure to build robust products and deliver projects on time.
Portfolio
Experience
- ETL - 10 years
- Data Warehousing - 10 years
- SQL - 10 years
- Data Engineering - 10 years
- Apache Spark - 4 years
- Big Data - 4 years
- Apache Hive - 4 years
- Python - 2 years
Availability
Preferred Environment
Hadoop, Spark, Python, Microsoft SQL Server, PostgreSQL, Vertica
The most amazing...
...ETL framework I've built automated the creation of 1,500+ manually built transformations and made it easy to migrate from proprietary to open-source ETL tools.
Work Experience
Staff BI Data Architect
Travel Agency
- Integrated a brand new in-memory cluster-based BI solution for the customer experience department.
- Tested different ETL approaches (Spark JDBC, a proprietary loader) and implemented the best variant with the required integration, performance, and monitoring capabilities.
- Designed CI/CD requirements and built the process of deploying ETL scripts into HDFS using Python; implemented version tracking using Git and TeamCity as a deployment tool.
- Implemented a backup-and-recovery process along with data, schema changes, and ETL flows (including disaster and recovery tests).
- Defined metrics for monitoring and logging and integrated with company monitoring solutions: InfluxDB, Telegraf, and Grafana.
- Built a data quality (DQ) solution to test consistency, freshness, and integrity between the new BI API and Hadoop storage.
- Constructed a pipeline for collecting data from the internal flow automation system to Hadoop storage using Spark and implemented monitoring and DQ.
- Built an ETL framework for loading files to the Vertica MPP system with scheduling via Airflow. The frameworks could generate DAG based on YAML configuration files. In the configuration, you can control schema and ETL parameters, DQ, and logging.
- Made a BI uptime tracking solution. Used Apache Airflow to collect usage stats from various storage types/engines, and then DQ checks the results. After calculating the uptime metrics, they're compared to the KPIs and sent back to the users.
Senior Data Scientist
Pico Networks, Inc.
- Participated in the project aimed to enhance the Content Creator Relationship Management (CRM) system by improving the existing data warehouse (DWH) solution and expanding data sources.
- Accelerated and optimized critical fact tables in Redshift by implementing materialized views, resulting in a 5-fold increase in query execution speed.
- Integrated Google Analytics data using BigQuery as a data source, enhancing the depth of analytics available for content creators.
- Developed advanced analytical reports in Metabase, leveraging the new data source configuration with BigQuery to facilitate future analysis.
- Designed and deployed an AWS EMR Presto solution to enable multi-data source analysis and reporting. Detailed AWS EMR configuration instructions were documented in Confluence for reference.
- Implemented secure connections through TLS to safeguard data during transmission and configured Presto as a data source within Metabase, enabling seamless integration of Presto-queryable data into analytical reports.
- Constructed reports that combined data from BigQuery, Redshift, and PostgreSQL sources, providing a comprehensive view for content creators.
- Performed a comprehensive risk assessment to identify potential PII data breaches across the company's entire data ecosystem, including Elasticsearch, Redis, S3 storage, MS SQL, and PostgreSQL.
- Introduced Python probe applications to extract potential PII data, analyze the most vulnerable areas, and visualize the findings within Metabase reports.
- Recommended best practices to address PII data concerns, including adopting the AWS Glue Data Catalog as a centralized data catalog and implementing data fencing via AWS Lake Formations. These measures aimed to enhance data security and compliance.
Technical Architect
Leroy Merlin (Russia)
- Designed a data model for the ODS, DDS, and DM layers of a data lake.
- Implemented a monitoring and data quality process of distributed enterprise retail management systems with over 100 retail stores using NiFi. Created a Python application for generating metadata-based NiFi templates.
- Built a data ingesting pipeline from Microsoft SQL databases to MPP data lake using Debezium, Kafka, and NiFi.
- Designed NiFi templates for ingesting data from various data sources, including RDMS, NoSQL, and API sources.
Senior Data Warehouse Architect
PJSC Mobile TeleSystems (Russia)
- Designed and implemented a security role model for Hadoop, MMP, and RDMS data lake.
- Reduced the Hive query execution time by implementing LLAP technology. The top 20 most common queries got a 10x speed increase.
- Designed a logical and physical model for the DDS layer of subscriber event geo entity. Composed all the documentation for ETL and DevOps teams.
- Designed and implemented a POC for a company website event tracking system using the analytical framework Snowplow. The work included estimating current and 1-3 horizon workload, infrastructure planning, procurement, and designing a data model.
Business Intelligence Team Lead
JSC Europlan (Russia)
- Designed the architecture and created from scratch a consistent well-managed enterprise data warehouse (DWH) system which accumulates data from tens of external sources and meets all customer needs;
- Developed OLAP cubes (15+ databases). Reports and cubes were made for different internal customers from leasing, banking, and insurance divisions, including accounting, finance, risk management, personnel, financial, security, and sales departments.
- Managed a project that involved migrating part of the existing BI solution to open-source software.
- Developed a multithread Java application to automate an ETL package creation process using only metadata and templates (to get rid of the need to manually create and maintain thousands of transformation packages).
- Led a team in the management of a data warehouse (DWH) which became not only a source for enterprise reports but also a source of high-grade verified data for many other intercorporate systems. This allowed us to unload algorithms on other systems.
Experience
Airflow Architecture Migration Project for Real Estate
• Conducting analysis of the current EC2 standalone solution, evaluating its disadvantages;
• Setting up a new AWS EKS cluster to support the updated architecture;
• Configuring Helm charts to deploy the Airflow app within the Kubernetes environment;
• Implementing FluxCD to embrace the GitOps paradigm for managing Kubernetes infrastructure and application;
• Providing fully hybrid environments:
a) Two cloud-based environments for production and staging.
b) One local environment for DAG development and debugging within an IDE.
c) Docker Compose for automated unit testing and integration tests.
• Developing unit tests for Airflow DAGs, enabling execution both locally and in CI/CD pipelines.
• Introducing a CI/CD pipeline within the GitLab ecosystem to ensure a proper software development lifecycle (SDLC) with mandatory unit tests during merge requests.
• Securing external access through dynamic DNS provisioning for Kubernetes with Cloudflare for enhanced security.
• Configuring a hybrid logging S3 and local storage.
• Implementing database change management using Liquibase.
The final solution had a GitOps approach, scalability, flexible configuration, and proper SDCL support.
Travel Agency Startup | Cloud Data Warehouse Architect
What was done:
• Analyzed data sources for structure and built S2T mapping.
• Selected target data storage as BigQuery; integrated with customer GCP environment.
• Set up data collectors to extract real-time changes from Firebase to BigQuery RAW schema.
• Created an ETL framework to transform data from RAW schema to the ODS layer (GCP Cloud Scheduled queries).
• Added metadata-based transformation for Firebase documents JSON fields to BQ schema.
• Created reporting layer in DWH with recent snapshots and historical data.
• Engineered a cloud set up and BI tool Metabase.
• Migrated manual exports from Retool to five BigQuery-based Metabase dashboards with more than 30 reports.
• Designed data quality policy and processes:
- Freshness checks for each fact table
- Consistency checks for source Firebase collections; targeted raw tables, specifically Python Cloud Functions.
- Calculated uptime KPI metric.
• Enabled alerting for DQ checks via Metabase alert feature for the dev team and business users.
• Documented all solutions in Notion.
• Implemented BigQuery cost monitoring Cloud Logging Sink to BigQuery and alerting via Metabase question subscription.
Agoda | Implementation of a Data Lake Integration with Customer Service Automation Software
Tasks:
• Interacted with vendor representatives to understand how to extract data from the API.
• Analyzed data available in the REST API.
• Designed transactional (ODS) and data mart (DM) data models for storage API data in Hadoop.
• Collaborated with the dev team responsible for system implementation about multi-datacenter HA deployment to consider future ETL and DQ tools.
• Built an ETL Apache Spark-based solution for incremental and full data loads.
• Integrated ETL app metrics to the enterprise monitoring system based on Cassandra and Grafana.
• Built the ETL for transformation data from the transactional layer to business-specific data marts.
• Created Metabase reports for business owners, which helped analyze the flow's effectiveness built by the flow interaction dev team.
• Established data quality processes to check consistency, accuracy, and freshness for each data pipeline stage.
This project created tremendous value for the analytical department, increased agent interaction performance transparency, and enabled real-time analyses.
Real Estate Company | Data Warehouse Solution
Project milestones:
1. Made the ETL process using Airflow for loading various sources—Amazon RDS MS SQL sources, S3 Flat files (Excel, CSV), and API sources—into the ingesting layer of the data warehouse.
2. Built the data model and ETL DAGs for data layer storage—preprocessing and detailed data storage (DDS)—on PostgreSQL Amazon RDS.
3. Created DQ processes for controlling freshness and consistency between data sources and ingestion layer tables, ingestion layer and preprocessing, and preprocessing and mart layer. This included the data structure for customizing checks, saving aggregated and detailed results, and Airflow DAGs for running these checks.
4. Created the preprocessing layer's adjustment framework, which allowed the following:
• a user can define datasets for the target table;
• validation of changesets before applying changes (Alembic style but for data);
• reverse changes if required;
• log of changes and tracking for audit purposes.
5. Implemented SCD 4 for DDS.
6. Built the mart layer for views and materialized views for reporting purposes.
7. Prepared guidelines and built the POC for the migration to Snowflake for future extensions.
Agoda | Creation of an ETL Framework for Vertica on Airflow
The created framework could generate a DAG-based on YAML configuration files:
• Typical pipeline for file storage: grab a file from the remote file storage.
• Push to a staging table in Vertica with minimum transformation.
• Merge or overwrite to a destination table in Vertica.
• Apply business-specific transformation in Vertica using provided SQL (for the reporting layer).
Supporting features:
• Push date linage and statistical information to data metric storage.
• Push data freshness values.
• Data source data validation and alerting it not correct.
An additional type of tasks:
• Run DQ checks to check integrity between data layers and accuracy.
• CI/CD was done via TeamCity.
The created solution became scalable, monitorable, and more robust. CI/CD enabled tracking on source changes and deployment process.
Education
Master's Degree in Automated System Faculty, Computer Systems and Networks
Tver State Technical University - Tver, Russia
Bachelor’s Degree with Honors in Automated System Faculty, Computer Systems and Networks
Tver State Technical University - Tver, Russia
Skills
Libraries/APIs
PySpark, Pandas, REST APIs, Liquibase
Tools
Apache Impala, Apache Airflow, BigQuery, Apache NiFi, Git, Vim Text Editor, Grafana, Amazon EKS, AWS ELB, Amazon Elastic MapReduce (EMR)
Languages
SQL, T-SQL (Transact-SQL), Python, Bash Script, Python 3, Scala, Pico
Paradigms
ETL, ETL Implementation & Design, Business Intelligence (BI), DevOps
Platforms
Microsoft BI Stack, ThoughtSpot, Google Cloud Platform (GCP), Firebase, Apache Kafka, Oracle, CentOS, Amazon Web Services (AWS)
Storage
Microsoft SQL Server, Apache Hive, PostgreSQL, JSON, Databases, Vertica, Greenplum, MongoDB, Amazon S3 (AWS S3), Redshift
Frameworks
Apache Spark, Hadoop, Spark, Scrapy, Flux, Presto
Other
Data Warehousing, Data Engineering, Data Quality, ETL Development, Data Modeling, Data Warehouse Design, Metabase, Integration, Big Data, Data Analysis, Reporting, CI/CD Pipelines, Cloud, Analytics, Data Migration, quilliup, Computer Science, Amazon RDS, Google Cloud Functions, Catalog Data Entry Services, Web Scraping, Machine Learning, Cloudflare, Data Science, Data Build Tool (dbt)
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring