Pavol Knapek, Developer in Berlin, Germany
Pavol is available for hire
Hire Pavol

Pavol Knapek

Verified Expert  in Engineering

Data Engineer and Systems Developer

Location
Berlin, Germany
Toptal Member Since
July 20, 2021

Pavol is a passionate data engineer and integrator who helps businesses manage their products and data. For the past 10+ years, Pavol has focused on a wide range of integration approaches, internal and external databases, row and column stores, SQL/NoSQL databases, OLTP/OLAP use cases, and everything that touches data. Pavol decommissioned an obsolete AWS Redshift DWH and established an alternative solution using relational databases, column-stores, and S3 storage.

Portfolio

Emplifi (Acquired Socialbakers)
Spark, Redshift, Presto, EMR, RabbitMQ, Amazon Kinesis, PostgreSQL, MongoDB...
Dashmote B.V.
Python, Spark, Apache Spark, Distributed Computing, Distributed Systems...
Socialbakers
Spark, Redshift, Presto, EMR, RabbitMQ, Amazon Kinesis, PostgreSQL, MongoDB...

Experience

Availability

Part-time

Preferred Environment

Amazon Web Services (AWS), Linux, Slack, Git

The most amazing...

...project I've developed was an internal ETL-DWH-reporting from top to bottom that enabled internal customers to grow the company's business by 150%.

Work Experience

Senior Data Platform Engineer

2021 - 2023
Emplifi (Acquired Socialbakers)
  • Designed the architecture of an internal big data platform, leading to better data democratization across the whole company, including its core products.
  • Implemented efficient Spark pipelines on top of social media data.
  • Led an internal data engineering edu group for active knowledge sharing.
  • Built a custom and efficient checkpoint manager for using Amazon S3 as a source for Spark Structured Streaming pipelines.
  • Owned SQL lectures designed for people from non-tech departments at the company.
Technologies: Spark, Redshift, Presto, EMR, RabbitMQ, Amazon Kinesis, PostgreSQL, MongoDB, Amazon DynamoDB, Amazon S3 (AWS S3), Python, Java, Scala, Machine Learning, Streaming, Databricks, Apache Airflow, Data Engineering, Big Data, NoSQL, Delta Lake, ETL, Data Pipelines, PySpark, Apache Spark, Data Architecture, Docker, Amazon Elastic MapReduce (EMR), Amazon Web Services (AWS), Solution Architecture, Data Build Tool (dbt), ELT, Databases, PL/SQL, Stored Procedure, API Design, Amazon RDS, Big Data Architecture, Data Lakes, AWS Glue, Data Warehousing, Apache Kafka, AWS Lambda, Amazon Athena, Amazon EC2, Relational Databases, Message Queues, Data Transformation, Data Migration, Performance Tuning, T-SQL (Transact-SQL)

Big Data Solution Architect

2021 - 2022
Dashmote B.V.
  • Migrated raw Python applications into dockerized PySpark applications on EMR.
  • Optimized most of the internal Spark pipelines to achieve better cost efficiency.
  • Introduced a Databricks platform to match more modern platform designs.
  • Built resilient pipelines for data enrichment using internally developed ML models.
  • Led a series of internal workshops aimed at best practices, specifically for more junior engineers (Git, Python, PySpark, SQL, CI/CD, and software testing).
  • Implemented internal Airflow operators for triggering EMR Serverless jobs.
  • Built a metadata-driven generic and dynamic internal tool for an efficient and transactional data migration from Amazon S3 to PostgreSQL.
  • Cooperated with software and data engineers based in Amsterdam and Shanghai.
Technologies: Python, Spark, Apache Spark, Distributed Computing, Distributed Systems, Amazon Elastic MapReduce (EMR), Amazon Elastic Container Service (Amazon ECS), Mentorship & Coaching, Team Mentoring, Troubleshooting, Teamwork, ETL, Data Pipelines, PySpark, Data Architecture, Docker, Amazon Web Services (AWS), Big Data, Solution Architecture, Databricks, Data Build Tool (dbt), ELT, Databases, Web Scraping, API Design, Amazon RDS, Terraform, Big Data Architecture, Data Lakes, AWS Glue, Data Warehousing, Amazon Athena, Amazon EC2, Relational Databases, Data Transformation, Data Migration, Performance Tuning, Snowflake, T-SQL (Transact-SQL)

Data Platform Engineer

2018 - 2021
Socialbakers
  • Initiated the process of designing the architecture of the internal big data platform.
  • Established content-enrichment pipelines by applying internally developed AI/ML models for both batched and real-time streaming scenarios.
  • Built an internal tool to support better CI/CD on Databricks projects.
  • Deployed Airflow on internal infrastructure and established internal guidelines.
  • Implemented a generic streaming framework in raw Python.
Technologies: Spark, Redshift, Presto, EMR, RabbitMQ, Amazon Kinesis, PostgreSQL, MongoDB, Amazon DynamoDB, Amazon S3 (AWS S3), Python, Java, Scala, Machine Learning, Streaming, Databricks, Apache Airflow, Data Engineering, Big Data, NoSQL, ETL, Data Pipelines, PySpark, Apache Spark, Data Architecture, Docker, Amazon Web Services (AWS), ELT, Databases, PL/SQL, Stored Procedure, API Design, Amazon RDS, R, Big Data Architecture, Data Lakes, Data Warehousing, Amazon Athena, Amazon EC2, Relational Databases, Message Queues, Data Transformation, Data Migration, Performance Tuning, T-SQL (Transact-SQL), ETL Tools

Back-end Platform Developer | Front-end R&D Developer

2017 - 2018
Edvisor
  • Improved the interaction between the front end and back end by migrating the traditional REST architecture into a GraphQL back end.
  • Implemented generic endpoints for data reporting purposes.
  • Built a React widget pluggable on B2B clients' websites.
  • Evangelized the development team with data-related technologies during internal lunch-and-learn sessions.
Technologies: JavaScript, MariaDB, Node.js, Sentry, AngularJS, React, Jira, Keen.io, SQL, GraphQL, REST, Data Architecture, Docker, Amazon Web Services (AWS), Databases, API Design, Amazon EC2, Relational Databases

Data Engineer and Integrator

2013 - 2017
Socialbakers
  • Maintained an integration solution by using Pentaho ETL, Java, and Node.js.
  • Built foundations of an internal DWH solution (traditional Kimball's dim/facts).
  • Integrated Salesforce, Mixpanel, and Zendesk with our main SaaS products.
  • Implemented internal REST APIs for product, integration, and reporting uses.
  • Migrated parts of internal DWH into Redshift and evangelized the team about the biggest advantages of using columnar-store database engines.
Technologies: JavaScript, ETL, Pentaho, PostgreSQL, PHP, RabbitMQ, MongoDB, Data Engineering, Big Data, NoSQL, Data Pipelines, Data Architecture, Docker, Amazon Web Services (AWS), ELT, Databases, PL/SQL, Stored Procedure, Web Scraping, API Design, Amazon RDS, Big Data Architecture, Data Lakes, Data Warehousing, Amazon EC2, Relational Databases, Message Queues, Data Transformation, Data Migration, Performance Tuning, T-SQL (Transact-SQL), ETL Tools

Junior Big Data Specialist (Internship)

2013 - 2013
IBM
  • Got selected for a summer internship, where I entered the world of big data.
  • Worked on scientific comparison and exploration of using GPFS over HDFS.
  • Completed a wide range of time management and soft skills training.
Technologies: HDFS, GPFS, SQL, IBM BigInsights, Apache Hive, Apache Pig, Big Data, Data Pipelines, Data Architecture, Databases, Relational Databases

Java Developer and Integrator

2010 - 2012
Zitec
  • Designed and implemented an integration architecture using SOAP and REST protocols (ERP, CRM, POS terminals, eCommerce, and BI).
  • Customized ADempiere, an open-source ERP system, to match our client's needs.
  • Managed an on-premise server using VMware ESXi and multiple virtual Linux server instances.
Technologies: Java, PostgreSQL, MySQL, ADempiere, iDempiere ERP, VMware ESXi, VirtualBox, CentOS, Linux, JBoss, GlassFish, Apache Tomcat, ETL, Data Architecture, Databases, Relational Databases

Data Lake at Socialbakers

An AWS S3-based data lake. The data format was a combination of Databricks Delta and raw organized Parquet files. File organization included logical partitioning and bucketing. I used various input data sources; Salesforce, SOAP, internal databases (PostgreSQL, MongoDB, DynamoDB, etc.), internal and external REST endpoints, and streams and message queues. Databricks-hosted Spark and pure Python were used for the ETL processes. Also, an AWS-managed PrestoDB was used for accessing the data from the data lake, so an internal Hive Metastore (HMS) was needed. We used Tableau for dashboarding and visualizations and Apache Airflow for the orchestration of jobs.

Languages

SQL, Java, Python, JavaScript, Stored Procedure, T-SQL (Transact-SQL), Scala, GraphQL, PHP, Snowflake, C++, R

Frameworks

Spark, Presto, AngularJS, Apache Spark

Libraries/APIs

Node.js, PySpark, React

Tools

Git, RabbitMQ, VirtualBox, Apache Tomcat, Apache Airflow, AWS Glue, Amazon Athena, Jira, ADempiere, iDempiere ERP, Amazon Elastic MapReduce (EMR), Terraform, Slack, Sentry, Amazon Elastic Container Service (Amazon ECS)

Paradigms

Agile Software Development, REST, ETL, Distributed Computing

Platforms

Amazon Web Services (AWS), Linux, Pentaho, Databricks, Docker, Apache Kafka, AWS Lambda, Amazon EC2, CentOS, JBoss, Apache Pig

Storage

Redshift, PostgreSQL, Amazon S3 (AWS S3), MySQL, Data Pipelines, Databases, PL/SQL, Data Lakes, Relational Databases, MongoDB, MariaDB, HDFS, Apache Hive, Amazon DynamoDB, GPFS, NoSQL

Other

Processing & Threading, Software Architecture, OOP Designs, Amazon Kinesis, Streaming, Data Engineering, Big Data, Data Architecture, Solution Architecture, ELT, Web Scraping, API Design, Amazon RDS, Big Data Architecture, Data Warehousing, Message Queues, Data Transformation, Data Migration, Performance Tuning, ETL Tools, Mathematics, EMR, VMware ESXi, GlassFish, Data Build Tool (dbt), Machine Learning, Keen.io, IBM BigInsights, Delta Lake, Distributed Systems, Mentorship & Coaching, Team Mentoring, Troubleshooting, Teamwork

2011 - 2013

Master's Degree in Web and Software Engineering, Focused on Information Systems and Management

Czech Technical University in Prague - Prague, Czech Republic

2008 - 2011

Bachelor's Degree in Informatics

Slovak University of Technology in Bratislava - Bratislava, Slovak Republic

MARCH 2017 - PRESENT

Machine Learning Foundations: A Case Study Approach

University of Washington | via Coursera

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring