Pavol Knapek, Data Engineer and Systems Developer in Berlin, Germany
Pavol Knapek

Data Engineer and Systems Developer in Berlin, Germany

Member since July 20, 2021
Pavol is a passionate data engineer and integrator who helps businesses manage their products and data. For the past 10+ years, Pavol has focused on a wide range of integration approaches, internal and external databases, row and column stores, SQL/NoSQL databases, OLTP/OLAP use cases, and everything that touches data. Pavol decommissioned an obsolete AWS Redshift DWH and established an alternative solution using relational databases, column-stores, and S3 storage.
Pavol is now available for hire

Portfolio

  • Emplifi (Acquired Socialbakers)
    Spark, Redshift, Presto DB, EMR, RabbitMQ, AWS Kinesis, PostgreSQL, MongoDB...
  • Socialbakers
    Spark, Redshift, Presto DB, EMR, RabbitMQ, AWS Kinesis, PostgreSQL, MongoDB...
  • Edvisor
    JavaScript, MariaDB, Node.js, Sentry, AngularJS, React, Jira, Keen.io, SQL...

Experience

Location

Berlin, Germany

Availability

Full-time

Preferred Environment

Amazon Web Services (AWS), Linux, Slack, Git

The most amazing...

...project I've developed was an internal ETL-DWH-reporting from top to bottom that enabled internal customers to grow the company's business by 150%.

Employment

  • Senior Data Engineer

    2021 - PRESENT
    Emplifi (Acquired Socialbakers)
    • Built a data lake, which integrates data across the whole company, and enables any internal client to ask any data question.
    • Decommissioned an obsolete AWS Redshift DWH (for cost efficiency), and established an alternative solution using relational databases, column-stores, and S3 storage.
    • Established and maintained an internal framework for real-time and near-real-time streaming of social-media data and the application of machine-learning models.
    Technologies: Spark, Redshift, Presto DB, EMR, RabbitMQ, AWS Kinesis, PostgreSQL, MongoDB, Amazon DynamoDB, Amazon S3 (AWS S3), Python, Java, Scala, Machine Learning, Streaming, Databricks, Apache Airflow, Data Engineering, Big Data, NoSQL
  • Data Engineer

    2018 - 2021
    Socialbakers
    • Established an internal data-engineering educational group.
    • Led the development of an internal ETL-DWH reporting application for internal customers.
    • Led an integration of Mixpanel and Salesforce, which significantly increased the effectiveness of sales personnel.
    • Built an internal CLI tool for the integration of a databricks workspace with local development and a Git versioning system.
    Technologies: Spark, Redshift, Presto DB, EMR, RabbitMQ, AWS Kinesis, PostgreSQL, MongoDB, Amazon DynamoDB, Amazon S3 (AWS S3), Python, Java, Scala, Machine Learning, Streaming, Databricks, Apache Airflow, Data Engineering, Big Data, NoSQL
  • Full-stack Developer

    2017 - 2018
    Edvisor
    • Built a React widget, which was then deployed at Kaplan International Languages school.
    • Delivered a new internal back-end GraphQL platform for the interaction of the front end and databases, which then replaced the old REST platform.
    • Experimented with the integration of AngularJS and React and introduced a way to incrementally migrate the main product to React.
    Technologies: JavaScript, MariaDB, Node.js, Sentry, AngularJS, React, Jira, Keen.io, SQL, GraphQL, REST
  • Systems and Data Integrator

    2013 - 2017
    Socialbakers
    • Designed the architecture of the internal DWH solution, according to Kimball's best practices. I used a couple of open-source technologies, including heavily optimized PostgreSQL as ROLAP DWH, and Pentaho ETL/BI.
    • Built an internal Salesforce widget that was showing health-metrics per Salesforce account.
    • Maintained old PHP processes for product integration and replaced them with a more stable ETL solution.
    Technologies: JavaScript, ETL, Pentaho, PostgreSQL, PHP, RabbitMQ, MongoDB, Data Engineering, Big Data, NoSQL
  • Junior Big Data Specialist (Internship)

    2013 - 2013
    IBM
    • Entered the world of big data. Digged through lots of big-data technologies (both closed and open-source).
    • Prepared an extensive comparison of data-transformation processes' efficiency and performance when using GPFS over HDFS.
    • Completed a wide range of time management and soft skills trainings.
    Technologies: HDFS, GPFS, SQL, IBM BigInsights, Apache Hive, Apache Pig, Big Data
  • Java Developer and Integrator

    2010 - 2012
    Zitec
    • Designed and implemented a complex integration platform, which integrated a large volume of point-of-sale (POS) terminals, a central ERP system (ADempiere), an open-source CRM system (SugarCRM), and an open-source eCommerce platform (Magento).
    • Implemented and customized an open-source ERP system (ADempiere) according to client demands. This solution was implemented using a standard open-source stack, including PostgreSQL, JBoss, and Java (J2EE on the back end, Swing, and ZK on the front end).
    • Built an effective reporting platform using set of ETL processes and Palo multidimensional database (Cuda-accelerated), which was used as a base for an internal BI solution.
    Technologies: Java, PostgreSQL, MySQL, ADempiere, iDempiere ERP, VMware ESXi, VirtualBox, CentOS, Linux, JBoss, GlassFish, Apache Tomcat

Experience

  • Data Lake at Socialbakers

    An AWS S3-based data lake. The data format was a combination of Databricks Delta and raw organized Parquet files. File organization included logical partitioning and bucketing. I used various input data sources; Salesforce, SOAP, internal databases (PostgreSQL, MongoDB, DynamoDB, etc.), internal and external REST endpoints, and streams and message queues. Databricks-hosted Spark and pure Python were used for the ETL processes. Also, an AWS-managed PrestoDB was used for accessing the data from the data lake, so an internal Hive Metastore (HMS) was needed. We used Tableau for dashboarding and visualizations and Apache Airflow for the orchestration of jobs.

Skills

  • Languages

    SQL, Java, Python, JavaScript, Scala, GraphQL, PHP, C++
  • Frameworks

    Spark, Presto DB, AngularJS
  • Libraries/APIs

    Node.js, React
  • Tools

    Git, RabbitMQ, VirtualBox, Apache Tomcat, Apache Airflow, Jira, ADempiere, iDempiere ERP, Slack, Sentry
  • Paradigms

    Agile Software Development, REST, ETL
  • Platforms

    Amazon Web Services (AWS), Linux, AWS Kinesis, Pentaho, Databricks, CentOS, JBoss, Apache Pig
  • Storage

    Redshift, PostgreSQL, Amazon S3 (AWS S3), MySQL, MongoDB, MariaDB, HDFS, Apache Hive, Amazon DynamoDB, GPFS, NoSQL
  • Other

    Processing & Threading, Software Architecture, OOP Designs, Streaming, Data Engineering, Mathematics, EMR, VMware ESXi, GlassFish, Machine Learning, Keen.io, IBM BigInsights, Big Data

Education

  • Master's Degree in Web and Software Engineering, Focused on Information Systems and Management
    2011 - 2013
    Czech Technical University in Prague - Prague, Czech Republic
  • Bachelor's Degree in Informatics
    2008 - 2011
    Slovak University of Technology in Bratislava - Bratislava, Slovak Republic

Certifications

  • Machine Learning Foundations: A Case Study Approach
    MARCH 2017 - PRESENT
    University of Washington | via Coursera

To view more profiles

Join Toptal
Share it with others