Gabor is available for hire

Gabor Hermann

Verified Expert in Engineering

Data Engineer and Developer

Location

Hilversum, Netherlands

Toptal Member Since

September 16, 2022

Gabor has eight years of far-ranging experience with data. A few highlights from Gabor's career include developing a large-scale stream processing framework, implementing machine learning algorithms, orchestrating data pipelines, creating and maintaining ETL jobs, measuring customer interactions, and supporting data scientists and analysts with engineering.

Java Python SQL Docker BigQuery Apache Airflow Linux Unix Git Spark Bash Data Pipelines ETL Kubernetes Flink Gitlab CI CD

Portfolio

Bol.com

Python, SQL, Java, Unix, Git, Docker, Apache Maven, GitLab CI/CD, Kubernetes...

Bol.com

Python, SQL, Java, Unix, Git, Docker, Apache Maven, GitLab CI/CD, Kubernetes...

bol.com

Apache Kafka, Apache Flink, YARN, Hadoop, HDFS, Java, Spring, Apache Avro...

Experience

Java - 8 years Google Cloud Platform (GCP) - 4 years Kubernetes - 4 years GitLab CI/CD - 4 years Docker - 4 years SQL - 4 years Python - 4 years Data Build Tool (dbt) - 1 year

Availability

Part-time

Preferred Environment

Linux, MacOS

The most amazing...

...project was setting up the infrastructure for a recommendation system's replacement, resulting in 6-figure cost savings and 8-figure additional revenue/year.

Work Experience

Senior Data Engineer

2020 - PRESENT

Bol.com

Established the infrastructure and optimized a replacement of a recommendation system, resulting in estimated six-figure cost savings and eight-figure additional revenue per year.
Helped an analytics team set up the deployment, scheduling, and monitoring of analytics pipelines (dbt, Airflow). Used Docker to make the local development environment easier to set up for analysts.
Introduced Apache Airflow, which took a machine learning prototype to production from a few weeks to a few days.
Introduced Site Reliability Engineering (SRE) practices, resulting in less maintenance time.

Technologies: Python, SQL, Java, Unix, Git, Docker, Apache Maven, GitLab CI/CD, Kubernetes, Apache Airflow, Spark, Flink, Apache Kafka, Apache Beam, Spring Boot, Google Cloud Platform (GCP), BigQuery, Google Cloud Storage, Google Bigtable, Google Cloud Dataproc, Google Cloud Dataflow, Data Build Tool (dbt), Data Pipelines, Data Engineering

Data Engineer

2018 - 2020

Bol.com

Built and maintained scalable data pipelines for a recommendation system processing TBs of data daily. Oversaw the pipeline that received data from the data warehouse, preprocessing it for ML and loading recommendations to a database for serving.
Worked with a team to create and maintain a service that serves recommendations, serving 5,000 requests per second with 99.9% availability under 15ms p99 latency (Java Spring, Kubernetes, Prometheus, and Grafana).
Introduced PySpark to a team to be used by data scientists. Led migration from Apache Pig jobs on a Hadoop (YARN) cluster to PySpark jobs on Google Cloud Dataproc.

Technologies: Python, SQL, Java, Unix, Git, Docker, Apache Maven, GitLab CI/CD, Kubernetes, Apache Airflow, Spark, Apache Beam, Spring Boot, Google Cloud Platform (GCP), BigQuery, Google Cloud Storage, Google Bigtable, Google Cloud Dataproc, Google Cloud Dataflow, Apache Pig, Data Pipelines, Data Engineering, Bash

Data Engineer

2017 - 2018

bol.com

Worked with a team to develop and maintain click data collection from a Java Spring back-end service to Kafka. The daily amount of collected data was in the TB range.
Built a simple Avro schema registry backed by Kafka to support schema evolution.
Worked on a system loading the data from Kafka to Parquet files on HDFS to be consumed by data analysts and data scientists.

Technologies: Apache Kafka, Apache Flink, YARN, Hadoop, HDFS, Java, Spring, Apache Avro, Data Pipelines, Data Engineering

Software Engineer

2014 - 2017

Informatics Laboratory, SZTAKI (Hungarian Academy of Sciences)

Collaborated on the development of the first prototype of the Apache Flink Streaming API.
Led a small team developing distributed machine learning algorithms on Apache Flink and Spark.
Worked in a research internship at the Database Group (DIMA) at TU Berlin for six months (February to July 2016).
Taught Apache Flink and Spark at the National Polytechnic Institute (IPN) in Mexico City, Mexico.

Technologies: Flink, Spark, Bash, Java, Scala, Apache Maven, Hadoop, YARN, Apache Kafka, Data Pipelines, Data Engineering

Experience

Migrating a Recommendation System to the Cloud

I was working within a team of data scientists and data engineers, leading the set up of the infrastructure and tooling for migrating a recommendation system to the Google Cloud Platform. We had Hadoop/Pig jobs and a Java service using Cassandra.

We then migrated it to BigQuery and PySpark jobs running on Dataproc, scheduled with Airflow, serving on Kubernetes backed by Bigtable. We also set up a new CI/CD pipeline in GitLab CI/CD. I also introduced SRE practices for service maintenance.

I was mostly in charge of establishing the infrastructure and tooling, which included GitLab CI/CD, Docker images, deployment with Kubernetes, creating clusters to run PySpark on, setting up permissions, monitoring services, scheduling with Airflow, and so on.

Implementing Distributed Machine Learning Algorithms

https://github.com/gaborhermann/flink-parameter-server

I led a small team that implemented distributed machine learning algorithms as part of a research project. Using Scala, we implemented matrix factorization (iALS, DSGD) on top of Apache Flink and Spark. I also implemented the Parameter Server architecture on top of Apache Flink, which can be used to implement many machine learning algorithms more easily (matrix factorization and passive-aggressive classifier).

Establishing Analytics Engineering Practices

I established the analytics engineering practices for an analytics team.

This included setting up data testing and coaching analysts on how to verify their assumptions. Implementing CI/CD made the development cycle faster; this included SQL code-style checks, query validation, fast Docker image builds, and automatic deployment.

I also made the local development environment easier to set up and use with Docker. I then set up the scheduling of pipelines with Airflow and implemented pipeline monitoring by loading DBT metadata to the data warehouse.

Education

2013 - 2017

Bachelor's Degree in Computer Science

Eötvös Loránd University - Budapest, Hungary

Skills

Libraries/APIs

PySpark

Tools

GitLab CI/CD, Apache Airflow, BigQuery, Git, Flink, Google Cloud Dataproc, Apache Maven, Apache Beam, Apache Avro

Languages

Python, SQL, Java, Bash, Haskell, Scala

Frameworks

Spark, Spring Boot, Hadoop, YARN, Spring

Platforms

Docker, Linux, Unix, Kubernetes, Google Cloud Platform (GCP), MacOS, Apache Kafka, Apache Flink, Apache Pig

Paradigms

ETL

Storage

Google Cloud Storage, Google Bigtable, Data Pipelines, Databases, HDFS

Other

Data Build Tool (dbt), Google Cloud Dataflow, Algorithms, Networks, Linear Algebra, Calculus, Numerical Methods, Discrete Mathematics, Linguistics, Machine Learning, Data Engineering

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring