Gabor Hermann
Verified Expert in Engineering
Data Engineer and Developer
Hilversum, Netherlands
Toptal member since September 16, 2022
Gabor has eight years of far-ranging experience with data. A few highlights from Gabor's career include developing a large-scale stream processing framework, implementing machine learning algorithms, orchestrating data pipelines, creating and maintaining ETL jobs, measuring customer interactions, and supporting data scientists and analysts with engineering.
Portfolio
Experience
- Java - 8 years
- Python - 4 years
- SQL - 4 years
- Docker - 4 years
- GitLab CI/CD - 4 years
- Kubernetes - 4 years
- Google Cloud Platform (GCP) - 4 years
- Data Build Tool (dbt) - 1 year
Availability
Preferred Environment
Linux, MacOS
The most amazing...
...project was setting up the infrastructure for a recommendation system's replacement, resulting in 6-figure cost savings and 8-figure additional revenue/year.
Work Experience
Senior Data Engineer
Bol.com
- Established the infrastructure and optimized a replacement of a recommendation system, resulting in estimated six-figure cost savings and eight-figure additional revenue per year.
- Helped an analytics team set up the deployment, scheduling, and monitoring of analytics pipelines (dbt, Airflow). Used Docker to make the local development environment easier to set up for analysts.
- Introduced Apache Airflow, which took a machine learning prototype to production from a few weeks to a few days.
- Introduced Site Reliability Engineering (SRE) practices, resulting in less maintenance time.
Data Engineer
Bol.com
- Built and maintained scalable data pipelines for a recommendation system processing TBs of data daily. Oversaw the pipeline that received data from the data warehouse, preprocessing it for ML and loading recommendations to a database for serving.
- Worked with a team to create and maintain a service that serves recommendations, serving 5,000 requests per second with 99.9% availability under 15ms p99 latency (Java Spring, Kubernetes, Prometheus, and Grafana).
- Introduced PySpark to a team to be used by data scientists. Led migration from Apache Pig jobs on a Hadoop (YARN) cluster to PySpark jobs on Google Cloud Dataproc.
Data Engineer
bol.com
- Worked with a team to develop and maintain click data collection from a Java Spring back-end service to Kafka. The daily amount of collected data was in the TB range.
- Built a simple Avro schema registry backed by Kafka to support schema evolution.
- Worked on a system loading the data from Kafka to Parquet files on HDFS to be consumed by data analysts and data scientists.
Software Engineer
Informatics Laboratory, SZTAKI (Hungarian Academy of Sciences)
- Collaborated on the development of the first prototype of the Apache Flink Streaming API.
- Led a small team developing distributed machine learning algorithms on Apache Flink and Spark.
- Worked in a research internship at the Database Group (DIMA) at TU Berlin for six months (February to July 2016).
- Taught Apache Flink and Spark at the National Polytechnic Institute (IPN) in Mexico City, Mexico.
Experience
Migrating a Recommendation System to the Cloud
We then migrated it to BigQuery and PySpark jobs running on Dataproc, scheduled with Airflow, serving on Kubernetes backed by Bigtable. We also set up a new CI/CD pipeline in GitLab CI/CD. I also introduced SRE practices for service maintenance.
I was mostly in charge of establishing the infrastructure and tooling, which included GitLab CI/CD, Docker images, deployment with Kubernetes, creating clusters to run PySpark on, setting up permissions, monitoring services, scheduling with Airflow, and so on.
Implementing Distributed Machine Learning Algorithms
https://github.com/gaborhermann/flink-parameter-serverEstablishing Analytics Engineering Practices
This included setting up data testing and coaching analysts on how to verify their assumptions. Implementing CI/CD made the development cycle faster; this included SQL code-style checks, query validation, fast Docker image builds, and automatic deployment.
I also made the local development environment easier to set up and use with Docker. I then set up the scheduling of pipelines with Airflow and implemented pipeline monitoring by loading DBT metadata to the data warehouse.
Education
Bachelor's Degree in Computer Science
Eötvös Loránd University - Budapest, Hungary
Skills
Libraries/APIs
PySpark
Tools
GitLab CI/CD, Apache Airflow, BigQuery, Git, Flink, Google Cloud Dataproc, Apache Maven, Apache Beam, Apache Avro
Languages
Python, SQL, Java, Bash, Haskell, Scala
Platforms
Docker, Linux, Unix, Kubernetes, Google Cloud Platform (GCP), MacOS, Apache Kafka, Apache Flink, Apache Pig
Frameworks
Spark, Spring Boot, Hadoop, Yarn, Spring
Paradigms
ETL
Storage
Google Cloud Storage, Google Bigtable, Data Pipelines, Databases, HDFS
Other
Data Build Tool (dbt), Google Cloud Dataflow, Algorithms, Networks, Linear Algebra, Calculus, Numerical Methods, Discrete Mathematics, Linguistics, Machine Learning, Data Engineering
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring