Sebastian is available for hire

Sebastian Brestin

Verified Expert in Engineering

Azure Databricks Developer

Location

Cluj-Napoca, Cluj County, Romania

Toptal Member Since

March 12, 2018

Since 2012, Sebastian has been developing distributed systems for various platforms ranging from Solaris, IBM AIX, HP-UX to Linux and Windows. He's worked with various technologies such as Apache Spark, Elasticsearch, PostgreSQL, RabbitMQ, Django, and Celery to build data-intensive scalable software. Sebastian is passionate about delivering high-quality solutions and is extremely interested in big data challenges.

Data Engineering Data Mining Python Celery PostgreSQL PySpark Apache Spark Elasticsearch Pandas Hadoop Amazon Web Services (AWS)NumPy Windows Linux RabbitMQ HP-UX OAuth 2.0

Portfolio

Fortune 500 Retail Company (via Toptal)

Jupyter Notebook, Pandas, Amazon Web Services (AWS), Parquet, Python, PySpark

Reconstrukt(via Toptal)

Amazon Web Services (AWS), Asyncio, Tornado, Python

Spyhce

Jenkins, Docker, Cassandra, Redis, Elasticsearch, PostgreSQL, Apache Kafka...

Experience

Python - 10 years Apache Spark - 4 years Elasticsearch - 4 years PySpark - 4 years Java - 3 years Pandas - 3 years Azure Databricks - 2 years Amazon Web Services (AWS) - 2 years

Availability

Part-time

Preferred Environment

Amazon Web Services (AWS), Apache Kafka, Spark, Linux, Hadoop

The most amazing...

...thing I've built is a MapReduce system using Docker, Django, PostgreSQL, and Celery.

Work Experience

Data Engineer

2019 - 2021

Fortune 500 Retail Company (via Toptal)

Redesigned Spark application in order to make it more robust and flexible for data scientists and software engineers.
Researched to increase Spark S3 parquet write performance. In order to analyze Spark jobs I have used Spark History Server and Ganglia to pinpoint where time is spent and to compare configuration fixes.
Implemented a robust and generic test framework for Spark pipelines.
Redesigned Spark application from a monolith into a modular Spark application which writes intermediate results and can be run in parallel.
Built a Spark application for ingesting 1TB of data from multiple sources and generate 40k possible features based on which the data science team can perform EDA and test models.
Implemented new application APIs in order to help other teams improve their productivity.

Technologies: Jupyter Notebook, Pandas, Amazon Web Services (AWS), Parquet, Python, PySpark

Python Developer

2018 - 2019

Reconstrukt(via Toptal)

Implemented a concurrent orchestrator for a real-time video rendering system using Python Tornado.
Used HTTP, Websockets, raw TCP connections and AWS S3 and NAS storage to build the pipelines needed to process content.

Technologies: Amazon Web Services (AWS), Asyncio, Tornado, Python

Data Engineer

2016 - 2018

Spyhce

Matched mutable objects (which users can create/update) with other millions of immutable objects in real time (or as close as possible) by creating three Spark-based apps. Additional details about the project can be found in my portfolio section.
Built a task manager Django application over Celery. The application allows administrators to easily manage tasks and view progress/statistics without an additional monitoring service.
Developed a Django audit application over the Django ORM in order to keep track of all of the user actions.

Technologies: Jenkins, Docker, Cassandra, Redis, Elasticsearch, PostgreSQL, Apache Kafka, RabbitMQ, Celery, Django, Python, Spark

Software Engineer

2012 - 2016

Hewlett-Packard

Developed a Python based build system for a virtual appliance that allows HP customers to deploy the product into production with little effort.
Maintained the project.
Introduced software/patch time-windowed installation that the server agents use in order to avoid loading/rebooting the server during critical hours.
Led the upgrade from SSL to TLS between all server and client components.
Redesigned the strategy that the server agents use to select the IP address in order to communicate with the core components.
Legacy code refactor in order to support custom installation path for the Windows server agent.

Technologies: OpenSSL, Windows, Unix, Linux, PostgreSQL, Oracle Database, C++, Python, Spring, WebLogic, WildFly, Java

Software Developer

2012 - 2012

GFI Software

Improved download speed for patches by using the cache of LAN neighbors.
Enhanced the build system for a better UX.
Maintained the project.
Redesigned the product legacy architecture in order to easily extend with new features.
Added a feature for discovering Android and iOS devices inside the LAN.

Technologies: Microsoft SQL Server, Delphi, .NET, C#, C++

Experience

Personalization Engine Optimization (Fortune 500 Retail Company)

Context:
We wanted to improve the user and developer experience. The Spark application was presenting performance problems, and tech debt became a blocker for adding new functionality.

Solution:
Redesigned Spark application to make it more robust and flexible for data scientists and software engineers:
Partitioned data to avoid shuffles and data skew
Decreased number of partitions while increasing partition size to avoid data skew and driver congestion
Used strategic caching in order avoid recomputation and improve computation by cutting down the query plan
Refactored the application as a pipeline for engineers to add features as plugins to the pipeline
Generated deterministic Spark output in case of rows with equal comparison values

Results:
Reduced execution time by 50% and increased configuration flexibility, which in the second turn improves data scientists productivity
Reduced tech debt, which improved development time and improved development experience
Improved functional tests by having a deterministic output

Personalization Engine Write Optimization (Fortune 500 Retail Company)

Context:
We wanted to increase cluster availability, which was being blocked by long-running Spark jobs.

Solution:
Researched to increase Spark S3 parquet write performance. To analyze Spark jobs, I have used Spark History Server and Ganglia to pinpoint where time is spent and compare configuration fixes.

Results:
Increased write performance by a factor of three.

Personalization Engine Test Framework (Fortune 500 Retail Company)

Context:
We wanted to increase the product stability and our development process.

Solution:
Implemented robust and generic functional test framework for Spark pipelines:
Allowed multiple instances of a test to run in parallel
Allowed tests to run in a pipeline manner as part of a suite
Simulated production runs by using a representative dataset

Results:
Faster POC development
Reduced development time by 50%
Increased product stability
Reduced AWS costs by simulating production runs with fewer resources

Personalization Engine Pipeline Optimization (Fortune 500 Retail Company)

Context:
Some of our Spark applications were long-running applications up to nine hours. In this timeframe, nodes could go out of memory or lose connectivity. We wanted our applications to run faster and to be easier to recover in case of failure.

Solution:
Redesigned Spark application from a monolith into a modular Spark application which writes intermediate results and can be run in parallel.

Results:
No failures due to out of memory
Easier to recover by re-running only the failed module

Personalization Engine Churn Model (Fortune 500 Retail Company)

Context:
We wanted a churn model that we can use to predict the future for our members and also integrate it into other Spark applications.

Solution:
Built a Spark application for ingesting 1TB of data from multiple sources and generate 40k possible features based on which the data science team can perform EDA and test models.

Personalization Engine Application Development (Fortune 500 Retail Company)

Context:
The data science team has to use the data and our systems most efficiently to be productive.

Solution:
Implemented new application APIs to help other teams improve their productivity:
1. Wrote better API for complex Spark or Cloud functionality
2. Wrote wrappers for complex application configuration
3. Implemented integration with adjacent applications
4. Wrote documentation and Unit Tests
5. Ran data analysis and data quality checks

Django MapReduce System

For a minimum viable product, I designed and implemented a custom Django MapReduce system for matching an object with other million objects. The system was running on 60 Docker nodes and one match task was running under 60 seconds. In order to test the system I have extended Django test framework in order to test locally the Django custom MapReduce system without spawning virtual machines only by using multiple processes and cores.

Spyhce | Mutable Object Matching Project

This project aimed to match mutable objects (which users can create/update) with millions of immutable objects in real time (or as close as possible).

We chose Apache Spark because of its Python support, rich analytics toolkit, and streaming capabilities.

The solution was split into three Spark apps:
01. Retrieves mutable data from the sources using Kafka and Spark streaming, extracting features and saving the result to Cassandra.
02. Retrieves immutable data from sources using Kafka and Spark streaming, extracting features and saving the result to disc as Parquet files.
03. Loads data from Parquet files, computing a match percentage between all immutable objects and a single mutable object from Cassandra.

In order to query data in a reasonable amount of time, I denormalized the PostgreSQL tables (billions of records) using Elasticsearch. This improved the read performance by two orders of magnitude but added a write penalty. Because only parts of the documents were changing frequently, the problem was solved using Elasticsearch bulk partial updates with Groovy scripts for complex fields.

Single Sign-on Platform

I led the implementation of a single-sign-on platform, using OpenID Connect, Django, PostgreSQL, Angular/TypeScript and AWS (which integrated the customer's entire product suite).

Machine Learning

While working as a contractor, I used NLTK, Scikit-learn, AWS Transcribe, and AWS Lambda to run a sentiment analysis to improve customer support.

High-content-streaming Platform

I led the implementation of a high-content-streaming platform using Django, PostgreSQL, Elasticsearch, and AWS. The system allowed pushing/pulling content efficiently using a RESTful API.

Spark Secondary Sort

https://www.qwertee.io/blog/spark-secondary-sort/

While working with Spark, I noticed the lack of proper documentation concerning PySpark. This situation motivated me to put together this article aimed at thoroughly explaining the secondary sort design pattern using PySpark.

The article covers two solutions. The first solution uses Spark's groupByKey which does a sort in the reducer phase while the second solution uses Spark's repartitionAndSortWithinPartitions which leverages the shuffle phase and iterator-to-iterator transformation to sort more efficiently and not run out of memory.

PostgreSQL Data Partitioning and Django

https://www.qwertee.io/blog/postgres-data-partitioning-and-django/

I wrote this article about PostgreSQL and data partitioning in general with practical examples using the Django web framework.

The first part of the article goes on about the reasons to partition data and also how partitioning can be implemented in PostgreSQL.

The second part describes a few solutions one can use to take advantage of PostgreSQL partitioning while using a web framework such as Django.

PostgreSQL B-Tree Index Explained | Part 1

https://www.qwertee.io/blog/postgresql-b-tree-index-explained-part-1/

Working with PostgreSQL for many years has inspired me to create an article about indexes that any developer should know.

In the first section of the article, I explain about the fundamentals of the PostgreSQL B-tree index structure with a focus on B-tree data structure and its main components—leaf nodes and internal nodes, what are their roles, and how they are accessed while executing a query (a process also known as index lookup). The section ends with the index classification, with a broader view over the index key arity which helps explain the impact on a query’s performance.

In the second section, there is a quick introduction to the query plan to better understand the examples that follow.

In the third section, I introduce the concept of predicates, and the explanations cover the process PostgreSQL uses to classify predicates depending on index definition and feasibility.

The last section goes deeper into the mechanics of scans and how an index definition, data distribution, or even predicate usage can influence the performance of a query.

Skills

Languages

Python, C++, Java, Delphi, C#, TypeScript, JavaScript

Frameworks

Hadoop, Apache Spark, JSON Web Tokens (JWT), OAuth 2, Django REST Framework, Django, .NET, Spring, Spark, Twisted, Spring MVC, Angular

Libraries/APIs

PySpark, NumPy, Pandas, Asyncio, OpenSSL, Scikit-learn, ZeroMQ, Natural Language Toolkit (NLTK)

Tools

Celery, RabbitMQ, WildFly, Ganglia, Amazon Elastic MapReduce (EMR), VMware vSphere, Apache Avro, Subversion (SVN), Git, Jenkins

Platforms

Amazon Web Services (AWS), Apache Kafka, Windows, Linux, Ubuntu, Oracle Database, Unix, Jupyter Notebook, HP-UX, Solaris, Docker

Storage

Elasticsearch, PostgreSQL, Microsoft SQL Server, Cassandra, Amazon S3 (AWS S3), Memcached, Redis, Oracle RDBMS, MongoDB, SQL Server 2012, MySQL

Other

RESTful Web Services, Data Mining, Data Engineering, OpenID Connect (OIDC), Azure Databricks, WebLogic, Parquet, VMware ESXi, Apache Cassandra, NATS, Apache Flume, Cryptography, Tornado

Paradigms

Scrum

Education

2010 - 2013

Bachelor's Degree in Computer Science

University of Babeș-Bolyai - Cluj-Napoca, Romania

Certifications

MAY 2014 - MAY 2016

Certified Scrum Master

Scrum Alliance

APRIL 2014 - PRESENT

Oracle Certified Associate, Java SE 7 Programmer

Net BRINEL SA

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring