
Sebastian Brestin
Verified Expert in Engineering
Azure Databricks Developer
Cluj-Napoca, Cluj County, Romania
Toptal member since March 12, 2018
Since 2012, Sebastian has been developing distributed systems for various platforms ranging from Solaris, IBM AIX, HP-UX to Linux and Windows. He's worked with various technologies such as Apache Spark, Elasticsearch, PostgreSQL, RabbitMQ, Django, and Celery to build data-intensive scalable software. Sebastian is passionate about delivering high-quality solutions and is extremely interested in big data challenges.
Portfolio
Experience
- Python - 10 years
- Apache Spark - 4 years
- Elasticsearch - 4 years
- PySpark - 4 years
- Java - 3 years
- Pandas - 3 years
- Azure Databricks - 2 years
- Amazon Web Services (AWS) - 2 years
Availability
Preferred Environment
Amazon Web Services (AWS), Apache Kafka, Spark, Linux, Hadoop
The most amazing...
...thing I've built is a MapReduce system using Docker, Django, PostgreSQL, and Celery.
Work Experience
Data Engineer
Fortune 500 Retail Company (via Toptal)
- Redesigned Spark application in order to make it more robust and flexible for data scientists and software engineers.
- Researched to increase Spark S3 parquet write performance. In order to analyze Spark jobs I have used Spark History Server and Ganglia to pinpoint where time is spent and to compare configuration fixes.
- Implemented a robust and generic test framework for Spark pipelines.
- Redesigned Spark application from a monolith into a modular Spark application which writes intermediate results and can be run in parallel.
- Built a Spark application for ingesting 1TB of data from multiple sources and generate 40k possible features based on which the data science team can perform EDA and test models.
- Implemented new application APIs in order to help other teams improve their productivity.
Python Developer
Reconstrukt(via Toptal)
- Implemented a concurrent orchestrator for a real-time video rendering system using Python Tornado.
- Used HTTP, Websockets, raw TCP connections and AWS S3 and NAS storage to build the pipelines needed to process content.
Data Engineer
Spyhce
- Matched mutable objects (which users can create/update) with other millions of immutable objects in real time (or as close as possible) by creating three Spark-based apps. Additional details about the project can be found in my portfolio section.
- Built a task manager Django application over Celery. The application allows administrators to easily manage tasks and view progress/statistics without an additional monitoring service.
- Developed a Django audit application over the Django ORM in order to keep track of all of the user actions.
Software Engineer
Hewlett-Packard
- Developed a Python based build system for a virtual appliance that allows HP customers to deploy the product into production with little effort.
- Maintained the project.
- Introduced software/patch time-windowed installation that the server agents use in order to avoid loading/rebooting the server during critical hours.
- Led the upgrade from SSL to TLS between all server and client components.
- Redesigned the strategy that the server agents use to select the IP address in order to communicate with the core components.
- Legacy code refactor in order to support custom installation path for the Windows server agent.
Software Developer
GFI Software
- Improved download speed for patches by using the cache of LAN neighbors.
- Enhanced the build system for a better UX.
- Maintained the project.
- Redesigned the product legacy architecture in order to easily extend with new features.
- Added a feature for discovering Android and iOS devices inside the LAN.
Experience
Personalization Engine Optimization (Fortune 500 Retail Company)
We wanted to improve the user and developer experience. The Spark application was presenting performance problems, and tech debt became a blocker for adding new functionality.
Solution:
Redesigned Spark application to make it more robust and flexible for data scientists and software engineers:
Partitioned data to avoid shuffles and data skew
Decreased number of partitions while increasing partition size to avoid data skew and driver congestion
Used strategic caching in order avoid recomputation and improve computation by cutting down the query plan
Refactored the application as a pipeline for engineers to add features as plugins to the pipeline
Generated deterministic Spark output in case of rows with equal comparison values
Results:
Reduced execution time by 50% and increased configuration flexibility, which in the second turn improves data scientists productivity
Reduced tech debt, which improved development time and improved development experience
Improved functional tests by having a deterministic output
Personalization Engine Write Optimization (Fortune 500 Retail Company)
We wanted to increase cluster availability, which was being blocked by long-running Spark jobs.
Solution:
Researched to increase Spark S3 parquet write performance. To analyze Spark jobs, I have used Spark History Server and Ganglia to pinpoint where time is spent and compare configuration fixes.
Results:
Increased write performance by a factor of three.
Personalization Engine Test Framework (Fortune 500 Retail Company)
We wanted to increase the product stability and our development process.
Solution:
Implemented robust and generic functional test framework for Spark pipelines:
Allowed multiple instances of a test to run in parallel
Allowed tests to run in a pipeline manner as part of a suite
Simulated production runs by using a representative dataset
Results:
Faster POC development
Reduced development time by 50%
Increased product stability
Reduced AWS costs by simulating production runs with fewer resources
Personalization Engine Pipeline Optimization (Fortune 500 Retail Company)
Some of our Spark applications were long-running applications up to nine hours. In this timeframe, nodes could go out of memory or lose connectivity. We wanted our applications to run faster and to be easier to recover in case of failure.
Solution:
Redesigned Spark application from a monolith into a modular Spark application which writes intermediate results and can be run in parallel.
Results:
No failures due to out of memory
Easier to recover by re-running only the failed module
Personalization Engine Churn Model (Fortune 500 Retail Company)
We wanted a churn model that we can use to predict the future for our members and also integrate it into other Spark applications.
Solution:
Built a Spark application for ingesting 1TB of data from multiple sources and generate 40k possible features based on which the data science team can perform EDA and test models.
Personalization Engine Application Development (Fortune 500 Retail Company)
The data science team has to use the data and our systems most efficiently to be productive.
Solution:
Implemented new application APIs to help other teams improve their productivity:
1. Wrote better API for complex Spark or Cloud functionality
2. Wrote wrappers for complex application configuration
3. Implemented integration with adjacent applications
4. Wrote documentation and Unit Tests
5. Ran data analysis and data quality checks
Django MapReduce System
Spyhce | Mutable Object Matching Project
We chose Apache Spark because of its Python support, rich analytics toolkit, and streaming capabilities.
The solution was split into three Spark apps:
01. Retrieves mutable data from the sources using Kafka and Spark streaming, extracting features and saving the result to Cassandra.
02. Retrieves immutable data from sources using Kafka and Spark streaming, extracting features and saving the result to disc as Parquet files.
03. Loads data from Parquet files, computing a match percentage between all immutable objects and a single mutable object from Cassandra.
In order to query data in a reasonable amount of time, I denormalized the PostgreSQL tables (billions of records) using Elasticsearch. This improved the read performance by two orders of magnitude but added a write penalty. Because only parts of the documents were changing frequently, the problem was solved using Elasticsearch bulk partial updates with Groovy scripts for complex fields.
Single Sign-on Platform
Machine Learning
High-content-streaming Platform
Spark Secondary Sort
https://www.qwertee.io/blog/spark-secondary-sort/The article covers two solutions. The first solution uses Spark's groupByKey which does a sort in the reducer phase while the second solution uses Spark's repartitionAndSortWithinPartitions which leverages the shuffle phase and iterator-to-iterator transformation to sort more efficiently and not run out of memory.
PostgreSQL Data Partitioning and Django
https://www.qwertee.io/blog/postgres-data-partitioning-and-django/The first part of the article goes on about the reasons to partition data and also how partitioning can be implemented in PostgreSQL.
The second part describes a few solutions one can use to take advantage of PostgreSQL partitioning while using a web framework such as Django.
PostgreSQL B-Tree Index Explained | Part 1
https://www.qwertee.io/blog/postgresql-b-tree-index-explained-part-1/In the first section of the article, I explain about the fundamentals of the PostgreSQL B-tree index structure with a focus on B-tree data structure and its main components—leaf nodes and internal nodes, what are their roles, and how they are accessed while executing a query (a process also known as index lookup). The section ends with the index classification, with a broader view over the index key arity which helps explain the impact on a query’s performance.
In the second section, there is a quick introduction to the query plan to better understand the examples that follow.
In the third section, I introduce the concept of predicates, and the explanations cover the process PostgreSQL uses to classify predicates depending on index definition and feasibility.
The last section goes deeper into the mechanics of scans and how an index definition, data distribution, or even predicate usage can influence the performance of a query.
Education
Bachelor's Degree in Computer Science
University of Babeș-Bolyai - Cluj-Napoca, Romania
Certifications
Certified Scrum Master
Scrum Alliance
Oracle Certified Associate, Java SE 7 Programmer
Net BRINEL SA
Skills
Libraries/APIs
PySpark, NumPy, Pandas, Asyncio, OpenSSL, Scikit-learn, ZeroMQ, Natural Language Toolkit (NLTK)
Tools
Celery, RabbitMQ, WildFly, Ganglia, Amazon Elastic MapReduce (EMR), VMware vSphere, Apache Avro, Subversion (SVN), Git, Jenkins
Languages
Python, C++, Java, Delphi, C#, TypeScript, JavaScript
Frameworks
Hadoop, Apache Spark, JSON Web Tokens (JWT), OAuth 2, Django REST Framework, Django, .NET, Spring, Spark, Twisted, Spring MVC, Angular
Platforms
Amazon Web Services (AWS), Apache Kafka, Windows, Linux, Ubuntu, Oracle Database, Unix, Jupyter Notebook, HP-UX, Solaris, Docker
Storage
Elasticsearch, PostgreSQL, Microsoft SQL Server, Cassandra, Amazon S3 (AWS S3), Memcached, Redis, Oracle RDBMS, MongoDB, SQL Server 2012, MySQL
Paradigms
Scrum
Other
RESTful Web Services, Data Mining, Data Engineering, OpenID Connect (OIDC), Azure Databricks, WebLogic, Parquet, VMware ESXi, Apache Cassandra, NATS, Apache Flume, Cryptography, Tornado
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring