Sebastian Brestin, Software Developer in Cluj-Napoca, Cluj County, Romania
Sebastian Brestin

Software Developer in Cluj-Napoca, Cluj County, Romania

Member since February 20, 2018
Since 2012, Sebastian has been developing distributed systems for various platforms ranging from Solaris, IBM AIX, HP-UX to Linux and Windows. He's worked with various technologies such as Apache Spark, Elasticsearch, PostgreSQL, RabbitMQ, Django, and Celery to build data-intensive scalable software. Sebastian is passionate about delivering high-quality solutions and is extremely interested in big data challenges.
Sebastian is now available for hire

Portfolio

Experience

Location

Cluj-Napoca, Cluj County, Romania

Availability

Full-time

Preferred Environment

Amazon Web Services (AWS), Apache Kafka, Spark, AWS, Linux, Hadoop

The most amazing...

...thing I've built is a MapReduce system using Docker, Django, PostgreSQL, and Celery.

Employment

  • Data Engineer

    2019 - 2021
    Fortune 500 Retail Company (via Toptal)
    • Redesigned Spark application in order to make it more robust and flexible for data scientists and software engineers.
    • Researched to increase Spark S3 parquet write performance. In order to analyze Spark jobs I have used Spark History Server and Ganglia to pinpoint where time is spent and to compare configuration fixes.
    • Implemented a robust and generic test framework for Spark pipelines.
    • Redesigned Spark application from a monolith into a modular Spark application which writes intermediate results and can be run in parallel.
    • Built a Spark application for ingesting 1TB of data from multiple sources and generate 40k possible features based on which the data science team can perform EDA and test models.
    • Implemented new application APIs in order to help other teams improve their productivity.
    Technologies: Jupyter Notebook, Pandas, Amazon Web Services (AWS), Parquet, Python, PySpark
  • Python Developer

    2018 - 2019
    Reconstrukt(via Toptal)
    • Implemented a concurrent orchestrator for a real-time video rendering system using Python Tornado.
    • Used HTTP, Websockets, raw TCP connections and AWS S3 and NAS storage to build the pipelines needed to process content.
    Technologies: Amazon Web Services (AWS), AWS, Asyncio, Tornado, Python
  • Data Engineer

    2016 - 2018
    Spyhce
    • Matched mutable objects (which users can create/update) with other millions of immutable objects in real time (or as close as possible) by creating three Spark-based apps. Additional details about the project can be found in my portfolio section.
    • Built a task manager Django application over Celery. The application allows administrators to easily manage tasks and view progress/statistics without an additional monitoring service.
    • Developed a Django audit application over the Django ORM in order to keep track of all of the user actions.
    Technologies: Jenkins, Docker, Cassandra, Redis, Elasticsearch, PostgreSQL, Apache Kafka, RabbitMQ, Celery, Django, Python, Spark
  • Software Engineer

    2012 - 2016
    Hewlett-Packard
    • Developed a Python based build system for a virtual appliance that allows HP customers to deploy the product into production with little effort.
    • Maintained the project.
    • Introduced software/patch time-windowed installation that the server agents use in order to avoid loading/rebooting the server during critical hours.
    • Led the upgrade from SSL to TLS between all server and client components.
    • Redesigned the strategy that the server agents use to select the IP address in order to communicate with the core components.
    • Legacy code refactor in order to support custom installation path for the Windows server agent.
    Technologies: OpenSSL, Windows, Unix, Linux, PostgreSQL, Oracle Database, C++, Python, Spring, WebLogic, WildFly, Java
  • Software Developer

    2012 - 2012
    GFI Software
    • Improved download speed for patches by using the cache of LAN neighbors.
    • Enhanced the build system for a better UX.
    • Maintained the project.
    • Redesigned the product legacy architecture in order to easily extend with new features.
    • Added a feature for discovering Android and iOS devices inside the LAN.
    Technologies: Microsoft SQL Server, Delphi, .NET, C#, C++

Experience

  • Personalization Engine Optimization (Fortune 500 Retail Company)

    Context:
    We wanted to improve the user and developer experience. The Spark application was presenting performance problems, and tech debt became a blocker for adding new functionality.

    Solution:
    Redesigned Spark application to make it more robust and flexible for data scientists and software engineers:
    Partitioned data to avoid shuffles and data skew
    Decreased number of partitions while increasing partition size to avoid data skew and driver congestion
    Used strategic caching in order avoid recomputation and improve computation by cutting down the query plan
    Refactored the application as a pipeline for engineers to add features as plugins to the pipeline
    Generated deterministic Spark output in case of rows with equal comparison values

    Results:
    Reduced execution time by 50% and increased configuration flexibility, which in the second turn improves data scientists productivity
    Reduced tech debt, which improved development time and improved development experience
    Improved functional tests by having a deterministic output

  • Personalization Engine Write Optimization (Fortune 500 Retail Company)

    Context:
    We wanted to increase cluster availability, which was being blocked by long-running Spark jobs.

    Solution:
    Researched to increase Spark S3 parquet write performance. To analyze Spark jobs, I have used Spark History Server and Ganglia to pinpoint where time is spent and compare configuration fixes.

    Results:
    Increased write performance by a factor of three.

  • Personalization Engine Test Framework (Fortune 500 Retail Company)

    Context:
    We wanted to increase the product stability and our development process.

    Solution:
    Implemented robust and generic functional test framework for Spark pipelines:
    Allowed multiple instances of a test to run in parallel
    Allowed tests to run in a pipeline manner as part of a suite
    Simulated production runs by using a representative dataset

    Results:
    Faster POC development
    Reduced development time by 50%
    Increased product stability
    Reduced AWS costs by simulating production runs with fewer resources

  • Personalization Engine Pipeline Optimization (Fortune 500 Retail Company)

    Context:
    Some of our Spark applications were long-running applications up to nine hours. In this timeframe, nodes could go out of memory or lose connectivity. We wanted our applications to run faster and to be easier to recover in case of failure.

    Solution:
    Redesigned Spark application from a monolith into a modular Spark application which writes intermediate results and can be run in parallel.

    Results:
    No failures due to out of memory
    Easier to recover by re-running only the failed module

  • Personalization Engine Churn Model (Fortune 500 Retail Company)

    Context:
    We wanted a churn model that we can use to predict the future for our members and also integrate it into other Spark applications.

    Solution:
    Built a Spark application for ingesting 1TB of data from multiple sources and generate 40k possible features based on which the data science team can perform EDA and test models.

  • Personalization Engine Application Development (Fortune 500 Retail Company)

    Context:
    The data science team has to use the data and our systems most efficiently to be productive.

    Solution:
    Implemented new application APIs to help other teams improve their productivity:
    1. Wrote better API for complex Spark or Cloud functionality
    2. Wrote wrappers for complex application configuration
    3. Implemented integration with adjacent applications
    4. Wrote documentation and Unit Tests
    5. Ran data analysis and data quality checks

  • Django MapReduce System

    For a minimum viable product, I designed and implemented a custom Django MapReduce system for matching an object with other million objects. The system was running on 60 Docker nodes and one match task was running under 60 seconds. In order to test the system I have extended Django test framework in order to test locally the Django custom MapReduce system without spawning virtual machines only by using multiple processes and cores.

  • Spyhce | Mutable Object Matching Project

    This project aimed to match mutable objects (which users can create/update) with millions of immutable objects in real time (or as close as possible).

    We chose Apache Spark because of its Python support, rich analytics toolkit, and streaming capabilities.

    The solution was split into three Spark apps:
    01. Retrieves mutable data from the sources using Kafka and Spark streaming, extracting features and saving the result to Cassandra.
    02. Retrieves immutable data from sources using Kafka and Spark streaming, extracting features and saving the result to disc as Parquet files.
    03. Loads data from Parquet files, computing a match percentage between all immutable objects and a single mutable object from Cassandra.

    In order to query data in a reasonable amount of time, I denormalized the PostgreSQL tables (billions of records) using Elasticsearch. This improved the read performance by two orders of magnitude but added a write penalty. Because only parts of the documents were changing frequently, the problem was solved using Elasticsearch bulk partial updates with Groovy scripts for complex fields.

  • Single Sign-on Platform

    I led the implementation of a single-sign-on platform, using OpenID Connect, Django, PostgreSQL, Angular/TypeScript and AWS (which integrated the customer's entire product suite).

  • Machine Learning

    While working as a contractor, I used NLTK, Scikit-learn, AWS Transcribe, and AWS Lambda to run a sentiment analysis to improve customer support.

  • High-content-streaming Platform

    I led the implementation of a high-content-streaming platform using Django, PostgreSQL, Elasticsearch, and AWS. The system allowed pushing/pulling content efficiently using a RESTful API.

  • Spark Secondary Sort
    https://www.qwertee.io/blog/spark-secondary-sort/

    While working with Spark, I noticed the lack of proper documentation concerning PySpark. This situation motivated me to put together this article aimed at thoroughly explaining the secondary sort design pattern using PySpark.

    The article covers two solutions. The first solution uses Spark's groupByKey which does a sort in the reducer phase while the second solution uses Spark's repartitionAndSortWithinPartitions which leverages the shuffle phase and iterator-to-iterator transformation to sort more efficiently and not run out of memory.

  • PostgreSQL Data Partitioning and Django
    https://www.qwertee.io/blog/postgres-data-partitioning-and-django/

    I wrote this article about PostgreSQL and data partitioning in general with practical examples using the Django web framework.

    The first part of the article goes on about the reasons to partition data and also how partitioning can be implemented in PostgreSQL.

    The second part describes a few solutions one can use to take advantage of PostgreSQL partitioning while using a web framework such as Django.

  • PostgreSQL B-Tree Index Explained | Part 1
    https://www.qwertee.io/blog/postgresql-b-tree-index-explained-part-1/

    Working with PostgreSQL for many years has inspired me to create an article about indexes that any developer should know.

    In the first section of the article, I explain about the fundamentals of the PostgreSQL B-tree index structure with a focus on B-tree data structure and its main components—leaf nodes and internal nodes, what are their roles, and how they are accessed while executing a query (a process also known as index lookup). The section ends with the index classification, with a broader view over the index key arity which helps explain the impact on a query’s performance.

    In the second section, there is a quick introduction to the query plan to better understand the examples that follow.

    In the third section, I introduce the concept of predicates, and the explanations cover the process PostgreSQL uses to classify predicates depending on index definition and feasibility.

    The last section goes deeper into the mechanics of scans and how an index definition, data distribution, or even predicate usage can influence the performance of a query.

Skills

  • Languages

    Python, C++, Java, Delphi, C#, TypeScript, JavaScript
  • Frameworks

    Hadoop, Apache Spark, JSON Web Tokens (JWT), OAuth 2, Django REST Framework, Django, .NET, Spring, Spark, Twisted, Spring MVC, Angular
  • Libraries/APIs

    PySpark, NumPy, Pandas, Asyncio, OpenSSL, Scikit-learn, ZeroMQ, NLTK
  • Tools

    Celery, RabbitMQ, WildFly, Ganglia, Amazon Elastic MapReduce (EMR), VMware vSphere, Apache Avro, Subversion (SVN), Git, Jenkins
  • Platforms

    Amazon Web Services (AWS), Apache Kafka, Windows, Linux, Ubuntu, Oracle Database, Unix, Jupyter Notebook, HP-UX, Solaris, Docker
  • Storage

    Elasticsearch, PostgreSQL, Microsoft SQL Server, Cassandra, AWS S3, Memcached, Redis, Oracle RDBMS, MongoDB, SQL Server 2012, MySQL
  • Other

    RESTful Web Services, Data Mining, Data Engineering, OpenID Connect (OIDC), WebLogic, AWS, Parquet, VMware ESXi, Apache Cassandra, NATS, Apache Flume, Cryptography, Tornado
  • Paradigms

    Scrum

Education

  • Bachelor's Degree in Computer Science
    2010 - 2013
    University of Babeș-Bolyai - Cluj-Napoca, Romania

Certifications

  • Certified Scrum Master
    MAY 2014 - MAY 2016
    Scrum Alliance
  • Oracle Certified Associate, Java SE 7 Programmer
    APRIL 2014 - PRESENT
    Net BRINEL SA

To view more profiles

Join Toptal
Share it with others