Big Data Posts

The Toptal Engineering Blog is a hub for in-depth development tutorials and new technology announcements created by professional software engineers in the Toptal network.
Roman Vashchegin
Conquer String Search with the Aho-Corasick Algorithm

The Aho-Corasick algorithm can be used to efficiently search for multiple patterns in a large blob of text, making it a really useful algorithm in data science and many other areas.

In this article, Toptal Freelance Software Engineer Roman Vashchegin shows how the Aho-Corasick algorithm uses a trie data structure to efficiently match a dictionary of words against any text.

Continue reading →
Juan Pablo Carzolio
A Guide to Consistent Hashing

Consistent Hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table. It powers many high-traffic dynamic websites and web applications.

In this tutorial, Toptal Freelance Software Engineer Juan Pablo Carzolio will walk us through what it is and how hashing, distributed hashing and consistent hashing work.

Continue reading →
Dino Causevic
Tree Kernels: Quantifying Similarity Among Tree-Structured Data

Today, a massive amount of data is available in the form of networks or graphs. For example, the World Wide Web, with its web pages and hyperlinks, social networks, semantic networks, biological networks, citation networks for scientific literature, and so on.

A tree is a special type of graph, and is naturally suited to represent many types of data. The analysis of trees is an important field in computer and data science. In this article, we will look at the analysis of the link structure in trees. In particular, we will focus on tree kernels, a method for comparing tree graphs to each other, allowing us to get quantifiable measurements of their similarities or differences. This an important process for many modern applications such as classification and data analysis.

Continue reading →
Michele Sciabarra
Developing for the Cloud in the Cloud: BigData Development with Docker in AWS

More and more people are moving their work from desktop applications to the cloud using an equivalent online web application. However, this has unfortunately not been true for software development IDEs. Although there have been some attempts to provide an online IDE, they have not come anywhere close to traditional IDEs.

In this article, Toptal Freelance Software Engineer Michele Sciabarra guides us on how to build a cloud-based development environment for Scala and big data applications, with the help of Docker in Amazon AWS.

Continue reading →
Avinash Kaza
Business Intelligence Platform: Tutorial Using MongoDB Aggregation Pipeline

In today’s data driven world, researches are busy answering interesting questions by churning through huge volumes of data. Some obvious challenges they face are due the sheer size of dataset that they have to deal with. In this article, we take a peek at a simple business intelligence platform implemented on top of the MongoDB Aggregation Pipeline.

Continue reading →
Doug Sparling
Full Text Search of Dialogues with Apache Lucene: A Tutorial

Apache Lucene is a powerful Java library used for implementing full-text search on a corpus of text. With its wide array of configuration options and customizability, it is possible to tune Apache Lucene specifically to the corpus at hand - improving both search quality and query capability.

This article gives us a glimpse of the simplicity and ease of customization of the Apache Lucene analysis pipeline.

Continue reading →
Radek Ostrowski
Introduction to Apache Spark with Examples and Use Cases

In this post, Toptal engineer Radek Ostrowski introduces Apache Spark – fast, easy-to-use, and flexible big data processing. Billed as offering “lightning fast cluster computing”, the Spark technology stack incorporates a comprehensive set of capabilities, including SparkSQL, Spark Streaming, MLlib (for machine learning), and GraphX. Spark may very well be the “child prodigy of big data”, rapidly gaining a dominant position in the complex world of big data processing.

Continue reading →