Data Science and Databases posts

The Toptal Engineering Blog is a platform for sharing projects and discussing technologies.
Roman Vashchegin
Conquer String Search with the Aho-Corasick Algorithm

The Aho-Corasick algorithm can be used to efficiently search for multiple patterns in a large blob of text, making it a really useful algorithm in data science and many other areas.

In this article, Toptal Freelance Software Engineer Roman Vashchegin shows how the Aho-Corasick algorithm uses a trie data structure to efficiently match a dictionary of words against any text.

Continue reading →
Yuri da Silva Villas Boas
Getting Started with the SRVB Cryptosystem

This article will give you an introduction to the principles behind public-key cryptosystems and introduce you to the Santana Rocha-Villas Boas (SRVB) cryptosystem, developed by the author of the article and prof. Daniel Santana Rocha. The algorithm authors are making a campaign that includes a financial reward to anyone who manages to crack the code.

Continue reading →
Gabriel Livan
The 12 Worst Mistakes Advanced WordPress Developers Make

WordPress is a very popular way to get a site up and running quickly. However, in their haste, plenty of developers end up making horrible decisions. Some mistakes, like leaving WP_DEBUG set to “true,” may be easy to make. Others, like lumping all your JavaScript into a single file, are as common as lazy engineers. Whichever mistake you manage to make, read on to find out the 12 most common WordPress mistakes that new and seasoned developers make.

Continue reading →
Dusan Simonovic
Get Started With Microservices: A Dropwizard Tutorial

Dropwizard allows developers to quickly bootstrap their projects and package applications as easily deployable standalone services. It also happens to be relatively simple to use and implement.

In this tutorial, Toptal Freelance Software Engineer Dusan Simonovic will introduce you to Dropwizard and demonstrate how you can use this powerful framework to create RESTful web services with ease.

Continue reading →
Hanee' Medhat
Apache Spark Streaming Tutorial: Identifying Trending Twitter Hashtags

Social networks are among the biggest sources of data today, and this means they are an extremely valuable asset for marketers, big data specialists, and even individual users like journalists and other professionals. Harnessing the potential of real-time Twitter data is also useful in many time-sensitive business processes.

In this article, Toptal Freelance Software Engineer Hanee’ Medhat explains how you can build a simple Python application to leverage the power of Apache Spark, and then use it to read and process tweets to identify trending hashtags.

Continue reading →
Josip Šaban
SQL Server 2016 Always Encrypted: Easy to Implement, Tough to Crack

Security has always been a primary concern for database experts, and with the advent of new, decentralized services, it’s become even more crucial. Microsoft addressed the need for an added level of security in SQL with the introduction of Always Encrypted functionality in SQL Server 2016.

In this blog post, Toptal Freelance Software Engineer Josip Saban explains how Microsoft’s Always Encrypted concept works, how it’s implemented, and why developers can’t afford to ignore it.

Continue reading →
Juan Pablo Carzolio
A Guide to Consistent Hashing

Consistent Hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table. It powers many high-traffic dynamic websites and web applications.

In this tutorial, Toptal Freelance Software Engineer Juan Pablo Carzolio will walk us through what it is and how hashing, distributed hashing and consistent hashing work.

Continue reading →
Mohammad Altarade
The Definitive Guide to NoSQL Databases

Limited SQL scalability has prompted the industry to develop and deploy a number of NoSQL database management systems, with a focus on performance, reliability, and consistency. The trend was driven by proprietary NoSQL databases developed by Google and Amazon. Eventually, open-source systems like MongoDB, Cassandra, and Hypertable brought NoSQL within reach of everyone.

In this post, Toptal Software Engineer Mohamad Altarade dives into some of them and explains why NoSQL will probably be with us for years to come.

Continue reading →
Ken Hu
A Data Engineer's Guide To Non-Traditional Data Storages

With the rise of big data and data science, storage and retrieval have become a critical pipeline component for data use and analysis. Recently, new data storage technologies have emerged. But the question is: Which one should you choose? Which one is best suited for data engineering?

In this article, Toptal Data Scientist Ken Hu compares three prominent storage technologies within the context of data engineering.

Continue reading →
Marian Paul
Migrate Legacy Data Without Screwing It Up

Nobody wants to leave valuable customer data behind. Unfortunately, though, the hardest part of data migration to a complex CRM system, such as Salesforce, is the handling of legacy data.

In this article, Toptal Software Engineer Marian Paul provides 10 tips for successful legacy data migration to Salesforce.

Continue reading →
Dallas H. Snider
An HDFS Tutorial for Data Analysts Stuck With Relational Databases

The Hadoop Distributed File System (HDFS) is a scalable, open source solution for storing and processing large volumes of data. With its built-in replication and resilience to disk failures, HDFS is an ideal system for storing and processing data for analytics.

In this step-by-step tutorial, Toptal Database Developer Dallas H. Snider details how to migrate existing data from a PostgreSQL database into the more efficient HDFS.

Continue reading →
Zhuyi Xue
A Comprehensive Introduction To Your Genome With the SciPy Stack

Genome data is one of the most widely analyzed datasets in the realm of Bioinformatics. The SciPy stack offers a suite of popular Python packages designed for numerical computing, data transformation, analysis and visualization, which is ideal for many bioinformatic analysis needs.

In this tutorial, Toptal Software Engineer Zhuyi Xue walks us through some of the capabilities of the SciPy stack. He also answers some interesting questions about the human genome, including: How much of the genome is incomplete? How long is a typical gene?

Continue reading →
Jan Gorecki
Boost Your Data Munging with R

As a language, R is strongly tied to data and is thus used mostly by statisticians and data scientists. Many who already use R for machine learning, though, are not aware that data munging can be done faster in R, meaning another tool is not required for that task.

In this article, Freelance Software Engineer Jan Gorecki explores tabular data transformations and introduces us to one of the fastest open-source data wrangling tools available.

Continue reading →
Nirmel Murtic
Bidirectional Relationship Support in JSON

Ever tried to create a JSON data structure that includes entities with bidirectional relationships? If you have, you know that this often results in errors or exceptions being thrown.

In this article, Toptal Freelance Software Engineer Nirmel Murtic provides a robust working approach to avoiding these errors when creating JSON structures that included entities with bidirectional (i.e. circular) relationships.

Continue reading →
Andrea Nalon
The Rise Of Automated Trading: Machines Trading the S&P 500

More than 60 percent of trading activities with different assets rely on automated trading and machine learning instead of human traders. Today, specialized programs based on particular algorithms and learned patterns automatically buy and sell assets in various markets, with a goal to achieve a positive return in the long run.

In this article, Toptal Freelance Data Scientist Andrea Nalon explains how to predict, using machine learning and Python, which trade should be made next on the S&P 500 to get a positive gain.

Continue reading →
Lovro Iliassich
Clustering Algorithms: From Start To State Of The Art

Clustering algorithms are very important to unsupervised learning and are key elements of machine learning in general. These algorithms give meaning to data that are not labelled and help find structure in chaos. But not all clustering algorithms are created equal; each has its own pros and cons.

In this article, Toptal Freelance Software Engineer Lovro Iliassich explores a heap of clustering algorithms, from the well known K-Means algorithm to the elegant, state-of-the-art Affinity Propagation technique.

Continue reading →
Rohit Boggarapu
A Tutorial on Drill-down FusionCharts in jQuery

When dealing with data analysis, most companies rely on MS Excel or Google Sheets, but dealing with data presented this way isn’t very eye-catching or intuitive. It’s once you add visualizations to this data that things become a little easier to manage. That’s the topic of today’s tutorial by our guest author from Adobe, Rohit Boggarapu. Join us as he guides us though the process of making interactive drill-down charts using jQuery and FusionCharts.

Continue reading →
Dino Causevic
Tree Kernels: Quantifying Similarity Among Tree-Structured Data

Today, a massive amount of data is available in the form of networks or graphs. For example, the World Wide Web, with its web pages and hyperlinks, social networks, semantic networks, biological networks, citation networks for scientific literature, and so on.

A tree is a special type of graph, and is naturally suited to represent many types of data. The analysis of trees is an important field in computer and data science. In this article, we will look at the analysis of the link structure in trees. In particular, we will focus on tree kernels, a method for comparing tree graphs to each other, allowing us to get quantifiable measurements of their similarities or differences. This an important process for many modern applications such as classification and data analysis.

Continue reading →
Radek Ostrowski
How I Used Apache Spark and Docker in a Hackathon to Build a Weather App

Hackathons often inspire engineers to create amazing software. By blending various technologies together, really useful and often fun projects can be realized in a short period of time.

In this article, Toptal engineer Radek Ostrowski shares his experience participating in the IBM Sparkathon, and walks us through how he elegantly combined the power of Apache Spark and Docker in IBM Bluemix to build a weather app.

Continue reading →
Necati Demir
Ensemble Methods: Elegant Techniques to Produce Improved Machine Learning Results

Machine Learning, in computing, is where art meets science. Perfecting a machine learning tool is a lot about understanding data and choosing the right algorithm. But why choose one algorithm when you can choose many and make them all work to achieve one thing: improved results.

In this article, Toptal Engineer Necati Demir walks us through some elegant techniques of ensemble methods where a combination of data splits and multiple algorithms is used to produce machine learning results with higher accuracy.

Continue reading →
Ivan Voras
Guide to Multi-processing Network Server Models

In this article, Toptal engineer Ivan Voras provides a useful overview and comparison of multi-processing network server models, with the goal being to take some of the mystery out of writing high performance networking code. The article is intended for “system programmers”, i.e., back-end developers who will work with the low-level details of their applications, implementing network server code.

Continue reading →
Ivan Bojovic
MySQL Master-Slave Replication on the Same Machine

Developers often work on only one machine, and have their whole development environment on that machine. Testing database replication before deploying changes in this kind of a development environment can be a challenging task.

In this article, Toptal engineer Ivan Bojovic guides us through a step-by-step tutorial on how to implement MySQL master-slave replication on one machine.

Continue reading →
Artem Galtsev
Building an IMAP Email Client with PHP

Developers sometimes run into tasks that require access to email mailboxes. In most cases, this is done using the Internet Message Access Protocol, or IMAP. As a PHP developer, I first turned to PHP’s built in IMAP library, but this library is buggy and impossible to debug or modify. So today we will create a working IMAP email client from the ground up using PHP. We will also see how to use Gmail’s special commands.

Continue reading →
Daniel Angel Muñoz Trejo
Optimized Successive Mean Quantization Transform

Image processing algorithms are often very resource intensive due to fact that they process pixels on an image one at a time and often requires multiple passes. Successive Mean Quantization Transform (SMQT) is one such resource intensive algorithm that can process images taken in low-light conditions and reveal details from dark regions of the image.

In this article, Toptal engineer Daniel Angel Munoz Trejo gives us some insight into how the SMQT algorithm works and walks us through a clever optimization technique to make the algorithm a viable option for handheld devices.

Continue reading →
Rob Moore
Towards Updatable D3.js Charts

When Mike Bostock created D3.js, he introduced a tried and true reusable charts pattern for implementing the same chart in any number of selections. However, the limitations of this pattern are realized once the chart is initialized. In this article, Toptal engineer Rob Moore presents a revised reusable charts pattern that leverages the full power of D3.js.

Continue reading →
Nermin Hajdarbegovic
Google Cloud Source Repositories vs. Bitbucket vs. GitHub: A Worthy Alternative?

Google’s new cloud code platform does not appear to be taking on GitHub head on. Instead, Cloud Source Repositories (CSR) will allow users to connect to repositories hosted on GitHub or Bitbucket. However, everything is automatically synced to the Google Cloud Source Repository.

The good news is that a Google CSR can be connected to another Git repository hosted on GitHub or Bitbucket. All changes will be synchronised on both platforms, as you can set Google CSR to automatically mirror from GitHub and Bitbucket.

Continue reading →
Sigfried Gold
Ultimate In-memory Data Collection Manipulation with Supergroup.js

In-memory data collection manipulation is something that we often need to do in data-centric reporting and visualization applications. When needed, we often tend to resort to complex loops, list comprehensions, and other suboptimal means, which can easily end up being a huge mess of hard-to-maintain spaghetti code. Supergroup.js is an in-memory data manipulation library that can be used to solve some common data manipulation challenges on limited datasets.

Continue reading →
Sripal Reddy Vindyala
How to Tune Microsoft SQL Server for Performance

To retain its users, any application or website must run fast. For mission critical environments, a couple of milliseconds delay in getting information might create big problems. As database sizes grow day by day, we need to fetch data as fast as possible, and write the data back into the database as fast as possible. To make sure all operations are executing smoothly, we have to tune Microsoft SQL Server for performance.

Continue reading →
Elder Santos
Data Mining for Predictive Social Network Analysis

Analysts have come to recognize social network data as a virtual treasure trove of information for sensing public opinion trends and groundswells of support. In this article, Toptal Engineer Elder Santos describes the techniques he employed for a proof-of-concept that effectively analyzed Twitter Trend Topics to predict, as a sample test case, regional voting patterns in the 2014 Brazilian presidential election.

Continue reading →
Ivan Matec
Azure Tutorial: Predicting Gas Prices Using Azure Machine Learning Studio

Machine learning has changed the way we deal with data. Data driven problems, that are difficult to solve using standard methods, can often be tackled with much more ease using machine learning algorithms. In this article, we will explore Azure Machine Learning features and capabilities through solving one of the problems that we face in our everyday lives.

Continue reading →
Avinash Kaza
Business Intelligence Platform: Tutorial Using MongoDB Aggregation Pipeline

In today’s data driven world, researches are busy answering interesting questions by churning through huge volumes of data. Some obvious challenges they face are due the sheer size of dataset that they have to deal with. In this article, we take a peek at a simple business intelligence platform implemented on top of the MongoDB Aggregation Pipeline.

Continue reading →
Radek Ostrowski
Introduction to Apache Spark with Examples and Use Cases

In this post, Toptal engineer Radek Ostrowski introduces Apache Spark – fast, easy-to-use, and flexible big data processing. Billed as offering “lightning fast cluster computing”, the Spark technology stack incorporates a comprehensive set of capabilities, including SparkSQL, Spark Streaming, MLlib (for machine learning), and GraphX. Spark may very well be the “child prodigy of big data”, rapidly gaining a dominant position in the complex world of big data processing.

Continue reading →
Ivan Voras
Installing Django on IIS: A Step-by-Step Tutorial

Although the most wide-spread and supported way of running Django is on a Linux system (e.g., with uwsgi and nginx), it actually doesn’t take much work to get it to run on IIS. In this article, Toptal Engineer Ivan Voras walks you through a step-by-step tutorial, clearly explaining how to install Django on IIS.

Continue reading →
Ahmed Al-Amir
Needle in a Haystack: A Nifty Large-Scale Text Search Algorithm Tutorial

When coming across the term “text search”, one usually thinks of a large body of text, which is indexed in a way that makes it possible to quickly look up one or more search terms when they are entered by a user. This is a classic problem in computer science, to which many solutions exist.

But how about a reverse scenario? What if what’s available for indexing beforehand is a group of search phrases, and only at runtime is a large body of text presented for searching?

Continue reading →
Ivan Dimoski
Top 10 Most Common Mistakes That Android Developers Make: A Programming Tutorial

There are thousands of different Android powered devices, with different screen sizes, chip architectures, hardware configurations, and software versions. Unfortunately, segmentation is the price to pay for openness, and there are thousands ways your app can fail on different devices.

Regardless of such huge segmentation, the majority of bugs are actually introduced because of logic errors. These bugs are easily prevented, as long as we get the basics right!

Here’s a quick rundown of the 10 most common mistakes Android developers make.

Continue reading →
Alex Rattray
Simple Data Flow in React Apps Using Flux and Backbone: A Tutorial with Examples

React.js is a fantastic library. It is only one part of a front-end application stack, however. It doesn’t have much to offer when it comes to managing data and state. Facebook, the makers of React, have offered some guidance there in the form of Flux. I’ll introduce basic Flux control flow, discuss what’s missing for Stores, and how to use Backbone Models and Collections to fill the gap in a “Flux-compliant” way.

Continue reading →
Francisco Sanchez Clariá
Data Encoding: A Guide to UTF-8 for PHP and MySQL

Once you step beyond the comfortable confines of English-only character sets, you quickly find yourself entangled in the wonderfully wacky world of UTF-8.

Indeed, navigating through UTF-8 related issues can be a frustrating and hair-pulling experience. This post provides a concise cookbook for addressing these issues when working with PHP and MySQL in particular, based on practical experience and lessons learned.

Continue reading →
Gergely Kalman
Fixing the “Heartbleed” OpenSSL Bug: A Tutorial for Sys Admins

A potentially critical problem, nicknamed “Heartbleed”, has surfaced in the widely-used OpenSSL cryptographic library. The vulnerability is particularly dangerous in that potentially critical data can be leaked and the attack leaves no trace.

As a user, chances are that sites you frequent regularly are affected and your data may have been compromised. As a developer or sys admin, sites or servers you’re responsible for are likely to have been affected.

Here are the key facts you need to know about this dangerous bug and how to mitigate your vulnerability.

Continue reading →
Ryan Wilcox
Your Boss Won't Appreciate TDD: Try This Behavior-Driven Development Example

Testing. It always seems to get left to the last minute, then cut because you’re out of time, budget, or whatever else. Management wonders why developers can’t just “get it right the first time”, and developers (especially on large systems) can be taken off-guard when different stakeholders describe different parts of the system.

With behavior-driven development, you can turn testing into a shared process that focuses on the behaviors of the system, why they matter, and who cares.

Continue reading →
Steven S. Morgan
Anti-Patterns in Telecommuting

As a veteran telecommuter through multiple jobs in my career, I have witnessed and experienced the many joys of being a remote worker. As for the horror stories, I have more than a few I could tell. With a bit of artistic inclination and a talent for mathematics, I also have a fascination with patterns: design patterns, architectural patterns, behavioral patterns, social patterns, weather patterns—all sorts of patterns!

When I first encountered anti-patterns, I discovered a trove of wisdom I wish I had known before I had learned the hard way. Anti-patterns are recognizable repeated patterns that contribute significantly to failure. For example, the manager that keeps interrupting the employee in order to see if the employee is getting any work done is engaging in an anti-pattern that serves to prevent the employee from getting any work done!

Based on my own experiences and experiences of friends and co-workers, I am assembling descriptions of anti-patterns related to telecommuting.

Continue reading →
Gergely Kalman
With a Filter Bypass and Some Hexadecimal, Hacked Credit Card Numbers Are Still, Still Google-able

In 2007, Bennett Haselton revealed a minor hack with major implications: querying ranges of numbers on Google would return pages of sensitive information, including Credit Card numbers, Social Security numbers, and more. While Haselton’s hack was addressed and patched, I was able to tweak his original technique to bypass Google’s filter and return the same old dangerous results.

Continue reading →
Rodrigo Koch
SQL Database Performance Tuning for Developers

Database tuning can be an incredibly difficult task, particularly when working with large-scale data where even the most minor change can have a dramatic (positive or negative) impact on performance.

In mid-sized and large companies, most database tuning will be handled by a Database Administrator (DBA). But believe me, there are plenty of developers out there who have to perform DBA-like tasks. Further, in many of the companies I’ve seen that do have DBAs, they often struggle to work well with developers—the positions simply require different modes of problem solving, which can lead to disagreement among coworkers.

In this article, I’d like to both provide developers with some developer-side database tuning tips and explain how developers and DBAs can work together effectively.

Continue reading →
Anna Chiara Bellini
The Trie: A Neglected Data Structure

From the very first days in our lives as programmers, we’ve all dealt with data structures: Arrays, linked lists, trees, sets, stacks and queues are our everyday companions, and the experienced programmer knows when and why to use them.

In this article we’ll see how an oft-neglected data structure, the trie, really shines in application domains with specific features, like word games.

Continue reading →
Paulo Renato Campos de Siqueira
Scaling Play! to Thousands of Concurrent Requests

Web Developers often fail to consider the consequences of thousands of users accessing our applications at the same time. Perhaps it’s because we love to rapidly prototype; perhaps it’s because testing such scenarios is simply hard.

Regardless, I’m going to argue that ignoring scalability is not as bad as it sounds—if you use the proper set of tools and follow good development practices. In this case: the Play! framework and the Scala language.

Continue reading →
Alejandro Rigatuso
Growing Growth: Perform Your Own Cohort Analysis with This Open Source Code

But this isn’t just another article about cohort analysis. If you already know the importance of the topic and want to skip the introduction, you can jump to the simulator, where you can either simulate startup growth based on retention, churn, and a number of other factors, or analyze your own PayPal logs with the code I’ve open sourced.

If, however, you don’t realize that these are some of the most important metrics around–continue reading.

Continue reading →
Gergely Kalman
How I Made Porn 20x More Efficient with Python Video Streaming

Porn is a big industry. There aren’t many sites on the Internet that can rival the traffic of its biggest players.

And juggling this immense traffic is tough. To make things even harder, much of the content served from porn sites is made up of low latency live streams rather than simple static video content. But for all of the challenges involved, rarely have I read about the developers who take them on. So I decided to write about my own experience on the job.

Continue reading →