Data Science

Showing 10-18 of 29 results
EngineeringIcon ChevronBack-end

The Definitive Guide to NoSQL Databases

by Mohammad Altarade

Limited SQL scalability has prompted the industry to develop and deploy a number of NoSQL database management systems, with a focus on performance, reliability, and consistency. The trend was driven by proprietary NoSQL databases developed by Google and Amazon. Eventually, open-source systems like MongoDB, Cassandra, and Hypertable brought NoSQL within reach of everyone. In this post, Senior Software Engineer Mohamad Altarade dives into some of them and explains why NoSQL will probably be with us for years to come.

16 minute readContinue Reading
EngineeringIcon ChevronData Science and Databases

A Data Engineer's Guide To Non-Traditional Data Storages

by Ken Hu

With the rise of big data and data science, storage and retrieval have become a critical pipeline component for data use and analysis. Recently, new data storage technologies have emerged. But the question is: Which one should you choose? Which one is best suited for data engineering? In this article, Toptal Data Scientist Ken Hu compares three prominent storage technologies within the context of data engineering.

7 minute readContinue Reading
EngineeringIcon ChevronData Science and Databases

An HDFS Tutorial for Data Analysts Stuck with Relational Databases

by Dallas H. Snider

The Hadoop Distributed File System (HDFS) is a scalable, open-source solution for storing and processing large volumes of data. With its built-in replication and resilience to disk failures, HDFS is an ideal system for storing and processing data for analytics. In this step-by-step tutorial, Toptal Database Developer Dallas H. Snider details how to migrate existing data from a PostgreSQL database into the more efficient HDFS.

10 minute readContinue Reading
EngineeringIcon ChevronData Science and Databases

A Comprehensive Introduction To Your Genome With the SciPy Stack

by Zhuyi Xue

Genome data is one of the most widely analyzed datasets in the realm of Bioinformatics. The SciPy stack offers a suite of popular Python packages designed for numerical computing, data transformation, analysis and visualization, which is ideal for many bioinformatic analysis needs. In this tutorial, Toptal Software Engineer Zhuyi Xue walks us through some of the capabilities of the SciPy stack. He also answers some interesting questions about the human genome, including: How much of the genome is incomplete? How long is a typical gene?

23 minute readContinue Reading
EngineeringIcon ChevronBack-end

Boost Your Data Munging with R

by Jan Gorecki

As a language, R is strongly tied to data and is thus used mostly by statisticians and data scientists. Many who already use R for machine learning, though, are not aware that data munging can be done faster in R, meaning another tool is not required for that task. In this article, Freelance Software Engineer Jan Gorecki explores tabular data transformations and introduces us to one of the fastest open-source data wrangling tools available.

17 minute readContinue Reading
EngineeringIcon ChevronData Science and Databases

Clustering Algorithms: From Start To State Of The Art

by Lovro Iliassich

Clustering algorithms are very important to unsupervised learning and are key elements of machine learning in general. These algorithms give meaning to data that are not labelled and help find structure in chaos. But not all clustering algorithms are created equal; each has its own pros and cons. In this article, Toptal Freelance Software Engineer Lovro Iliassich explores a heap of clustering algorithms, from the well known K-Means algorithm to the elegant, state-of-the-art Affinity Propagation technique.

11 minute readContinue Reading
EngineeringIcon ChevronData Science and Databases

Tree Kernels: Quantifying Similarity Among Tree-structured Data

by Dino Causevic

Today, a massive amount of data is available in the form of networks or graphs. For example, the World Wide Web, with its web pages and hyperlinks, social networks, semantic networks, biological networks, citation networks for scientific literature, and so on. A tree is a special type of graph, and is naturally suited to represent many types of data. The analysis of trees is an important field in computer and data science. In this article, we will look at the analysis of the link structure in trees. In particular, we will focus on tree kernels, a method for comparing tree graphs to each other, allowing us to get quantifiable measurements of their similarities or differences. This an important process for many modern applications such as classification and data analysis.

12 minute readContinue Reading
EngineeringIcon ChevronData Science and Databases

Ensemble Methods: Elegant Techniques to Produce Improved Machine Learning Results

by Necati Demir, PhD

Machine Learning, in computing, is where art meets science. Perfecting a machine learning tool is a lot about understanding data and choosing the right algorithm. But why choose one algorithm when you can choose many and make them all work to achieve one thing: improved results. In this article, Toptal Engineer Necati Demir walks us through some elegant techniques of ensemble methods where a combination of data splits and multiple algorithms is used to produce machine learning results with higher accuracy.

6 minute readContinue Reading
EngineeringIcon ChevronBack-end

Guide To Budget-friendly Data Mining

by Jeffrey Shumaker

Although database programming does not evolve at nearly the same pace as traditional application programming, recent advancements in several fields are bringing new techniques and technologies within the reach of small and independent developers. In this guide, Toptal Freelance Software Engineer Jeffrey Shumaker explains how developers can quickly and easily tap these methods to identify database issues they may not even be aware of, and how they can build excellent data mining tools without spending a lot on expensive software licenses.

9 minute readContinue Reading

Join the Toptal® community.