The Protein Data Bank (PDB) bioinformatics database is the world’s largest repository of experimentally-determined structures of proteins, nucleic acids, and complex assemblies. All data is gathered using experimental methods such as X-ray, spectroscopy, crystallography, NMR, etc. This article explains how to extract, filter, and clean data from the PDB to make it suitable for further analysis.Continue reading →
In this article, Toptal Freelance Software Engineer Neven Pičuljan introduces you to the intricacies of deep learning in hedge funds and finance in general.Continue reading →
Limited SQL scalability has prompted the industry to develop and deploy a number of NoSQL database management systems, with a focus on performance, reliability, and consistency. The trend was driven by proprietary NoSQL databases developed by Google and Amazon. Eventually, open-source systems like MongoDB, Cassandra, and Hypertable brought NoSQL within reach of everyone.
In this post, Senior Software Engineer Mohamad Altarade dives into some of them and explains why NoSQL will probably be with us for years to come.Continue reading →
With the rise of big data and data science, storage and retrieval have become a critical pipeline component for data use and analysis. Recently, new data storage technologies have emerged. But the question is: Which one should you choose? Which one is best suited for data engineering?
In this article, Toptal Data Scientist Ken Hu compares three prominent storage technologies within the context of data engineering.Continue reading →
The Hadoop Distributed File System (HDFS) is a scalable, open source solution for storing and processing large volumes of data. With its built-in replication and resilience to disk failures, HDFS is an ideal system for storing and processing data for analytics.
In this step-by-step tutorial, Toptal Database Developer Dallas H. Snider details how to migrate existing data from a PostgreSQL database into the more efficient HDFS.Continue reading →
Genome data is one of the most widely analyzed datasets in the realm of Bioinformatics. The SciPy stack offers a suite of popular Python packages designed for numerical computing, data transformation, analysis and visualization, which is ideal for many bioinformatic analysis needs.
In this tutorial, Toptal Software Engineer Zhuyi Xue walks us through some of the capabilities of the SciPy stack. He also answers some interesting questions about the human genome, including: How much of the genome is incomplete? How long is a typical gene?Continue reading →
As a language, R is strongly tied to data and is thus used mostly by statisticians and data scientists. Many who already use R for machine learning, though, are not aware that data munging can be done faster in R, meaning another tool is not required for that task.
In this article, Freelance Software Engineer Jan Gorecki explores tabular data transformations and introduces us to one of the fastest open-source data wrangling tools available.Continue reading →
Clustering algorithms are very important to unsupervised learning and are key elements of machine learning in general. These algorithms give meaning to data that are not labelled and help find structure in chaos. But not all clustering algorithms are created equal; each has its own pros and cons.
In this article, Toptal Freelance Software Engineer Lovro Iliassich explores a heap of clustering algorithms, from the well known K-Means algorithm to the elegant, state-of-the-art Affinity Propagation technique.Continue reading →
Today, a massive amount of data is available in the form of networks or graphs. For example, the World Wide Web, with its web pages and hyperlinks, social networks, semantic networks, biological networks, citation networks for scientific literature, and so on.
A tree is a special type of graph, and is naturally suited to represent many types of data. The analysis of trees is an important field in computer and data science. In this article, we will look at the analysis of the link structure in trees. In particular, we will focus on tree kernels, a method for comparing tree graphs to each other, allowing us to get quantifiable measurements of their similarities or differences. This an important process for many modern applications such as classification and data analysis.Continue reading →
Machine Learning, in computing, is where art meets science. Perfecting a machine learning tool is a lot about understanding data and choosing the right algorithm. But why choose one algorithm when you can choose many and make them all work to achieve one thing: improved results.
In this article, Toptal Engineer Necati Demir walks us through some elegant techniques of ensemble methods where a combination of data splits and multiple algorithms is used to produce machine learning results with higher accuracy.Continue reading →
Although database programming does not evolve at nearly the same pace as traditional application programming, recent advancements in several fields are bringing new techniques and technologies within the reach of small and independent developers.
In this guide, Toptal Freelance Software Engineer Jeffrey Shumaker explains how developers can quickly and easily tap these methods to identify database issues they may not even be aware of, and how they can build excellent data mining tools without spending a lot on expensive software licenses.Continue reading →
Analysts have come to recognize social network data as a virtual treasure trove of information for sensing public opinion trends and groundswells of support. In this article, Toptal Engineer Elder Santos describes the techniques he employed for a proof-of-concept that effectively analyzed Twitter Trend Topics to predict, as a sample test case, regional voting patterns in the 2014 Brazilian presidential election.Continue reading →
Machine learning has changed the way we deal with data. Data driven problems, that are difficult to solve using standard methods, can often be tackled with much more ease using machine learning algorithms. In this article, we will explore Azure Machine Learning features and capabilities through solving one of the problems that we face in our everyday lives.Continue reading →
In today’s data driven world, researches are busy answering interesting questions by churning through huge volumes of data. Some obvious challenges they face are due the sheer size of dataset that they have to deal with. In this article, we take a peek at a simple business intelligence platform implemented on top of the MongoDB Aggregation Pipeline.Continue reading →
Bitcoin blockchain is the backbone of the network and provides a tamper-proof data structure, providing a shared public ledger open to all. This article provides insight in blockchain technology, current status and its potential.Continue reading →
How do we understand and interpret the huge amounts of data coming out of simulations? How do we visualize potential gigabytes of datapoints in a large dataset? In this article I will give a quick introduction to VTK and its pipeline architecture, and go on to discuss a real-life visualization example.Continue reading →
Data conversion, translation, and mapping is by no means rocket science, but it is by all means tedious. This article introduces MetaDapper, a .NET library that strives to simplify, streamline, and automate the data conversion process to the greatest extent possible.Continue reading →
Once you step beyond the comfortable confines of English-only character sets, you quickly find yourself entangled in the wonderfully wacky world of UTF-8.
Indeed, navigating through UTF-8 related issues can be a frustrating and hair-pulling experience. This post provides a concise cookbook for addressing these issues when working with PHP and MySQL in particular, based on practical experience and lessons learned.Continue reading →
The recent resurgence in Artificial Intelligence has been powered in no small part by a new trend in machine learning, known as “Deep Learning”. In this article, I’ll introduce you to the key concepts and algorithms behind Deep Learning, beginning with the simplest building block.Continue reading →
A few years ago, driven by my curiosity, I took my first steps into the world of Forex by creating a demo account and playing out simulations (with fake money) using the Meta Trader 4 trading platform.
After a week of ‘trading’, I’d almost doubled my ‘money’. Spurred on by my own success, I dug deeper and eventually signed up for a number of forums. Soon, I was spending hours reading about trading systems (i.e., rule sets that determine whether you should buy or sell), custom indicators, market moods, and more.Continue reading →
But this isn’t just another article about cohort analysis. If you already know the importance of the topic and want to skip the introduction, you can jump to the simulator, where you can either simulate startup growth based on retention, churn, and a number of other factors, or analyze your own PayPal logs with the code I’ve open sourced.
If, however, you don’t realize that these are some of the most important metrics around–continue reading.Continue reading →
I live in Córdoba, Argentina, approximately 130 kilometers (~80 miles) away from the lake where I kitesurf. Thats roughly a two-hour drive, which I can deal with. But I cant deal with the fact that weather forecasts are inaccurate. And where I live, good wind conditions last just a couple of hours. The last thing you want to do is clear up your Monday schedule to go kitesurfing and find yourself cursing the gods on a windless lake after two hours of driving.
I needed to know the wind conditions of my favorite kitesurfing spot—in real time. So I decided to build my own weather station.Continue reading →