Toptal is a marketplace for top Big Data architects. Top companies and startups choose Toptal Big Data freelancers for their mission-critical software projects.
Brian is a senior data engineer and software engineer with over 14 years of experience writing code, leading projects, and solving some of the toughest problems within popular cloud and big data ecosystems. He specializes in data skills such as Airflow and DBT, as well as cloud automation skills like Terraform and Ansible. Brian's favorite projects are data and cloud migration projects, machine learning, automation, and even traditional software development.
Alex is an innovative and experienced big data engineer, skilled in a wide variety of tools and technologies. He is competent in all aspects of the data science process, including data ingestion, storage, transformation, and statistical modeling. Alex has a proven track record of thoughtfully dissecting business problems, vetting requirements, and then designing the processes to address them.
Benjamin has over two decades of software and big data development experience, including data modeling and data warehouse design. His active toolset includes Spark, Python, Scala, AWS, Azure, SQL, Hive, Linux, Microsoft BI solutions, C#.NET, and Java. His orientation to detail and strong analytical and problem-solving skills make him an excellent addition to any team. A kind and intentional communicator, Benjamin always produces high-quality work.
Vivek is an IT professional with 15 years of experience in designing and building systems, the last five years in big data systems. He excels in multiple tools, technologies, and programming languages, including SQL, .NET, Java, Scala, and JavaScript. Vivek has worked with data in Excel, VBA, Access, RDBMS, and distributed systems.
Renato has 13+ years of experience in big data projects. He has worked for Tier 1 tech companies, consulting firms, and financial institutions. Renato has migrated petabytes of data to on-premise and cloud data lake environments, architected entire lakehouses, implemented machine learning models that provided intelligent suggestions to clients and managed multicultural data teams that delivered data projects to top-notch banks in Brazil. He has a master's degree in big data.
Karip is a data engineer and developer with 18 years of experience in the IT industry and six years in the freelance community. He specializes in Java, Scala, and Python and is skilled at Hadoop, Spark, and PL/SQL. Karip enjoys working on the back end, database, and big data projects.
Mohamed is a big data platform architect with 15 years of experience in the IT industry. He excels with distributed systems, data engineering, machine learning, and DevOps. Mohamed builds robust batch and real-time data platforms for stakeholders and moves companies to state-of-the-art data platforms. He designed GDPR-compliant data lakes and optimized an ETL pipeline, which saved the client hundreds of thousands of dollars annually. Mohamed is pragmatic and has an agile mindset.
Attila is a senior data engineer with 14 years of experience, initially specializing in data warehousing and then big data and machine learning. He has worked on large-scale use cases with over 100TB of data and excelled in challenging environments. Attila's industry experience is backed by a master's degree in technical informatics and certifications in Azure data science, Databricks, and DevOps, and AI engineering.
Igor is a seasoned data architect with 16+ years of experience in high-load systems, DWH, ETL, and ML pipelines. He has delivered innovative solutions for industry leaders such as TangoMe, Gazprombank, Stanford University, and Royal Mail. As a cloud-agnostic expert specializing in Flask, FastAPI, and database integration, Igor builds robust, scalable architectures. His passion for cloud-based systems empowers businesses to operate efficiently, gain flexibility, and achieve strategic advantages.
Sandro holds a BSc and MSc in computer science and has over a decade of software engineering experience. His main fields of expertise include machine learning R&D (NLP/CV/DL), MLOps, big data, and algorithm design. Sandro has worked with big companies like Microsoft and Logitech, as well as small startups.
As a highly effective technical leader with over 20 years of experience, Andrew specializes in data: integration, conversion, engineering, analytics, visualization, science, ETL, big data architecture, analytics platforms, and cloud architecture. He has an array of skills in building data platforms, analytic consulting, trend monitoring, data modeling, data governance, and machine learning.
Big Data is an extremely broad domain, typically addressed by a hybrid team of data scientists, software engineers, and statisticians. Real expertise in big data therefore requires far more than learning the ins and outs of a particular technology. This guide offers a sampling of effective questions to help evaluate the breadth and depth of a candidate's mastery of this complex domain.
... allows corporations to quickly assemble teams that have the right skills for specific projects.
Despite accelerating demand for coders, Toptal prides itself on almost Ivy League-level vetting.
Our clients
Creating an app for the game
Leading a digital transformation
Building a cross-platform app to be used worldwide
Drilling into real-time data creates an industry game changer
Testimonials
Tripcents wouldn't exist without Toptal. Toptal Projects enabled us to rapidly develop our foundation with a product manager, lead developer, and senior designer. In just over 60 days we went from concept to Alpha. The speed, knowledge, expertise, and flexibility is second to none. The Toptal team were as part of Tripcents as any in-house team member of Tripcents. They contributed and took ownership of the development just like everyone else. We will continue to use Toptal. As a startup, they are our secret weapon.
Brantley Pace
CEO & Co-Founder
I am more than pleased with our experience with Toptal. The professional I got to work with was on the phone with me within a couple of hours. I knew after discussing my project with him that he was the candidate I wanted. I hired him immediately and he wasted no time in getting to my project, even going the extra mile by adding some great design elements that enhanced our overall look.
Paul Fenley
Director
The developers I was paired with were incredible -- smart, driven, and responsive. It used to be hard to find quality engineers and consultants. Now it isn't.
Ryan Rockefeller
CEO
Toptal understood our project needs immediately. We were matched with an exceptional freelancer from Argentina who, from Day 1, immersed himself in our industry, blended seamlessly with our team, understood our vision, and produced top-notch results. Toptal makes connecting with superior developers and programmers very easy.
Jason Kulik
Co-founder
As a small company with limited resources we can't afford to make expensive mistakes. Toptal provided us with an experienced programmer who was able to hit the ground running and begin contributing immediately. It has been a great experience and one we'd repeat again in a heartbeat.
Stuart Pocknee
Principal
How to Hire Big Data Architects Through Toptal
1
Talk to One of Our Client Advisors
A Toptal client advisor will work with you to understand your goals, technical needs, and team dynamics.
2
Work With Hand-selected Talent
Within days, we'll introduce you to the right Big Data architect for your project. Average time to match is under 24 hours.
3
The Right Fit, Guaranteed
Work with your new Big Data architect for a trial period (pay only if satisfied), ensuring they're the right fit before starting the engagement.
Capabilities of Big Data Architects
Elevate your data strategy with our Big Data architects, experts in deploying Hadoop, Spark, and cloud services for scalable, secure data solutions.
Big Data Tools and Technologies
Big data tools and technologies encompass frameworks and databases like Hadoop, Spark, and Cassandra. Toptal’s big data architects leverage these tools for distributed processing of large datasets, enabling them to manage and analyze large volumes of data effectively.
Data Modeling and Optimization
In data analytics, conceptual models are often used to represent relationships and structures. Toptal’s experts have extensive experience in designing and fine-tuning data models to handle large-scale datasets, as well as optimizing the data and data processing workflows to maximize speed and resource utilization.
Data Lakes and Data Warehouses
Data lakes and data warehouses play complementary roles in the management of data. Data lakes store data in its raw form, while data warehouses store and manage structured data. Our big data architects harness these technologies to implement reliable and efficient solutions for advanced analytics.
Cloud-based Solutions
Big data architects use cloud services like AWS, Google Cloud, and Azure to store, process, and manage data. With modern datasets growing larger and larger, Toptal experts take advantage of the scalability of cloud services to architect robust, data-driven systems with enhanced efficiency.
Real-time Data Processing and Streaming
Real-time data processing and streaming involve continuously collecting, processing, and analyzing data as it is generated, enabling immediate insights and responses to dynamic information streams. Toptal’s big data architects are adept at designing scalable data frameworks that handle the continuous influx of data efficiently.
Data Integration and ETL Pipelines
Data integration and ETL (extract, transform, load) pipelines consolidate data from various sources and prepare it for storage or analysis. Using tools such as Apache Airflow and Fivetran, Toptal’s big data architects design and implement ETL pipelines that provide efficient data flow and processing.
Scalable Data Architecture Design
Big data architects design scalable architectures by building distributed systems capable of handling large volumes of diverse data. Toptal’s architects leverage cloud services and technologies like Hadoop and Spark to implement robust data solutions, ensuring high-quality scalability for the future.
Security and Governance
Big data architects integrate meticulous security and governance protocols, utilizing advanced encryption, stringent access controls, and comprehensive auditing to align with regulatory requirements. Our big data architects are highly proficient at crafting and implementing these protective measures to strengthen organizational data integrity and privacy.
Batch Processing and Analytics
Batch processing and analytics encompass collecting and optimizing analysis tasks for large datasets in batches. Toptal’s big data architects optimize the systems that process these batches, ensuring robust and scalable analytics solutions for diverse business needs.
Big Data Infrastructure Management
In order to support large-scale data analytics, the underlying systems that store and process the data must be robust and efficient. Toptal’s big data architects are adept at deploying these complex systems, ensuring high availability, fault tolerance, and disaster recovery capabilities for mission-critical systems.
Find Experts With Related Skills
Access a vast pool of skilled developers in our talent network and hire the top 3% within just 48 hours.
Typically, you can hire a Big Data architect with Toptal in about 48 hours. For larger teams of talent or Managed Delivery, timelines may vary. Our talent matchers are highly skilled in the same fields they’re matching in—they’re not recruiters or HR reps. They’ll work with you to understand your goals, technical needs, and team dynamics, and match you with ideal candidates from our vetted global talent network.
Once you select your Big Data architect, you’ll have a no-risk trial period to ensure they’re the perfect fit. Our matching process has a 98% trial-to-hire rate, so you can rest assured that you’re getting the best fit every time.
How do I hire a Big Data architect?
To hire the right Big Data architect, it’s important to evaluate a candidate’s experience, technical skills, and communication skills. You’ll also want to consider the fit with your particular industry, company, and project. Toptal’s rigorous screening process ensures that every member of our network has excellent experience and skills, and our team will match you with the perfect Big Data architects for your project.
How are Toptal Big Data architects different?
At Toptal, we thoroughly screen our Big Data architects to ensure we only match you with the highest caliber of talent. Of the more than 200,000 people who apply to join the Toptal network each year, fewer than 3% make the cut.
In addition to screening for industry-leading expertise, we also assess candidates’ language and interpersonal skills to ensure that you have a smooth working relationship.
When you hire with Toptal, you’ll always work with world-class, custom-matched Big Data architects ready to help you achieve your goals.
Can you hire Big Data architects on an hourly basis or for project-based tasks?
You can hire Big Data architects on an hourly, part-time, or full-time basis. Toptal can also manage the entire project from end-to-end with our Managed Delivery offering. Whether you hire an expert for a full- or part-time position, you’ll have the control and flexibility to scale your team up or down as your needs evolve. Our Big Data architects can fully integrate into your existing team for a seamless working experience.
What is the no-risk trial period for Toptal Big Data architects?
We make sure that each engagement between you and your Big Data architect begins with a trial period of up to two weeks. This means that you have time to confirm the engagement will be successful. If you’re completely satisfied with the results, we’ll bill you for the time and continue the engagement for as long as you’d like. If you’re not completely satisfied, you won’t be billed. From there, we can either part ways, or we can provide you with another expert who may be a better fit and with whom we will begin a second, no-risk trial.
Share
How to Hire an Excellent Big Data Architect
Big Data is an extremely broad domain, typically addressed by a hybrid team of data scientists, data analysts, software engineers, and statisticians. Finding a single individual knowledgeable in the entire breadth of this domain versus say a Microsoft Azure expert is therefore extremely unlikely and rare. Rather, one will most likely be searching for multiple individuals with specific sub-areas of expertise. This guide is therefore divided at a high level into two sections:
This guide highlights questions related to key concepts, paradigms, and technologies in which a big data expert can be expected to have proficiency. Bear in mind, though, that not every “A” candidate will be able to answer them all, nor does answering them all guarantee an “A” candidate. Ultimately, effective interviewing and hiring is as much of an art as it is a science.
Big Data Algorithms, Techniques, and Approaches
When it comes to big data and data management, fundamental knowledge of and work experience with relevant algorithms, techniques, big data platforms, and approaches is essential. Generally speaking, mastering these areas requires more time and skill than becoming an expert with a specific set of software languages or tools. As such, software engineers who do have expertise in these areas are both hard to find and extremely valuable to your team. The questions that follow can be helpful in gauging such expertise.
Q: Given a stream of data of unknown length, and a requirement to create a sample of a fixed size, how might you perform a simple random sample across the entire dataset? (i.e., given N elements in a data stream, how can you produce a sample of k elements, where N > k, whereby every element has a 1/N chance of being included in the sample?
One of the effective algorithms for addressing this is known as Reservoir Sampling.
The basic procedure is as follows:
Create an array of size k.
Fill the array with the first k elements from the stream.
For each subsequent element E (with index i) read from the stream, generate a random number j between 0 and i. If j is less than k, replace the jth element in the array with E.
This approach gives each element in the stream the same probability of appearing in the output sample.
Q: Describe and compare some of the more common algorithms and techniques for cluster analysis.
Cluster analysis is a common unsupervised learning technique used in many fields. It has a huge range of applications both in science and in business. A few examples include:
Bioinformatics: Organizing genes into clusters by analyzing similarity of gene expression patterns.
Marketing: Discovering distinct groups of customers and the using this knowledge to structure a campaign that targets the right marketing segments.
Insurance: Identifying categories of insurance holders that have a high average claim cost.
Clustering algorithms can be logically categorized based on their underlying cluster model as summarized in the table below.
Based on core idea of objects that are "close" to one another being more related, these algorithms connect objects to form clusters based on distance
Employs a distance algorithm (such as Levenshtein distance in the case of string comparison) to determine "nearness" of objects
Linkage criteria can be based on minimum distance (single linkage), maximum distance (complete linkage), average distance, centroid distance, or any other algorithm of arbitrary complexity
Clustering can be agglomerative (starting with single elements, aggregating them into clusters) or divisive (starting with the complete dataset, dividing it into partitions).
Does not require pre-specifying the number of clusters
Can be useful for proof-of-concept or preliminary analyses
Produce a hierarchy from which user still needs to choose appropriate clusters
Complexity generally makes them too slow for large datasets
"Chaining phenomenon", whereby outliers either show up as additional clusters or cause other clusters to merge erroneously
Clusters are represented by a central vector, which is not necessarily a member of the set.
When the number of clusters is fixed to K, K-means clustering gives a formal definition as an optimization problem: find the K cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.
With large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small)
K-Means may produce tighter clusters than hierarchical clustering, especially if clusters are globular
Requires number of clusters (K) to be specified in advance
Prefers clusters of approximately similar size, which often leads to incorrectly set borders between clusters
Clusters are defined as areas of higher density than the remainder of the dataset. Objects in sparse areas are usually considered to be noise and/or border points.
Connects points based on distance thresholds (similar to linkage-based clustering), but only connects those that satisfy a specified density criterion.
Most popular density-based clustering method is DBSCAN, which features a well-defined cluster model called "density-reachability".
Doesn't require specifying number of clusters a priori
Can find arbitrarily-shaped clusters; can even find a cluster completely surrounded by (but not connected to) a different cluster
Mostly insensitive to the ordering of the points in the database
Expects a density "drop" or "cliff" to detect cluster borders
DBSCAN is unable to detect intrinsic cluster structures that are prevalent in much of real-world data
On datasets consisting of mixtures of Gaussians, almost always outperformed by methods such as EM clustering that are able to precisely model such data
Q: Define and discuss ACID, BASE, and the CAP theorem.
ACID refers to the following set of properties that collectively guarantee reliable processing of database transactions, with the goal being “immediate consistency”:
Atomicity. Requires each transaction to be “all or nothing”; i.e., if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged.
Consistency. Requires every transaction to bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including (but not limited to) constraints, cascades, triggers, and any combination thereof.
Isolation. Requires concurrent execution of transactions to yield a system state identical to that which would be obtained if those same transactions were executed sequentially.
Durability. Requires that, once a transaction has been committed, it will remain so even in the event of power loss, crashes, or errors.
Many databases rely upon locking to provide ACID capabilities. Locking means that the transaction marks the data that it accesses so that the DBMS knows not to allow other transactions to modify it until the first transaction succeeds or fails. An alternative to locking is multiversion concurrency control in which the database provides each reading transaction the prior, unmodified version of data that is being modified by another active transaction.
Guaranteeing ACID properties in a distributed transaction across a distributed database where no single node is responsible for all data affecting a transaction presents additional complications. Network connections might fail, or one node might successfully complete its part of the transaction and then be required to roll back its changes, because of a failure on another node. The two-phase commit protocol (not to be confused with two-phase locking) provides atomicity for distributed transactions to ensure that each participant in the transaction agrees on whether the transaction should be committed or not.
BASE was developed as an alternative for producing more scalable and affordable data architectures. Allowing less constantly updated data gives developers the freedom to build other efficiencies into the overall system. In BASE, engineers embrace the idea that data has the flexibility to be “eventually” updated, resolved or made consistent, rather than instantly resolved.
The Eventual Consistency model employed by BASE informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. A system that has achieved eventual consistency is often said to have converged, or achieved replica convergence. Eventual consistency is sometimes criticized as increasing the complexity of distributed software applications.
In nearly all models, elements like consistency and availability often are viewed as resource competitors, where adjusting one can impact another. Accordingly, the CAP theorem (a.k.a. Brewer’s theorem) states that it’s impossible for a distributed computer system to provide more than two of the following three guarantees concurrently:
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
In the decade since its introduction, designers and researchers have used the CAP theorem as a reason to explore a wide variety of novel distributed systems. The CAP Theorem has therefore certainly proven useful, fostering much discussion, debate, and creative approaches to addressing tradeoffs, some of which have even yielded new systems and technologies.
Yet at the same time, the “2 out of 3” constraint does somewhat oversimplify the tensions between the three properties. By explicitly handling partitions, for example, designers can optimize consistency and availability, thereby achieving some trade-off of all three. Although designers do need to choose between consistency and availability when partitions are present, there is an incredible range of flexibility for handling partitions and recovering from them.
Aspects of the CAP theorem are often misunderstood, particularly the scope of availability and consistency, which can lead to undesirable results. If users cannot reach the service at all, there is no choice between C and A except when part of the service runs on the client. This exception, commonly known as disconnected operation or offline mode, is becoming increasingly important. Some HTML5 feature make disconnected operation easier going forward. These systems normally choose A over C and thus must recover from long partitions.
Q: What is dimensionality reduction and how is it relevant to processing big data? Name some techniques commonly employed for dimensionality reduction.
Dimensionality reduction is the process of converting data of very high dimensionality into data of lower dimensionality, typically for purposes such as visualization (i.e, projection onto a 2D or 3D space for visualization purposes), compression (for efficient storage and retrieval), or noise removal.
Some of the more common techniques for dimensionality reduction include:
Note: Each of the techniques listed above is itself a complex topic, so each is provided as a hyperlink to further information for those interested in learning more.
Q: Discuss some common statistical sampling techniques, including their strengths and weaknesses.
When analyzing big data, processing the entire dataset would often be operationally untenable. Exploring a representative sample is easier, more efficient, and can in many cases be nearly as accurate as exploring the entire dataset. The table below describes some of the statistical sampling techniques that are more commonly used with big data.
COMMON SAMPLING TECHNIQUES
Technique
Description
Advantages
Disadvantages
Simple random sampling
Every element has the same chance of selection (as does any pair of elements, triple of elements, etc.)
Minimizes bias
Simplifies analysis of results
Vulnerable to sampling error
Systematic sampling
Orders data and selects elements at regular intervals through the ordered dataset
Easy to implement
Can be efficient (depending on ordering scheme)
Vulnerable to periodicities in the ordered data
Theoretical properties make it difficult to quantify accuracy
Stratified sampling
Divides data into separate strata (i.e., categories) and then samples each stratum separately
Able to draw inferences about specific subgroups
Focuses on important subgroups; ignores irrelevant ones
Improves accuracy/efficiency of estimation
Different sampling techniques can be applied to different subgroups
Can increase complexity of sample selection
Selection of stratification variables can be difficult
Not useful when there are no homogeneous subgroups
Can sometimes require a larger sample than other methods
Q: Explain the term “cache oblivious”. Discuss some of its advantages and disadvantages in the context of processing big data.
A cache oblivious (a.k.a., cache transcendent) algorithm is designed to take advantage of a CPU cache without knowing its size. Its goal is to perform well – without modification or tuning – on machines with different cache sizes, or for a memory hierarchy whose levels are of different cache sizes.
Typically, a cache oblivious algorithm employs a recursive divide and conquer approach, whereby the problem is divided into smaller and smaller sub-problems, until a sub-problem size is reached that fits into the available cache. For example, an optimal cache oblivious matrix multiplication is obtained by recursively dividing each matrix into four sub-matrices to be multiplied.
In tuning for a specific machine, one may use a hybrid algorithm which uses blocking tuned for the specific cache sizes at the bottom level, but otherwise uses the cache-oblivious algorithm.
The ability to perform well, independent of cache size and without cache-size-specific tuning, is the primary advantage of the cache oblivious approach. However, it is important to acknowledge that this lack of any cache-size-specific tuning also means that a cache oblivious algorithm may not perform as well as a cache-aware algorithm (i.e., an algorithm tuned to a specific cache size). Another disadvantage of the cache oblivious approach is that it typically increases the memory footprint of the data structure, which may further degrade performance. Interestingly though, in practice, performance of cache oblivious algorithms are often surprisingly comparable to that of cache aware algorithms, making them that much more interesting and relevant for big data processing.
Big Data Technologies
These days, it’s not only about finding a single tool to get the job done; rather, it’s about building a scalable architecture to effectively collect, process, and query enormous volumes of data. Armed with a strong foundational knowledge of big data algorithms, techniques, and approaches, a big data expert will be able to employ tools from a growing landscape of technologies that can be used to exploit big data to extract actionable information. The questions that follow can help evaluate this dimension of a candidate’s expertise.
Q: What is Hadoop? What are its key features? What modules does it consist of?
Apache Hadoop is a software framework that allows for the distributed processing of large datasets across clusters of computers. It is designed to effectively scale from single servers to thousands of machines, each offering local computation and storage. Rather than relying on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer.
The project includes these modules:
Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop MapReduce: A system for parallel processing of large datasets.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop Common: The common utilities that support the other Hadoop modules.
Q: Provide an overview of HDFS, including a description of an HDFS cluster and its components.
HDFS is a highly fault-tolerant distributed filesystem, commonly used as a source and output of data in Hadoop jobs. It builds on a simple coherence model of write-once-read-many (with append possible) access. It supports a traditional hierarchical organization of directories and files.
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
An HDFS cluster also contains what is referred to as a Secondary NameNode that periodically downloads the current NameNode image, edits log files, joins them into a new image. and uploads the new image back to the NameNode. (Despite its name, though, the Secondary NameNode does not serve as a backup to the primary NameNode in case of failure.)
It’s important to note that HDFS is meant to handle large files, with the default block size being 128 MB. This means that, for example, a 1 GB file will be split into just 8 blocks. On the other hand, a file that’s just 1 KB will still take a full 128 MB block. Selecting a block size that will optimize performance for a cluster can be a challenge and there is no “one-size-fits-all” answer in this regard. Setting block size to too small value might increase network traffic and put huge overhead on the NameNode, which processes each request and locates each block. Setting it to too large a value, on the other hand, might result in a great deal of wasted space. Since the focus of HDFS is on large files, one strategy could be to combine small files into larger ones, if possible.
Q: Provide an overview of MapReduce, including a discussion of its key components, features, and benefits.
MapReduce is a programming model and an associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a Map() procedure that performs filtering and sorting and a Reduce() procedure that performs a summary (i.e., data reduction) operation. A MapReduce system orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.
MapReduce processes parallelizable problems across huge datasets using a large number of nodes in a cluster or grid as follows:
Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
Reduce step: The master node collects the answers from its worker nodes and combines them to form its output (i.e., the answer to the problem it was tasked with solving).
MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel (though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source). Similarly, a set of ‘reducers’ can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than high performance servers can usually handle (e.g., a large server farm of “commodity” machines can use MapReduce to sort a petabyte of data in only a few hours).
The parallelism in MapReduce also helps facilitate recovery from a partial failure of servers or storage during the operation (i.e., if one mapper or reducer fails, the work can be rescheduled, assuming the input data is still available). This is particularly important, since the failure of single nodes are fairly common in multi-machine clusters. When there are thousands of disks spinning for a couple of hours, it’s not unlikely that one or two of them will fail at some point.
The key contributions of the MapReduce framework are not the map and reduce functions per se, but the scalability and fault-tolerance achieved for a variety of applications. It should be noted that a single-threaded implementation of MapReduce will usually not be faster than a traditional implementation. Its benefits are typically only realized when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerant features of the framework come into play.
As a side note, the name MapReduce originally referred to a proprietary Google technology but has since been genericized.
Q: Describe the MapReduce process in detail.
While not all big data software engineers will be familiar with the the internals of the MapReduce process, those who are will be that much better equipped to leverage its capabilities and to thereby architect and design an optimal solution.
MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation is Apache Hadoop. We therefore provide a Hadoop-centric description of the MapReduce process, described here both for Apache Hadoop MapReduce 1 (MR1) as well as for Apache Hadoop MapReduce 2 (MR2, a.k.a. YARN). For simplicity, we use an example that takes input from an HDFS and stores it back to an HDFS. Such a process would operate as follows:
The local Job Client prepares the job for the submission and sends it to Resource Manager (MR2) or Job Tracker (MR1).
Resource Manager (MR2) / Job Tracker (MR1) schedules the job across the cluster. In the case of MR2, the Application Master is started, which will take care of managing the job and terminating it after of completion. The job is sent to Node Managers (MR2) or Task Trackers (MR1). An InputSplitter is used to distribute the map tasks across the cluster in the most efficient way.
Each of the Note Managers (MR2) / Task Trackers (MR1) spawns a map task, sending progress updates to an Application Master (MR2) / Job Tracker (MR1). The output is partitioned, according to the reducers it needs to go to. An optional Combiner might be used as well to join values for the same local key.
If a map fails, it is retried (up to a defined maximum number of retries).
After a map succeeds, the local partitions with the output can be copied to respective reduce nodes (depending on the partition).
After a reduce task receives all relevant files, it starts a sort phase, in which it merges the map outputs, maintaining the natural order of keys.
For each of the keys in the sorted output, a reduce function is invoked and the output is stored to the desired sink (in our case, HDFS). Since the node where it’s being executed is usually also a DataNode, the first replica is written to the local disk.
If a reducer fails, it is run again.
After a successful chain of operations, temporary files are removed and the Application Master closes gracefully.
Q: What are major types of NoSQL databases commonly used for big data?
In recent years, the growing prevalence and relevance of big data has dramatically increased the need to store and process huge quantities of data. This has caused a shift away from more traditional relational database systems to other approaches and technologies, commonly known as NoSQL databases. NoSQL databases are typically simpler than traditional SQL databases and typically lack ACID transactions capabilities, thereby making them more efficient and scalable.
The taxonomy of such databases might be created using a couple of different spaces. Although the table below groups them by data model, in the context of big data, consistency model would be another pivotal feature to consider when evaluating options for these types of datastores.
Inspired by Google's BigTable publication, stores data in records that are able to store anvery large number of dynamic columns.
In general, describe data using somewhat loose schema - defining tables and the main columns (e.g. ColumnFamilies in case of HBase) and being flexible regarding the data stored inside them, which are key-values.
Very fast writes
Easy to scale
Organizes data into tables and columns using a flexible schema
Easy to store history of operations with timestamps
Some versions (e.g., HBase) require the whole (usually complex) Hadoop environment as a prerequisite
Often requires effort to tune and optimize performance parameters
Designed around the concept of a "document"; i.e., a single object encapsulating data in some standard format (such as JSON, XML, etc.).
Instead of tables with rows, operates on collections of documents.
Schema-free design
Simplicity gives significant gains in scalability and performance
Many real-word objects are in fact documents
Multiple implementations support indexing out-of-the-box
Tracking any kind of relation is difficult
No implied schema puts more effort on the client
Q: What are Pig, Hive, and Impala? What are the differences between them?
Pig, Hive, and Impala are examples of big data frameworks for querying and processing data in the Apache Hadoop ecosystem. While a number of other systems have recently been introduced (notable mentions include Facebook’s Presto or Spark SQL), these are still considered “the big 3” when dealing with big data challenges. The table below provides a basic description and comparison of the three.
Comparitive Overview: Hive, Pig, and Impala
Pig
Hive
Impala
Introduced in 2006 by Yahoo Research. Ad-hoc way of creating and executing map-reduce jobs on a very large datasets
Introduced in 2007 by Facebook. Peta-byte scale data warehousing framework.
Introduced in 2012 by Cloudera. Massively parallel processing SQL query engine on HDFS.
PigLatin, procedural, data flow oriented
HiveQL, SQL-like syntax, declarative
SQL
Runs MapReduce jobs
Runs MapReduce jobs
Custom execution engine; does most of its operation in-memory
Batch processing or ETL
Batch processing, ad-hoc queries
Interactive-like
Relatively easy to extend with user-defined functions (UDF)
Relatively harder to extend with UDF
Supports native (C++) and Hive UDFs
High fault-tolerance of queries
High fault-tolerance of queries
If a node fails, the query is aborted
Q: What are Apache Tez and Apache Spark? What kind of problems do they solve (and how)?
One issue with MapReduce is that it’s not suited for interactive processing. Moreover, stacking one MapReduce job over another (which is a common real-world use case when running, for example, Pig or Hive queries) is typically quite ineffective. After each stage, the intermediate result is stored to HDFS, only to be picked up as an input by another job, and so on. It also requires a lot of shuffling of data during the reduce phases. While sometimes it is indeed necessary, in many cases the actual flow could be optimized, as some of the jobs could be joined together or could use some cache that would reduce the need of joins and in effect increase the speed significantly.
Addressing these issues was one of important reasons behind creating Apache Tez and Apache Spark. They both generalize the MapReduce paradigm and execute the overall job by first defining the flow using a Direct Acyclic Graph (DAG). At a high level of abstraction, each vertex is a user operation (i.e., performing some form of processing on the input data), while each edge defines the relation with other steps of the overall job (such as grouping, filtering, counting, and so on). Based on the specified DAG, the scheduler can decide which steps can be executed together (and when) and which require pushing the data over the cluster. Additionally, both Tez and Spark offer forms of caching, minimizing the need to push huge datasets between the nodes.
There are some significant differences between the two technologies though. Apache Tez was created more as an additional element of Hadoop ecosystem, taking advantage of YARN and allowing to easily include it in existing MapReduce solutions. It can even be easily run with regular MapReduce jobs. Apache Spark, on the other hand, was built more as a new approach to processing big data. It adds some new concepts, such as RDDs (Resilient Distributed Datasets), provides a well-thought-out idiomatic API for defining the steps of the job (which has many more operations than just map or reduce, such as joins or co-groups), and has multiple cache-related features. The code is lazily evaluated and the Direct Acyclic Graph is created and optimized automatically (in contrast, in the case of Tez, the graph must be defined manually). Taking all of this into account, the code in Spark tends to be very concise. Spark is also a significant part of the Berkeley Data Analytics Stack (BDAS).
Q: What is a column-oriented database? When and why would you use one?
Column-oriented databases arrange data storage on disk by column, rather than by row, which allows more efficient disk seeks for particular operations. This has a number of benefits when working with large datasets, including faster aggregation related queries, efficient compression of data, and optimized updating of values in a specific column across all (or many) rows.
Whether or not a column-oriented database will be more efficient in operation varies in practice. It would appear that operations that retrieve data for objects would be slower, requiring numerous disk operations to collect data from multiple columns to build up the record. However, these whole-row operations are generally rare. In the majority of cases, only a limited subset of data is retrieved. In a rolodex application, for instance, operations collecting the first and last names from many rows in order to build a list of contacts are far more common than operations reading all data for a single entity in the rolodex. This is even more true for writing data into the database, especially if the data tends to be “sparse” with many optional columns.
It is also the case that data organized by columns, being coherent, tends to compress very well.
For these reasons, column stores have demonstrated excellent real-world performance in spite of any theoretical disadvantages.
Q: What is a Lambda Architecture and how might it be used to perform analytics on streams with real-time updates?
One of the common use cases for big data is real-time processing of huge volumes of incoming data streams. Depending on the actual problem and the nature of the data, multiple solutions might be proposed. In general, a Lambda Architecture approach is often an effective way to address this type of challenge.
The core concept is that the result is always a function of input data (lambda). The final solution should work as such a lambda function, irrelevant of the amount of data it has to process. To make this possible, the data is split into two parts; namely, raw data (which never changes and might be only appended) and pre-computed data. The pre-computed data is further subdivided into two other groups; namely, the old data and the recent data (where “old” and “recent” are relative terms, the specifics of which depend on the operational context).
Having this distinction, we can now build a system based on the lambda architecture consisting of the following three major building blocks:
Batch layer. Manages the master dataset (an immutable, append-only set of raw data) and pre-computes arbitrary query functions (called batch views).
Serving layer. Indexes the batch views to support ad-hoc queries with low latency.
Speed layer. Accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.
There are many tools that can be applied to each of these architectural layers. For instance, the raw results could simply be stored in HDFS and the batch views might be pre-computed with help of Pig jobs. Similarly, to make it possible to query batch views effectively, they might be indexed using technologies such as Apache Drill, Impala, ElasticSearch or many others.
The speed layer, though, often requires a non-trivial amount of effort. Fortunately, there are many new technologies that can help with that as well. One of the most commonly chosen ones is Apache Storm which uses an abstraction of spouts and bolts to define data sources and manipulations of them, allowing distributed processing of streaming data.
The final element is to combine the results of the Speed layer and the Serving layer when assembling the query response.
This architecture combines the batch processing power of Hadoop with real-time availability of results. If the underlying algorithm ever changes, it’s very easy to just recalculate all batch views as a batch process and then update the speed layer to take the new version into account.
Wrap-up
Without a doubt, big data represents one of the most formidable and pervasive challenges facing the software industry today. From social networking, to marketing, to security and law enforcement, the need for large scale big data solutions that can effectively handle and process big data is becoming increasingly important and is rapidly on the rise.
Real expertise and proficiency in this domain requires far more than learning the ins and outs of a particular technology. It requires an understanding, at the first principles level, of the manifold technical challenges and complexities involved as well as the most effective methodologies and problem-solving approaches to address these hurdles.
We hope you find the questions presented in this article to be a useful foundation for “separating the wheat from the chaff” in your quest for the elite few among Big Data engineers, whether you need them full-time or part-time.
Featured Toptal Big Data Architecture Publications