How to Hire a Great Big Data Architect
Big data is an extremely broad domain, typically addressed by a hybrid team of data scientists, software engineers, and statisticians. Finding a single individual knowledgeable in the entire breadth of this domain is therefore extremely unlikely and rare. Rather, one will most likely be searching for multiple individuals with specific sub-areas of expertise. This guide is therefore divided at a high level into two sections:
- Big Data Algorithms, Techniques, and Approaches: Geared primarily toward data scientists and statisticians.
- Big Data Technologies: Geared primarily toward software engineers.
This guide highlights questions related to key concepts, paradigms, and technologies in which a big data expert can be expected to have proficiency. Bear in mind, though, that not every “A” candidate will be able to answer them all, nor does answering them all guarantee an “A” candidate. Ultimately, effective interviewing and hiring is as much of an art as it is a science.
Big Data Algorithms, Techniques, and Approaches
When it comes to big data, fundamental knowledge of relevant algorithms, techniques, and approaches is essential. Generally speaking, mastering these areas requires more time and skill than becoming an expert with a specific set of software languages or tools. As such, software engineers who do have expertise in these areas are both hard to find and extremely valuable to your team. The questions that follow can be helpful in gauging such expertise.
Q: Given a stream of data of unknown length, and a requirement to create a sample of a fixed size, how might you perform a simple random sample across the entire dataset? (i.e., given N elements in a data stream, how can you produce a sample of k elements, where N > k, whereby every element has a 1/N chance of being included in the sample?
One of the effective algorithms for addressing this is known as Reservoir Sampling.
The basic procedure is as follows:
- Create an array of size k.
- Fill the array with the first k elements from the stream.
- For each subsequent element E (with index i) read from the stream, generate a random number j between 0 and i. If j is less than k, replace the jth element in the array with E.
This approach gives each element in the stream the same probability of appearing in the output sample.
Q: Describe and compare some of the more common algorithms and techniques for cluster analysis.
Cluster analysis is a common unsupervised learning technique used in many fields. It has a huge range of applications both in science and in business. A few examples include:
- Bioinformatics: Organizing genes into clusters by analyzing similarity of gene expression patterns.
- Marketing: Discovering distinct groups of customers and the using this knowledge to structure a campaign that targets the right marketing segments.
- Insurance: Identifying categories of insurance holders that have a high average claim cost.
Clustering algorithms can be logically categorized based on their underlying cluster model as summarized in the table below.
|Connectivity-based (a.k.a. hierarchical) clustering||
(a.k.a. K-means) clustering
Q: Define and discuss ACID, BASE, and the CAP theorem.
ACID refers to the following set of properties that collectively guarantee reliable processing of database transactions, with the goal being “immediate consistency”:
Atomicity. Requires each transaction to be “all or nothing”; i.e., if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged.
Consistency. Requires every transaction to bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including (but not limited to) constraints, cascades, triggers, and any combination thereof.
Isolation. Requires concurrent execution of transactions to yield a system state identical to that which would be obtained if those same transactions were executed sequentially.
Durability. Requires that, once a transaction has been committed, it will remain so even in the event of power loss, crashes, or errors.
Many databases rely upon locking to provide ACID capabilities. Locking means that the transaction marks the data that it accesses so that the DBMS knows not to allow other transactions to modify it until the first transaction succeeds or fails. An alternative to locking is multiversion concurrency control in which the database provides each reading transaction the prior, unmodified version of data that is being modified by another active transaction.
Guaranteeing ACID properties in a distributed transaction across a distributed database where no single node is responsible for all data affecting a transaction presents additional complications. Network connections might fail, or one node might successfully complete its part of the transaction and then be required to roll back its changes, because of a failure on another node. The two-phase commit protocol (not to be confused with two-phase locking) provides atomicity for distributed transactions to ensure that each participant in the transaction agrees on whether the transaction should be committed or not.
In contrast to ACID (and its immediate-consistency-centric approach), BASE (Basically Available, Soft State, Eventual Consistency) favors availability over consistency of operations.
BASE was developed as an alternative for producing more scalable and affordable data architectures. Allowing less constantly updated data gives developers the freedom to build other efficiencies into the overall system. In BASE, engineers embrace the idea that data has the flexibility to be “eventually” updated, resolved or made consistent, rather than instantly resolved.
The Eventual Consistency model employed by BASE informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. A system that has achieved eventual consistency is often said to have converged, or achieved replica convergence. Eventual consistency is sometimes criticized as increasing the complexity of distributed software applications.
In nearly all models, elements like consistency and availability often are viewed as resource competitors, where adjusting one can impact another. Accordingly, the CAP theorem (a.k.a. Brewer’s theorem) states that it’s impossible for a distributed computer system to provide more than two of the following three guarantees concurrently:
- Consistency (all nodes see the same data at the same time)
- Availability (a guarantee that every request receives a response about whether it was successful or failed)
- Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
In the decade since its introduction, designers and researchers have used the CAP theorem as a reason to explore a wide variety of novel distributed systems. The CAP Theorem has therefore certainly proven useful, fostering much discussion, debate, and creative approaches to addressing tradeoffs, some of which have even yielded new systems and technologies.
Yet at the same time, the “2 out of 3” constraint does somewhat oversimplify the tensions between the three properties. By explicitly handling partitions, for example, designers can optimize consistency and availability, thereby achieving some trade-off of all three. Although designers do need to choose between consistency and availability when partitions are present, there is an incredible range of flexibility for handling partitions and recovering from them.
Aspects of the CAP theorem are often misunderstood, particularly the scope of availability and consistency, which can lead to undesirable results. If users cannot reach the service at all, there is no choice between C and A except when part of the service runs on the client. This exception, commonly known as disconnected operation or offline mode, is becoming increasingly important. Some HTML5 feature make disconnected operation easier going forward. These systems normally choose A over C and thus must recover from long partitions.
Q: What is dimensionality reduction and how is it relevant to processing big data? Name some techniques commonly employed for dimensionality reduction.
Dimensionality reduction is the process of converting data of very high dimensionality into data of lower dimensionality, typically for purposes such as visualization (i.e, projection onto a 2D or 3D space for visualization purposes), compression (for efficient storage and retrieval), or noise removal.
Some of the more common techniques for dimensionality reduction include:
- Principal component analysis
- Factor analysis
- Projection pursuit
- Independent component analysis
- Non-linear principal component analysis
- Random projection
Note: Each of the techniques listed above is itself a complex topic, so each is provided as a hyperlink to further information for those interested in learning more.
Q: Discuss some common statistical sampling techniques, including their strengths and weaknesses.
When analyzing big data, processing the entire dataset would often be operationally untenable. Exploring a representative sample is easier, more efficient, and can in many cases be nearly as accurate as exploring the entire dataset. The table below describes some of the statistical sampling techniques that are more commonly used with big data.
|COMMON SAMPLING TECHNIQUES|
|Simple random sampling||Every element has the same chance of selection (as does any pair of elements, triple of elements, etc.)||
|Systematic sampling||Orders data and selects elements at regular intervals through the ordered dataset||
|Stratified sampling||Divides data into separate strata (i.e., categories) and then samples each stratum separately||
Q: Explain the term “cache oblivious”. Discuss some of its advantages and disadvantages in the context of processing big data.
A cache oblivious (a.k.a., cache transcendent) algorithm is designed to take advantage of a CPU cache without knowing its size. Its goal is to perform well – without modification or tuning – on machines with different cache sizes, or for a memory hierarchy whose levels are of different cache sizes.
Typically, a cache oblivious algorithm employs a recursive divide and conquer approach, whereby the problem is divided into smaller and smaller sub-problems, until a sub-problem size is reached that fits into the available cache. For example, an optimal cache oblivious matrix multiplication is obtained by recursively dividing each matrix into four sub-matrices to be multiplied.
In tuning for a specific machine, one may use a hybrid algorithm which uses blocking tuned for the specific cache sizes at the bottom level, but otherwise uses the cache-oblivious algorithm.
The ability to perform well, independent of cache size and without cache-size-specific tuning, is the primary advantage of the cache oblivious approach. However, it is important to acknowledge that this lack of any cache-size-specific tuning also means that a cache oblivious algorithm may not perform as well as a cache-aware algorithm (i.e., an algorithm tuned to a specific cache size). Another disadvantage of the cache oblivious approach is that it typically increases the memory footprint of the data structure, which may further degrade performance. Interestingly though, in practice, performance of cache oblivious algorithms are often surprisingly comparable to that of cache aware algorithms, making them that much more interesting and relevant for big data processing.
Big Data Technologies
These days, it’s not only about finding a single tool to get the job done; rather, it’s about building a scalable architecture to effectively collect, process, and query enormous volumes of data. Armed with a strong foundational knowledge of big data algorithms, techniques, and approaches, a big data expert will be able to employ tools from a growing landscape of technologies that can be used to exploit big data to extract actionable information. The questions that follow can help evaluate this dimension of a candidate’s expertise.
Q: What is Hadoop? What are its key features? What modules does it consist of?
Apache Hadoop is a software framework that allows for the distributed processing of large datasets across clusters of computers. It is designed to effectively scale from single servers to thousands of machines, each offering local computation and storage. Rather than relying on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer.
The project includes these modules:
- Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
- Hadoop MapReduce: A system for parallel processing of large datasets.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop Common: The common utilities that support the other Hadoop modules.
Q: Provide an overview of HDFS, including a description of an HDFS cluster and its components.
HDFS is a highly fault-tolerant distributed filesystem, commonly used as a source and output of data in Hadoop jobs. It builds on a simple coherence model of write-once-read-many (with append possible) access. It supports a traditional hierarchical organization of directories and files.
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
An HDFS cluster also contains what is referred to as a Secondary NameNode that periodically downloads the current NameNode image, edits log files, joins them into a new image. and uploads the new image back to the NameNode. (Despite its name, though, the Secondary NameNode does not serve as a backup to the primary NameNode in case of failure.)
It’s important to note that HDFS is meant to handle large files, with the default block size being 128 MB. This means that, for example, a 1 GB file will be split into just 8 blocks. On the other hand, a file that’s just 1 KB will still take a full 128 MB block. Selecting a block size that will optimize performance for a cluster can be a challenge and there is no “one-size-fits-all” answer in this regard. Setting block size to too small value might increase network traffic and put huge overhead on the NameNode, which processes each request and locates each block. Setting it to too large a value, on the other hand, might result in a great deal of wasted space. Since the focus of HDFS is on large files, one strategy could be to combine small files into larger ones, if possible.
Q: Provide an overview of MapReduce, including a discussion of its key components, features, and benefits.
MapReduce is a programming model and an associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a
Map() procedure that performs filtering and sorting and a
Reduce() procedure that performs a summary (i.e., data reduction) operation. A MapReduce system orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.
MapReduce processes parallelizable problems across huge datasets using a large number of nodes in a cluster or grid as follows:
Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
Reduce step: The master node collects the answers from its worker nodes and combines them to form its output (i.e., the answer to the problem it was tasked with solving).
MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel (though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source). Similarly, a set of ‘reducers’ can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than high performance servers can usually handle (e.g., a large server farm of “commodity” machines can use MapReduce to sort a petabyte of data in only a few hours).
The parallelism in MapReduce also helps facilitate recovery from a partial failure of servers or storage during the operation (i.e., if one mapper or reducer fails, the work can be rescheduled, assuming the input data is still available). This is particularly important, since the failure of single nodes are fairly common in multi-machine clusters. When there are thousands of disks spinning for a couple of hours, it’s not unlikely that one or two of them will fail at some point.
The key contributions of the MapReduce framework are not the map and reduce functions per se, but the scalability and fault-tolerance achieved for a variety of applications. It should be noted that a single-threaded implementation of MapReduce will usually not be faster than a traditional implementation. Its benefits are typically only realized when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerant features of the framework come into play.
As a side note, the name MapReduce originally referred to a proprietary Google technology but has since been genericized.
Q: Describe the MapReduce process in detail.
While not all big data software engineers will be familiar with the the internals of the MapReduce process, those who are will be that much better equipped to leverage its capabilities and to thereby architect and design an optimal solution.
MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation is Apache Hadoop. We therefore provide a Hadoop-centric description of the MapReduce process, described here both for Apache Hadoop MapReduce 1 (MR1) as well as for Apache Hadoop MapReduce 2 (MR2, a.k.a. YARN). For simplicity, we use an example that takes input from an HDFS and stores it back to an HDFS. Such a process would operate as follows:
- The local Job Client prepares the job for the submission and sends it to Resource Manager (MR2) or Job Tracker (MR1).
- Resource Manager (MR2) / Job Tracker (MR1) schedules the job across the cluster. In the case of MR2, the Application Master is started, which will take care of managing the job and terminating it after of completion. The job is sent to Node Managers (MR2) or Task Trackers (MR1). An InputSplitter is used to distribute the map tasks across the cluster in the most efficient way.
- Each of the Note Managers (MR2) / Task Trackers (MR1) spawns a map task, sending progress updates to an Application Master (MR2) / Job Tracker (MR1). The output is partitioned, according to the reducers it needs to go to. An optional Combiner might be used as well to join values for the same local key.
- If a map fails, it is retried (up to a defined maximum number of retries).
- After a map succeeds, the local partitions with the output can be copied to respective reduce nodes (depending on the partition).
- After a reduce task receives all relevant files, it starts a sort phase, in which it merges the map outputs, maintaining the natural order of keys.
- For each of the keys in the sorted output, a reduce function is invoked and the output is stored to the desired sink (in our case, HDFS). Since the node where it’s being executed is usually also a DataNode, the first replica is written to the local disk.
- If a reducer fails, it is run again.
- After a successful chain of operations, temporary files are removed and the Application Master closes gracefully.
Q: What are major types of NoSQL databases commonly used for big data?
In recent years, the growing prevalence and relevance of big data has dramatically increased the need to store and process huge quantities of data. This has caused a shift away from more traditional relational database systems to other approaches and technologies, commonly known as NoSQL databases. NoSQL databases are typically simpler than traditional SQL databases and typically lack ACID transactions capabilities, thereby making them more efficient and scalable.
The taxonomy of such databases might be created using a couple of different spaces. Although the table below groups them by data model, in the context of big data, consistency model would be another pivotal feature to consider when evaluating options for these types of datastores.
|Distributed file system
(e.g., Riak, Amazon DynamoDB, Redis)
(e.g., BigTable, Apache Cassandra, Apache HBase, Apache Accumulo)
|Document-oriented (e.g. MongoDB, CouchDB)||
Q: What are Pig, Hive, and Impala? What are the differences between them?
Pig, Hive, and Impala are examples of big data frameworks for querying and processing data in the Apache Hadoop ecosystem. While a number of other systems have recently been introduced (notable mentions include Facebook’s Presto or Spark SQL), these are still considered “the big 3” when dealing with big data challenges. The table below provides a basic description and comparison of the three.
|Comparitive Overview: Hive, Pig, and Impala|
|Introduced in 2006 by Yahoo Research. Ad-hoc way of creating and executing map-reduce jobs on a very large datasets||Introduced in 2007 by Facebook. Peta-byte scale data warehousing framework.||Introduced in 2012 by Cloudera. Massively parallel processing SQL query engine on HDFS.|
|PigLatin, procedural, data flow oriented||HiveQL, SQL-like syntax, declarative||SQL|
|Runs MapReduce jobs||Runs MapReduce jobs||Custom execution engine; does most of its operation in-memory|
|Batch processing or ETL||Batch processing, ad-hoc queries||Interactive-like|
|Relatively easy to extend with user-defined functions (UDF)||Relatively harder to extend with UDF||Supports native (C++) and Hive UDFs|
|High fault-tolerance of queries||High fault-tolerance of queries||If a node fails, the query is aborted|
Q: What are Apache Tez and Apache Spark? What kind of problems do they solve (and how)?
One issue with MapReduce is that it’s not suited for interactive processing. Moreover, stacking one MapReduce job over another (which is a common real-world use case when running, for example, Pig or Hive queries) is typically quite ineffective. After each stage, the intermediate result is stored to HDFS, only to be picked up as an input by another job, and so on. It also requires a lot of shuffling of data during the reduce phases. While sometimes it is indeed necessary, in many cases the actual flow could be optimized, as some of the jobs could be joined together or could use some cache that would reduce the need of joins and in effect increase the speed significantly.
Addressing these issues was one of important reasons behind creating Apache Tez and Apache Spark. They both generalize the MapReduce paradigm and execute the overall job by first defining the flow using a Direct Acyclic Graph (DAG). At a high level of abstraction, each vertex is a user operation (i.e., performing some form of processing on the input data), while each edge defines the relation with other steps of the overall job (such as grouping, filtering, counting, and so on). Based on the specified DAG, the scheduler can decide which steps can be executed together (and when) and which require pushing the data over the cluster. Additionally, both Tez and Spark offer forms of caching, minimizing the need to push huge datasets between the nodes.
There are some significant differences between the two technologies though. Apache Tez was created more as an additional element of Hadoop ecosystem, taking advantage of YARN and allowing to easily include it in existing MapReduce solutions. It can even be easily run with regular MapReduce jobs. Apache Spark, on the other hand, was built more as a new approach to processing big data. It adds some new concepts, such as RDDs (Resilient Distributed Datasets), provides a well-thought-out idiomatic API for defining the steps of the job (which has many more operations than just map or reduce, such as joins or co-groups), and has multiple cache-related features. The code is lazily evaluated and the Direct Acyclic Graph is created and optimized automatically (in contrast, in the case of Tez, the graph must be defined manually). Taking all of this into account, the code in Spark tends to be very concise. Spark is also a significant part of the Berkeley Data Analytics Stack (BDAS).
Q: What is a column-oriented database? When and why would you use one?
Column-oriented databases arrange data storage on disk by column, rather than by row, which allows more efficient disk seeks for particular operations. This has a number of benefits when working with large datasets, including faster aggregation related queries, efficient compression of data, and optimized updating of values in a specific column across all (or many) rows.
Whether or not a column-oriented database will be more efficient in operation varies in practice. It would appear that operations that retrieve data for objects would be slower, requiring numerous disk operations to collect data from multiple columns to build up the record. However, these whole-row operations are generally rare. In the majority of cases, only a limited subset of data is retrieved. In a rolodex application, for instance, operations collecting the first and last names from many rows in order to build a list of contacts are far more common than operations reading all data for a single entity in the rolodex. This is even more true for writing data into the database, especially if the data tends to be “sparse” with many optional columns.
It is also the case that data organized by columns, being coherent, tends to compress very well.
For these reasons, column stores have demonstrated excellent real-world performance in spite of any theoretical disadvantages.
Q: What is a Lambda Architecture and how might it be used to perform analytics on streams with real-time updates?
One of the common use cases for big data is real-time processing of huge volumes of incoming data streams. Depending on the actual problem and the nature of the data, multiple solutions might be proposed. In general, a Lambda Architecture approach is often an effective way to address this type of challenge.
The core concept is that the result is always a function of input data (lambda). The final solution should work as such a lambda function, irrelevant of the amount of data it has to process. To make this possible, the data is split into two parts; namely, raw data (which never changes and might be only appended) and pre-computed data. The pre-computed data is further subdivided into two other groups; namely, the old data and the recent data (where “old” and “recent” are relative terms, the specifics of which depend on the operational context).
Having this distinction, we can now build a system based on the lambda architecture consisting of the following three major building blocks:
- Batch layer. Manages the master dataset (an immutable, append-only set of raw data) and pre-computes arbitrary query functions (called batch views).
- Serving layer. Indexes the batch views to support ad-hoc queries with low latency.
- Speed layer. Accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.
There are many tools that can be applied to each of these architectural layers. For instance, the raw results could simply be stored in HDFS and the batch views might be pre-computed with help of Pig jobs. Similarly, to make it possible to query batch views effectively, they might be indexed using technologies such as Apache Drill, Impala, ElasticSearch or many others.
The speed layer, though, often requires a non-trivial amount of effort. Fortunately, there are many new technologies that can help with that as well. One of the most commonly chosen ones is Apache Storm which uses an abstraction of spouts and bolts to define data sources and manipulations of them, allowing distributed processing of streaming data.
The final element is to combine the results of the Speed layer and the Serving layer when assembling the query response.
This architecture combines the batch processing power of Hadoop with real-time availability of results. If the underlying algorithm ever changes, it’s very easy to just recalculate all batch views as a batch process and then update the speed layer to take the new version into account.
Without a doubt, big data represents one of the most formidable and pervasive challenges facing the software industry today. From social networking, to marketing, to security and law enforcement, the need for solutions that can effectively handle and process big data is becoming increasingly important and is rapidly on the rise.
Real expertise and proficiency in this domain requires far more than learning the ins and outs of a particular technology. It requires an understanding, at the first principles level, of the manifold technical challenges and complexities involved as well as the most effective methodologies and approaches to address these hurdles.
We hope you find the questions presented in this article to be a useful foundation for “separating the wheat from the chaff” in your quest for the elite few among Big Data engineers.