10 Essential Hadoop Interview Questions *

Toptal sourced essential questions that the best Hadoop developers and engineers can answer. Driven from our community, we encourage experts to submit questions and offer feedback.

is an exclusive network of the top freelance software developers, designers, finance experts, product managers, and project managers in the world. Top companies hire Toptal freelancers for their most important projects.

Interview Questions

How can one define custom input and output data formats for MapReduce jobs?

View answer

Hadoop MapReduce comes with built-in support for many common file formats such as SequenceFile. To implement custom types, one has to implement the InputFormat and OutputFormat Java interfaces for reading and writing, respectively.

A class implementing InputFormat (and similarly OutputFormat), should implement the logic to split the data and also the logic on how to read records out of each split. The latter should be an implementation of the RecordReader (and RecordWriter) interfaces.

Implementations of InputFormat and OutputFormat may retrieve data by means other than from files on HDFS. For instance, Apache Cassandra ships with implementations of InputFormat and RecordReader.

What is HDFS?

View answer

The Hadoop Distributed File System (HDFS) is a distributed file system and a central part of the Hadoop collection of software. HDFS attempts to abstract away the complexities involved in distributed file systems, including replication, high availability, and hardware heterogeneity.

Two major components of HDFS are NameNode and a set of DataNodes. NameNode exposes the filesystem API, persists metadata, and orchestrates replication amongst DataNodes.

MapReduce natively makes use of HDFS’ data locality API to dispatch MapReduce tasks to run where the data lives.

What read and write consistency guarantees does HDFS provide?

View answer

Even though data is distributed amongst multiple DataNodes, NameNode is the central authority for file metadata and replication (and as a result, a single point of failure). The configuration parameter dfs.NameNode.replication.min defines the number of replicas a block should replicate to in order for the write to return as successful.

Apply to Join Toptal's Development Network

and enjoy reliable, steady, remote Freelance Hadoop Developer Jobs

Apply as a Freelancer

What is the MapReduce programming paradigm and how can it be used to design parallel programs?

View answer

MapReduce is a programming model used to implement parallel programs. It provides a programming model to run a program on a distributed set of machines. The similarly named “Hadoop MapReduce” is an implementation of the MapReduce model.

Input and output data in MapReduce are modeled as records of key-value pairs.

Central to MapReduce are map and reduce programs, reminiscent of map and reduce in functional programming. They transform data in two phases, each running in parallel and linearly scalable.

The map function takes each key-value pair and outputs a list of key-value pairs. The reduce function receives an aggregate of all values emitted for each key across all outputs of instances of map invocations and reduces them to a single final value.

MapReduce integrates with HDFS to provide data locality for the data it processes. For sufficiently large data, a map or reduce program is better to be sent to run where the data lives, rather than bringing the data to them.

Hadoop’s implementation of MapReduce provides native support for the JVM runtime and extended support for other runtimes communicating via standard in/out.

What common data serialization formats are used to store data in HDFS and what are their properties?

View answer

HDFS can store any type of file regardless of format; however, certain properties make some file formats better suited for distributed computation.

HDFS organises and distributes files in blocks of fixed size. For example, given a block size of 128MB, a 257MB file is split into three blocks. Records at block boundaries, as a result, may be split. File formats designed to be consumed when split, also called “splittable,” include “sync markers” between groups of records so that any contiguous chunk of the file can be consumed. Furthermore, compression may be desired in conjunction with splittability.

Support for compression is particularly important because it trades off IO and CPU resources. A compressed file is quicker to load from disk but takes extra time to decompress.

CSV files, for instance, are splittable since they include a “line separator” between records. However, they are not suitable for binary data, and they do not support compression.

The SequenceFile format, native to the Hadoop ecosystem, is a binary format that stores key-value records, is splittable, and supports compression at the block and record levels.

Apache Avro, a data serialization and RPC framework, defines the Avro Object Container File format that stores Avro-encoded records. It is both splittable and compressible. Having also a flexible schema definition language, it’s widely used.

The Parquet file format, another Apache project, supports columnar data, where fields belonging to each column are stored efficiently together.

What availability guarantees does HDFS provide?

View answer

HDFS relies on NameNode to store metadata about which DataNodes different blocks are stored at. Since NameNode runs on a single node, it’s a single point of failure and its failure makes HDFS unavailable.

A standby NameNode may be configured to be able to fail-over to in order to achieve high availability. In order to achieve this, the Active NameNode streams a log of mutations to a group of JournalNodes, from which the Standby NameNode receives the latest changes to the filesystem metadata.

Automatic failover between Active and Standby NameNodes can be configured by maintaining an ephemeral lock on a quorum of a Zookeeper cluster. A failover controller process on NameNodes is responsible for checking the NameNodes’ health, for maintaining the ephemeral lock, and for executing a fencing mechanism that makes sure that upon failover, the previous NameNode does indeed act passively.

What’s the purpose of Hadoop Streaming and how does it work?

View answer

Hadoop Streaming is an extension of Hadoop’s MapReduce API that makes it possible for programs that run within runtimes other than the JVM to act as map and reduce programs. Hadoop Streaming defines an interface where data can be sent and received via the standard out and standard in streams provided by operating systems (and hence its name).

What is speculative execution and when can it be used?

View answer

A MapReduce program may translate into many invocations of mapper and reducer tasks on different HDFS DataNodes. If a task is slow to respond, MapReduce “speculatively” runs the same task on another replica, as the first node might have been overloaded or faulty.

For speculative execution to work correctly, tasks need to have no side effects; or if they do they need to be “idempotent.” A side-effect-free task is one that besides producing the expected output, does not mutate any external state (such as writing into a database). Idempotence in this context means that if a side effect is repeatedly applied (due to speculative execution), it would not change the end result. Nevertheless, side effects are generally undesirable for a MapReduce task regardless of speculative execution.

What is the “small files problem” with Hadoop?

View answer

NameNode is the registry for all metadata in HDFS. The metadata, although journaled on disk, is served from memory and as a result is subject to the limitations of the runtime. NameNode, being a Java application, runs using the JVM runtime and cannot operate efficiently with larger heap allocations.

10.

Explain rack awareness in Hadoop.

View answer

HDFS replicates blocks onto multiple machines. In order to have higher fault tolerance against rack failures (network or physical), HDFS is able to distribute replicas across multiple racks.

Hadoop obtains network topology information by either invoking a user-defined script or by loading a Java class which should be an implementation of the DNSToSwitchMapping interface. It’s the administrator’s responsibility to choose the method, to set the right configuration, and to provide the implementation of said method.

There is more to interviewing than tricky technical questions, so these are intended merely as a guide. Not every “A” candidate worth hiring will be able to answer them all, nor does answering them all guarantee an “A” candidate. At the end of the day, hiring remains an art, a science — and a lot of work.

Why Toptal

Submit an interview question

Submitted questions and answers are subject to review and editing, and may or may not be selected for posting, at the sole discretion of Toptal, LLC.

Looking for Hadoop Developers?

Looking for Hadoop Developers? Check out Toptal’s Hadoop developers.

View Adrian

Adrian Dominiczak

Freelance Hadoop Developer

PolandToptal Member Since July 21, 2020

Adrian is a senior big data engineer with nearly a decade of professional experience. Adrian started his career as a software engineer at Samsung's R&D and has worked on a range of projects from machine learning and big data engineering in banking and pharmaceutical industries to big data and cloud architecting at Santander and Lingaro. Adrian's areas of expertise lie mainly with Hadoop and Spark.

Hadoop Python Java Spark Big Data Data Analytics Data Engineering + more

View Selahattin

Selahattin Gungormus

Freelance Hadoop Developer

TurkeyToptal Member Since May 4, 2021

Selahattin is a data engineer with several years of hands-on experience building scalable data integration solutions using open-source technologies. He excels at developing data applications using distributed processing platforms such as Hadoop, Spark, and Kafka. Selahattin also has practical experience in cloud architecture types such as AWS and Azure, as well as developing microservices using Python and JavaScript frameworks

Hadoop Apache Airflow Apache Spark Python SQL Data Modeling Data Warehousing PL/SQL Data Warehouse Design Databases ETL Data Pipelines Data Engineering + more

View Dmitry

Dmitry Kozlov

Freelance Hadoop Developer

CanadaToptal Member Since February 24, 2021

Dmitry is a senior big data architect with 16+ years of experience in data warehousing, BI, ETL, analytics, and the cloud. He's led teams in the delivery of 24 projects in the industries of finance, insurance, telecommunications, government, education, mining, manufacturing, and retail. Dmitry thrives in high-paced environments, has demonstrated the ability to lead effectively, manage, and support teams, and has consulted on several projects as a BI, data warehouse, and big data expert.

Hadoop SQL ETL Big Data Business Intelligence (BI)Data Warehousing Tableau erwin Data Modeler PL/SQL IBM Db2 SQL Stored Procedures Erwin Oracle + more

Toptal Connects the Top 3% of Freelance Talent All Over The World.

Join the Toptal community.

Learn more