I first heard of Spark in late 2013 when I became interested in Scala, the language in which Spark is written. Some time later, I did a fun data science project trying to predict survival on the Titanic. This turned out to be a great way to get further introduced to Spark concepts and programming. I highly recommend it for any aspiring Spark developers looking for a place to get started.

Today, Spark is being adopted by major players like Amazon, eBay, and Yahoo! Many organizations run Spark on clusters with thousands of nodes. According to the Spark FAQ, the largest known cluster has over 8000 nodes. Indeed, Spark is a technology well worth taking note of and learning about.

apache spark tutorial

This article provides an introduction to Spark including use cases and examples. It contains information from the Apache Spark website as well as the book Learning Spark - Lightning-Fast Big Data Analysis.

What is Apache Spark? An Introduction

Spark is an Apache project advertised as “lightning fast cluster computing”. It has a thriving open-source community and is the most active Apache project at the moment.

Spark provides a faster and more general data processing platform. Spark lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. Last year, Spark took over Hadoop by completing the 100 TB Daytona GraySort contest 3x faster on one tenth the number of machines and it also became the fastest open source engine for sorting a petabyte.

Spark also makes it possible to write code more quickly as you have over 80 high-level operators at your disposal. To demonstrate this, let’s have a look at the “Hello World!” of BigData: the Word Count example. Written in Java for MapReduce it has around 50 lines of code, whereas in Spark (and Scala) you can do it as simply as this:

            .flatMap(line => line.split(" "))
            .map(word => (word, 1)).reduceByKey(_ + _)

Another important aspect when learning how to use Apache Spark is the interactive shell (REPL) which it provides out-of-the box. Using REPL, one can test the outcome of each line of code without first needing to code and execute the entire job. The path to working code is thus much shorter and ad-hoc data analysis is made possible.

Additional key features of Spark include:

  • Currently provides APIs in Scala, Java, and Python, with support for other languages (such as R) on the way
  • Integrates well with the Hadoop ecosystem and data sources (HDFS, Amazon S3, Hive, HBase, Cassandra, etc.)
  • Can run on clusters managed by Hadoop YARN or Apache Mesos, and can also run standalone

The Spark core is complemented by a set of powerful, higher-level libraries which can be seamlessly used in the same application. These libraries currently include SparkSQL, Spark Streaming, MLlib (for machine learning), and GraphX, each of which is further detailed in this article. Additional Spark libraries and extensions are currently under development as well.

spark libraries and extensions

Spark Core

Spark Core is the base engine for large-scale parallel and distributed data processing. It is responsible for:

  • memory management and fault recovery
  • scheduling, distributing and monitoring jobs on a cluster
  • interacting with storage systems

Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel. An RDD can contain any type of object and is created by loading an external dataset or distributing a collection from the driver program.

RDDs support two types of operations:

  • Transformations are operations (such as map, filter, join, union, and so on) that are performed on an RDD and which yield a new RDD containing the result.
  • Actions are operations (such as reduce, count, first, and so on) that return a value after running a computation on an RDD.

Transformations in Spark are “lazy”, meaning that they do not compute their results right away. Instead, they just “remember” the operation to be performed and the dataset (e.g., file) to which the operation is to be performed. The transformations are only actually computed when an action is called and the result is returned to the driver program. This design enables Spark to run more efficiently. For example, if a big file was transformed in various ways and passed to first action, Spark would only process and return the result for the first line, rather than do the work for the entire file.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist or cache method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.


SparkSQL is a Spark component that supports querying data either via SQL or via the Hive Query Language. It originated as the Apache Hive port to run on top of Spark (in place of MapReduce) and is now integrated with the Spark stack. In addition to providing support for various data sources, it makes it possible to weave SQL queries with code transformations which results in a very powerful tool. Below is an example of a Hive compatible query:

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)

Spark Streaming

Spark Streaming supports real time processing of streaming data, such as production web server log files (e.g. Apache Flume and HDFS/S3), social media like Twitter, and various messaging queues like Kafka. Under the hood, Spark Streaming receives the input data streams and divides the data into batches. Next, they get processed by the Spark engine and generate final stream of results in batches, as depicted below.

spark streaming

The Spark Streaming API closely matches that of the Spark Core, making it easy for programmers to work in the worlds of both batch and streaming data.


MLlib is a machine learning library that provides various algorithms designed to scale out on a cluster for classification, regression, clustering, collaborative filtering, and so on (check out Toptal’s article on machine learning for more information on that topic). Some of these algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering (and more on the way). Apache Mahout (a machine learning library for Hadoop) has already turned away from MapReduce and joined forces on Spark MLlib.



GraphX is a library for manipulating graphs and performing graph-parallel operations. It provides a uniform tool for ETL, exploratory analysis and iterative graph computations. Apart from built-in operations for graph manipulation, it provides a library of common graph algorithms such as PageRank.

How to Use Apache Spark: Event Detection Use Case

Now that we have answered the question “What is Apache Spark?”, let’s think of what kind of problems or challenges it could be used for most effectively.

I came across an article recently about an experiment to detect an earthquake by analyzing a Twitter stream. Interestingly, it was shown that this technique was likely to inform you of an earthquake in Japan quicker than the Japan Meteorological Agency. Even though they used different technology in their article, I think it is a great example to see how we could put Spark to use with simplified code snippets and without the glue code.

First, we would have to filter tweets which seem relevant like “earthquake” or “shaking”. We could easily use Spark Streaming for that purpose as follows:

            .filter(_.getText.contains("earthquake") || _.getText.contains("shaking"))

Then, we would have to run some semantic analysis on the tweets to determine if they appear to be referencing a current earthquake occurrence. Tweets like ”Earthquake!” or ”Now it is shaking”, for example, would be consider positive matches, whereas tweets like “Attending an Earthquake Conference” or ”The earthquake yesterday was scary” would not. The authors of the paper used a support vector machine (SVM) for this purpose. We’ll do the same here, but can also try a streaming version. A resulting code example from MLlib would look like the following:

// We would prepare some earthquake tweet data and load it in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "sample_earthquate_tweets.txt")

// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)

// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)

// Clear the default threshold.

// Compute raw scores on the test set. 
val scoreAndLabels = test.map { point =>
  val score = model.predict(point.features)
  (score, point.label)

// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()

println("Area under ROC = " + auROC)

If we are happy with the prediction rate of the model, we could move onto the next stage and react whenever we discover an earthquake. To detect one we need a certain number (i.e., density) of positive tweets in a defined time window (as described in the article). Note that, for tweets with Twitter location services enabled, we would also extract the location of the earthquake. Armed with this knowledge, we could use SparkSQL and query an existing Hive table (storing users interested in receiving earthquake notifications) to retrieve their email addresses and send them a personalized warning email, as follows:

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// sendEmail is a custom function
sqlContext.sql("FROM earthquake_warning_users SELECT firstName, lastName, city, email")

Other Apache Spark Use Cases

Potential use cases for Spark extend far beyond detection of earthquakes of course.

Here’s a quick (but certainly nowhere near exhaustive!) sampling of other use cases that require dealing with the velocity, variety and volume of Big Data, for which Spark is so well suited:

In the game industry, processing and discovering patterns from the potential firehose of real-time in-game events and being able to respond to them immediately is a capability that could yield a lucrative business, for purposes such as player retention, targeted advertising, auto-adjustment of complexity level, and so on.

In the e-commerce industry, real-time transaction information could be passed to a streaming clustering algorithm like k-means or collaborative filtering like ALS. Results could then even be combined with other unstructured data sources, such as customer comments or product reviews, and used to constantly improve and adapt recommendations over time with new trends.

In the finance or security industry, the Spark stack could be applied to a fraud or intrusion detection system or risk-based authentication. It could achieve top-notch results by harvesting huge amounts of archived logs, combining it with external data sources like information about data breaches and compromised accounts (see, for example, https://haveibeenpwned.com/) and information from the connection/request such as IP geolocation or time.


To sum up, Spark helps to simplify the challenging and compute-intensive task of processing high volumes of real-time or archived data, both structured and unstructured, seamlessly integrating relevant complex capabilities such as machine learning and graph algorithms. Spark brings Big Data processing to the masses. Check it out!

About the author

Radek Ostrowski, United Kingdom
member since July 25, 2014
Radek is a talented big data engineer able to hit the ground running. He is highly effective in taking applications from inception to completion and improving existing solutions. He is particularly interested in Apache Spark (Certified Developer), Apache Cassandra, Docker, and Scala. He is also a double winner of IBM Sparkathon: http://devpost.com/software/my-perfect-weather [click to continue...]
Hiring? Meet the Top 10 Freelance Spark Developers for Hire in October 2016


Very good high level overview of a game changing technology, one place I have learned a lot of different pieces in Spark Core is here : https://www.gitbook.com/book/databricks/databricks-spark-reference-applications/details
Filip Petkovski
Thanks for this article, Spark is definitely something worth keeping an eye on !
Radek, thank you very much for your Post, it's very valuable for Big-Data-rookies like myself. 1) I need to quickly mine huge XML files containing retail-transaction data: is Spark - in your opinion - the right tool to do it? 2) Starting from scratch (anyway, I'm a computer engineer with years of experience, but not in Big Data), what's the best approach to create a simple Proof-of-Concept with Spark? Any suggestions? Thank you very much again, AC
Hi Andy, sorry, just saw your comment. How huge is huge? 1&2) Anyway, yes, I'd recommend Spark. You could quickly write your program piece by piece with REPL. You don't need much computing power up front as you could have Spark running on your local machine. You could also test it with a subset of your data to have a quick feedback. If you are a programmer you would be just fine, as you don't need specific knowledge to get something working (you need deeper understanding to get it performing better though). Just make sure MLlib contains the data mining algos you want to use. Good luck!
Thanks for sharing, looks like a great source of info
Andy Cavallini
Thank you very much, Radek. AC
Manik Jasrotia
Radek - Thanks a lot for this insight. Even I am into a process of doing a POC on Retail Data using few Machine learning Algorithms and coming up with a prediction model for Out of stock analysis. My questions might sound stupid but I would really appreciate if you or anyone else can answer me. So far I have been able to get a data set ==> Convert the features into a (labelpoint , Feature Vectors) ==> Train a ML model ==> Run the model on Test DataSet and ==> Get the predictions. Problem 1: ------------- Since I have no experience on any of the JAVA/Python/Scala languages, I am building my features in the database and saving that data as a CSV file for my machine learning Algorithm. How do we create features using Scala from raw data. Problem 2: ------------- The Source Data set consists of many features for a set of (Store, Product , date) and their recorded OOS events (Target) StoreID(Text column), ProductID(Text Column), TranDate , (Label/Target), Feature1, Feature2........................FeatureN Since the Features can only contain numeric values so, I just create features out of the numeric columns and not the text ones (Which is the natural key for me). When I run the model on a validation set I get a (Prediction, Label) array back. Now how do I link this resultant set back to the original data set and see which specific (Store, Product, Date) might have a possible Out Of Stock event ? I hope the problem statement was clear enough. MJ
Alyona Polishuk
Thank you for good articl! May I ask you - about "By default, each transformed RDD may be recomputed each time you run an action on it". I have a code with next script: 1) transaction 2) action 3) action So in this way - It can be processed as 1-2-1-3?
Piotr Galas
Great article Radek, its perfect introduction for beginners in this area of IT.
sunil bhardwaj
Hi Radek, First of all,thanks for the insights. Could you please suggest where spark streaming and sql would fit my use case. We have a use case of batch payment processing where reading huge payment data from db and process payment after some business logic applied on each record from db. Once processed, we have to update multiple db tables.
the only difference between batch processing and stremming here is if you want to process it real time... other wise, you can just export you db to a file, process it with spark, get the output and perform sql queries to insert the results where you need.... that is what i understand
that is what i understand you can use cache to avoid some recalculations and speed up processing :)
Rahul Mahajan
You can check this <a href="http://data-flair.training/apache-spark-scala/">link to get more information about apache spark</a>.I followed this link and worked in 2 poc's successfully.
I have series of question. When I execute "val rdd = sc.textFile("textfile.txt")" a new rdd is created and is partitioned automatically by spark. Now were would the partitioned data is stored in the cluster? Does they store in worker node memory or worker node disk? If it is stored in worker node memory, what is the need of cache? Like Hadoop does spark has replication of data in its cluster?. Also if a node fails in spark, how does the computation is handled for the data in that particular failed node?
Ramesh Khade
Thanks for sharing this information. You may refer for more details http://www.s4techno.com/blog/category/cassandra/
hi.welcome you all.thank you for sharing this information.its really informative. <a href="https://www.gangboard.com/big-data-training/big-data-analytics-training">BigData Analytics Training</a>
hi its really awesome post for the learners of apache spark training <a href="https://www.gangboard.com/big-data-training/apache-spark-training">apachespark training</a>
hi welcome to this blog.its really informative blog.thank you for sharing this blog. <a href="https://www.gangboard.com/big-data-training/apache-spark-training">apachespark training</a>
Nice Article .. Thank you Manish ( http://singletonjava.blogspot.com/2016/02/docker-interview-questions-and-answers.html )
comments powered by Disqus
The #1 Blog for Engineers
Get the latest content first.
No spam. Just great engineering and design posts.
The #1 Blog for Engineers
Get the latest content first.
Thank you for subscribing!
You can edit your subscription preferences here.
Trending articles
Relevant technologies
About the author
Radek Ostrowski
Java Developer
Radek is a talented big data engineer able to hit the ground running. He is highly effective in taking applications from inception to completion and improving existing solutions. He is particularly interested in Apache Spark (Certified Developer), Apache Cassandra, Docker, and Scala. He is also a double winner of IBM Sparkathon: http://devpost.com/software/my-perfect-weather