There are some very misleading articles out there titled Spark or Hadoop, Spark is better than Hadoop or even Spark is replacing Hadoop.
In this article I am going to show you the differences between Spark and Hadoop. After reading it you will know when and for what you should use Spark and Hadoop.
You’ll also understand why Hadoop or Spark is the totally wrong question.
Where’s the difference?
To make it clear how Hadoop differs from Spark I created this simple feature table:
Hadoop is used to store data in the Hadoop Distributed File System (HDFS). It can analyse the stored data with MapReduce and manage resources with YARN.
However, Hadoop is more than just storage, analytics and resource management. There’s a whole eco system of tools around the Hadoop core. I’ve written about tis eco system in this article: What is Hadoop and why is it so freakishly popular. You should check it out as well.
Compared to Hadoop, Spark is “just” an analytics framework. It has no storage capability. Although it has a standalone resource management, you usually don’t use that feature.
More on that later.
What’s wrong with MapReduce?
As you can see from the table above, comparing Spark to Hadoop makes not sense. You need to compare Spark to MapReduce.
When do I use MapReduce?
MapReduce is awesome for simpler analytics tasks, like counting stuff. It just has one flaw: It has only two stages Map and Reduce.
First MapReduce loads the data from HDFS into the mapping function. There you prepare the input data for the processing in the reducer. After the reduce is finished the results get written to the data store.
If you don’t know already how exactly MapReduce works then you should read my article about distributed processing.
The problem with MapReduce is that there is no simple way to chain multiple map and reduce processes together. At the end of each reduce process the data must be stored somewhere.
This fact makes it very hard to do complicated analytics processes. You would need to chain MapReduce jobs together.
Chaining jobs with storing and loading intermediate results just makes no sense.
Another issue with MapReduce is that it is not capable of streaming analytics. Jobs take some time to spin up, do the analytics and shut down. Basically Minutes of wait time are totally normal.
This is a big negative point in a more and more real time data processing world.
Why Spark is the perfect for complex analytics
Spark is a complete in memory framework. Data gets loaded from, for instance hdfs, into the memory of workers.
There is no longer a fixed map and reduce stage. Your code can be as complex as you want.
Once in memory, the input data and the intermediate results stay in memory (until the job finishes). They do not get written to a drive like with MapReduce.
This makes Spark the optimal choice for doing complex analytics. It allows you for instance to do Iterative processes. Modifying a dataset multiple times in order to create an output is totally easy.
Streaming analytics capability is also what makes spark so great. Spark has natively the option to schedule a job to run every X seconds or X milliseconds.
As a result, Spark can deliver you results from streaming data in “real time”.
Spark and Hadoop: A perfect fit
So, if Hadoop and Spark are not the same things, can they work together?
Absolutely! Here’s how the first picture will look if you combine Hadoop with Spark:
As Storage you use the Hadoop distributed file system. Analytics is done with Apache Spark and Yarn is taking care of the resource management.
Why does that work so well together?
From a platform architecture perspective, Hadoop and Spark are usually managed on the same cluster. This means on each server where a HDFS data node is running, a spark worker thread runs as well.
In distributed processing, network transfer between machines is a large bottle neck. Transferring data within a machine reduces this traffic significantly.
Spark is able to determine on which data node the needed data is stored. This allows a direct load of the data from the local storage into the memory of the machine.
This reduces network traffic a lot.
As for YARN: You need to make sure that your physical resources are distributed perfectly between the services. This is especially the case when you run Spark workers with other Hadoop services on the same machine.
It just would not make sense to have two resource managers managing the same server’s resources. Sooner or later they will get in each others way.
That’s why the Spark standalone resource manager is seldom used.
So, the question is not Spark or Hadoop. The question has to be: Should you use Spark or MapReduce alongside Hadoop’s HDFS and YARN.
My simple rule of thumb is:
If you are doing simple batch jobs like counting values or doing calculating averages: Go with MapReduce.
If you need more complex analytics like machine learning or fast stream processing: Go with Apache Spark.
What do you think? Did you expect this conclusion, did you root for Spark or Hadoop (MapReduce)?
Tell me in the comments section below 🙂
PS: Did you like this article? Leave a like at the LinkedIn Post.
That would help me a lot! Thanks again.
3 More Scientific Working Tips For Data ScientistsSeptember 28, 2017
5 Scientific Working Tips For Data ScientistsSeptember 26, 2017
Title Image by Photofunia Retro Wave generator: https://photofunia.com/categories/all_effects/retro-wave