Apache Hadoop Or Spark?

Ever wish you could get an easy answer to what is Big Data? Get the Big Data Checklist, totally free, along with weekly blog tips delivered directly to your inbox.

There are some very misleading articles out there titled Spark or Hadoop, Spark is better than Hadoop or even Spark is replacing Hadoop.

In this article I am going to show you the differences between Spark and Hadoop. After reading it you will know when and for what you should use Spark and Hadoop.

You’ll also understand why Hadoop or Spark is the totally wrong question.

Where’s the difference?

To make it clear how Hadoop differs from Spark I created this simple feature table:

Hadoop is used to store data in the Hadoop Distributed File System (HDFS). It can analyse the stored data with MapReduce and manage resources with YARN.

However, Hadoop is more than just storage, analytics and resource management. There’s a whole eco system of tools around the Hadoop core. I’ve written about tis eco system in this article: What is Hadoop and why is it so freakishly popular. You should check it out as well.

Compared to Hadoop, Spark is “just” an analytics framework. It has no storage capability. Although it has a standalone resource management, you usually don’t use that feature.

More on that later.

What’s wrong with MapReduce?

As you can see from the table above, comparing Spark to Hadoop makes not sense. You need to compare Spark to MapReduce.

When do I use MapReduce?

MapReduce is awesome for simpler analytics tasks, like counting stuff. It just has one flaw: It has only two stages Map and Reduce.

First MapReduce loads the data from HDFS into the mapping function. There you prepare the input data for the processing in the reducer. After the reduce is finished the results get written to the data store.

If you don’t know already how exactly MapReduce works then you should read my article about distributed processing.

The problem with MapReduce is that there is no simple way to chain multiple map and reduce processes together. At the end of each reduce process the data must be stored somewhere.

This fact makes it very hard to do complicated analytics processes. You would need to chain MapReduce jobs together.

Chaining jobs with storing and loading intermediate results just makes no sense.

Another issue with MapReduce is that it is not capable of streaming analytics. Jobs take some time to spin up, do the analytics and shut down. Basically Minutes of wait time are totally normal.

This is a big negative point in a more and more real time data processing world.

Why Spark is the perfect for complex analytics

Spark is a complete in memory framework. Data gets loaded from, for instance hdfs, into the memory of workers.

There is no longer a fixed map and reduce stage. Your code can be as complex as you want.

Once in memory, the input data and the intermediate results stay in memory (until the job finishes). They do not get written to a drive like with MapReduce.

This makes Spark the optimal choice for doing complex analytics. It allows you for instance to do Iterative processes. Modifying a dataset multiple times in order to create an output is totally easy.

Streaming analytics capability is also what makes spark so great. Spark has natively the option to schedule a job to run every X seconds or X milliseconds.

As a result, Spark can deliver you results from streaming data in “real time”.

Spark and Hadoop: A perfect fit

So, if Hadoop and Spark are not the same things, can they work together?

Absolutely! Here’s how the first picture will look if you combine Hadoop with Spark:

As Storage you use the Hadoop distributed file system. Analytics is done with Apache Spark and Yarn is taking care of the resource management.

Why does that work so well together?

From a platform architecture perspective, Hadoop and Spark are usually managed on the same cluster. This means on each server where a HDFS data node is running, a spark worker thread runs as well.

In distributed processing, network transfer between machines is a large bottle neck. Transferring data within a machine reduces this traffic significantly.

Spark is able to determine on which data node the needed data is stored. This allows a direct load of the data from the local storage into the memory of the machine.

This reduces network traffic a lot.

As for YARN: You need to make sure that your physical resources are distributed perfectly between the services. This is especially the case when you run Spark workers with other Hadoop services on the same machine.

It just would not make sense to have two resource managers managing the same server’s resources. Sooner or later they will get in each others way.

That’s why the Spark standalone resource manager is seldom used.

My Conclusion

So, the question is not Spark or Hadoop. The question has to be: Should you use Spark or MapReduce alongside Hadoop’s HDFS and YARN.

My simple rule of thumb is:

If you are doing simple batch jobs like counting values or doing calculating averages: Go with MapReduce.

If you need more complex analytics like machine learning or fast stream processing: Go with Apache Spark.

 

What do you think? Did you expect this conclusion, did you root for Spark or Hadoop (MapReduce)?

Tell me in the comments section below 🙂

 

PS: Did you like this article? Leave a like at the LinkedIn Post.
That would help me a lot! Thanks again.

 

 

Title Image by Photofunia Retro Wave generator: https://photofunia.com/categories/all_effects/retro-wave

Comments 15

  1. What about the use, where the ETL is relatively straight forward, however the dataset is extremely large that it makes sense to process via spark for performance gain. This is especially true when you have to do group by or analytic function over the whole dataset in which M/R is extremely slow

    1. Post
      Author
  2. Hi,
    Nice article. I am currently doing a course from Coursera on Hadoop Platform and Application Framework and your articles are supplementing my learning and giving good points to think about. But in my learning curve whenever the questions is asked which one is better most of the time the answers have been of this nature where the combination is more fruitful and yields result. The only thing is that the processors need to be faster otherwise it is a headache when large data is to be processed. Any tips for me so i can make my learning effective, using cloudera live and VM ware for learning purposes.

    With Regards
    Rahul

    1. Post
      Author

      I would say concentrate on one thing for instance Spark. Get a Scala book and just write applications. I am not sure if Cloudera live already has the Spark parcel. Maybe you need do install it first. The processing speed does not depend on how much data you handle. Hadoop is a distributed system, so usually you can add more nodes to the cluster to speed up processing.

  3. Its true the comparison should be between Mapreduce and Spark.Learning scala becomes a prerequisite for spark then.Mapreduce is at its place and spark is at its place,What I feel is that as a developer one must know what could be done using Mapreduce and Spark .Then one can choose the exact tool based on the purpose he is using them.

    1. Post
      Author

      This is an awesome comment Shraddha! It’s what I always preach to people as well. Learn what the tools can do and then choose one.
      It also is a great advantage if you know how to do basic things with the tools.

  4. I have a Newbie question….when you say data is loaded in memory you mean ram right ? Does it mean if you want to process 1To of data with spark,you need 1To of ram?

      1. Post
        Author

        Also 1TB of ram is nothing in a Spark cluster. If you run just a 10 node Spark setup, each with 512GB RAM ->5TB RAM

  5. I am good in hadoop but new in spark. I have seen in real scenario Map Reduce is more flexible than spark.
    When i built Spark streaming service was stuck for sharing object across the cluster. I wanted to create one object for one service which will share all the executors in spark but not able to do because of custom serialization. In hadoop we see lot of box classes are there which are already implemented java serialization . please let us know your concern.

  6. First of all thanks to share helpful information about Spark, but i want to know, hadoop is a combination of Hadoop Distributed File System (HDFS) +mapreduce, its possible to replace mapreduce, but not HDFS. OK I agree with all ur answers. thanks to share it, Nice post.

    1. Post
      Author

      For simple tasks MapReduce is still ok to use. That Spark is not included into Hadoop is because they are different Apache projects.
      However big data cluster vendors like Cloudera, MapR or Hortonworks have incorporated Spark in their distributions.
      So as you said, in these cases Hadoop actually comes with Spark.

  7. i think – Spark is a cluster-computing framework, so it can competes more with MapReduce than Hadoop system. But, Spark doesn’t have its own distributed filesystem and MapReduce is disk-based. i agree with your points that for simple batch jobs go with MapReduce and for complex analytics use Apache Spark. Thank You from

Leave a Reply

Your email address will not be published. Required fields are marked *