When people talk about big data, one of the first things come to mind is Hadoop. Google’s search for Hadoop returns about 28 million results.
It seems like you need Hadoop to do big data. Today I am going to shed light onto why Hadoop is so trendy.
You will see that Hadoop has evolved from a platform into an ecosystem. It’s design allows a lot of Apache projects and 3rd party tools to benefit from Hadoop.
I will conclude with my opinion on, if you need to learn Hadoop and if Hadoop is the right technology for everybody.
What is Hadoop?
Hadoop is a platform for distributed storing and analyzing of very large data sets.
Hadoop has four main modules: Hadoop common, HDFS, MapReduce and YARN. The way these modules are woven together is what makes Hadoop so successful.
The Hadoop common libraries and functions are working in the background. That’s why I will not go further into them. They are mainly there to support Hadoop’s modules.
The Hadoop distributed file system, or HDFS, allows you to store files in Hadoop. The difference between HDFS and other file systems like NTFS or EXT is that it is a distributed one.
What does that mean exactly?
A typical file system stores your data on the actual hard drive. It is hardware dependent.
If you have two disks then you need to format every disk with its own file system. They are completely separate.
You then decide on which disk you physically store your data.
HDFS works different to a typical file system. HDFS is hardware independent.
Not only does it span over many disks in a server. It also spans over many servers.
HDFS will automatically place your files somewhere in the Hadoop server collective.
It will not only store your file, Hadoop will also replicate it two or three times (you can define that). Replication means replicas of the file will be distributed to different servers.
This gives you superior fault tolerance. If one server goes down, then your data stays available on a different server.
Another great thing about HDFS is, that there is no limit how big the files can be. You can have server log files that are terabytes big.
How can files get so big? HDFS allows you to append data to files. Therefore, you can continuously dump data into a single file without worries.
HDFS physically stores files different then a normal file system. It splits the file into blocks.
These blocks are then distributed and replicated on the Hadoop cluster. The splitting happens automatically.
In the configuration you can define how big the blocks should be. 128 megabyte or 1 gigabyte?
No problem at all.
Why is it so great that HDFS is splitting large files into blocks and distributing them over the server collective?
The answer is simply analytics. Splitting files into blocks makes it easy to analyze them in a distributed fashion.
Let’s say you store a gigabyte size file in a normal file system. To analyze that file the analysis process needs to sequentially read the file.
Often the content gets transferred into memory for fast access and analytics.
With smaller files, it is a simple and fast process, it works great. Problems arise when files get big, terabyte size.
A terabyte file will no longer fit into the RAM of your computer. And, it takes forever to read the file because you need to read it sequentially.
This is where MapReduce shines.
MapReduce is a framework for distributed data analysis. In conjunction with HDFS MapReduce is able to analyze the blocks of a file in parallel.
Because the blocks in HDFS are distributed across the servers, the analysis works distributed as well. Blocks are read and analyzed in parallel instead of sequentially.
For MapReduce it makes almost not difference how big your file is.
My article about distributed processing explains how MapReduce works exactly.
Storing data in HDFS and analysing it with MapReduce takes immense resources. These resources need to be managed.
YARN (yet another resource negotiator) does that for you. YARN has two basic functions: resource management and job scheduling/management.
Resource management has authority over the actual physical resources like CPU, RAM, disk and network.
Job scheduling is handing out these resources to the actual processes and monitors them. YARN makes it very efficient to run a Hadoop cluster.
Once configured to your specific needs, it works for you purely in the background :)
The Ecosystem – why it’s so popular
Storing and analyzing data as large as you want is nice. But what makes Hadoop so popular?
Hadoop’s core functionality is the driver of Hadoop’s adoption. Many Apache side projects use it’s core functions.
Because of all those side projects Hadoop has turned more into an ecosystem. An ecosystem for storing and processing big data.
To better visualize this eco system I have drawn you the following graphic. It shows some projects of the Hadoop ecosystem who are closely connected with the Hadoop.
It is not a complete list. There are many more tools that even I don’t know. Maybe I am drawing a complete map in the future.
How the ecosystem’s components work together
Remember my big data platform blueprint? The blueprint has four stages: Ingest, store, analyse and display.
Because of the Hadoop ecosystem” the different tools in these stages can work together perfectly.
Here’s an example:
You use Apache Kafka to ingest data, and store the it in HDFS. You do the analytics with Apache Spark and as a backend for the display you store data in Apache HBase.
To have a working system you also need YARN for resource management. You also need Zookeeper, a configuration management service to use Kafka and HBase
As you can see in the picture below each project is closely connected to the other.
Spark for instance, can directly access Kafka to consume messages. It is able to access HDFS for storing or processing stored data.
It also can write into HBase to push analytics results to the front end.
The cool thing of such ecosystem is that it is easy to build in new functions.
Want to store data from Kafka directly into HDFS without using Spark?
No problem, there is a project for that. Apache Flume has interfaces for Kafka and HDFS.
It can act as an agent to consume messages from Kafka and store them into HDFS. You even do not have to worry about Flume resource management.
Flume can use Hadoop’s YARN resource manager out of the box.
Is Hadoop working everywhere?
Although Hadoop is so popular it is not the silver bullet. It isn’t the tool that you should use for everything.
Often times it does not make sense to deploy a Hadoop cluster, because it can be overkill. Hadoop does not run on a single server.
You basically need at least five servers, better six to run a small cluster. Because of that. the initial platform costs are quite high.
One option you have is to use a specialized systems like Cassandra, MongoDB or other NoSQL DB’s for storage. Or you move to Amazon and use Amazon’s Simple Storage Service, or S3.
Guess what the tech behind S3 is. Yes, HDFS. That’s why AWS also has the equivalent to MapReduce named Elastic MapReduce.
The great thing about S3 is that you can start very small. When your system grows you don’t have to worry about s3’s server scaling.
Should you learn Hadoop?
Yes, I definitely recommend you to get to now how Hadoop works and how to use it. As I have shown you in this article, the ecosystem is quite large.
Many big data projects use Hadoop or can interface with it. Thats why it is generally a good idea to know as many big data technologies as possible.
Not in depth, but to the point that you know how they work and how you can use them. Your main goal should be to be able to hit the ground running when you join a big data project.
Plus, most of the technologies are open source. You can try them out for free.
Liked this post? Please share it with your peers! 🙂
Also, make sure to not to miss any new by subscribing to the newsletter. This way I will be able to send you an email when I publish a new post 🙂