Ingest, Store, Analyse and Display

The Perfect Big Data Platform — My Blueprint

Andreas Kretz BigData, Blog 7 Comments

Ever wish you could get an easy answer to what is Big Data? Here’s your chance. Get the Big Data Checklist, totally free, along with weekly blog tips delivered directly to your inbox.

Some time ago I have created a simple and modular big data platform blueprint for myself. It is based on what I have seen in the field and read in tech blogs all over the internet.

Today I am going to share it with you.

Why do I believe it will be super useful to you?

Because, unlike other blueprints it is not focused on technology. It is based on four common big data platform design patterns.

Following my blueprint will allow you to create the big data platform that fits exactly your needs. Building the perfect platform will allow data scientists to discover new insights.

It will enable you to perfectly handle big data and allow you to make data driven decisions.

The Blueprint

The blueprint is focused on the four key areas: Ingest, store, analyse and display.

Ingest, Store, Analyse and Display

Having the platform split like this turns it it a modular platform with loosely coupled interfaces.

Why is it so important to have a modular platform?

If you have a platform that is not modular you end up with something that is fixed or hard to modify. This means you can not adjust the platform to changing requirements of the company.

Because of modularity it is possible to switch out every component, if you need it.

Now, lets talk more about each key area.

Ingest

Ingestion is all about getting the data in from the source and making it available to later stages. Sources can be everything form tweets, server logs to IoT sensor data like from cars.

Sources send data to your API Services. The API is going to push the data into a temporary storage.

The temporary storage allows other stages simple and fast access to incoming data.

A great solution is to use messaging queue systems like Apache Kafka, RabbitMQ or AWS Kinesis. Sometimes people also use caches for specialised applications like Redis.

A good practice is that the temporary storage follows the publish, subscribe pattern. This way APIs can publish messages and Analytics can quickly consume them.

Store

This is the typical big data storage where you just store everything. It enables you to analyse the big picture.

Most of the data might seem useless for now, but it is of upmost importance to keep it. Throwing data away is a big no no.

Why not throw something away when it is useless?

Although it seems useless for now, data scientists can work with the data. They might find new ways to analyse the data and generate valuable insight from it.

What kind of systems can be used to store big data?

Systems like Hadoop HDFS, Hbase, Amazon S3 or DynamoDB are a perfect fit to store big data.

Analyse

The analyse stage is where the actual analytics is done. Analytics, in the form of stream and batch processing.

Streaming data is taken from ingest and fed into analytics. Streaming analyses the “live” data thus, so generates fast results.

As the central and most important stage, analytics also has access to the big data storage. Because of that connection, analytics can take a big chunk of data and analyse it.

This type of analysis is called batch processing. It will deliver you answers for the big questions.

To learn more about stream and batch processing read my blog post: How to Create New and Exciting Big Data Aided Products

The analytics process, batch or streaming, is not a one way process. Analytics also can write data back to the big data storage.

Often times writing data back to the storage makes sense. It allows you to combine previous analytics outputs with the raw data.

Analytics insight can give meaning to the raw data when you combine them. This combination will often times allow you to create even more useful insight.

A wide variety of analytics tools are available. Ranging from MapReduce or AWS Elastic MapReduce to Apache Spark and AWS lambda.

Display

Displaying data is as important as ingesting, storing and analysing it. People need to be able to make data driven decisions.

This is why it is important to have a good visual presentation of the data. Sometimes you have a lot of different use cases or projects using the platform.

It might not be possible for you to build the perfect UI that fits everyone. What you should do in this case is enable others to build the perfect UI themselves.

How to do that? By creating APIs to access the data and making them available to developers.

Either way, UI or API the trick is to give the display stage direct access to the data in the big data cluster. This kind of access will allow the developers to use analytics results as well as raw data to build the the perfect application.

What kind of systems can you use and where?

Here is how the blueprint looks, if you replace the symbols with software:

Unfortunatly it is too much to explain all the single software platforms in this post. That is why in the following weeks I will go through every one of them with you.

Starting next week with showing you what Hadoop is and what is making it so popular.

Jump directly to the post: What Is Hadoop And Why Is It So Freakishly Popular?

Make sure to not to miss it by subscribing to the newsletter. This way I will be able to send you an e-mail when I published the post 🙂

 

In the meantime, I recommend you to read how stream and batch processing works.

Do you have some comments or questions about this article? Please drop a line in the comment section to get in contact with me and the community.

Andreas

Comments 7

  1. I like the modular design and I agree that it is very important. I was curious how this was different from a Lambda architecture with a BI data connector? I could see using MapR’s Lambda architecture and Progress Software’s Spark ODBC driver to Tableau as being the same with fewer moving parts.

    Regardless I am a fan of both Lambda and your design. I think it can solve a lot of problems.

    1. Post
      Author

      Hi Julien, thanks for the comment.

      Well, actually my blueprint incorporates the concept of the lambda architecture. Namely stream and batch processing.
      I think your suggested solution would also fit into my blueprint with MapR’s tools as storage, Spark for analytics and Tableau for display.

      I like the underlying question in your comment: Do I need to use all the blocks of the blueprint?
      No. For instance, if ingestion it is done some other way (SAP?) than you can forget ingestion.

      I am not 100% sure what parts of the MapR lambda architecture you would deploy. Can you elaborate a bit further?

      Andreas

  2. Very well written and the concepts are explained with examples. I liked the way you described about the modular aspects of a big data platform.

    1. Post
      Author
  3. Hi Andreas,
    The design is simple and easy to grasp. Where does security fit in? If a company invests in big data there is likely to be more than one user displaying the analytics results. What if there is some sensitive ingested data that some users should not be able to access?

    1. Post
      Author

      Hello Barbara, I left out security because it depends part of what exact technologies you use.
      Generally speaking you have three options:

      1 – Separating the data in the storage like different folders in HDFS or different tables of a NoSQL database.
      2 – Having, a analytics tool or front end that permits users to only access some data.
      3- I also seen that HP has a proprietary method of encrypting part of the data stored in HDFS. This data then can only be decrypted in Hive through the users Key. Keys are managed in a keystore for easy and save key management.

      What do you think?

Leave a Reply

Your email address will not be published. Required fields are marked *