The Brutally Honest Truth About Learning Big Data the Right Way

Andreas Kretz BigData, Blog 18 Comments

Ever wish you could get an easy answer to what is Big Data? Get the Big Data Checklist, totally free, along with weekly blog tips delivered directly to your inbox.

People try to teach you Big Data by focusing on teaching the technology like Hadoop, Spark and what not. The truth is that this strategy is wrong, a waste of your time.

This has to stop, and it stops today!

In this first post I am going to start with the exact opposite of getting into Big Data. By not focusing on technology. Because the most important thing that you first need to understand is: What problems is Big Data helping to solve.

My Motivation for This Awesome Blog Series

My motivation for this series is this. Every day I stumble upon posts on Twitter, Reddit or LinkedIn where people ask:

“I want to learn about Big Data as an absolute beginner but where do I start?”

Nobody is pointing out that to start with Big Data you need to figure out the why.

Why Big Data?

What is so wrong with the traditional way of storing and analysing data?

So, I am taking matters into my own hands and help you kickstart your Big Data journey to success.

When you are finished with this one I recommend you read the following five parts:

  1. Mastering Big Data With Distributed Processing (MapReduce)
  2. How Everybody Can Harvest The Power Of Data Mining (Spark)
  3. What is Stream and Batch Processing 
  4. The Perfect Big Data Platform
  5. What Is Hadoop And Why Is It So Freakishly Popular?

The Hockey Stick Curve of Catastrophic Success

The first thing I want to point out is the concept of exponential growth.

Exponential growth is defined by the function:  

What that means is that the value basically doubles with every step taken. At first it looks like nothing is happening and then suddenly it goes through the roof. Look at it what a beauty!

Exponential Growth

This curve is what every investor is looking for. It’s how very successful startups thrive. This is where the big money is at.

Doubling your user count every month or doubling the IoT devices connected to your platform is a great thing. But this kind of catastrophic success will create some problems.

Let’s stay with the IoT example. Exponential growth of data to be stored and processed demands that also your IT infrastructure needs to be capable of handling it. And this is where the problems start.

The Ultra Problematic Extract Transform Load (ETL) Process

A typical old-school platform deployment would look like the picture below. Devices use a data API to upload data that gets stored in a SQL database. An external analytics tool is querying data and uploading the results back to the SQL db. Users then use the user interface to display data stored in the database.

SQL IoT Infrastructure

Now, when the front end queries data from the SQL database the following three steps happen:

  1. The database extracts all the needed rows from the storage
  2. Extracted data gets transformed, for instance sorted by timestamp or something a lot more complex
  3. The extracted and transformed data gets loaded to the destination (the user interface) for chart creation

With exploding amounts of stored data the ETL process starts being a real problem.

Analytics is working with large data sets, for instance whole days, weeks, months or more. Data sets are very big like 100GB or Terabytes. That means Billions or Trillions of rows.

This has the result that the ETL process for large data sets takes longer and longer. Very quickly the ETL performance gets so bad it won’t deliver results to analytics anymore.

A traditional solution to overcome these performance issues is trying to increase the performance of the database server. That’s what’s called scaling up.

Scaling Up the System – The Easy Way Out

To scale up the system and therefore increase ETL speeds administrators resort to more powerful hardware by:

  • Speeding up the extract performance by adding faster disks to physically read the data faster.
  • Increasing RAM for row caching. What is already in memory does not have to be read by slow disk drives.
  • Using more powerful CPU’s for better transform performance (more RAM helps here as well)
  • Increasing or optimising networking performance for faster data delivery to the front end and analytics

Scaling up the system is fairly easy.

But with exponential growth it is obvious that sooner or later (more sooner than later) you will run into the same problems again. At some point you simply cannot scale up anymore because you already have a monster system, or you cannot afford to buy more expensive hardware.

The next step you could take would be scaling out.

Scaling Out the System – The Complicated Way Out

Scaling out is the opposite of scaling up. Instead of building bigger systems the goal is to distribute the load between many smaller systems.

The simplest way of scaling out an SQL database is using a storage area network (SAN) to store the data. You can then use up to eight SQL servers, attach them to the SAN and let them handle queries. This way load gets distributed between those eight servers.

Scaling Out the SQL Database

One major downside of this setup is that, because the storage is shared between the sql servers, it can only be used as an read only database. Updates have to be done periodically, for instance once a day. To do updates all SQL servers have to detach from the database. Then, one is attaching the db in read-write mode and refreshing the data. This procedure can take a while if a lot of data needs to be uploaded.

This Link to a Microsoft MSDN page has more options of scaling out an SQL database for you.

I deliberately don’t want to get into details about possible scaling out solutions. The point I am trying to make is that while it is possible to scale out SQL databases it is very complicated.

There is no perfect solution. Every option has its up- and downsides. One common major issue is the administrative effort that you need to take to implement and maintain a scaled out solution.

The Sad Conclusion

Let’s recap what we have learned today.

With catastrophic success in the form of exponential growth some major problems of traditional SQL databases emerge.

The ETL process is hindering analytics to do it’s job because the database cannot deliver data fast enough to analytics.

Traditional scaling methods for relational db’s like scaling up or scaling out can hardly keep up with exponential growth. Every solution comes with certain tradeoffs that need to be considered. Maintaining operations for such a system can get very complicated and time consuming.

You have to ask yourself: Is it really worth spending all that money for a solution that might turn out not performing that well in the end? Probably not 🙁

The Big Data Way

So, how is the Big Data way of dealing with the ETL problem? Well, you can go straight ahead to my next post for that:
http://iotdonequick.com/2016/06/24/mastering-big-data-with-distributed-processing/

In it, we will go over the details how the SQL ETL problem is solved by the concept of distributed processing. I will talk about what distributed processing is and why it is the foundation of Big Data systems like Hadoop.

To make sure you don’t miss my new posts all you have to do is to subscribe to my newsletter. Just put in your E-Mail address right here and hit subscribe.

This way I will be able to send you an E-Mail when I have uploaded the next article in this series.
You can also follow me on Twitter where I share stuff I find throughout the day:

Also please share this article on social media with your friends and professional network. That would be super awesome!

See you next time!
Andreas

Comments 18

  1. Why do you say that Google Invented Hadoop? Google released a white paper which sparked the creation of MapReduce and HDFS was already designed by Doug Cutting and his Yahoo colleagues. Even the name itself ‘Hadoop’ based on Doug’s son’s toy elephant name Hadoop. 😃

    1. Post
      Author

      Yes, you are totally right. Google did not invent Hadoop.
      What I should have written is why Google had such an interest in Hadoop and MapReduce.

      Thanks a lot for pointing that out!

    1. Post
      Author
    1. Post
      Author
  2. It’s a great article. I too am starting out on Big Data Technologies. Would you be so kind as to be my mentor for this technology journey of mine? Thanks in advance!

    1. Post
      Author

      Hi Rahman, to be honest I currently don’t have the time to do individual mentoring.
      Besides my full-time job and, well, life, writing this blog takes all of my spare time.

      However, if you have any specific questions please do write me an e-mail to andreas@iotdonequick.com
      I will get back to you as quickly as possible.

  3. Thanks for the sharing. But there is nothing wrong with learning Spark, because the idea in this post should be covered by most Big Data MOOC. They know they should talk about what is big data before teaching you the techniques. Check out the Spark Xseries on edX for example.

  4. I agree database has its own scalability limits and scaling out with database could be very expensive. Distributed caching as a middle tier between ETL and database is a really scalable solution depending on which distributed caching solution you choose. If it doesn’t provide you data reliability, fail over and object/data query support then it is useless. The distributed caching should also be synced with databases because other apps might change it and all analytics would be wrong.

Leave a Reply

Your email address will not be published. Required fields are marked *