Apache Spark: A Success Guarantee For Data Scientists

Ever wish you could get an easy answer to what is Big Data? Get the Big Data Checklist, totally free, along with weekly blog tips delivered directly to your inbox.

I know it is hard. As a data scientist there are so many things you could learn.

But what data science skills are companies actually looking for? What makes you a sought after professional on the job market?

You definitely should learn how to use Apache Spark!

Why? Because you need to work together with another type of professional: The big data solution architect.

Together, the both of you can achieve great things by utilising the power of Spark.

What is a data scientist?

Data scientists aren’t like every other scientist.

Data scientists do not wear white coats or work in high tech labs full of science fiction movie equipment. They work in offices just like you and me.

What differs them from most of us is that they are the math experts. They use linear algebra and multivariable calculus to create new insight from existing data.

How exactly does this insight look?

Here’s an example:

An industrial company produces a lot of products that need to be tested before shipping.

Usually such tests take a lot of time because there are hundreds of things to be tested. All to make sure that your product is not broken.

Wouldn’t it be great to know early if a test fails ten steps down the line? If you knew that you could skip the other tests and just trash the product or repair it.

That’s exactly where a data scientist can help you, big-time. This field is called predictive analytics and the technique of choice is machine learning.

Machine what? Learning?

Yes, machine learning, it works like this:

  1. You feed an algorithm with measurement data.
  2. It generates a model and optimises it based on the data you fed it with. That model basically represents a pattern of how your data is looking
  3. You show that model new data and the model will tell you if the data still represents the data you have trained it with.

This technique can also be used for predicting machine failure in advance with machine learning. Of course the whole process is not that simple.

The actual process of training and applying a model is not that hard. A lot of work for the data scientist is to figure out how to pre-process the data that gets fed to the algorithms.

Because to train a algorithm you need useful data. If you use any data for the training the produced model will be very unreliable.

A unreliable model for predicting machine failure would tell you that your machine is damaged even if it is not. Or even worse: It would tell you the machine is ok even when there is an malfunction.

Model outputs are very abstract. You also need to post-process the model outputs to receive health values from 0 to 100.

Machine Learning Preprocessing

Where to get more info about data science?

Want to know more about data science? I recommend checking out datasciencecentral.com The articles there about data science are great.

You should also definitely follow Kirk D. Borne on Twitter. He is the principal data scientist of Booz Allen Hamilton and his tweets are pure data science gold!

As for learning data science check out Lilian Pierson and her site Data Mania. She has a very interesting data science blog. She also offers online courses on data science.

3 Reasons why data scientists are loving Spark

1. Complex analytics is possible

The problem with data analytics through MapReduce lies with iterative processes. That means processes where you have to reuse results multiple times.

To realise iterative processes you have to chain multiple MapReduce jobs together and carry the results over. The problem is that results cannot be carried over to the next job directly.

To get the results form one job to another you have to store the results somewhere, for instance in HDFS, HBase. The stored results are then read by the next job in the line to continue the calculations.

This is complicated to manage and very inefficient, because why store and reload the data when you already have access to it?

With Spark this is no longer an issue. Spark’s Resilient Distributed Datasets (RDD) are the key to iterative analytics processes.

The trick is that RDDs are immutable data sets. Once an RDD is created it cannot be changed.

This fact allows Spark to distribute RDDs to workers on different machines and allows the workers to run calculations on the RDD’s data in parallel.

The great thing is that RDDs stay in memory. When you need to reuse them in your job for iterative purposes they are still there. No need to reload the data.

2. It’s easy to get into programming Spark

Spark jobs can be programmed in a variety of languages. That makes creating analytic processes very user-friendly for data scientists.

Spark supports Python, Scala and Java. With the help of SparkR you can even connect your R program to a Spark cluster.

If you are a data scientist who is very familiar with Python just use Python, its great. If you know how to code Java I suggest you start using Scala.

Spark jobs are easier to code in Scala than in Java. In Scala you can use anonymous functions to do processing.

This results in less overhead, it is a much cleaner, simpler code.

With Java 8 simplified function calls were introduced with lambda expressions. Still, a lot of people, including me prefer Scala over Java.

3. Machine learning is included

The machine learning library MLlib is included in Spark so there is often no need to import another library.

I have to admit because I am not a data scientist I am not an expert in machine learning.

From what I have seen and read though the machine learning framework MLlib is a nice treat for data scientists wanting to train and apply models with Spark.

What is a big data solution architect?

Big data solution architects are the link between the management’s big data strategy and the data scientists that need to work with data.

What they do is building the platforms that enable data scientists to do their magic.

These platforms are usually used in four different ways:

  1. Data ingestion and storage of large amounts of data
  2. Algorithm creation by data scientists
  3. Automation of the data scientist’s machine learning models and algorithms for production use
  4. Data visualisation for employees and customers

Most of the time these guys start as traditional solution architects for systems that involve SQL databases, web servers, SAP installations and other “standard” systems.

But to create big data platforms the solution architect needs to be an expert in specifying, setting up and maintaining big data technologies like: Hadoop, Spark, HBase, Cassandra, MongoDB, Kafka, Redis and more.

What they also need is experience on how to deploy systems on cloud infrastructure like at Amazon or Google.

3 reasons why solution architects love Spark

1. The way you can set up a distributed Spark system

From a solution architect’s point of view Spark is a perfect fit for Hadoop big data platforms. This has a lot to do with cluster deployment and management.

Companies like Cloudera, MapR or Hortonworks include Spark into their Hadoop distributions. Because of that, Spark can be deployed and managed with the clusters Hadoop management web fronted.

This makes the process for deploying and configuring a Spark cluster very quick and admin friendly.

2. How you can manage Spark resources

When running a computing framework you need resources to do computation: CPU time, RAM, I/O and so on. Out of the box Spark can manage resources with it’s stand-alone resource manager.

If Spark is running in an Hadoop environment you don’t have to use Spark’s own stand-alone resource manager. You can configure Spark to use Hadoop’s YARN resource management.

Why would you do that?
It allows YARN to efficiently allocate resources to your Hadoop and Spark processes.

Having a single resource manager instead of two independent ones makes it a lot easier to configure the resource management.

YARN Cluster Resource Management

3. Spark’s ability to access data

Another thing is data locality. In my previous posts I always made the point that processing data locally where it is stored is the most efficient thing to do.

That’s exactly what Spark is doing. You can and should run Spark workers directly on the data nodes of your Hadoop cluster.

Spark Data Locality

Spark can then natively identify on what data node the needed data is stored. This enables Spark to use the worker running on the machine where the data is stored to load the data into the memory.

The downside of this setup is that you need more expensive servers. Because Spark processing needs stronger servers with more RAM and CPUs than a “pure” Hadoop setup.

Who companies really need

For a good company it is absolutely important to get well trained solution architects and data scientists.

Think of the data scientist as the professional race car driver. A fit athlete with talent and driving skills like you have never seen.

What he needs to win races is someone who will provide him the perfect race car to drive. That’s what the solution architect is for.

Like the driver and his team the data scientist and the solution architect need to work closely together. They need to know the different big data tools Inside and out.

Thats why companies are looking for people with Spark experience. It is a common ground between data scientists and solution architects that drives innovation.

Spark gives data scientists the tools to do analytics and helps solution architects to bring the data scientist’s algorithms into production.

After all, those two decide how good the data platform is, how good the analytics insight is and how fast the whole system gets into a production ready state.

Next time:

Next time we will look into stream and batch processing.

It will help you figure out when you need batching or streaming and what kind of frameworks you can use to implement it.

We will also look into why streaming and batching is so popular at Twitter, Netflix and IBM (IoT example).

Jump directly to the post: How to Create New and Exciting Big Data Aided Products

To make sure you don’t miss any of my new posts I suggest you subscribe to my newsletter. Just put in your E-Mail address right here and hit subscribe.

This way I will be able to send you an E-Mail when I have uploaded the next post.

You know what would also be super awesome? When you share this post with your friends over LinkedIn, Facebook or Twitter.

Thanks a lot!

Have a great week! Until next time,

Andreas

Comments 2

    1. Post
      Author

Leave a Reply

Your email address will not be published. Required fields are marked *