Apache Spark – Introduction

In plain simple computing language, spark is an open source cluster computing framework. It is used to solve big data problems. Spark distributes the data on the cluster nodes and then will process the distributed data on each of the nodes against the local data and then send the consolidated response back to the requested of the spark job. If someone is going to ask you about spark, the above explanation is good enough.

Big data has gained lot of traction in last decade or so as users of the internet are continuously creating huge amount of data and processing huge data was something our older frameworks were not capable of handling. Such huge amount of data requires special handling and that was provided initially by Hadoop. What spark provides over Hadoop is the speed. Spark in most cases will perform better than Hadoop. Spark does all the processing in-memory whereas Hadoop writes it on the disk. When it is in-memory processing spark performs upto 100 times better than Hadoop and upto 10 times faster when we write to disk in spark.

It does appear that it will be complex and difficult to follow and understand spark however, most of the complexity is abstracted by the spark and it is extremely easy to start coding in spark. If you know basics of Java, Scala, Python or R, then you can easily write a spark job. In terms of java, we are supposed to write everything inside a main program and we can submit the same to a spark cluster.

A spark cluster typically looks like below. Suppose we are having a spark cluster of 3 nodes. So one of the node will become the master node and rest as worker nodes. Spark has a standalone cluster manager which will basically drive your program across the cluster and act as master. A spark job is submitted to a spark cluster ( a spark job is nothing but a main program bundled in a jar), the node on which we submit a spark job is called the driver program and will have the instance of spark context. All the processing will happen on the worker nodes. Spark can work with HDFS, Hive, Cassandra, Hbase as its storage. We will come to know why do we need storage in future posts.

This is a small introduction to Apache Spark. I will be writing more about the Spark Architecture, Core components, Running your first spark application in future posts. Let me know what you would like to see first.

Also please comment or like if you liked the explanation. Thank you for spending time reading this post.