Design a site like this with WordPress.com
Get started

Apache Spark – Beginner Example

Lets look at a simple spark job

We are going to look at a basic spark example. At the end of the tutorial, we will come to know

  1. A basic spark project structure
  2. Bare minimum libraries required to run a spark application
  3. How to run a spark application on local

Git Repo – https://github.com/letsblogcontent/SparkSimpleExample

Pre-requisites
1. Java
2. Maven
3. Intellij or STS (Optional but recommended)

Follow below steps to complete your first spark application

  1. Create a new maven project without any specific archetype. I am using IntelliJ editor but you may choose any other suitable editor as well. I have created a project with name “SparkExample”
    • Navigate to File-> New Project
    • Select Maven from Left Panel
    • Do not select any archetype
    • Click on “Next”
    • Name the project “SparkExample”
    • Click on “Finish”
      This should create a new maven project like below
      project structure
  2. Next we update the pom.xml with spark-core dependency as below.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>SparkExamples</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>2.0.0</version>
        </dependency>

    </dependencies>

</project>

3. Now we create a new Class “WordCount” in “com.examples” package and copy below contents.

package com.examples;


import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.util.Arrays;
import java.util.Map;

public class WordCount {

    public static void main(String[] args) throws Exception {

        SparkConf conf = new SparkConf().setAppName("wordCounts").setMaster("local[3]");
        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<String> lines = sc.textFile("in/word_count.txt");
        JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());

        Map<String, Long> wordCounts = words.countByValue();

        for (Map.Entry<String, Long> entry : wordCounts.entrySet()) {
            System.out.println(entry.getKey() + " : " + entry.getValue());
        }
    }
}

4. create a new directory folder “in” at the project root and add below file into the “in” directory. In this example we are going to read this file and use spark to count the occurrence of each word in the file.

5. Now we have to build our project to see our output. Since we are using maven, we can run the “mvn clean install” from the command prompt or we can use the rebuild from the intellij. both works and once that is done we can run our application. So basically we have to run the WordCount class so right click on the class and run “WordCount.main()”

6. This should fire up a standalone spark application and run our job of “WordCount”. This job basically counts for the occurrences of words in the file “word_count.txt”. The output should look like below

Twenties, : 1
 II : 2
 industries. : 1
 economy : 1
  : 7
 ties : 2
 buildings : 1
 for : 3
 eleventh : 1
 ultimately : 1
 support : 1
 channels : 1
 Thereafter, : 1
 subsequent : 1
.....
..

7. Now that we have successfully ran the program, lets learn what really happened
The below code configures the name of our spark application and we set the master to be local, which basically tells spark to run this application locally and run it on 3 cores.

SparkConf conf = new SparkConf().setAppName("wordCounts").setMaster("local[3]");

8. Initializes the spark context

 JavaSparkContext sc = new JavaSparkContext(conf);

9. The below code reads the file and converts it into what is called as Resilient Distributed Dataset (RDD) . This will distribute our file into 3 cores to be processed further and returns us with a single reference to the RDD for manipulation

JavaRDD<String> lines = sc.textFile("in/word_count.txt");

10. Rest of the code is self explanatory. The RDD api provides with certain apis like the one we have used. The countByValue as name suggest counts the occurrence of values in our text file and then when we print the values from the map, we get a consolidated view of the aggregation.

So as can be seen, writing a spark application is really easy and its only with a single class we can start writing a spark application. Please comment if you have faced any issue following this tutorial or like if you would like to see more.

Advertisement

Apache Spark – Introduction

In plain simple computing language, spark is an open source cluster computing framework. It is used to solve big data problems. Spark distributes the data on the cluster nodes and then will process the distributed data on each of the nodes against the local data and then send the consolidated response back to the requested of the spark job. If someone is going to ask you about spark, the above explanation is good enough.

Big data has gained lot of traction in last decade or so as users of the internet are continuously creating huge amount of data and processing huge data was something our older frameworks were not capable of handling. Such huge amount of data requires special handling and that was provided initially by Hadoop. What spark provides over Hadoop is the speed. Spark in most cases will perform better than Hadoop. Spark does all the processing in-memory whereas Hadoop writes it on the disk. When it is in-memory processing spark performs upto 100 times better than Hadoop and upto 10 times faster when we write to disk in spark.

It does appear that it will be complex and difficult to follow and understand spark however, most of the complexity is abstracted by the spark and it is extremely easy to start coding in spark. If you know basics of Java, Scala, Python or R, then you can easily write a spark job. In terms of java, we are supposed to write everything inside a main program and we can submit the same to a spark cluster.

A spark cluster typically looks like below. Suppose we are having a spark cluster of 3 nodes. So one of the node will become the master node and rest as worker nodes. Spark has a standalone cluster manager which will basically drive your program across the cluster and act as master. A spark job is submitted to a spark cluster ( a spark job is nothing but a main program bundled in a jar), the node on which we submit a spark job is called the driver program and will have the instance of spark context. All the processing will happen on the worker nodes. Spark can work with HDFS, Hive, Cassandra, Hbase as its storage. We will come to know why do we need storage in future posts.

This is a small introduction to Apache Spark. I will be writing more about the Spark Architecture, Core components, Running your first spark application in future posts. Let me know what you would like to see first.

Also please comment or like if you liked the explanation. Thank you for spending time reading this post.