We are going to look at a basic spark example. At the end of the tutorial, we will come to know
- A basic spark project structure
- Bare minimum libraries required to run a spark application
- How to run a spark application on local
Git Repo – https://github.com/letsblogcontent/SparkSimpleExample
Pre-requisites
1. Java
2. Maven
3. Intellij or STS (Optional but recommended)
Follow below steps to complete your first spark application
- Create a new maven project without any specific archetype. I am using IntelliJ editor but you may choose any other suitable editor as well. I have created a project with name “SparkExample”
- Navigate to File-> New Project
- Select Maven from Left Panel
- Do not select any archetype
- Click on “Next”
- Name the project “SparkExample”
- Click on “Finish”
This should create a new maven project like below
- Next we update the pom.xml with spark-core dependency as below.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>SparkExamples</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.0.0</version>
</dependency>
</dependencies>
</project>
3. Now we create a new Class “WordCount” in “com.examples” package and copy below contents.
package com.examples;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import java.util.Arrays;
import java.util.Map;
public class WordCount {
public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf().setAppName("wordCounts").setMaster("local[3]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("in/word_count.txt");
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
Map<String, Long> wordCounts = words.countByValue();
for (Map.Entry<String, Long> entry : wordCounts.entrySet()) {
System.out.println(entry.getKey() + " : " + entry.getValue());
}
}
}
4. create a new directory folder “in” at the project root and add below file into the “in” directory. In this example we are going to read this file and use spark to count the occurrence of each word in the file.
5. Now we have to build our project to see our output. Since we are using maven, we can run the “mvn clean install” from the command prompt or we can use the rebuild from the intellij. both works and once that is done we can run our application. So basically we have to run the WordCount class so right click on the class and run “WordCount.main()”
6. This should fire up a standalone spark application and run our job of “WordCount”. This job basically counts for the occurrences of words in the file “word_count.txt”. The output should look like below
Twenties, : 1
II : 2
industries. : 1
economy : 1
: 7
ties : 2
buildings : 1
for : 3
eleventh : 1
ultimately : 1
support : 1
channels : 1
Thereafter, : 1
subsequent : 1
.....
..
7. Now that we have successfully ran the program, lets learn what really happened
The below code configures the name of our spark application and we set the master to be local, which basically tells spark to run this application locally and run it on 3 cores.
SparkConf conf = new SparkConf().setAppName("wordCounts").setMaster("local[3]");
8. Initializes the spark context
JavaSparkContext sc = new JavaSparkContext(conf);
9. The below code reads the file and converts it into what is called as Resilient Distributed Dataset (RDD) . This will distribute our file into 3 cores to be processed further and returns us with a single reference to the RDD for manipulation
JavaRDD<String> lines = sc.textFile("in/word_count.txt");
10. Rest of the code is self explanatory. The RDD api provides with certain apis like the one we have used. The countByValue as name suggest counts the occurrence of values in our text file and then when we print the values from the map, we get a consolidated view of the aggregation.
So as can be seen, writing a spark application is really easy and its only with a single class we can start writing a spark application. Please comment if you have faced any issue following this tutorial or like if you would like to see more.