Design a site like this with WordPress.com
Get started

Apache Spark – Beginner Example

Lets look at a simple spark job

We are going to look at a basic spark example. At the end of the tutorial, we will come to know

  1. A basic spark project structure
  2. Bare minimum libraries required to run a spark application
  3. How to run a spark application on local

Git Repo – https://github.com/letsblogcontent/SparkSimpleExample

Pre-requisites
1. Java
2. Maven
3. Intellij or STS (Optional but recommended)

Follow below steps to complete your first spark application

  1. Create a new maven project without any specific archetype. I am using IntelliJ editor but you may choose any other suitable editor as well. I have created a project with name “SparkExample”
    • Navigate to File-> New Project
    • Select Maven from Left Panel
    • Do not select any archetype
    • Click on “Next”
    • Name the project “SparkExample”
    • Click on “Finish”
      This should create a new maven project like below
      project structure
  2. Next we update the pom.xml with spark-core dependency as below.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>SparkExamples</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>2.0.0</version>
        </dependency>

    </dependencies>

</project>

3. Now we create a new Class “WordCount” in “com.examples” package and copy below contents.

package com.examples;


import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.util.Arrays;
import java.util.Map;

public class WordCount {

    public static void main(String[] args) throws Exception {

        SparkConf conf = new SparkConf().setAppName("wordCounts").setMaster("local[3]");
        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<String> lines = sc.textFile("in/word_count.txt");
        JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());

        Map<String, Long> wordCounts = words.countByValue();

        for (Map.Entry<String, Long> entry : wordCounts.entrySet()) {
            System.out.println(entry.getKey() + " : " + entry.getValue());
        }
    }
}

4. create a new directory folder “in” at the project root and add below file into the “in” directory. In this example we are going to read this file and use spark to count the occurrence of each word in the file.

5. Now we have to build our project to see our output. Since we are using maven, we can run the “mvn clean install” from the command prompt or we can use the rebuild from the intellij. both works and once that is done we can run our application. So basically we have to run the WordCount class so right click on the class and run “WordCount.main()”

6. This should fire up a standalone spark application and run our job of “WordCount”. This job basically counts for the occurrences of words in the file “word_count.txt”. The output should look like below

Twenties, : 1
 II : 2
 industries. : 1
 economy : 1
  : 7
 ties : 2
 buildings : 1
 for : 3
 eleventh : 1
 ultimately : 1
 support : 1
 channels : 1
 Thereafter, : 1
 subsequent : 1
.....
..

7. Now that we have successfully ran the program, lets learn what really happened
The below code configures the name of our spark application and we set the master to be local, which basically tells spark to run this application locally and run it on 3 cores.

SparkConf conf = new SparkConf().setAppName("wordCounts").setMaster("local[3]");

8. Initializes the spark context

 JavaSparkContext sc = new JavaSparkContext(conf);

9. The below code reads the file and converts it into what is called as Resilient Distributed Dataset (RDD) . This will distribute our file into 3 cores to be processed further and returns us with a single reference to the RDD for manipulation

JavaRDD<String> lines = sc.textFile("in/word_count.txt");

10. Rest of the code is self explanatory. The RDD api provides with certain apis like the one we have used. The countByValue as name suggest counts the occurrence of values in our text file and then when we print the values from the map, we get a consolidated view of the aggregation.

So as can be seen, writing a spark application is really easy and its only with a single class we can start writing a spark application. Please comment if you have faced any issue following this tutorial or like if you would like to see more.

Advertisement

HashMap

HashMap is one of the most widely used collection in java. As the name suggests, it is based on the hashing function and it is a collection of key value pairs represented by map. The collection is favorite question for an interviewer and hence it is critical to understand its functioning and its working.

As already told a HashMap is a collection of key value pairs. so unlike other collections where we store only one value, in a HashMap we store 2 values against each element of the collection. One being the key and other being a value.

So the concept itself is very simple and there is very little to understand here. The collection is unordered which means the order of retrieval is not guaranteed. These are mere basic of the map.

Lets now understand how they are stored and how the hashing function is used. So when we call put function on a map with a key and value pair, the map stores this at some part of the array and array in a map is collection of node.

The collection uses the hascode function of the object being stored to determine the index of the array within the hashmap where the object will be stored. Feels a little complicated ?

lets say we are storing employee details against employee id. So in our case, employee id is the key and employee details will be the value. Lets say we are storing “A101”,EmployeeDetail[“name:Sushil”,”email: sushil@gmail.com”]

So when we call put function with key “A101” which is String object, then the hashcode function of this object will be called to generate a number and this number will be used to determine the exact location of the pair within the map.

So why really is it done this way ? I can still store a pair in list if I create a class with key and value, isn’t it ?

The answer is to make retrieval faster. Lets see how the above mentioned storage method makes retrieval faster.

So at the time of retrieval, I present the collection with the key. In our case is “A101”, In order to determine the position of the value object, hashmap will again use the hashcode function to find our where it has stored the value of the key. So the hashcode function will help determine the index within the map and it can directly go that location to find out the value as it would be the same position which will be used for storing the pair initially.

So seems ok at this moment. So we have one key and one value and we can find the location within the map for our object and get the value. So it will be faster than traversing the list and then each time trying to find if the key in the list is equal to the key presented. So HashMap will be much faster in retrieval when we are looking for a specific object in the collection.

hashcode’s are not supposed to be unique, two objects can have the same hashcode and its logically permissible. We are supposed to make them as unique as possible but it may not be possible all the time.

So what happens when we have 2 objects with same hascode ? How will the HashMap manage the storing and retrieval. Lets see.

So during storage as earlier stated it will used the hashcode function to determine the index within the map and then against that index store the pair in a list, In this case its a linked list. For illustration look at the below example

So the bucket is really the linked list which stores the elements in a list having the same hashcode. So against each index there is a linked list where the key value pairs are stored.

So if there are more than one element with the same hascode then the element is stored in the linked list.

At the time of retrieval, the hashmap will first identify the index and then traverse through the linked list to find the element matching the key using the equals method in the object.

So the equals and hascode methods are very important when storing values in the Map. Java has provided the hashcode and equals methods for the wrapper classes but if you wish to use user defined objects as key in the HashMap then you must override the equals and hashcode methods for the HashMap to work correctly.

2 different objects can have the same hashcode. If the hashcode is same, then it does not mean that both the objects are equal. However if 2 objects are equal, they must evaluate to same hashcode. This is primary rule you must keep in mind while overriding these two methods.

You must also note that you must try and create unique hashcode for each object as it would help in the performance of the map. A hashcode which returns the same integer all the time is also a correct implementation but then it will affect the performance of the HashMap as all the elements will be stored in the same bucket.

When we try and store two objects which are logically not equal but evaluate to same hascode, a hash collision is said to be occurred. This scenario must be avoided to improve the performance of the HashMap.

If you find this tutorial helpful, Please like it and comment if you are looking of any additional information on HashMaps.

Setup Authentication in Django – In just 10 mins

A quick 10 min tutorial to setup authentication in django.

We are quickly going to see how do we setup authentication in Django. You wont believe how easy it will be, It is a matter of only 10 min.

git url https://github.com/letsblogcontent/Python/tree/master/djangoauthentication

Step 1

Lets begin with creation of the project using command django-admin startproject djangoauth. I will be using pycharm to edit files. You can edit in pycharm or anyother ide or your choice.

Step 2

On the terminal, in our project root directory, execute commands “python manage.py makemigrations” and “python manage.py migrate”


Step 3
Lets create a template directory where we will place our html files. I have created a templates directory in the root of the project. You may create the same structure. Also we need to register this directory in the settings.py file in the templates object. Against the DIRS add the following path. This will register our templates directory with the django project


 'DIRS': [os.path.join(BASE_DIR, 'templates')],

Lets add a home page for our application by adding home.html under the templates directory and add some sample content as below.

Step 4 – Add url mapping to enable authentication and the home page in the urls.py of djangoauth project (/djangoauth/djangoauth) file. The “django.contrib.auth.urls” is where we set up authentication

from django.contrib import admin
from django.urls import path, include
from django.views.generic.base import TemplateView

urlpatterns = [
    path('admin/', admin.site.urls),
]

urlpatterns += [
    path('', include('django.contrib.auth.urls')),
    path('', TemplateView.as_view(template_name='home.html'))
]

Step 5 –
Now lets run and check what is happening on the browser. Run our application by execution “python manage.py runserver”. You should see following code

Step 6
Django provides an admin page where we can add modify users. We will first create a superuser so that we can add more users to log-in. To create a superuser execute command “python maange.py createsuperuser”

To login with this user, On the browser navigate to http://localhost:8000/admin, this will show up the login page. Login with the user created above, you should see the below page

Now create a new user by clicking on add button. you will be presented with below page. fill in the details and click save. On the next page you can just click save.

Step 7 Next up is we need to setup the login page. Django looks for login page under the registration directory of templates. So we create a login.html file as below in templates/registrations directory.



{% block content %}
  <form method="post" action="{% url 'login' %}">
    {% csrf_token %}
    <table>
      <tr>
        <td>{{ form.username.label_tag }}</td>
        <td>{{ form.username }}</td>
      </tr>
      <tr>
        <td>{{ form.password.label_tag }}</td>
        <td>{{ form.password }}</td>
      </tr>
    </table>
    <input type="submit" value="login" />
    <input type="hidden" name="next" value="{{ next }}" />
  </form>

{% endblock %}

Next we update our home.html to show login and logout. Copy the below content to home.html

{% if user.is_authenticated %}
<h1>Welcome to home</h1>
{% else %}
<h1> Login to the application</h1>
 {% endif %}
<a href = "http://localhost:8000/login/">login</a>
{% if user.is_authenticated %}
<a href = "http://localhost:8000/logout/">logout</a>
{% endif %}

Add below variables to settings.py file. Here we are setting the login url and where should it redirect upon login and logout. Save the file and now navigate to http://localhost:8000/

LOGIN_REDIRECT_URL = '/'

LOGIN_URL = '/login'

LOGIN_REDIRECT_URL = '/'

LOGOUT_REDIRECT_URL= '/'

navigate to http://localhost:8000/ to below page

Click on login and enter with the user you created to see below page. And now you can click on logout to check if it works. It should take you back to the login screen.