Implementation of Spark Applications in Hadoop

 Welcome to the twelfth lesson ‘Implementation of Spark Applications’ of Big Data Hadoop Tutorial which is a part of ‘big data and hadoop course’ offered by OnlineItGuru.

In this lesson, we will discuss how to implement Spark application. You will also be introduced to SparkContext and the Spark application cluster options.

Let us first look at the objectives of this lesson.

Objectives

After completing this lesson, you will be able to:

  • Explain the difference between Spark Shell and Spark applications.
  • Describe SparkContext.
  • Explain the three supported cluster resource managers of Spark applications - Hadoop YARN, Spark Standalone, and Apache Mesos.
  • List the steps to run Spark application.
  • List the steps for dynamic resource allocation.
  • Explain the different configurations related to Spark application.

You can check the Course Preview of big data hadoop online training.

Spark Shell vs. Spark Applications

The table below explains how spark shell differs from Spark Applications.

Spark Shell

Spark Applications

  • Allows interactive exploration and manipulation of data
  • Example: REPL using Python or Scala
  • Run as independent programs
  • Examples: Python, Scala, or Java, and jobs such as ETL processing, and streaming

The SparkContext

Every Spark program needs a SparkContext, which is the main entry point to Spark for a Spark application. The interactive Spark Shell creates it for you. The spark context is explained in the below diagram.

spark-context-in-a-spark-program


The first thing that any Spark program needs to do is to create a SparkContext object, which informs Spark how to access a cluster through a resource manager.

In order to create a SparkContext, begin by building a SparkConf object, which contains information about your application.

You can create SparkContext in Spark application. In a Spark shell, a special SparkContext is already created for you, in the variable called sc. When you want to terminate the program, you can call sc.stop.

For example, when you want to close files or network connections after you are done with them, you can call sc.stop, which lets the spark master know that your application is finished consuming resources.

Here is an example where you can see a SparkContext being initiated in Scala.

a-spark-context-being-initiated-in-scala


Here, the number of lines containing “a” and lines containing “b” are being calculated.

In the following sections, let’s learn about the different options of Spark application clusters and supported cluster resource managers.

Spark Application Cluster Options

Spark applications can run locally without any distributed processing, locally with multiple worker threads, and on a cluster.

Local mode

Spark application that runs locally is as shown below.

the-spark-application-that-runs-locally

Local mode is useful for development and testing, while the cluster is preferred for production.

Cluster mode

Spark application that runs on a cluster is shown as below.

the-spark-application-on-cluster


Supported Cluster Resource Managers

The three supported cluster resource managers of Spark applications are Hadoop YARN, Spark Standalone, and Apache Mesos.

Hadoop YARN

Hadoop YARN is included in Cloud Distribution Hadoop or CDH. Hadoop YARN is most commonly used for production sites and allows sharing cluster resources with other applications.

Spark Standalone

Spark Standalone is included with Spark. It has limited configurability and scalability but is easy to install and run. It is useful for testing, development, or small systems. However, there is no security support.

Apache Mesos

Apache Mesos was the first platform to be supported by Spark. However, currently, it is not as popular as the other resource managers.

In the next section, we will learn how to run Spark on Hadoop YARN, both in client mode and cluster mode.

Running Spark on YARN: Client Mode (1)

Now that you know what the three supported cluster resource managers are, let’s understand how Spark runs in Hadoop YARN.

to-run-spark-on-yarn-client-mode-1


As you can see from the diagram, when Spark is run in client mode, the SparkContext runs on the client machine.

The resource manager opens the application master, which, in turn, opens the executors and executes the program.

Running Spark on YARN: Client Mode (2)

In the diagram below, you can see that once the executors finish processing, they return the result to the SparkContext.

unning-spark-on-yarn-client-mode-2


Now, let’s look at an example where another client has opened a SparkContext.

Running Spark on YARN: Client Mode (3)

running-spark-on-yarn-client-mode-3


This application has its own application master, which opens its own executors.

Once the executors finish processing, they return the result to the new SparkContext as shown in the below diagram.

Running Spark on YARN: Client Mode (4)

running-spark-on-yarn-client-mode-4


You’ve seen how Spark runs in client mode. Now let’s understand how Spark runs in Cluster mode.

Running Spark on YARN: Cluster-Mode (1)

In case of cluster mode, SparkContext, present in the cluster, opens new executors.

running-spark-on-yarn-cluster-mode-1


Once the executors finish the task, it returns the value to SparkContext. As mentioned earlier, a Spark application can run in different modes.

running-spark-on-yarn-cluster-mode-2


In the following sections, let’s see how to run a Spark application locally and starting a Spark shell on a cluster.

Running Spark Application

Let us first understand how to how to run Spark application locally.

Running a Spark Application Locally

To run a Spark application locally, use spark-submit--master to specify the cluster option.

Here are the different local options:

  • Use local[*] to run the application locally with as many threads as cores. This is a default option.
  • Use local[n] to run the application locally with n threads.
  • Use local to run the application locally with a single thread.

Running a Spark Application on a Cluster

To run a Spark application on a cluster, use spark-submit--master to specify the cluster option. The different cluster options are:

  • Yarn-client
  • Yarn-cluster
  • spark://masternode:port which is used in Spark Standalone
  • mesos://masternode:port which is used in Apache Mesos.

Starting a Spark Application on a Cluster

Not only can you run a Spark application on a cluster, you can also run a Spark shell on a cluster. Both pyspark and spark-shell have a --master option.

Spark shell needs to be run in Hadoop YARN client mode only so that the machine you are running on acts as the driver.

To start the Spark shell, use the Spark or Apache Mesos cluster manager URL.

In the section of the lesson, you will learn the steps included in dynamic resource allocation.

Dynamic Resource Allocation in Spark

Spark can dynamically allocate executors, as required. Dynamic allocation allows a Spark application to add or release executors.

In the example below, three executors are initially provided. Spark can add more executors at any time during the execution of the program if required.

dynamic-resource-allocation-in-spark


You can see that two more executors have been added. Spark can also reduce executors according to the requirement. Here, in this example, it is reduced by 2 executors.

Dynamic allocation in Hadoop YARN is enabled by default starting in CDH 5.5.

  • It is enabled at a site level in Hadoop YARN instead of the application level.
  • It can be disabled for an individual application.
  • Specify --num-executors for the spark-submit script.

Let’s understand the different configurations related to Spark application.

Configuring Your Spark Application

Spark provides numerous properties for configuring your application. Here are some example properties, some of which have reasonable default values.

Properties

Details

spark.master

Controls the workflow in Spark processing

spark.app.name

Provides an application name, appears in the UI, and stores log data

spark.local.dir

Shows where to store local files, such as shuffle output. The default is slash tmp

spark.ui.port

Used to run the Spark Application UI. The port is used for application’s dashboard that shows memory and workload data. The default is 4040

spark.executor. memory

Indicates the amount of memory to allocate to each executor or per executor process. The default is 1g

spark.driver. memory

Shows the amount of memory to allocate to the driver in client mode. It is the memory used for the driver process, that is, where SparkContext is initialized. The default is 1g

Spark applications can be configured declaratively or programmatically.

Let’s discuss them in detail in the following sections.

Declarative Configuration Options

Declarative configuration options include:

  • Spark-submit script
  • Properties file
  • Site default properties file

Spark-submit script

The spark-submit script is used to launch applications on a cluster. The script in Spark’s bin directory is used to launch applications on a cluster.

Examples of the script include spark-submit-driver-memory 500M and spark-submit-conf spark.executor.cores=4.

Properties file

Properties file are a popular means of configuring applications. It includes a tab- or space-separated list of properties and values. You can load the files spark-submit--properties-file filename.

Site default properties file

Using the site defaults properties file, you can configure properties for all Spark applications in spark-defaults.conf.

To configure properties for all Spark applications using the command line, edit the file using Spark_home/conf/spark-defaults.conf. A template file is also provided.

Let’s take a look at how they can be configured programmatically.

Setting Configuration Properties Programmatically

Spark configuration settings are a part of the SparkContext. It can be configured using a SparkConf object. Set functions are used to return a SparkConf object to support chaining.

Some examples of set functions include:

  • setAppName(name)
  • setMaster(master)
  • set(property - name, value).

SparkConf Example (Scala)

Here is an example of a code that is used to configure SparkConf programmatically in Scala.

a-code-used-to-configure-spark-conf-programmatically-in-scala-example


Spark: Points to Remember

Remember the following points while implementing Spark application:

  • Use the Spark Shell application for interactive data exploration.
  • Write a Spark application to run independently.
  • Spark applications require a Spark Context object.
  • Spark applications are run using the spark-submit script.
  • Spark configuration parameters can be set declaratively using the spark-submit script or a properties file or set programmatically using a SparkConf object.

Summary

Let’s now summarize what we learned in this lesson.

  • The Spark Shell allows interactive exploration and manipulation of data, while Spark applications run as independent programs.
  • Every Spark program needs a SparkContext.
  • The interactive Spark Shell creates it for the user.
  • Spark applications can run locally without any distributed processing, locally with multiple worker threads, and on a cluster.
  • The three supported cluster resource managers of Spark applications are Hadoop YARN, Spark Standalone, and Apache Mesos.
  • Spark provides numerous properties for configuring an application such as spark.master, spark.app.name, spark.local.dir, spark.ui.port, spark.executor.memory, and spark.driver.memory.
  • Spark applications can be configured declaratively or programmatically.

Want to check our Big Data Hadoop and Spark Developer Certification course? big data course

Conclusion

This concludes the lesson of “Implementation of Spark Applications.

No comments:

Powered by Blogger.