Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

SparkContext

Introduction to SparkContext

Generating the Spark context is the primary and necessary step in any SparkContext for the Spark driver. On worker the nodes, the operations inside the executors are run by the driver program. The gateway point of Spark in Apache functionality is the Spark context. Through the spark context, the driver application of Spark will be passed and they have parameters. A driver program initializes, which has the main function and the SparkContext gets initiated and generated here, as soon as we run any Spark application. Access and allowance to Spark Cluster is done with the help of Resource Manager which are of two types in main the Mesos, YARN. Initially, SparkConf ( spark configuration ) should be made to create a SparkContext.

Syntax for Apache SparkContext:

from pyspark import SparkContext
sc = SparkContext("local", "First App")

How Apache SparkContext is Created?

Initially, SparkConf should be made if one has to create SparkContext. The parameter for configuration of Sparkconf is our Spark driver application will pass to SparkContext. The parameters from these, a few are used in defining the properties of driver application in Spark.

And the other few are utilized in allocating the cluster resources which are the memory size, the number the cores on the worker nodes, used by executors run by the Spark. To put it in a simple way, the Spark context helps in guiding the accession of the Spark cluster. Invoking the text file – textFile, the file sequencing – sequenceFile, and parallelize and a few others can be done after creating the SparkContext object.

Parameters:

Profiler_cls Used to do the profiling which is called a custom profiler and the default is pyspark.profiler.BasicProfiler.
JSC Instance of the Java Spark context.
Gateway Install a new JVM or otherwise use the present or existing JVM.
Serializer Serialiser of RDD.
batchSize The number of Python objects show cased as a single object in Java. To disable batching, set 1. To automatically choose the batch size based on object sizes, set 0. or to use an unlimited batch size, set -1.
Environment Worker nodes environment variables.
pyFiles PythonPath has an add on of .zip or .py files to send to the cluster.
SparkHome Directory for Spark installation.
appName The job name of particulars.
Master It connects the cluster URL.
Conf To set all the Spark properties, an object of L, that is, Sparkconf, spark configuration is used.

Below represents the data flow of the Spark context:

The Spark context takes Py4J to use and launches a Java virtual machine which further creates a Java Spark context. PySpark has the context in Spark available as sc which is in default. That is the reason why creating a new Spark context will not work.

Code:

class pyspark.SparkContext (
master = None,
appName = None,
sparkHome = None,
pyFiles = None,
environment = None,
batchSize = 0,
serializer = PickleSerializer(),
conf = None,
gateway = None,
jsc = None,
profiler_cls =
)

Example:

Code:

package com.dataflair.spark
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object Word_Count {
def main(args: Array[String]) {
//Configuration for spark context is set and created
val conf = new SparkConf()
.setAppName("WordCount")
// An object for Spark context is created
val sc = new SparkContext(conf)
//Check whether sufficient params are supplied
if (args.length println("Usage: ScalaWordCount ")
System.exit(1)
}
//Read file and create RDD
val rawData = sc.textFile(args(0))
//Using flatMap operation, convert the lines into words
val words = rawData.flatMap(line => line.split(" "))
//Using map and reduceByKey operation, count the individual words
val wordCount = words.map(word => (word, 1)).reduceByKey(_ + _)
// result is saved
wordCount.saveAsTextFile(args(1))
// spark context is stopped
sc.stop
}}

Output:

Conclusion

To sum up, Spark helps to simplify the challenging and computationally intensive task of processing high volumes of real-time or archived data, both structured and unstructured, seamlessly integrating relevant complex capabilities such as machine learning and graph algorithms. Spark brings Big Data processing to the masses. Hence, SparkContext provides the various functions in Spark like get the current status of Spark Application, set the configuration, cancel a job, Cancel a stage and much more. It is an entry point to the Spark functionality. Thus, it acts as a backbone.

Recommended Articles

This is a guide to SparkContext. Here we discuss the introduction to SparkContext and how apache SparkContext is created with respective example. You may also have a look at the following articles to learn more –

  1. Spark Accumulator
  2. Spark Parallelize
  3. Spark Functions
  4. Spark Versions

The post SparkContext appeared first on EDUCBA.



This post first appeared on Best Online Training & Video Courses | EduCBA, please read the originial post: here

Share the post

SparkContext

×

Subscribe to Best Online Training & Video Courses | Educba

Get updates delivered right to your inbox!

Thank you for your subscription

×