To create SparkSession, we use the builder pattern. The SparkSession class from the
pyspark.sql
library has the
getOrCreate()
method which creates a new SparkSession if there is none or else it returns the existing SparkSession object. The following code is an example for creating SparkSession:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]")
.appName('InterviewBitSparkSession')
.getOrCreate()?
Here,
* master()
: This is used for setting up the mode in which the application has to run - cluster mode (use the master name) or standalone mode. For Standalone mode, we use the
local[x]
value to the function, where x represents partition count to be created in RDD, DataFrame and DataSet. The value of x is ideally the number of CPU cores available.
* appName()
: Used for setting the application name
* getOrCreate()
: For returning SparkSession object. This creates a new object if it does not exist. If an object is there, it simply returns that.
If we want to create a new SparkSession object every time, we can use the newSession method as shown below:
import pyspark
from pyspark.sql import SparkSession
spark_session = SparkSession.newSession?