Google News
logo
PySpark - Interview Questions
What are PySpark serializers?
The serialization process is used to conduct performance tuning on Spark. The data sent or received over the network to the disk or memory should be persisted. PySpark supports serializers for this purpose. It supports two types of serializers, they are:

PickleSerializer : This serializes objects using Python’s PickleSerializer (class pyspark.PickleSerializer). This supports almost every Python object.
MarshalSerializer : This performs serialization of objects. We can use it by using class pyspark.MarshalSerializer. This serializer is faster than the PickleSerializer but it supports only limited types.

Consider an example of serialization which makes use of MarshalSerializer:
# --serializing.py----
from pyspark.context import SparkContext
from pyspark.serializers import MarshalSerializer
sc = SparkContext("local", "Marshal Serialization", serializer = MarshalSerializer())    #Initialize spark context and serializer
print(sc.parallelize(list(range(1000))).map(lambda x: 3 * x).take(5))
sc.stop()?

When we run the file using the command :
$SPARK_HOME/bin/spark-submit serializing.py?

The output of the code would be the list of size 5 of numbers multiplied by 3:
[0, 3, 6, 9, 12]?
Advertisement