Google News
logo
PySpark - Interview Questions
What is the checkpointing in PySpark?
The streaming application must be available 24/7 and tolerant of errors outside the application code (e.g., system crashes, JVM crashes, etc.). The checkpointing process makes streaming applications more fault tolerant. We can store data and metadata in the checkpoint directory.

Checkpoint can be of two types – metadata check and data check.

A metadata checkpoint allows you to store the information that defines a streaming computation in a fault-tolerant storage system such as HDFS. This helps to recover data after a streaming application controller node failure.

Data checkpointing means saving the created RDDs to a safe place. This type of checkpoint requires several state calculations that combine data from different batches.
Advertisement