How is Caching relevant in Spark Streaming?

Spark - Interview Questions

Spark Streaming involves the division of data stream’s data into batches of X seconds called DStreams. These DStreams let the developers cache the data into the memory which can be very useful in case the data of DStream is used for multiple computations. The caching of data can be done using the cache() method or using persist() method by using appropriate persistence levels. The default persistence level value for input streams receiving data over the networks such as Kafka, Flume, etc is set to achieve data replication on 2 nodes to accomplish fault tolerance.

Caching using cache method :

val cacheDf = dframe.cache()â€‹

Caching using persist method :

val persistDf = dframe.persist(StorageLevel.MEMORY_ONLY)

The main advantages of caching are :

Cost efficiency : Since Spark computations are expensive, caching helps to achieve reusing of data and this leads to reuse computations which can save the cost of operations.

Time-efficient : The computation reusage leads to saving a lot of time.

More Jobs Achieved : By saving time of computation execution, the worker nodes can perform/execute more jobs.