What is RDD in PySpark?

PySpark - Interview Questions

In PySpark, RDD is an acronym that stands for Resilient Distributed Datasets. It is a core data structure of PySpark. It is a low-level object that is highly efficient in performing distributed tasks.

The PySpark's RDDs are the elements that can run and operate on multiple nodes to do parallel processing on a cluster. These are immutable elements. It means that if you once create an RDD, you cannot change it. RDDs are also fault-tolerant. In the case of any failure, they recover automatically. We can apply multiple operations on RDDs to achieve a certain task.