The most common workflow followed by the spark program is:
* The first step is to create input RDDs depending on the external data. Data can be obtained from different data sources. * Post RDD creation, the RDD transformation operations like filter() or map() are run for creating new RDDs depending on the business logic. * If any intermediate RDDs are required to be reused for later purposes, we can persist those RDDs. * Lastly, if any action operations like first(), count() etc are present then spark launches it to initiate parallel computation.