Representation of Data |
Spark Datasets is a combination of Dataframes and RDDs with features like static type safety and object-oriented interfaces. |
Spark Dataframe is a distributed collection of data that is organized into named columns. |
Spark RDDs are a distributed collection of data without schema. |
Optimization |
Datasets make use of catalyst optimizers for optimization. |
Dataframes also makes use of catalyst optimizer for optimization. |
There is no built-in optimization engine. |
Schema Projection |
Datasets find out schema automatically using SQL Engine. |
Dataframes also find the schema automatically. |
Schema needs to be defined manually in RDDs. |
Aggregation Speed |
Dataset aggregation is faster than RDD but slower than Dataframes. |
Aggregations are faster in Dataframes due to the provision of easy and powerful APIs. |
RDDs are slower than both the Dataframes and the Datasets while performing even simple operations like data grouping. |