Google News
logo
Spark - Interview Questions
Differentiate between Spark Datasets, Dataframes and RDDs.
Criteria Spark Datasets Spark Dataframes Spark RDDs
Representation of Data Spark Datasets is a combination of Dataframes and RDDs with features like static type safety and object-oriented interfaces. Spark Dataframe is a distributed collection of data that is organized into named columns. Spark RDDs are a distributed collection of data without schema.
Optimization Datasets make use of catalyst optimizers for optimization. Dataframes also makes use of catalyst optimizer for optimization. There is no built-in optimization engine.
Schema Projection Datasets find out schema automatically using SQL Engine. Dataframes also find the schema automatically. Schema needs to be defined manually in RDDs.
Aggregation Speed Dataset aggregation is faster than RDD but slower than Dataframes. Aggregations are faster in Dataframes due to the provision of easy and powerful APIs. RDDs are slower than both the Dataframes and the Datasets while performing even simple operations like data grouping.
Advertisement