What can you say about Spark Datasets?

Spark - Interview Questions

Spark Datasets are those data structures of SparkSQL that provide JVM objects with all the benefits (such as data manipulation using lambda functions) of RDDs alongside Spark SQL-optimised execution engine. This was introduced as part of Spark since version 1.6.

* Spark datasets are strongly typed structures that represent the structured queries along with their encoders.

* They provide type safety to the data and also give an object-oriented programming interface.

* The datasets are more structured and have the lazy query expression which helps in triggering the action. Datasets have the combined powers of both RDD and Dataframes. Internally, each dataset symbolizes a logical plan which informs the computational query about the need for data production. Once the logical plan is analyzed and resolved, then the physical query plan is formed that does the actual query execution.

Datasets have the following features :

Optimized Query feature : Spark datasets provide optimized queries using Tungsten and Catalyst Query Optimizer frameworks. The Catalyst Query Optimizer represents and manipulates a data flow graph (graph of expressions and relational operators). The Tungsten improves and optimizes the speed of execution of Spark job by emphasizing the hardware architecture of the Spark execution platform.

Compile-Time Analysis : Datasets have the flexibility of analyzing and checking the syntaxes at the compile-time which is not technically possible in RDDs or Dataframes or the regular SQL queries.

Interconvertible : The type-safe feature of datasets can be converted to “untyped” Dataframes by making use of the following methods provided by the Datasetholder :

toDS():Dataset[T]
toDF():DataFrame
toDF(columName:String*):DataFrame

Faster Computation : Datasets implementation are much faster than those of the RDDs which helps in increasing the system performance.

Persistent storage qualified : Since the datasets are both queryable and serializable, they can be easily stored in any persistent storages.

Less Memory Consumed : Spark uses the feature of caching to create a more optimal data layout. Hence, less memory is consumed.

Single Interface Multiple Languages : Single API is provided for both Java and Scala languages. These are widely used languages for using Apache Spark. This results in a lesser burden of using libraries for different types of inputs.