Google News
logo
Apache Spark - Quiz(MCQ)
A)
2007
B)
2008
C)
2009
D)
2010

Correct Answer :   2009


Explanation : Spark is one of Hadoop's sub project developed in 2009 in UC Berkeley's AMPLab by Matei Zaharia.

A)
RDDs
B)
Spark SQL
C)
Spark Streaming
D)
None of the above

Correct Answer :   Spark SQL


Explanation : Spark SQL introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.

A)
Spark SQL
B)
Standalone
C)
Hadoop Yarn
D)
Spark in MapReduce

Correct Answer :   Spark SQL


Explanation : There are three ways of Spark deployment :- Standalone , Hadoop Yarn, Spark in MapReduce.

A)
RDD
B)
Dataset
C)
DataFrame
D)
All of the above

Correct Answer :   DataFrame

A)
MLlib
B)
GraphX
C)
Spark R
D)
Spark SQL

Correct Answer :   Spark SQL

A)
Speed
B)
Advanced Analytics
C)
Supports multiple languages
D)
All of the above

Correct Answer :   All of the above


Explanation : Apache Spark has following features.: speed, Supports multiple languages ,Advanced Analytics.

A)
2
B)
3
C)
4
D)
5

Correct Answer :   2


Explanation : Spark uses Hadoop in two ways : one is storage and second is processing.

A)
MLlib
B)
GraphX
C)
Spark Streaming
D)
None of the above

Correct Answer :   GraphX


Explanation : GraphX started initially as a research project at UC Berkeley AMPLab and Databricks, and was later donated to the Spark project.

A)
Java
B)
Scala
C)
Pascal
D)
Python

Correct Answer :   Pascal


Explanation : The Spark engine runs in a variety of environments, from cloud services to Hadoop or Mesos clusters.

A)
Driver nodes
B)
Cluster manager
C)
Executor Nodes
D)
None of the above

Correct Answer :   Executor Nodes

A)
RIS
B)
SIR
C)
SIM
D)
SIMR

Correct Answer :   SIMR


Explanation : With SIMR, users can start experimenting with Spark and use its shell within a couple of minutes after downloading it.

A)
Virtual
B)
Structured
C)
Real-time
D)
All of the above

Correct Answer :   Real-time


Explanation : Spark is best suited for real-time data whereas Hadoop is best suited for structured data.

A)
True
B)
False
C)
None of the above
D)
--

Correct Answer :   True

A)
Execution
B)
Physical planning
C)
Analysis
D)
Logical Optimization

Correct Answer :   Execution

A)
True
B)
False
C)
None of the above
D)
--

Correct Answer :   False

A)
We can build DataFrame from different data sources. structured data file, tables in Hive
B)
The Application Programming Interface (APIs) of DataFrame is available in various languages
C)
Both in Scala and Java, we represent DataFrame as Dataset of rows.
D)
DataFrame in Apache Spark is behind RDD

Correct Answer :   DataFrame in Apache Spark is behind RDD

A)
Tables in Hive
B)
External databases
C)
Structured data files
D)
All of the above

Correct Answer :   All of the above

A)
RDD
B)
Dataset
C)
DataFrame
D)
None of the above

Correct Answer :   RDD

A)
Yes
B)
No
C)
None of the above
D)
--

Correct Answer :   No

A)
R
B)
Java
C)
Scala
D)
Python

Correct Answer :   Scala

A)
Flume
B)
Kafka
C)
Kinesis
D)
All of the above

Correct Answer :   All of the above

A)
10 times faster
B)
100 times faster
C)
200 times faster
D)
300 times faster

Correct Answer :   100 times faster

A)
Decision Trees
B)
Naive Bayes
C)
Random Forests
D)
Logistic Regression

Correct Answer :   Decision Trees

A)
Decision Trees
B)
Ridge Regression
C)
Logistic Regression
D)
Gradient-Boosted Trees

Correct Answer :   Logistic Regression

A)
DataFrames provide a more user-friendly API than RDDs.
B)
DataFrame API have provision for compile-time type safety
C)
Both (A) and (B)
D)
None of the above

Correct Answer :   DataFrames provide a more user-friendly API than RDDs.

A)
Creates one or many new RDDs
B)
The ways to send result from executors to the driver
C)
Takes RDD as input and produces one or more RDD as output.
D)
All of the above

Correct Answer :   The ways to send result from executors to the driver

A)
The data required to compute resides on multiple partitions.
B)
The data required to compute resides on the single partition.
C)
Both (A) and (B)
D)
None of the above

Correct Answer :   The data required to compute resides on the single partition.

A)
A tree contains a node object.
B)
A tree is the main data type in the catalyst.
C)
New nodes are defined as subclasses of TreeNode class.
D)
All of the above

Correct Answer :   All of the above

A)
We can manipulate tree using rules.
B)
We can define rules as a function from one tree to another tree.
C)
Using rule we get the pattern that matches each pattern to a result.
D)
All of the above

Correct Answer :   All of the above

30 .
Which of the following organized a data into a named column?

1) RDD
 
2) DataFrame
 
3) Dataset
A)
Both 1 and 2
B)
Both 2 and 3
C)
Both 1 and 3
D)
None of the above

Correct Answer :   Both 2 and 3

A)
RDD
B)
Dataset
C)
DataFrame
D)
None of the above

Correct Answer :   Dataset

A)
Java
B)
Scala
C)
Python
D)
All of the above

Correct Answer :   All of the above

A)
Sqoop
B)
MLlib
C)
GraphX
D)
BlinkDB

Correct Answer :   Sqoop

A)
RDD
B)
Dstream
C)
Shared Variable
D)
None of the above

Correct Answer :   Dstream

A)
Streaming KMeans
B)
Streaming Linear Regression
C)
Tanimoto distance
D)
None of the above

Correct Answer :   Tanimoto distance

A)
RDD
B)
GraphX
C)
Spark R
D)
Spark Streaming

Correct Answer :   Spark Streaming

A)
RDDs are immutable and fault-tolerant
B)
Support for different language APIs like Scala, Java, Python and R
C)
DAG execution engine and in-memory computation
D)
None of the above

Correct Answer :   DAG execution engine and in-memory computation

A)
Pipelines
B)
Persistence
C)
Utilities like linear algebra, statistics
D)
All of the above

Correct Answer :   All of the above

A)
Spark is an open source framework which is written in Java
B)
Spark is 100 times faster than Bigdata Hadoop
C)
It provides high-level API in Java, Python, R, Scala
D)
It can be integrated with Hadoop and can process existing Hadoop HDFS data

Correct Answer :   Spark is an open source framework which is written in Java

A)
It is the kernel of Spark
B)
It enables users to run SQL / HQL queries on the top of Spark.
C)
Provides an execution platform for all the Spark applications
D)
Enables powerful interactive and data analytics application across live streaming data

Correct Answer :   It enables users to run SQL / HQL queries on the top of Spark.

A)
It is the kernel of Spark
B)
It enables users to run SQL / HQL queries on the top of Spark.
C)
Improves the performance of iterative algorithm drastically.
D)
It is the scalable machine learning library which delivers efficiencies

Correct Answer :   It is the kernel of Spark

A)
DAG
B)
Lazy-evaluation
C)
In-memory processing
D)
All of the above

Correct Answer :   All of the above

A)
Scheduling
B)
Monitoring data across a cluster
C)
Distributing data across a cluster
D)
All of the above

Correct Answer :   All of the above

A)
True
B)
False
C)
None of the above
D)
--

Correct Answer :   True

A)
SparkContext
B)
SparkSession
C)
Both (A) and (B)
D)
None of the above

Correct Answer :   SparkSession

A)
Apache Spark
B)
Apache Flink
C)
Apache Hadoop
D)
All of the above

Correct Answer :   All of the above

A)
Constant folding
B)
Predicate pushdown
C)
Abstract syntax tree
D)
Projection pruning

Correct Answer :   Abstract syntax tree

48 .
In the analysis phase which is the correct order of execution after forming unresolved logical plan

a) Search relation BY NAME FROM CATALOG.
 
b) Determine which attributes match to the same value to give them unique ID.
 
c) Map the name attribute
 
d) Propagate and push type through expressions
A)
abcd
B)
acbd
C)
adbc
D)
dcab

Correct Answer :   acbd

A)
True
B)
False
C)
None of the above
D)
--

Correct Answer :   True

A)
Yes
B)
No
C)
None of the above
D)
--

Correct Answer :   No

A)
RDD
B)
Dataset
C)
DataFrame
D)
All of the above

Correct Answer :   RDD

A)
Apache Spark
B)
Apache Hadoop
C)
Apache Flink
D)
None of the above

Correct Answer :   Apache Flink

A)
To simplify working with structured data it provides DataFrame abstraction in Python, Java, and Scala.
B)
The best way to use Spark SQL is inside a Spark application. This empowers us to load data and query it with SQL.
C)
The data can be read and written in a variety of structured formats. For example, JSON, Hive Tables, and Parquet.
D)
Using SQL we can query data,only from inside a Spark program and not from external tools.

Correct Answer :   Using SQL we can query data,only from inside a Spark program and not from external tools.

A)
DataSet
B)
DataFrame
C)
Either DataFrame or Dataset
D)
Neither DataFrame nor Dataset

Correct Answer :   Either DataFrame or Dataset

A)
Scala and R
B)
Java and Scala
C)
Scala and Python
D)
Java, Scala and python

Correct Answer :   Java and Scala

A)
The optimizer helps us to run queries much faster than their counter RDD part.
B)
The optimizer helps us to run queries little faster than their counter RDD part.
C)
The optimizer helps us to run queries in the same speed as their counter RDD part.
D)
None of the above

Correct Answer :   The optimizer helps us to run queries much faster than their counter RDD part.

A)
It executes SQL queries.
B)
We can read data from existing Hive installation using SparkSQL.
C)
When we run SQL within another programming language we will get the result as Dataset/DataFrame.
D)
All of the above

Correct Answer :   All of the above

A)
Resilient
B)
In-memory
C)
Immutability
D)
All of the above

Correct Answer :   All of the above

A)
Map transforms an RDD of length N into another RDD of length N.
B)
It applies to each element of RDD and it returns the result as new RDD
C)
Map allows returning 0, 1 or more elements from map function.
D)
In the Map operation developer can define his own custom business logic.

Correct Answer :   Map allows returning 0, 1 or more elements from map function.

A)
top()
B)
take(n)
C)
countByValue()
D)
mapPartitionWithIndex()

Correct Answer :   mapPartitionWithIndex()

A)
Distinct()
B)
CountByValue()
C)
Union(dataset)
D)
Intersection(other-dataset)

Correct Answer :   CountByValue()

A)
foreach()
B)
top()
C)
collect()
D)
countByValue()

Correct Answer :   foreach()

A)
Windowed operations and updateStateByKey() are two type of Stateless transformation.
B)
The processing of each batch has no dependency on the data of previous batches.
C)
Uses data or intermediate results from previous batches and computes the result of the current batch.
D)
None of the above

Correct Answer :   The processing of each batch has no dependency on the data of previous batches.

A)
Stateful transformations are simple RDD transformations.
B)
The processing of each batch has no dependency on the data of previous batches.
C)
Uses data or intermediate results from previous batches and computes the result of the current batch.
D)
None of the above

Correct Answer :   Uses data or intermediate results from previous batches and computes the result of the current batch.

A)
It is the scalable machine learning library which delivers efficiencies
B)
Provides an execution platform for all the Spark applications
C)
enables powerful interactive and data analytics application across live streaming data
D)
All of the above

Correct Answer :   It is the scalable machine learning library which delivers efficiencies

A)
Iterative algorithms
B)
Interactive data mining tools
C)
Both (A) and (B)
D)
None of the above

Correct Answer :   Both (A) and (B)

A)
Upon action
B)
Upon transformation
C)
On both transformation and action
D)
None of the above

Correct Answer :   Upon action

A)
Fine-grained
B)
Coarse-grained
C)
Neither fine-grained nor coarse-grained
D)
Either fine-grained or coarse-grained

Correct Answer :   Either fine-grained or coarse-grained

A)
Fine-grained
B)
Coarse-grained
C)
Either fine-grained or coarse-grained
D)
Neither fine-grained nor coarse-grained

Correct Answer :   Coarse-grained

A)
Lazy-evaluation
B)
Immutable nature of RDD
C)
DAG (Directed Acyclic Graph)
D)
None of the above

Correct Answer :   DAG (Directed Acyclic Graph)

A)
Returns final result of RDD computations.
B)
The ways to send result from executors to the driver
C)
Takes RDD as input and produces one or more RDD as output.
D)
None of the above

Correct Answer :   Takes RDD as input and produces one or more RDD as output.

A)
Only one
B)
More than one
C)
Not specific
D)
None of the above

Correct Answer :   Only one

A)
2
B)
3
C)
4
D)
5

Correct Answer :   3

A)
Rscript
B)
R Shell
C)
RStudio
D)
All of the above

Correct Answer :   All of the above

A)
Fault-tolerance
B)
It is cost efficient
C)
Supports in-memory computation
D)
Compatible with other file storage system

Correct Answer :   It is cost efficient

A)
State size, window length
B)
State size, sliding interval
C)
Window length, sliding interval
D)
None of the above

Correct Answer :   Window length, sliding interval

A)
ForeachRDD
B)
SaveAsTextFiles
C)
SaveAsHadoopFiles
D)
ReduceByKeyAndWindow

Correct Answer :   ReduceByKeyAndWindow

A)
YARN
B)
MESOS
C)
Standalone Cluster Manager
D)
All of the above

Correct Answer :   All of the above

A)
MEMORY_ONLY
B)
DISK_ONLY
C)
MEMORY_AND_DISK
D)
MEMORY_ONLY_SER

Correct Answer :   MEMORY_ONLY

A)
Both are data processing platforms
B)
Both are cluster computing environments
C)
Both have their own file system
D)
Both use open source APIs to link between different tools

Correct Answer :   Both have their own file system