Apache Spark MCQ (Multiple Choice Questions), Apache Spark Online Quiz

1 .

When was Apache Spark developed?

A)

2007

B)

2008

C)

2009

D)

2010

Correct Answer : 2009

Explanation : Spark is one of Hadoop's sub project developed in 2009 in UC Berkeley's AMPLab by Matei Zaharia.

2 .

__________ is a component on top of Spark Core.

A)

RDDs

B)

Spark SQL

C)

Spark Streaming

D)

None of the above

Correct Answer : Spark SQL

Explanation : Spark SQL introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.

3 .

Which of the following is incorrect way for Spark deployment?

A)

Spark SQL

B)

Standalone

C)

Hadoop Yarn

D)

Spark in MapReduce

Correct Answer : Spark SQL

Explanation : There are three ways of Spark deployment :- Standalone , Hadoop Yarn, Spark in MapReduce.

4 .

The primary Machine Learning API for Spark is now the _____ based API.

A)

RDD

B)

Dataset

C)

DataFrame

D)

All of the above

Correct Answer : DataFrame

5 .

Which of the following is a module for Structured data processing?

A)

MLlib

B)

GraphX

C)

Spark R

D)

Spark SQL

Correct Answer : Spark SQL

6 .

Which of the following Features of Apache Spark?

A)

Speed

B)

Advanced Analytics

C)

Supports multiple languages

D)

All of the above

Correct Answer : All of the above

Explanation : Apache Spark has following features.: speed, Supports multiple languages ,Advanced Analytics.

7 .

In how many ways Spark uses Hadoop?

A)

2

B)

3

C)

4

D)

5

Correct Answer : 2

Explanation : Spark uses Hadoop in two ways : one is storage and second is processing.

8 .

________ is a distributed graph processing framework on top of Spark.

A)

MLlib

B)

GraphX

C)

Spark Streaming

D)

None of the above

Correct Answer : GraphX

Explanation : GraphX started initially as a research project at UC Berkeley AMPLab and Databricks, and was later donated to the Spark project.

9 .

Which of the following language is not supported by Spark?

A)

Java

B)

Scala

C)

Pascal

D)

Python

Correct Answer : Pascal

Explanation : The Spark engine runs in a variety of environments, from cloud services to Hadoop or Mesos clusters.

10 .

SparkSQL translates commands into codes. These codes are processed by______

A)

Driver nodes

B)

Cluster manager

C)

Executor Nodes

D)

None of the above

Correct Answer : Executor Nodes

11 .

Which of the following can be used to launch Spark jobs inside MapReduce?

A)

RIS

B)

SIR

C)

SIM

D)

SIMR

Correct Answer : SIMR

Explanation : With SIMR, users can start experimenting with Spark and use its shell within a couple of minutes after downloading it.

12 .

Spark is best suited for ______ data.

A)

Virtual

B)

Structured

C)

Real-time

D)

All of the above

Correct Answer : Real-time

Explanation : Spark is best suited for real-time data whereas Hadoop is best suited for structured data.

13 .

Spark SQL plays the main role in the optimization of queries.

A)

True

B)

False

C)

None of the above

D)

--

Correct Answer : True

14 .

Which of the following is not a Spark SQL query execution phases?

A)

Execution

B)

Physical planning

C)

Analysis

D)

Logical Optimization

Correct Answer : Execution

15 .

DataFrame in Apache Spark prevails over RDD and does not contain any feature of RDD.

A)

True

B)

False

C)

None of the above

D)

--

Correct Answer : False

16 .

Which of the following is not true for DataFrame?

A)

We can build DataFrame from different data sources. structured data file, tables in Hive

B)

The Application Programming Interface (APIs) of DataFrame is available in various languages

C)

Both in Scala and Java, we represent DataFrame as Dataset of rows.

D)

DataFrame in Apache Spark is behind RDD

Correct Answer : DataFrame in Apache Spark is behind RDD

17 .

We can create DataFrame using__________

A)

Tables in Hive

B)

External databases

C)

Structured data files

D)

All of the above

Correct Answer : All of the above

18 .

Which of the following is the fundamental data structure of Spark?

A)

RDD

B)

Dataset

C)

DataFrame

D)

None of the above

Correct Answer : RDD

19 .

Is MLlib deprecated?

A)

Yes

B)

No

C)

None of the above

D)

--

Correct Answer : No

20 .

Spark is developed in which language

A)

R

B)

Java

C)

Scala

D)

Python

Correct Answer : Scala

21 .

In Spark Streaming the data can be from what all sources?

A)

Flume

B)

Kafka

C)

Kinesis

D)

All of the above

Correct Answer : All of the above

22 .

How much faster can Apache Spark potentially run batch-processing programs when processed in memory than MapReduce can?

A)

10 times faster

B)

100 times faster

C)

200 times faster

D)

300 times faster

Correct Answer : 100 times faster

23 .

For Multiclass classification problem which algorithm is not the solution?

A)

Decision Trees

B)

Naive Bayes

C)

Random Forests

D)

Logistic Regression

Correct Answer : Decision Trees

24 .

For Regression problem which algorithm is not the solution?

A)

Decision Trees

B)

Ridge Regression

C)

Logistic Regression

D)

Gradient-Boosted Trees

Correct Answer : Logistic Regression

25 .

Which of the following is true about DataFrame?

A)

DataFrames provide a more user-friendly API than RDDs.

B)

DataFrame API have provision for compile-time type safety

C)

Both (A) and (B)

D)

None of the above

Correct Answer : DataFrames provide a more user-friendly API than RDDs.

26 .

What is action in Spark RDD?

A)

Creates one or many new RDDs

B)

The ways to send result from executors to the driver

C)

Takes RDD as input and produces one or more RDD as output.

D)

All of the above

Correct Answer : The ways to send result from executors to the driver

27 .

Which of the following is true about narrow transformation_______

A)

The data required to compute resides on multiple partitions.

B)

The data required to compute resides on the single partition.

C)

Both (A) and (B)

D)

None of the above

Correct Answer : The data required to compute resides on the single partition.

28 .

Which of the following is true for the tree in Catalyst optimizer?

A)

A tree contains a node object.

B)

A tree is the main data type in the catalyst.

C)

New nodes are defined as subclasses of TreeNode class.

D)

All of the above

Correct Answer : All of the above

29 .

Which of the following is true for the rule in Catalyst optimizer?

A)

We can manipulate tree using rules.

B)

We can define rules as a function from one tree to another tree.

C)

Using rule we get the pattern that matches each pattern to a result.

D)

All of the above

Correct Answer : All of the above

30 .

Which of the following organized a data into a named column?

1) RDD

2) DataFrame

3) Dataset

A)

Both 1 and 2

B)

Both 2 and 3

C)

Both 1 and 3

D)

None of the above

Correct Answer : Both 2 and 3

31 .

Which of the following provide the object-oriented programming interface

A)

RDD

B)

Dataset

C)

DataFrame

D)

None of the above

Correct Answer : Dataset

32 .

Apache Spark has API's in_________

A)

Java

B)

Scala

C)

Python

D)

All of the above

Correct Answer : All of the above

33 .

Which of the following is not a component of Spark Ecosystem?

A)

Sqoop

B)

MLlib

C)

GraphX

D)

BlinkDB

Correct Answer : Sqoop

34 .

The basic abstraction of Spark Streaming is__________

A)

RDD

B)

Dstream

C)

Shared Variable

D)

None of the above

Correct Answer : Dstream

35 .

Which of the following algorithm is not present in MLlib?

A)

Streaming KMeans

B)

Streaming Linear Regression

C)

Tanimoto distance

D)

None of the above

Correct Answer : Tanimoto distance

36 .

Which of the following provide the Spark Coreâ€™s fast scheduling capability to perform streaming analytics.

A)

RDD

B)

GraphX

C)

Spark R

D)

Spark Streaming

Correct Answer : Spark Streaming

37 .

Which of the following is the reason for Spark being Speedy than MapReduce?

A)

RDDs are immutable and fault-tolerant

B)

Support for different language APIs like Scala, Java, Python and R

C)

DAG execution engine and in-memory computation

D)

None of the above

Correct Answer : DAG execution engine and in-memory computation

38 .

Which of the following is a tool of Machine Learning Library?

A)

Pipelines

B)

Persistence

C)

Utilities like linear algebra, statistics

D)

All of the above

Correct Answer : All of the above

39 .

Which of the following is false for Apache Spark?

A)

Spark is an open source framework which is written in Java

B)

Spark is 100 times faster than Bigdata Hadoop

C)

It provides high-level API in Java, Python, R, Scala

D)

It can be integrated with Hadoop and can process existing Hadoop HDFS data

Correct Answer : Spark is an open source framework which is written in Java

40 .

Which of the following is true for Spark SQL?

A)

It is the kernel of Spark

B)

It enables users to run SQL / HQL queries on the top of Spark.

C)

Provides an execution platform for all the Spark applications

D)

Enables powerful interactive and data analytics application across live streaming data

Correct Answer : It enables users to run SQL / HQL queries on the top of Spark.

41 .

Which of the following is true for Spark core?

A)

It is the kernel of Spark

B)

It enables users to run SQL / HQL queries on the top of Spark.

C)

Improves the performance of iterative algorithm drastically.

D)

It is the scalable machine learning library which delivers efficiencies

Correct Answer : It is the kernel of Spark

42 .

The shortcomings of Hadoop MapReduce was overcome by Spark RDD by_________

A)

DAG

B)

Lazy-evaluation

C)

In-memory processing

D)

All of the above

Correct Answer : All of the above

43 .

What does Spark Engine do?

A)

Scheduling

B)

Monitoring data across a cluster

C)

Distributing data across a cluster

D)

All of the above

Correct Answer : All of the above

44 .

SparkContext guides how to access the Spark cluster.

A)

True

B)

False

C)

None of the above

D)

--

Correct Answer : True

45 .

Which of the following is the entry point of Spark SQL?

A)

SparkContext

B)

SparkSession

C)

Both (A) and (B)

D)

None of the above

Correct Answer : SparkSession

46 .

Which of the following is open-source?

A)

Apache Spark

B)

Apache Flink

C)

Apache Hadoop

D)

All of the above

Correct Answer : All of the above

47 .

In Spark SQL optimization which of the following is not present in the logical plan_________

A)

Constant folding

B)

Predicate pushdown

C)

Abstract syntax tree

D)

Projection pruning

Correct Answer : Abstract syntax tree

48 .

In the analysis phase which is the correct order of execution after forming unresolved logical plan

a) Search relation BY NAME FROM CATALOG.

b) Determine which attributes match to the same value to give them unique ID.

c) Map the name attribute

d) Propagate and push type through expressions

A)

abcd

B)

acbd

C)

adbc

D)

dcab

Correct Answer : acbd

49 .

Apache Spark is presently added in all major distribution of Hadoop_______

A)

True

B)

False

C)

None of the above

D)

--

Correct Answer : True

50 .

Does Dataset API support Python and R.

A)

Yes

B)

No

C)

None of the above

D)

--

Correct Answer : No

51 .

Which of the following is good for low-level transformation and actions.

A)

RDD

B)

Dataset

C)

DataFrame

D)

All of the above

Correct Answer : RDD

52 .

Which of the following technology is good for Stream technology?

A)

Apache Spark

B)

Apache Hadoop

C)

Apache Flink

D)

None of the above

Correct Answer : Apache Flink

53 .

Which of the following is not true for Apache Spark Execution?

A)

To simplify working with structured data it provides DataFrame abstraction in Python, Java, and Scala.

B)

The best way to use Spark SQL is inside a Spark application. This empowers us to load data and query it with SQL.

C)

The data can be read and written in a variety of structured formats. For example, JSON, Hive Tables, and Parquet.

D)

Using SQL we can query data,only from inside a Spark program and not from external tools.

Correct Answer : Using SQL we can query data,only from inside a Spark program and not from external tools.

54 .

When SQL run from the other programming language the result will be______

A)

DataSet

B)

DataFrame

C)

Either DataFrame or Dataset

D)

Neither DataFrame nor Dataset

Correct Answer : Either DataFrame or Dataset

55 .

The Dataset API is accessible in________

A)

Scala and R

B)

Java and Scala

C)

Scala and Python

D)

Java, Scala and python

Correct Answer : Java and Scala

56 .

Which of the following is true for Catalyst optimizer?

A)

The optimizer helps us to run queries much faster than their counter RDD part.

B)

The optimizer helps us to run queries little faster than their counter RDD part.

C)

The optimizer helps us to run queries in the same speed as their counter RDD part.

D)

None of the above

Correct Answer : The optimizer helps us to run queries much faster than their counter RDD part.

57 .

Which of the following are uses of Apache Spark SQL?

A)

It executes SQL queries.

B)

We can read data from existing Hive installation using SparkSQL.

C)

When we run SQL within another programming language we will get the result as Dataset/DataFrame.

D)

All of the above

Correct Answer : All of the above

58 .

Which of the following are the common feature of RDD and DataFrame?

A)

Resilient

B)

In-memory

C)

Immutability

D)

All of the above

Correct Answer : All of the above

59 .

Which of the following is not true for map() Operation?

A)

Map transforms an RDD of length N into another RDD of length N.

B)

It applies to each element of RDD and it returns the result as new RDD

C)

Map allows returning 0, 1 or more elements from map function.

D)

In the Map operation developer can define his own custom business logic.

Correct Answer : Map allows returning 0, 1 or more elements from map function.

60 .

Which of the following is a transformation?

A)

top()

B)

take(n)

C)

countByValue()

D)

mapPartitionWithIndex()

Correct Answer : mapPartitionWithIndex()

61 .

Which of the following is action in Apache Spark?

A)

Distinct()

B)

CountByValue()

C)

Union(dataset)

D)

Intersection(other-dataset)

Correct Answer : CountByValue()

62 .

In which of the following Action the result is not returned to the driver.

A)

foreach()

B)

top()

C)

collect()

D)

countByValue()

Correct Answer : foreach()

63 .

Which of the following is true for stateless transformation?

A)

Windowed operations and updateStateByKey() are two type of Stateless transformation.

B)

The processing of each batch has no dependency on the data of previous batches.

C)

Uses data or intermediate results from previous batches and computes the result of the current batch.

D)

None of the above

Correct Answer : The processing of each batch has no dependency on the data of previous batches.

64 .

Which of the following is true for stateful transformation in Apache Spark?

A)

Stateful transformations are simple RDD transformations.

B)

The processing of each batch has no dependency on the data of previous batches.

C)

Uses data or intermediate results from previous batches and computes the result of the current batch.

D)

None of the above

Correct Answer : Uses data or intermediate results from previous batches and computes the result of the current batch.

65 .

Which of the following is true for Spark MLlib?

A)

It is the scalable machine learning library which delivers efficiencies

B)

Provides an execution platform for all the Spark applications

C)

enables powerful interactive and data analytics application across live streaming data

D)

All of the above

Correct Answer : It is the scalable machine learning library which delivers efficiencies

66 .

In which of the following cases do we keep the data in-memory?

A)

Iterative algorithms

B)

Interactive data mining tools

C)

Both (A) and (B)

D)

None of the above

Correct Answer : Both (A) and (B)

67 .

When does Apache Spark evaluate RDD?

A)

Upon action

B)

Upon transformation

C)

On both transformation and action

D)

None of the above

Correct Answer : Upon action

68 .

The read operation on RDD is__________

A)

Fine-grained

B)

Coarse-grained

C)

Neither fine-grained nor coarse-grained

D)

Either fine-grained or coarse-grained

Correct Answer : Either fine-grained or coarse-grained

69 .

The write operation on RDD is_________

A)

Fine-grained

B)

Coarse-grained

C)

Either fine-grained or coarse-grained

D)

Neither fine-grained nor coarse-grained

Correct Answer : Coarse-grained

70 .

Fault Tolerance in RDD is achieved using_______

A)

Lazy-evaluation

B)

Immutable nature of RDD

C)

DAG (Directed Acyclic Graph)

D)

None of the above

Correct Answer : DAG (Directed Acyclic Graph)

71 .

What is a transformation in Spark RDD?

A)

Returns final result of RDD computations.

B)

The ways to send result from executors to the driver

C)

Takes RDD as input and produces one or more RDD as output.

D)

None of the above

Correct Answer : Takes RDD as input and produces one or more RDD as output.

72 .

How many Spark Context can be active per JVM?

A)

Only one

B)

More than one

C)

Not specific

D)

None of the above

Correct Answer : Only one

73 .

In how many ways RDD can be created?

A)

2

B)

3

C)

4

D)

5

Correct Answer : 3

74 .

You can connect R program to a Spark cluster from________

A)

Rscript

B)

R Shell

C)

RStudio

D)

All of the above

Correct Answer : All of the above

75 .

Which of the following is not the feature of Spark?

A)

Fault-tolerance

B)

It is cost efficient

C)

Supports in-memory computation

D)

Compatible with other file storage system

Correct Answer : It is cost efficient

76 .

What are the parameters defined to specify window operation________

A)

State size, window length

B)

State size, sliding interval

C)

Window length, sliding interval

D)

None of the above

Correct Answer : Window length, sliding interval

77 .

Which of the following is not output operation on DStream

A)

ForeachRDD

B)

SaveAsTextFiles

C)

SaveAsHadoopFiles

D)

ReduceByKeyAndWindow

Correct Answer : ReduceByKeyAndWindow

78 .

Which Cluster Manager do Spark Support?

A)

YARN

B)

MESOS

C)

Standalone Cluster Manager

D)

All of the above

Correct Answer : All of the above

79 .

The default storage level of cache() is?

A)

MEMORY_ONLY

B)

DISK_ONLY

C)

MEMORY_AND_DISK

D)

MEMORY_ONLY_SER

Correct Answer : MEMORY_ONLY

80 .

Which of the following is not true for Hadoop and Spark?

A)

Both are data processing platforms

B)

Both are cluster computing environments

C)

Both have their own file system

D)

Both use open source APIs to link between different tools

Correct Answer : Both have their own file system