Google News
logo
PySpark - Quiz(MCQ)

PySpark : PySpark is the Python API to use Spark. Spark is an open-source, cluster computing system which is used for big data solution. It is lightning fast technology that is designed for fast computation.

PySpark provides Py4j library, with the help of this library, Python can be easily integrated with Apache Spark.

A)
C
B)
PHP
C)
Python
D)
Java

Correct Answer :   Python


Explanation : An API for using Spark in Python is PySpark.

A)
PySpark is a standalone data processing system.
B)
PySpark does not support distributed processing.
C)
PySpark is used for processing data only in small batches.
D)
PySpark is a Python library used for Big Data processing.

Correct Answer :   PySpark is a Python library used for Big Data processing.


Explanation : PySpark is a Python library used for Big Data processing. It is built on top of Apache Spark, which is a distributed computing system. PySpark provides APIs in Python for data processing, machine learning, and graph processing.

A)
filter()
B)
count()
C)
collect()
D)
reduce()

Correct Answer :   filter()


Explanation : filter() is a transformation operation in PySpark. It creates a new RDD by selecting elements from an existing RDD based on a condition. Other transformation operations in PySpark include map(), flatMap(), union(), distinct(), and groupByKey().

4 .
Using Spark, users can implement big data solutions in an ____-source, cluster computing environment.
A)
Closed
B)
Open
C)
Hybrid
D)
None of the above

Correct Answer :   Open


Explaination : Using Spark, users can implement big data solutions in an open-source, cluster computing environment.

A)
PyJ
B)
PyS4J
C)
Py4j
D)
PySpark4J

Correct Answer :   Py4j

A)
set()
B)
map()
C)
get()
D)
data()

Correct Answer :   map()


Explanation : Serializing another function can be done using the map() function.

A)
Serialization
B)
Profiler
C)
SparkFiles
D)
StorageLevel

Correct Answer :   Serialization


Explanation : A tuning procedure on Apache Spark is performed using PySpark Serialization.

A)
getJobInfo(jobId)
B)
getActiveStageIds()
C)
getJobIdsForGroup(jobGroup = None)
D)
All of the above

Correct Answer :   getActiveStageIds()


Explanation : The active stage ids are returned by getActiveStageIds() in an array.

A)
map()
B)
filter()
C)
flatMap()
D)
count()

Correct Answer :   count()


Explanation : count() is an action operation in PySpark. It returns the number of elements in an RDD. Other action operations in PySpark include collect(), reduce(), take(), and foreach().

A)
DataSet
B)
DataFrame
C)
SparkContext
D)
SQLContext

Correct Answer :   SparkContext


Explanation : SparkContext is used to create an RDD in PySpark. It is the entry point to the Spark computing system and provides APIs to create RDDs, accumulates values, and manipulate data. Other Spark components in PySpark include SQLContext, SparkSession, and DataFrameReader.

A)
It sets up internal services
B)
It sends the application to executors
C)
It does not execute tasks in each executor
D)
Establishes a connection to an execution environment

Correct Answer :   It sets up internal services

A)
SparkContent
B)
SparkContext
C)
ContentSpark
D)
ContextSpark

Correct Answer :   SparkContext

A)
They are mutable in nature
B)
They are distributed in nature
C)
Execution starts before an action is triggered
D)
None of the above

Correct Answer :   They are distributed in nature

A)
Low
B)
High
C)
Average
D)
None of the above

Correct Answer :   Low


Explanation : Job and stage progress can be monitored using PySpark's low-level APIs.

A)
Add
B)
Stats
C)
Profile
D)
All of the above

Correct Answer :   All of the above


Explanation : Among the methods that need to be defined by the custom profiler are:

* Add
* Stats
* Profile

A)
cProfile
B)
Accumulator
C)
Both (A) and (B)
D)
None of the above

Correct Answer :   Both (A) and (B)


Explanation : class pyspark.BasicProfiler(ctx) implements cProfile and Accumulator as a default profiler.

A)
Persistence
B)
Fault Tolerant
C)
Lazy Evaluation
D)
All of the above

Correct Answer :   All of the above


Explanation : The following are the features of PySpark :

* Lazy Evaluation
* Fault Tolerant
* Persistence

A)
Static
B)
Real-time
C)
Virtual
D)
Dynamic

Correct Answer :   Real-time


Explanation : In-memory processing of large data makes PySpark ideal for real-time computation.

A)
readTextFile()
B)
writeTextFile()
C)
readDataFrame()
D)
writeDataFrame()

Correct Answer :   readTextFile()


Explanation : readTextFile() is used to read data from a file in PySpark. It reads the contents of a file and creates an RDD with each line of the file as an element. Other file input/output operations in PySpark include read.csv(), read.json(), write.csv(), and write.json().

A)
fromRDD()
B)
RDDtoDF()
C)
toDataFrame()
D)
createDataFrame()

Correct Answer :   createDataFrame()


Explanation : createDataFrame() is used to convert an RDD to a DataFrame in PySpark. It creates a DataFrame from an RDD with a specified schema. Other DataFrame operations in PySpark include select(), filter(), groupBy(), and join().

A)
persist()
B)
cache()
C)
collect()
D)
saveAsTextFile()

Correct Answer :   persist()


Explanation : persist() is used to cache an RDD in memory in PySpark. It stores the RDD in memory and/or on disk so that it can be reused efficiently in subsequent operations. Other RDD operations in PySpark include mapPartitions(), sortByKey(), reduceByKey(), and aggregateByKey().

A)
SparkFiles.go
B)
SparkFiles.get
C)
SparkFiles.set
D)
SparkFiles.fetch

Correct Answer :   SparkFiles.get


Explanation : With SparkFiles.get, we can obtain the working directory path.

A)
DISK_ONLY
B)
DISK_ONLY_2
C)
MEMORY_AND_DISK
D)
All of the above

Correct Answer :   All of the above


Explanation : To decide how RDDs are stored, PySpark has different StorageLevels, such as the following :

* DISK_ONLY
* DISK_ONLY_2
* MEMORY_AND_DISK

A)
Java
B)
Scala
C)
Python
D)
All of the above

Correct Answer :   All of the above


Explanation : A variety of programming languages can be used with PySpark framework, such as Scala, Java, Python, and R.

A)
RCD
B)
RAD
C)
RDD
D)
RBD

Correct Answer :   RDD


Explanation : When working with RDD, Python's dynamic typing comes in handy.

A)
10
B)
100
C)
1000
D)
10000

Correct Answer :   10


Explanation : In memory, PySpark processes data 100 times faster, and on disk, the speed is 10 times faster.

A)
groupByKey()
B)
map()
C)
filter()
D)
reduce()

Correct Answer :   groupByKey()


Explanation : groupByKey() is a transformation operation that shuffles data in PySpark. It groups the values of each key in an RDD and creates a new RDD of (key, value) pairs. Other shuffling operations in PySpark include sortByKey(), reduceByKey(), and aggregateByKey().

A)
zip()
B)
map()
C)
flatMap()
D)
groupByKey()

Correct Answer :   zip()


Explanation : zip() is used to create a PairRDD in PySpark. It creates a new RDD by aggregating the elements of two RDDs into pairs. The first element of each RDD becomes the key, and the second element becomes the value. Other PairRDD operations in PySpark include reduceByKey(), groupByKey(), and join().

A)
broadcast()
B)
rdd.broadcast()
C)
spark.broadcast()
D)
sc.broadcast()

Correct Answer :   sc.broadcast()


Explanation : sc.broadcast() is used to broadcast a read-only variable in PySpark. It broadcasts the variable to all nodes in a Spark cluster so that it can be accessed efficiently by tasks. Other broadcasting operations in PySpark include accumulators and counters.

A)
RDD
B)
Dataframe
C)
PySpark Core
D)
PySpark SQL

Correct Answer :   PySpark SQL

A)
Using join() function
B)
Using filter() function
C)
Using parallel() function
D)
Using parallelize() function

Correct Answer :   Using parallelize() function

A)
spark-cluster
B)
spark cluster
C)
spark-submit
D)
spark submit

Correct Answer :   spark-submit

A)
RDDread.Collect()
B)
RDDread.Content()
C)
RDD.CollectContent()
D)
RDDread.ContentCollect()

Correct Answer :   RDDread.Collect()

A)
Random Forest
B)
Linear Regression
C)
K-Means Clustering
D)
All of the above

Correct Answer :   All of the above


Explanation : PySpark provides several built-in machine learning algorithms, including Linear Regression, K-Means Clustering, Random Forest, Decision Trees, Gradient Boosting, and Naive Bayes. These algorithms can be used for regression, classification, clustering, and collaborative filtering.

A)
Clustering Concise
B)
Clustering Computing
C)
Clustering Collective
D)
Clustering Calculative

Correct Answer :   Clustering Computing


Explanation : The Apache Software Foundation introduced Apache Spark, an open-source clustering computing framework.

36 .
____ are among the key features of Apache Spark. It is easy to use, provides simplicity, and can run virtually anywhere.
A)
High Speed
B)
Stream Analysis
C)
Both (A) and (B)
D)
None of the above

Correct Answer :   Both (A) and (B)


Explaination : Stream analysis and high speed are among the key features of Apache Spark. It is easy to use, provides simplicity, and can run virtually anywhere.

A)
Scala
B)
Spark
C)
PySpark
D)
None of the above

Correct Answer :   Scala


Explanation : Programming in Scala is the official language of Apache Spark.

38 .
Scala is a ____ typed language as opposed to Python, which is an interpreted, ____ programming language.
A)
Statically, Dynamic
B)
Dynamic, Statically
C)
Statically, Partially Dynamic
D)
Dynamic, Partially Statically

Correct Answer :   Statically, Dynamic


Explaination : Scala is a statically typed language as opposed to Python, which is an interpreted, dynamic programming language.

A)
2
B)
5
C)
10
D)
20

Correct Answer :   10


Explanation : Python is 10 times slower than Scala.

A)
C
B)
C++
C)
Scala
D)
Python

Correct Answer :   Python


Explanation : Java version 1.8.0 or higher is required for PySpark, as is Python version 3.6 or higher.

A)
Associative
B)
Commutative
C)
Both (A) and (B)
D)
None of the above

Correct Answer :   Both (A) and (B)


Explanation : Associative and commutative operations are carried out on the accumulator variables to combine the information.

A)
sc.newFile
B)
sc.addFile
C)
sc.deleteFile
D)
sc.updateFile

Correct Answer :   sc.addFile


Explanation : Using sc.addFile, PySpark allows you to upload your files.

A)
Broadcast
B)
Accumulator
C)
Both (A) and (B)
D)
None of the above

Correct Answer :   Both (A) and (B)


Explanation : The following types of shared variables are supported by Apache Spark :

* Broadcast
* Accumulator

A)
set (key, value)
B)
setAppName (value)
C)
setMastervalue (value)
D)
All of the above

Correct Answer :   All of the above


Explanation : The following are the features of the SparkConf :

* set (key, value)
* setMastervalue (value)
* setAppName (value)

A)
URL
B)
Site
C)
Page
D)
Browser

Correct Answer :   URL


Explanation : The Master URL identifies the cluster connected to Spark.

A)
Conf
B)
pyFiles
C)
BatchSize
D)
SparkHome

Correct Answer :   SparkHome


Explanation : The SparkHome directory contains the Spark installation files.

A)
.py
B)
.zip
C)
Both (A) and (B)
D)
None of the above

Correct Answer :   Both (A) and (B)


Explanation : The PYTHONPATH is set by sending .zip or .py files to the cluster.

48 .
A ____ memory abstraction, resilient distributed datasets (RDDs), allows programmers to run in-memory computations on clustered systems.
A)
Configured
B)
Distributed
C)
Compressed
D)
Concentrated

Correct Answer :   Distributed


Explaination : A distributed memory abstraction, resilient distributed datasets (RDDs), allows programmers to run in-memory computations on clustered systems.

A)
Tolerant
B)
Intolerant
C)
Manageable
D)
None of the above

Correct Answer :   Tolerant


Explanation : The main advantage of RDD is that it is fault-tolerant, which means that if there is a failure, it automatically recovers.

A)
Resilient Defined Dataset
B)
Resilient Distributed Dataset
C)
Resilient Defined Database
D)
Resilient Distributed Database

Correct Answer :   Resilient Distributed Dataset


Explanation : The full form of RDD is Resilient Distributed Dataset.

A)
Slowness
B)
Py4JJavaError
C)
Both (A) and (B)
D)
None of the above

Correct Answer :   Both (A) and (B)


Explanation : The following are the common UDF problems :

* Py4JJavaError
* Slowness

A)
Stacks
B)
Arrays
C)
Queues
D)
Objects

Correct Answer :   Objects


Explanation : This number corresponds to the BatchSize of the Python objects.

A)
0
B)
1
C)
Void
D)
Null

Correct Answer :   1


Explanation : The batching can be disabled by setting it to 1

A)
aggregate()
B)
reduce()
C)
collect()
D)
groupByKey()

Correct Answer :   aggregate()


Explanation : aggregate() is used to aggregate data in PySpark. It applies a function to each partition of an RDD and then combines the results using another function. Other aggregation operations in PySpark include reduce(), fold(), and combineByKey().

A)
sort()
B)
groupByKey()
C)
reduceByKey()
D)
sortByKey()

Correct Answer :   sortByKey()


Explanation : sortByKey() is used to sort data in PySpark. It sorts an RDD of (key, value) pairs by the key in ascending or descending order. Other sorting operations in PySpark include sort(), sortBy(), and sortByValue().

A)
readTextFile()
B)
readDataFrame()
C)
writeTextFile()
D)
writeDataFrame()

Correct Answer :   writeTextFile()


Explanation : writeTextFile() is used to write data to a file in PySpark. It writes the contents of an RDD to a file with each element of the RDD on a separate line. Other file input/output operations in PySpark include write.csv(), write.json().

A)
readCSV()
B)
readJSON()
C)
readTextFile()
D)
read.parquet()

Correct Answer :   readCSV()


Explanation : readCSV() is used to read data from a CSV file in PySpark. It reads the contents of a CSV file and creates a DataFrame with each row of the file as a separate row in the DataFrame. Other file input/output operations in PySpark include read.json(), read.parquet(), and read.text().

A)
Functional-to-relational
B)
Relational-to-functional
C)
Functional-to-functional
D)
None of the above

Correct Answer :   Relational-to-functional


Explanation : An integrated relational-to-functional programming API is provided by PySpark SQL in Spark.

A)
Changing the trash setting will prevent us from dropping encrypted databases in cascade.
B)
In other words, if the workflow execution fails in the middle, you cannot recover the position from which it stopped.
C)
MapReduce executes ad-hoc queries, which are launched by Hive, but the performance of the analysis is delayed due to the medium-sized database.
D)
All of the above

Correct Answer :   All of the above


Explanation : The drawbacks of Hive are :

* Changing the trash setting will prevent us from dropping encrypted databases in cascade.
* In other words, if the workflow execution fails in the middle, you cannot recover the position from which it stopped.
* MapReduce executes ad-hoc queries, which are launched by Hive, but the performance of the analysis is delayed due to the medium-sized database.

A)
pyspark.sql.types
B)
pyspark.sql.Window
C)
pyspark.sql.functions
D)
All of the above

Correct Answer :   pyspark.sql.functions


Explanation : A list of built-in functions for DataFrame is stored in pyspark.sql.functions.

A)
pyspark.sql.Row
B)
pyspark.sql.Column
C)
pyspark.sql.functions
D)
pyspark.sql.DataFrameNaFunctions

Correct Answer :   pyspark.sql.DataFrameNaFunctions


Explanation : Missing data can be handled via pyspark.sql.DataFrameNaFunctions.

A)
Data.groupBy()
B)
DataFrame.groupBy()
C)
Data.groupedBy()
D)
DataFrame.groupedBy()

Correct Answer :   DataFrame.groupBy()


Explanation : DataFrame.groupBy() returns aggregation methods.

A)
rename()
B)
column()
C)
withColumn()
D)
renameColumn()

Correct Answer :   rename()


Explanation : rename() is used to rename a column in a PySpark DataFrame. It renames the specified column to a new name. Other DataFrame operations in PySpark include withColumn(), withColumnRenamed(), and drop().

A)
write.csv()
B)
write.parquet()
C)
write.json()
D)
write.text()

Correct Answer :   write.parquet()


Explanation : write.parquet() is used to write data to a Parquet file in PySpark. Parquet is a columnar storage format that is optimized for query performance. Other file input/output operations in PySpark include write.csv(), write.json(), and write.text().

A)
Standard Connectivity
B)
Consistence Data Access
C)
Incorporation with Spark
D)
All of the above

Correct Answer :   All of the above


Explanation : The features of PySpark SQL are :

* Standard Connectivity
* Consistence Data Access
* Incorporation with Spark

66 .
The Consistent Data Access feature allows SQL to access a variety of data sources, such as ____, JSON, and JDBC, from a single place.
A)
Hive
B)
Avro
C)
Parquet
D)
All of the above

Correct Answer :   All of the above


Explaination : The Consistent Data Access feature allows SQL to access a variety of data sources, such as Hive, Avro, Parquet, JSON, and JDBC, from a single place.

A)
User-Defined Fidelity
B)
User-Defined Fortray
C)
User-Defined Functions
D)
User-Defined Formula

Correct Answer :   User-Defined Functions


Explanation : The full form of UDF is User-Defined Functions.

68 .
A UDF extends Spark SQL's DSL vocabulary for transforming DataFrames by defining a new ____-based function.
A)
Tuple
B)
Row
C)
Column
D)
None of the above

Correct Answer :   Column


Explaination : A UDF extends Spark SQL's DSL vocabulary for transforming DataFrames by defining a new column-based function.

A)
pyspark.sql.Row
B)
pyspark.sql.Column
C)
pyspark.sql.DataFrame
D)
pyspark.sql.SparkSession

Correct Answer :   pyspark.sql.SparkSession


Explanation : DataFrame and SQL functionality are accessed through pyspark.sql.SparkSession.

A)
pyspark.sql.DataFrame
B)
pyspark.sql.Row
C)
pyspark.sql.Column
D)
pyspark.sql.GroupedData

Correct Answer :   pyspark.sql.DataFrame


Explanation : pyspark.SQL.DataFrame represents a set of named columns and distributed data.