Google News
logo
Spark Interview Questions
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance.

Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
Support for Several Programming Languages : Spark code can be written in any of the four programming languages, namely Java, Python, R, and Scala. It also provides high-level APIs in these programming languages. Additionally, Apache Spark provides shells in Python and Scala. The Python shell is accessed through the ./bin/pyspark directory, while for accessing the Scala shell one needs to go to the .bin/spark-shell directory.

Speed : For large-scale data processing, Spark can be up to 100 times faster than Hadoop MapReduce. Apache Spark is able to achieve this tremendous speed via controlled portioning. The distributed, general-purpose cluster-computing framework manages data by means of partitions that help in parallelizing distributed data processing with minimal network traffic.

Machine Learning : For big data processing, Apache Spark’s MLib machine learning component is useful. It eliminates the need for using separate engines for processing and machine learning.

Lazy Evaluation : Apache Spark makes use of the concept of lazy evaluation, which is to delay the evaluation up until the point it becomes absolutely compulsory.

Multiple Format Support : Apache Spark provides support for multiple data sources, including Cassandra, Hive, JSON, and Parquet. The Data Sources API offers a pluggable mechanism for accessing structured data via Spark SQL. These data sources can be much more than just simple pipes able to convert data and pulling the same into Spark.

Real-Time Computation : Spark is designed especially for meeting massive scalability requirements. Thanks to its in-memory computation, Spark’s computation is real-time and has less latency.

Hadoop Integration : Spark offers smooth connectivity with Hadoop. In addition to being a potential replacement for the Hadoop MapReduce functions, Spark is able to run on top of an extant Hadoop cluster by means of YARN for resource scheduling.

Supports Spark GraphX for graph parallel execution, Spark SQL, libraries for Machine learning, etc.

Active Developer’s Community : Apache Spark has a large developers base involved in continuous development. It is considered to be the most important project undertaken by the Apache community.

Cost Efficiency : Apache Spark is considered a better cost-efficient solution when compared to Hadoop as Hadoop required large storage and data centers while data processing and replication.
RDD stands for Resilient Distribution Datasets. It is a fault-tolerant collection of parallel running operational elements. The partitioned data of RDD is distributed and immutable. There are two types of datasets :
 
Parallelized collections : Meant for running parallelly.

Hadoop datasets : These perform operations on file record systems on HDFS or other storage systems.

RDDs are basically parts of data that are stored in the memory distributed across many nodes. RDDs are lazily evaluated in Spark. This lazy evaluation is what contributes to Spark’s speed.
Features Spark Hadoop
Data processing Part of Hadoop, hence batch processing Batch Processing even for high volumes
Streaming Engine Apache spark streaming - micro-batches Map-Reduce
Data Flow Direct Acyclic Graph-DAG Map-Reduce
Computation Model Collect and process Map-Reduce batch-oriented model
Performance Slow due to batch processing Slow due to batch processing
Memory Management Automatic memory management in the latest release Dynamic and static - Configurable
Fault Tolerance Recovery available without extra code Highly fault-tolerant due to Map-Reduce
Scalability Highly scalable - spark Cluster(8000 Nodes) Highly scalable - Produces a large number of nodes
Spark has the following benefits over MapReduce :
 
* Due to the availability of in-memory processing, Spark implements the processing around 10 to 100 times faster than Hadoop MapReduce whereas MapReduce makes use of persistence storage for any of the data processing tasks.

* Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.

* Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.

* Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
Apache Spark MapReduce
Spark processes data in batches as well as in real-time MapReduce processes data in batches only
Spark runs almost 100 times faster than Hadoop MapReduce Hadoop MapReduce is slower when it comes to large scale data processing
Spark stores data in the RAM i.e. in-memory. So, it is easier to retrieve it Hadoop MapReduce data is stored in HDFS and hence takes a long time to retrieve the data
Spark provides caching and in-memory data storage Hadoop is highly disk-dependent
Standalone Mode : By default, applications submitted to the standalone mode cluster will run in FIFO order, and each application will try to use all available nodes. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.

Apache Mesos : Apache Mesos is an open-source project to manage computer clusters, and can also run Hadoop applications. The advantages of deploying Spark with Mesos include dynamic partitioning between Spark and other frameworks as well as scalable partitioning between multiple instances of Spark.

Hadoop YARN : Apache YARN is the cluster resource manager of Hadoop 2. Spark can be run on YARN as well.

Kubernetes : Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.
Similar to Hadoop, YARN is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. YARN is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on YARN, the same way Hadoop Map Reduce can run on YARN. Running Spark on YARN necessitates a binary distribution of Spark as built on YARN support. 
No, because Spark runs on top of YARN. Spark runs independently from its installation. Spark has some options to use YARN when dispatching jobs to the cluster, rather than its own built-in manager, or Mesos. Further, there are some configurations to run YARN. They include master, deploy-mode, driver-memory, executor-memory, executor-cores, and queue.
Spark provides two methods to create RDD :
 
* By parallelizing a collection in your Driver program.
 
* This makes use of SparkContext’s ‘parallelize’
method val DataArray = Array(2,4,6,8,10)
 
val DataRDD = sc.parallelize(DataArray)
* By loading an external dataset from external storage like HDFS, HBase, shared file system.
DAG stands for Directed Acyclic Graph with no directed cycles. There would be finite vertices and edges. Each edge from one vertex is directed to another vertex in a sequential manner. The vertices refer to the RDDs of Spark and the edges represent the operations to be performed on those RDDs.
Apache Spark has 3 main categories that comprise its ecosystem. Those are :
 
Language support : Spark can integrate with different languages to applications and perform analytics. These languages are Java, Python, Scala, and R.

Core Components : Spark supports 5 main core components. There are Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and GraphX.

Cluster Management : Spark can be run in 3 environments. Those are the Standalone cluster, Apache Mesos, and YARN.
When Spark operates on any dataset, it remembers the instructions. When a transformation such as a map() is called on an RDD, the operation is not performed instantly. Transformations in Spark are not evaluated until you perform an action, which aids in optimizing the overall data processing workflow, known as lazy evaluation.
Apache Spark stores data in-memory for faster processing and building machine learning models. Machine Learning algorithms require multiple iterations and different conceptual steps to create an optimal model. Graph algorithms traverse through all the nodes and edges to generate a graph. These low latency workloads that need multiple iterations can lead to increased performance.
Receivers are those entities that consume data from different data sources and then move them to Spark for processing. They are created by using streaming contexts in the form of long-running tasks that are scheduled for operating in a round-robin fashion. Each receiver is configured to use up only a single core. The receivers are made to run on various executors to accomplish the task of data streaming. There are two types of receivers depending on how the data is sent to Spark :
 
Reliable receivers : Here, the receiver sends an acknowledegment to the data sources post successful reception of data and its replication on the Spark storage space.

Unreliable receiver :
Here, there is no acknowledgement sent to the data sources.
Repartition  Coalesce
Usage repartition can increase/decrease the number of data partitions. Spark coalesce can only reduce the number of data partitions.
Repartition creates new data partitions and performs a full shuffle of evenly distributed data. Coalesce makes use of already existing partitions to reduce the amount of shuffled data unevenly.
Repartition internally calls coalesce with shuffle parameter thereby making it slower than coalesce. Coalesce is faster than repartition. However, if there are unequal-sized data partitions, the speed might be slightly slower.
Transformations are functions applied on RDD, resulting into another RDD. It does not execute until an action occurs. map() and filter() are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements from current RDD that pass function argument.
val rawData=sc.textFile("path to/movies.txt")
 
val moviesData=rawData.map(x=>x.split("  "))
As we can see here, rawData RDD is transformed into moviesData RDD. Transformations are lazily evaluated.
18 .
How does spark partition the data?
Spark uses map-reduce API to do the partition the data. In Input format, we can create a number of partitions. By default, HDFS block size is partition size (for best performance), but it’s possible to change partition size like Split.
19 .
What are the data formats supported by Spark?
Spark supports both the raw files and the structured file formats for efficient reading and processing. File formats like paraquet, JSON, XML, CSV, RC, Avro, TSV, etc are supported by Spark.
Spark Architecture
Spark applications are run in the form of independent processes that are well coordinated by the Driver program by means of a SparkSession object. The cluster manager or the resource manager entity of Spark assigns the tasks of running the Spark jobs to the worker nodes as per one task per partition principle. There are various iterations algorithms that are repeatedly applied to the data to cache the datasets across various iterations. Every task applies its unit of operations to the dataset within its partition and results in the new partitioned dataset. These results are sent back to the main driver application for further processing or to store the data on the disk.
There are a total of 4 steps that can help you connect Spark to Apache Mesos.
 
* Configure the Spark Driver program to connect with Apache Mesos
* Put the Spark binary package in a location accessible by Mesos
* Install Spark in the same location as that of the Apache Mesos
* Configure the spark.mesos.executor.home property for pointing to the location where Spark is installed
Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark :
hive> set spark.home=/location/to/sparkHome;
hive> set hive.execution.engine=spark;
Hive supports Spark on YARN mode by default.
23 .
What is GraphX?
Spark uses GraphX for graph processing to build and transform interactive graphs. The GraphX component enables programmers to reason about structured data at scale.
24 .
What does MLlib do?
MLlib is a scalable Machine Learning library provided by Spark. It aims at making Machine Learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and the like.
Spark SQL modules help in integrating relational processing with Spark’s functional programming API. It supports querying data via SQL or HiveQL (Hive Query Language).
 
Also, Spark SQL supports a galore of data sources and allows for weaving SQL queries with code transformations. DataFrame API, Data Source API, Interpreter & Optimizer, and SQL Service are the four libraries contained by the Spark SQL.
Yes, it is possible to use Apache Spark for accessing as well as analyzing data stored in Cassandra databases using the Spark Cassandra Connector. It needs to be added to the Spark project during which a Spark executor talks to a local Cassandra node and will query only local data.

Connecting Cassandra with Apache Spark allows making queries faster by means of reducing the usage of the network for sending data between Spark executors and Cassandra nodes.
Any node that is capable of running the code in a cluster can be said to be a worker node. The driver program needs to listen for incoming connections and then accept the same from its executors. Additionally, the driver program must be network addressable from the worker nodes.
 
A worker node is basically a slave node. The master node assigns work that the worker node then performs. Worker nodes process data stored on the node and report the resources to the master node. The master node schedule tasks based on resource availability.
A sparse vector is used for storing non-zero entries for saving space. It has two parallel arrays :
 
* One for indices
* The other for values

An example of a sparse vector is as follows :
Vectors.sparse(7,Array(0,1,2,3,4,5,6),Array(1650d,50000d,800d,3.0,3.0,2009,95054))
* Loading data from a variety of structured sources

* Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC), e.g., using Business Intelligence tools like Tableau

* Providing rich integration between SQL and the regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.
30 .
How SparkSQL is different from HQL and SQL?
SparkSQL is a special component on the spark core engine that supports SQL and HiveQueryLanguage without changing any syntax. It’s possible to join the SQL table and HQL table.
31 .
What is PageRank?
A unique feature and algorithm in GraphX, PageRank is the measure of each vertex in a graph. For instance, an edge from u to v represents an endorsement of v‘s importance w.r.t. u. In simple terms, if a user on Instagram is followed massively, he/she will be ranked high on that platform.
There are 2 ways to convert a Spark RDD into a DataFrame:
 
Using the helper function - toDF :
import com.mapr.db.spark.sql._

val df = sc.loadFromMapRDB(<table-name>)

.where(field(“first_name”) === “Peter”)

.select(“_id”, “first_name”).toDF()
 
Using SparkSession.createDataFrame :
You can convert an RDD[Row] to a DataFrame by
 
calling createDataFrame on a SparkSession object
def createDataFrame(RDD, schema:StructType)
In case the client machines are not close to the cluster, then the Cluster mode should be used for deployment. This is done to avoid the network latency caused while communication between the executors which would occur in the Client mode. Also, in Client mode, the entire process is lost if the machine goes offline.

If we have the client machine inside the cluster, then the Client mode can be used for deployment. Since the machine is inside the cluster, there won’t be issues of network latency and since the maintenance of the cluster is already handled, there is no cause of worry in cases of failure.
Spark Datasets are those data structures of SparkSQL that provide JVM objects with all the benefits (such as data manipulation using lambda functions) of RDDs alongside Spark SQL-optimised execution engine. This was introduced as part of Spark since version 1.6.
 
* Spark datasets are strongly typed structures that represent the structured queries along with their encoders.

* They provide type safety to the data and also give an object-oriented programming interface.

* The datasets are more structured and have the lazy query expression which helps in triggering the action. Datasets have the combined powers of both RDD and Dataframes. Internally, each dataset symbolizes a logical plan which informs the computational query about the need for data production. Once the logical plan is analyzed and resolved, then the physical query plan is formed that does the actual query execution.

Datasets have the following features :
 
Optimized Query feature : Spark datasets provide optimized queries using Tungsten and Catalyst Query Optimizer frameworks. The Catalyst Query Optimizer represents and manipulates a data flow graph (graph of expressions and relational operators). The Tungsten improves and optimizes the speed of execution of Spark job by emphasizing the hardware architecture of the Spark execution platform.

Compile-Time Analysis : Datasets have the flexibility of analyzing and checking the syntaxes at the compile-time which is not technically possible in RDDs or Dataframes or the regular SQL queries.

Interconvertible : The type-safe feature of datasets can be converted to “untyped” Dataframes by making use of the following methods provided by the Datasetholder :
toDS():Dataset[T]
toDF():DataFrame
toDF(columName:String*):DataFrame

 

Faster Computation : Datasets implementation are much faster than those of the RDDs which helps in increasing the system performance.

Persistent storage qualified : Since the datasets are both queryable and serializable, they can be easily stored in any persistent storages.

Less Memory Consumed : Spark uses the feature of caching to create a more optimal data layout. Hence, less memory is consumed.

Single Interface Multiple Languages : Single API is provided for both Java and Scala languages. These are widely used languages for using Apache Spark. This results in a lesser burden of using libraries for different types of inputs.
The applications developed in Spark have the same fixed cores count and fixed heap size defined for spark executors. The heap size refers to the memory of the Spark executor that is controlled by making use of the property spark.executor.memory that belongs to the -executor-memory flag. Every Spark applications have one allocated executor on each worker node it runs. The executor memory is a measure of the memory consumed by the worker node that the application utilizes.
GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph.
 
The property graph is a directed multi-graph which can have multiple edges in parallel. Every edge and vertex have user defined properties associated with it. Here, the parallel edges allow multiple relationships between the same vertices. At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge.
 
To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and mapReduceTriplets) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.
Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics formats so far. 
 
Parquet is a columnar format, supported by many data processing systems. The advantages of having a columnar storage are as follows :
 
* Columnar storage limits IO operations.
* It can fetch specific columns that you need to access.
* Columnar storage consumes less space.
* It gives better-summarized data and follows type-specific encoding.
A Lineage Graph is a dependencies graph between the existing RDD and the new RDD. It means that all the dependencies between the RDD will be recorded in a graph,  rather than the original data.
 
The need for an RDD lineage graph happens when we want to compute a new RDD or if we want to recover the lost data from the lost persisted RDD. Spark does not support data replication in memory. So, if any data is lost, it can be rebuilt using RDD lineage. It is also called an RDD operator graph or RDD dependency graph.
Sensor Data Processing : Apache Spark’s “In-memory” computing works best here, as data is retrieved and combined from different sources.

Real Time Processing : Spark is preferred over Hadoop for real-time querying of data. e.g. Stock Market Analysis, Banking, Healthcare, Telecommunications, etc.

Stream Processing : For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.

Big Data Processing : Spark runs upto 100 times faster than Hadoop when it comes to processing medium and large-sized datasets.
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are :
 
Using Broadcast Variable : Broadcast variable enhances the efficiency of joins between small and large RDDs.

Using Accumulators : Accumulators help update the values of variables in parallel while executing.

The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.
41 .
What are Accumulators in Spark?
Spark of-line debuggers are called accumulators. Spark accumulators are similar to Hadoop counters, to count the number of events and what’s happening during the job you can use accumulators. Only the driver program can read an accumulator value, not the tasks.
Spark utilizes memory. The developer has to be careful. A casual developer might make the following mistakes :
 
* She may end up running everything on the local node instead of distributing work over to the cluster.
* She might hit some web service too many times by the way of using multiple clusters.

The first problem is well tackled by Hadoop Map reduce paradigm as it ensures that the data your code is churning is fairly small at a point in time thus you can make the mistake of trying to handle whole data on a single node.

The second mistake is possible in Map-Reduce too. While writing Map-Reduce, the user may hit a service from inside of map() or reduce() too many times. This overloading of service is also possible while using Spark.
43 .
How Spark uses Akka?
Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.
A Discretized Stream (DStream) is a continuous sequence of RDDs and the rudimentary abstraction in Spark Streaming. These RDDs sequences are of the same type representing a constant stream of data. Every RDD contains data from a specific interval.
 
The DStreams in Spark take input from many sources such as Kafka, Flume, Kinesis, or TCP sockets. It can also work as a data stream generated by converting the input stream. It facilitates developers with a high-level API and fault tolerance.
Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the process of making streaming applications resilient to failures. It allows you to save the data and metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start from wherever it has stopped.
 
There are 2 types of data for which we can use checkpointing in Spark.
 
Metadata Checkpointing : Metadata means the data about data. It refers to saving the metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches.
 
Data Checkpointing : Here, we save the RDD to reliable storage because its need arises in some of the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous batches. 
Controlling the transmission of data packets between multiple computer networks is done by the sliding window. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data.

sliding window operation
DISK_ONLY : Stores the RDD partitions only on the disk
 
MEMORY_ONLY_SER : Stores the RDD as serialized Java objects with a one-byte array per partition
 
MEMORY_ONLY : Stores the RDD as deserialized Java objects in the JVM. If the RDD is not able to fit in the memory available, some partitions won’t be cached
 
OFF_HEAP : Works like MEMORY_ONLY_SER but stores the data in off-heap memory
 
MEMORY_AND_DISK : Stores RDD as deserialized Java objects in the JVM. In case the RDD is not able to fit in the memory, additional partitions are stored on the disk
 
MEMORY_AND_DISK_SER : Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk
Consider you have the below details regarding the cluster:
Number of nodes = 10
Number of cores in each node = 15 cores
RAM of each node = 61GB 
To identify the number of cores, we follow the approach:
Number of Cores = number of concurrent tasks that can be run parallelly by the executor.
The optimal value as part of a general rule of thumb is 5.
Hence to calculate the number of executors, we follow the below approach :
Number of executors = Number of cores/Concurrent Task
                   = 15/5
                   = 3
Number of executors = Number of nodes * Number of executor in each node 
                   = 10 * 3 
                   = 30 executors per Spark job
49 .
What is the function of filer()?
The function of filer() is to develop a new RDD by selecting the various elements from the existing RDD, which passes the function argument.
50 .
What is another method than "Spark.cleaner.ttl" to trigger automated clean-ups in Spark?
Another method than “Spark.clener.ttl” to trigger automated clean-ups in Spark is by dividing the long-running jobs into different batches and writing the intermediary results on the disk.
Criteria Spark Datasets Spark Dataframes Spark RDDs
Representation of Data Spark Datasets is a combination of Dataframes and RDDs with features like static type safety and object-oriented interfaces. Spark Dataframe is a distributed collection of data that is organized into named columns. Spark RDDs are a distributed collection of data without schema.
Optimization Datasets make use of catalyst optimizers for optimization. Dataframes also makes use of catalyst optimizer for optimization. There is no built-in optimization engine.
Schema Projection Datasets find out schema automatically using SQL Engine. Dataframes also find the schema automatically. Schema needs to be defined manually in RDDs.
Aggregation Speed Dataset aggregation is faster than RDD but slower than Dataframes. Aggregations are faster in Dataframes due to the provision of easy and powerful APIs. RDDs are slower than both the Dataframes and the Datasets while performing even simple operations like data grouping.
Spark Streaming involves the division of data stream’s data into batches of X seconds called DStreams. These DStreams let the developers cache the data into the memory which can be very useful in case the data of DStream is used for multiple computations. The caching of data can be done using the cache() method or using persist() method by using appropriate persistence levels. The default persistence level value for input streams receiving data over the networks such as Kafka, Flume, etc is set to achieve data replication on 2 nodes to accomplish fault tolerance.
 
Caching using cache method :
val cacheDf = dframe.cache()​

 

Caching using persist method :
val persistDf = dframe.persist(StorageLevel.MEMORY_ONLY)
The main advantages of caching are :
 
Cost efficiency : Since Spark computations are expensive, caching helps to achieve reusing of data and this leads to reuse computations which can save the cost of operations.

Time-efficient : The computation reusage leads to saving a lot of time.

More Jobs Achieved : By saving time of computation execution, the worker nodes can perform/execute more jobs.
map() flatMap()
A map function returns a new DStream by passing each element of the source DStream through a function func It is similar to the map function and applies to each element of RDD and it returns the result as a new RDD
Spark Map function takes one element as an input process it according to custom code (specified by the developer) and returns one element at a time FlatMap allows returning 0, 1, or more elements from the map function. In the FlatMap operation
ML Algorithms : Classification, Regression, Clustering, and Collaborative filtering

Featurization : Feature extraction, Transformation, Dimensionality reduction, and Selection
 
Pipelines : Tools for constructing, evaluating, and tuning ML pipelines

Persistence :
Saving and loading algorithms, models, and pipelines

Utilities :
Linear algebra, statistics, data handling