Google News
logo
Cassandra Interview Questions
Cassandra is one of the most favored NoSQL distributed database management systems by Apache. With its open-source technology, Cassandra is efficiently designed to store and manage large volumes of data without any failure.

Highly scalable for Big Data models and originally designed by Facebook, Apache Cassandra is written in Java comprising flexible schemas. Apache Cassandra has no single point of failure.

There are various types of NoSQL databases, and Cassandra is a hybrid of column-oriented and key–value store database. The keyspace is the outermost container for an application, and the table or column family in Cassandra is the keyspace entity.
Apache Cassandra delivers near real-time performance simplifying the work of Developers, Administrators, Data Analysts, and Software Engineers.
 
* Instead of master–slave architecture, Cassandra is established on a peer-to-peer architecture ensuring no failure.

* It also assures phenomenal flexibility as it allows the insertion of multiple nodes to any Cassandra cluster in any data center. Further, any client can forward its request to any server.

* Cassandra facilitates extensible scalability and can be easily scaled up and scaled down as per the requirements. With a high throughput for read and write operations, this NoSQL application need not be restarted while scaling.

* Cassandra is also revered for its strong data replication on nodes capability as it allows data storage at multiple locations enabling users to retrieve data from another location if one node fails. Users have the option to set up the number of replicas they want to create.

* Shows brilliant performance when used for massive datasets and thus, the most preferable NoSQL DB by most organizations.

* Operates on column-oriented structure and thus, quickens and simplifies the process of slicing. Even data access and retrieval becomes more efficient with column-based data model.

* Further, Apache Cassandra supports schema-free/schema-optional data model, which un-necessitate the purpose of showing all the columns required by your application.Find out how Cassandra Versus MongoDB can help you get ahead in your career!
Cassandra has following features :

* High Scalability
* High fault tolerant
* Flexible Data storage
* Easy data distribution
* Tunable Consistency
* Efficient Wires
* Cassandra Query Language
There are four types of NoSQL Database :

* Document store types ( MongoDB and CouchDB)
* Key-Value store types ( Redis and Volgemort)
* Column store types ( Cassandra)
* Graph store types ( Neo4j and Giraph)
It is a database that deals with the non-relational database. It is referred to as a Not only SQL database. It provides a mechanism to store and retrieve the different type of data that includes images, sounds etc..
The features of NoSQL Database are :
 
* Schema Agnostic
* AutoSharding and Elasticity
* Highly Distributable
* Easily Scalable
* Integrated Caching
The original authors of Cassandra are Avinash Lakshman and Prashant Malik. It was initially developed at Facebook to power the Facebook inbox search feature.
These are some key components of Cassandra data model :
 
* Table
* Column : It consists of a column name, value and timestamp
* Column family : This refers to multiple columns with row key reference.

* Cluster : These are made up of multiple nodes and keyspaces
* Keyspace : It is a namespace to group multiple column families, especially one per partition.
* Node : A node is a single machine running Cassandra
 
 
Some other components of Cassandra are :
 
* Node
* Data Center
* Commit log
* Mem-table
* SSTable
* Bloom Filter
In Cassandra, composite keys are used to define key or a column name with a concatenation of data of different type. There are two types of Composite key in Cassandra :
 
* Row Key
* Column Name
Data replication is an electronic copying of data from a database in one computer or server to a database in another so that all users can share the same level of information. Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. The replication strategy decides the nodes where replicas are placed.
It is a memory-resident data structure. After commit log, the data will be written to the mem-table. Mem-table is in-memory/write-back cache space consisting of content in key and column format. The data in mem- table is sorted by key, and each column family consists of a distinct mem-table that retrieves column data via key. It stores the writes until it is full, and then flushed out.
SSTable or ‘Sorted String Table,’ refers to an important data file in Cassandra. It accepts regular written memtables which are stored on disk and exist for each Cassandra table. Being immutable, SStables do not allow any further addition and removal of data items once written. For each SSTable, Cassandra creates three separate files like partition index, partition summary and a bloom filter.
Bloom filter is an off-heap data structure to check whether there is any data available in the SSTable before performing any I/O disk operation.
Cassandra-Cqlsh is a query language that enables users to communicate with its database. By using Cassandra cqlsh, you can do following things :
 
* Define a schema
* Insert a data, and
* Execute a query
A keyspace is an outermost container for data in Cassandra. Like a relational database, a keyspace has a name and a set of attributes that define keyspace-wide behavior. The keyspace is used to group Column families together.

Cassandra keyspace contains 3 types of operations which go as follows :
 
* Create keyspace
* Alter keyspace
* Drop keyspace
* The Drop table command drops the specified table including all the data from the keyspace.

* The Truncate table command is used to truncate a table and deletes all the rows of the table permanently.
* A Static Table uses a relatively static set of column names and is similar to Relational Database Table.

* A dynamic table allows you to pre-compute result sets and stores them in a single row for efficient data retrieval.
* Apache Hadoop, File Storage, Grid Compute processing via Map Reduce.

* Apache Hive, SQL like interface on top of Hadoop.

* Apache Hbase, Column Family Storage built like BigTable

* Apache Cassandra, Column Family Storage build like BigTable with Dynamo topology and consistency.
A seed node in Cassandra is a node that is contacted by other nodes when they first start up and join the cluster. A cluster can have multiple seed nodes. Seed node helps the process of bootstrapping for a new node joining a cluster. It is recommended using the 2 seed node per data center.
* Database replication is the frequent electronic copying of data from a database in one computer or server to a database in another so that all users share the same level of information.

* Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. A replication strategy determines the nodes where replicas are placed. The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica. As a general rule,  the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.
By default, Cassandra uses 7000 for cluster communication, 9160 for clients (Thrift), and 8080 for JMX. These are all editable in the configuration file or bin/cassandra.in.sh (for JVM options). All ports are TCP.
Cassandra is based on the NoSQL database and does not provide ACID and relational data property. If you have a strong requirement of ACID property (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make work out of it, however, you will end up writing lots of application code to handle ACID property and will loose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.
Anti-entropy is the replica synchronization mechanism, ensuring that data on different nodes is updated to the newest version
Cassandra uses Merkle tree for anti-entropy repair. A Merkel Tree is a hash tree where leaves are hashes of the values of individual keys.
* ALL : All levels including custom levels

* TRACE : Designates finer-grained informational events than the DEBUG

* DEBUG : Designates fine-grained informational events that are most useful to debug an application

* INFO : Designates informational messages that highlight the progress at a coarse-grained level

* ARN : Designates potentially harmful situations

* ERROR : Designates error events that might still allow the application to continue running

* OFF : The highest possible rank and is intended to turn off logging
Secondary indexes are indexes built over column values. In other words, let’s say you have a user table, which contains a user’s email. The primary index would be the user ID, so if you wanted to access a particular user’s email, you could look them up by their ID. However, to solve the inverse query given an email, fetch the user ID requires a secondary index.
Cassandra SuperColumn is a unique element consisting of similar collections of data. They are actually key-value pairs with values as columns. It is a sorted array of columns, and they follow a hierarchy when in action.
Thrift is the name of the Remote Procedure Call (RPC) client used to communicate with the Cassandra server.
There are three collection data types :
 
* List : A list is a collection of one or more ordered elements.
 
* Map : A map is a collection of key-value pairs.
 
* Set : A set is a collection of one or more elements.
Syntax for creating keyspace in Cassandra is
 
CREATE KEYSPACE <identifier> WITH <properties>

It supports two consistencies :

    * Eventual Consistency
    * Strong Consistency.
 
* The eventual consistency is used when no new updates are made on a given data item, all accesses return the last updated value eventually. Systems with eventual consistency are known to have achieved replica convergence.
 
* Cassandra supports the following conditions for strong consistency:
 
R + W > N
 
Here
 
N: Number of replicas
 
W: Number of nodes that need to agree for a successful write
 
R: Number of nodes that need to agree for a successful read
Tunable Consistency is a phenomenal characteristic of Cassandra which makes it a popular choice. Consistency refers to the up-to-date and synchronized data rows on all their replicas. Cassandra's Tunable Consistency facilitates users to select the consistency level best suited for their use cases.
DataStaxOpsCenter : It is an internet-based management and monitoring solution for Cassandra cluster and DataStax. It is free to download and includes an additional Edition of OpsCenter.
 
SPM : SPM primarily administers Cassandra metrics and various OS and JVM metrics. It also monitors Hadoop, Spark, Solr, Storm, zookeeper and other Big Data platforms besides Cassandra.
The main features of SPM are :
 
* Correlation of events and metrics
* Distributed transaction tracing
* Creating real-time graphs with zooming
* Detection and heartbeat alerting
* Node : A node is a single machine running Cassandra.
 
* Cluster : A cluster is a collection of nodes that contains similar types of data together.
 
* Datacenter : A datacenter is a useful component when serving customers in different geographical areas. Different nodes of a cluster can be grouped into different data centers.
Hadoop, HBase, Hive and Cassandra all are Apache products.
 
Apache Hadoop supports file storage, grid compute processing via Map reduce. Apache Hive is a SQL like interface on the top of Haddop. Apache HBase follows column family storage built like Big Table. Apache Cassandra also follows column family storage built like Big Table with Dynamo topology and consistency.
Although Cassandra comes with built-in tolerance features, it still needs to be monitored for effective results. Here are some tools which Cassandra uses to monitor its databases :
 
* Solarwind server and application monitor
* Instana
* Instaclustr
* AppDynamics
* Dynatrace
* Machine engine applications manager.
These operations are used to make changes in the Cassandra database.
 
CRUD stands for :
 
* Create operation
* Read operation
* Update operation and
* Delete/drop operation.
You want to query on a column that isn't the primary key and isn't part of a composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have a column Town, that is a good choice for secondary indexing because lots of people will form the same town, date of birth however will not be such a good choice).
Cassandra query language is used for Cassandra Database. It is an interface that a user uses to access the database. It basically is a communication medium. All the operations are carried out from this panel.
These are the advantages if Cassandra :
 
* Since data can be replicated to several nodes, Cassandra is fault tolerant.
 
* Cassandra can handle a large set of data.
 
* Cassandra provides high scalability.
The main objective of Cassandra is to handle a large amount of data. Furthermore, the objective also ensures fault tolerance with the swift transfer of data.
Compaction refers to a maintenance process in Cassandra, in which the SSTables are reorganized for data optimization of data structures on the disk. The compaction process is useful during interacting with memtables. There are two types of compaction in Cassandra.
 
* Minor compaction : It gets started automatically when a new SSTable is created. Here, Cassandra condenses all the equally sized SSTables into one.
 
* Major compaction : It is triggered manually using the nodetool. It compacts all SSTables of a column family into one.
Both elements work on the principle of tuples having name and value. However, the former’s value is a string, while the value of the latter is a map of columns with different data types.
 
Unlike Columns, Super Columns do not contain the third component of timestamp.
Tombstone is a row marker indicating a column deletion. These marked columns are deleted during compaction. Tombstones are of great significance as Cassandra supports eventual consistency, where the data must respond before any successful operation.
Replication factor is the measure of the number of data copies existing. It is important to increase the replication factor to log into the cluster.
Using get_range_slices. You can start iteration with an empty string, and after each iteration the last key read serves as the start key for the next iteration.
SSTables are immutable and cannot remove a row from SSTables. When a row needs to be deleted, Cassandra assigns the column value with a special value called Tombstone. When the data is read, the Tombstone value is considered as deleted.
There are various Cqlsh shell commands in Cassandra. Command “Capture”, captures the output of a command and adds it to a file while, command “Consistency” display the current consistency level or set a new consistency level.
* Cassandra concatenate changed data to commitlog
* Commitlog acts as a crash recovery log for data
* Until the changed data is concatenated to commitlog write operation will be never considered successful

Data will not be lost once commitlog is flushed out to file.