Cassandra Interview Questions

Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.

  • 4.7 Rating
  • 29 Question(s)
  • 25 Mins of Read
  • 9852 Reader(s)

Beginner

There are mainly 4 types of NoSQL databases:

  • Document store types ( MongoDB and CouchDB)
  • Key-Value store types ( Redis and Volgemort)
  • Column store types ( Cassandra)
  • Graph store types ( Neo4j and Giraph)

The main Cassandra configuration file is the cassandra.yaml file, which houses all the main options that control how Cassandra operates.

The Cassandra data model has four main components which are cluster, keyspace, column,column family. Clusters contain many nodes (machines) and can contain multiple keyspaces. A keyspace is a namespace to group multiple column families, typically one per application. A column contains a name, value and timestamp . A column family contains multiple columns referenced by a row keys.

INSERT INTO formula1 (race_id, race_name, race_start_date, race_end_date) VALUES (101, 'championship2','2015-04-27', '2018-04-28') USING TTL 172800;

There are three types of such read requests:

  • A direct read request
  • A digest request
  • A background read repair request

The coordinator node contacts one replica node, in a direct read request. The coordinator then sends a digest request to a number of replicas which is determined by the consistency level specified by the client. The digest request checks the data in the replica node and makes sure that the data is up to date. Then a digest request is sent to all remaining replicas by the coordinator. A background read repair request is sent If any replica nodes have out of date data. Read repair requests ensures that the requested row is made consistent on all replicas involved in the read query.

In a digest request, the coordinator first contacts the replicas specified by the consistency level. The coordinator sends these requests to the replicas that respond  fastest at that time. The contacted nodes respond with a digest of the requested data; if multiple nodes are contacted, the rows from each replica are compared in memory for consistency. If they are not consistent, the replica having the most recent data is used by the coordinator to forward the result back to the client. To ensure that all replicas have the most recent version of the data, read repair is carried out to update out-of-date replicas.

A quorum is the number of nodes that need to be in agreement to reach a consensus. The formula to determine the nodes needed for a quorum is:

NodesNeededForQuorum = ReplicationFactor / 2 + 1

The sum of all the replication_factor settings for each datacenter is the sum_of_replication_factors. Given by:

sum_of_replication_factors = datacenter1_RF + datacenter2_RF + . . . + datacentern_RF

SSTable, also known as ‘Sorted String Table’ is a file of key/value string pairs, sorted by keys. In SSTable, memtables are stored on disk and exist for each Cassandra table. SSTables are immutable i.e they don’t enable addition and removal of data items once written. Three files are created for every SSTable by Cassandra like partition index, partition outline, and a bloom filter.

Memtable is a memory-resident data structure. once commit log, the info is written to the mem-table. Mem-table is in-memory/write-back cache house consisting of content in key and column format. The info in mem- a table is sorted by key, and every column family consists of a definite mem-table that retrieves column knowledge via the key.

Cassandra provides cassandra query language shell also known as cqlsh, using which one can execute the Cassandra Query Language. In Cassandra, CQL collections can be used in following ways:

  • List: It is used once the order of the info has to be maintained, and a worth is to be held on multiple times.
  • SET: It is used for the cluster of components to store and came back in sorted orders
  • MAP: It is used to store a key-value pair of elements.

Cassandra Super Column is a unique element consisting of similar collections of data. They are actually key-value pairs with values as columns. It is a sorted array of columns, and they follow a hierarchy when in action: keystore> column family> super column> column data structure in JSON.

Similar to row keys, super column data entries contains no independent values but are used to collect other columns. It is interesting to note that super column keys appearing in different rows do not necessarily match and will not ever

Unlike relational databases, Cassandra does not support ACID transactions.

The CAP theorem states that it is impossible for a distributed computer system to simultaneously provide Consistency, Availability, Partition Tolerance at the same time.

Cassandra is generally classified as an AP system, meaning that availability and partition tolerance are generally considered to be more important than Consistency in Cassandra. But, Cassandra can be tuned with replication factor and consistency level to also meet the C in CAP.

CREATE KEYSPACE “KeySpace Name”
WITH replication = {'class': ‘Strategy name’, 'replication_factor' : ‘No.Of ¬†replicas’}
AND durable_writes = ‘Boolean value’;

In Cassandra, commit log is a crash-recovery mechanism. Every write operation is written to the commit log.

Cassandra is based on NoSQL database and does not provide ACID and relational data property. If you have strong requirement of ACID property (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make work out of it, however you will end up writing lots of application code to handle ACID property and will loose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.

First, Cassandra writes data to a commit log and then associate in memtable and SSTable. When both commits are complete, a write is successful. For every column family, Memtables and SSTables are created. Writes are written to disk in a table structure called an SSTable. In the event of a fault when writing to a SSTable, Cassandra will merely replay the commit log. Due to this style of writing Cassandra has the lowest disk I/O and offers high speed write performance. As a result of this, the commit log is append-only and Cassandra doesn’t look for writes.

Cassandra exposes a number of statistics and management operations via Java Management Extensions (JMX). JMX is a Java technology that supplies tools for managing and monitoring Java applications and services. Any statistic or operation that a Java application has exposed as an MBean can then be monitored or manipulated using JMX.

Cassandra offers several solutions for migrating from other databases:

  • The COPY command, which mirrors what the PostgreSQL RDBMS uses for file/export import.
  • The Cassandra bulk loader provides the ability to bulk load external data into a cluster.

If you need more sophistication applied to a data movement situation (more than just extract-load), then you can use any number of extract-transform-load (ETL) solutions that now support Cassandra.

Cassandra backs up data by taking a snapshot of all on-disk data files (SSTable files) stored in the data directory. You can take a snapshot of all keyspaces, a single keyspace, or a single table while the system is online.

Using a parallel ssh tool (such as pssh), you can snapshot an entire cluster. This provides an eventually consistent backup. Although no one node is guaranteed to be consistent with its replica nodes at the time a snapshot is taken, a restored snapshot resumes consistency using Cassandra's built-in consistency mechanisms.

The nodetool utility is a command line interface for managing a cluster.

CREATE TABLE WebLogs (
webpage_id uuid,
webpage_name Text,
insert_time timestamp,
webpage_count counter,
PRIMARY KEY (webpage_id, insert_time)
);

You will get the below error : –

ERROR : – InvalidRequest: Error from server: code=2200 [Invalid query] message=”Cannot mix counter and non counter columns in the same table”

Counter column do not exists with non counter columns , so every other column in the table should be a primary key/clustering keys .

CREATE TABLE WebLogs (
webpage_id uuid,
webpage_name Text,
insert_time timestamp,
webpage_count counter,
PRIMARY KEY ((webpage_id,webpage_name), insert_time)
);

Advanced

Cassandra can store replicas of the same data on two or more nodes, in a multi-node cluster . This helps prevent data loss but at the same time complicates the delete process. If a node receives a delete for data it stores locally, the node tombstones the specified record and tries to pass the tombstone to other nodes containing replicas of that record. If any one replica node is unresponsive at that time and does not receive the tombstone immediately, it will still contain the pre-delete version of the record. If the tombstoned record has already been deleted from the rest of the cluster before that node recovers, Cassandra treats the record on the recovered node as new data later propagating it to the rest of the cluster. This ‘deleted but persistent’ record is called a zombie.

To prevent the reappearance of zombies, a grace period is given to each tombstone  by the database. The purpose of the grace period is to give unresponsive nodes certain time to recover and process tombstones normally. When multiple replica,  answers part of a read request and the responses from the replicas differ, then whichever values are most recent take precedence.

If a node has a tombstone and if another node has only older value for the record, then the final record will have the tombstone. The database overwrites the tombstone when a client writes a new update to the tombstone during the grace period.

When any unresponsive node recovers, Cassandra uses hinted handoff to replay the database mutations which the node missed while it was down.. Cassandra may miss the deletion, if the node does not recover within the grace period

After the tombstone's grace period ends, Cassandra deletes the tombstone during the process of compaction.

a) Replication takes the same data and copies it over multiple nodes. Sharding puts different data on different nodes

b) Sharding is particularly valuable for performance because it can improve both read and write performance. Using replication, particularly with caching, can greatly

improve read performance but does little for applications that have a lot of writes.

Sharding provides a way to horizontally scale writes.

cqlsh> CREATE OR REPLACE FUNCTION function_log (input double) CALLED ON NULL INPUT RETURNS double LANGUAGE java AS 'return Double.valueOf(Math.log(input.doubleValue()));';

CALLED ON NULL INPUT ensures the function will always be executed. RETURNS NULL ON NULL INPUT ensures the function will always return NULL if any of the input arguments is NULL. RETURNS defines the data type of the value returned by the function.

Strong consistency can be guaranteed when the following condition is true:

R + W > N

where

  • R is the consistency level of read operations
  • W is the consistency level of write operations
  • N is the number of replicas

If the replication factor is 3, then the consistency level of the reads and writes combined must be at least 4. For example, read operations using 2 out of 3 replicas to verify the value, and write operations using 2 out of 3 replicas to verify the value will result in strong consistency. If fast write operations are required, but strong consistency is still desired, the write consistency level is lowered to 1, but now read operations have to verify a matched value on all 3 replicas. Writes will be fast, but reads will be slower.

Eventual consistency occurs if the following condition is true:
R + W =< N
where

  • R is the consistency level of read operations
  • W is the consistency level of write operations
  • N is the number of replicas

If the replication factor is 3, then the consistency level of the reads and writes combined are 3 or less. For example, read operations using QUORUM (2 out of 3 replicas) to verify the value, and write operations using ONE (1 out of 3 replicas) to do fast writes will result in eventual consistency. All replicas will receive the data, but read operations are more vulnerable to selecting data before all replicas write the data.

In Cassandra, Compaction refers to the operation of merging multiple SSTables into a single new one. It mainly deals with the following.

  • Merge keys
  • Combine columns
  • Discard tombstones

Compaction is done for two purposes.

  • To bound the number of SSTables to consult on reads.
  • To reclaim space taken by obsolete data in SSTable

After compaction, the old SSTables will be marked as obsolete. These SSTables are deleted asynchronously when JVM performs a GC, or when Cassandra restarts, whichever happens first. It is also possible to force a deletion from JConsole.

Paxos is a technique for achieving consensus on a single value over unreliable communication channels.

The role of Paxos in distributed systems is similar to Compare-And-Swap in concurrent, single-machine systems.The algorithm defines a peer-to-peer consensus protocol that is based on simple majority rule and is capable to ensure that one and only one resulting value can be achieved. No peer’s suggestions are more or less valid than any other peer  and all peers are allowed to make them. There is no concept of dedicated leaders in Paxos, unlike some other consensus protocols. Any peer may make a suggestion and lead the effort in achieving resolution but other peers are free to do the same and may even override the efforts of their neighbors. Eventually though, a majority of peers will agree upon a suggestion and the value associated with that suggestion will become the final solution.

The Paxos protocol is implemented as a series of phases in Cassandra:

  1. Prepare/Promise
  2. Read/Results
  3. Propose/Accept
  4. Commit/Acknowledge

Indexes built over column values are Secondary indexes . In other words, let’s say there is a user table, which contains a user’s email and the primary index, here, would be the user ID.  Now, if anyone wants to access a particular user’s email, one can look them up by their ID. However, to solve the inverse query, given an email, fetching the user ID requires a secondary index. One might want to query on a column that isn’t the primary key and isn’t part of a composite key or, say, the column has few unique values.

Using get_range_slices. You can start iteration with the empty string and after each iteration, the last key read serves as the start key for next iteration.

Description

Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.
Levels