Scrutiny: Cassandra Interview Questions And Answers

What is NoSQL Database?

A NoSQL database is sometimes called as Not Only SQL. It is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases.
These type databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data.
Primary objective of a NoSQL database is to have: simplicity of design, horizontal scaling, and finer control over availability.
SQL was designed to be a query language for relational databases, and relational databases are usually table- based, much like what we see in a spreadsheet. In a relational database, records are stored in rows and then the columns represent fields in each row. SQL allows we to query within and between tables in that relational database.
On the other hand, NoSQL databases are more flexible, NoSQL databases allow we to define fields as we create a record.
Nested values are common in NoSQL databases. We can have hashes and arrays and objects, and then nest more objects and arrays and hashes within those.
Also fields are not standardized between records in NoSQL databases, we can have a different structure for every record in your NoSQL database.

What is the difference between NoSQLDatabase and a Relational database?

Relational Database supports powerful query language, where as NoSQLDatabase supports very simple query language.
Relational Database has a fixed schema. No fixed schema in NoSQLDatabase.
Relational Database follows ACID (Atomicity, Consistency, Isolation, and Durability). On other hand NoSQLDatabase is only 'eventually consistent'.
Relational Database supports transactions, where as NoSQLDatabase does not support transactions.

Name different types of NoSQL database
There are four types of NoSQL Database:

Key Value Store type database
Document Store type database
Column STore type database
Graph Database

What are the key features of any NoSQL database?

Schema Agnostic
AutoSharding and Elasticity
Highly Distributable
Easily Scalable
Integrated Caching

Why Cassandra is an eventually consistent database

This is another big difference with relational databases is that Cassandra is an eventually consistent database. That means there may be times, usually quite brief periods of time, where replicas of a row have different versions of the data.
This is because Cassandra keeps multiple copies of data on different nodes. In case a node fails, users can still get their data from a replica on another node.
Even for databases designed for fast operations, it can take some period of time before all copies are updated. In that case, a user might read an old version of data.
This difference in the copies of data is known as an inconsistency. Eventually, the inconsistency will be corrected.

Define commit log, Memtable and SSTable.

Every write operation is written to Commit Log. It is a mechanism that is used to recover data in case the database crashes. Every operation that is carried out is saved in the commit log. Using this the data can be recovered.
After data written in Commit log, data is written in Mem-table. Memtables are basically a cache space containing content in key and column format. Data is written in Mem-table temporarily.
When Mem-table reaches a certain threshold, data is flushed to an SSTable disk file. SSTable is Sorted String Table, it is a data file that accepts regular Mem Tables.

Define Node.
A node represents a system that is a part of a cluster. It is the main area in which data is stored.

Define a column family.
A keyspace contains many column families. They basically represent the table. Furthermore, it basically defines titles or application specific tables.

What is CQL?
Cassandra has a query language, called CQL, which stands for Cassandra Query Language. It is similar to SQL, but has more restrictions.
Define data replication and replication factor.

Data replication is an operation in which data from one node is copied to different nodes in the cluster.
This operation ensures redundancy and fault tolerance in the database.
The replication factor decides the number of copies and the replication strategy decides the nodes in which the data is copied.

Define replication strategy.

Replication strategies define the technique how the replicas are placed in a cluster.
There are mainly two types of Replication Strategy: Simple and Network Topology Strategy
SimpleStrategy is used when you have just one data center. SimpleStrategy places the first replica on the node selected by the partitioner. After that, remaining replicas are placed in clockwise direction in the Node ring.
NetworkTopologyStrategy is used when you have more than one data centers. In NetworkTopologyStrategy, replicas are set for each data center separately. NetworkTopologyStrategy places replicas in the clockwise direction in the ring until reaches the first node in another rack. This strategy tries to place replicas on different racks in the same data center. This is due to the reason that sometimes failure or problem can occur in the rack. Then replicas on other nodes can provide data.

Define Consistency level and how write operation works in Cassendra.

When write request comes to the node, first of all, it logs in the commit log. Then Cassandra writes the data in the mem-table.
Data written in the mem-table on each write request also writes in commit log separately.
Mem-table is a temporarily stored data in the memory while Commit log logs the transaction records for back up purposes.
When mem-table is full, data is flushed to the SSTable data file.
Consistency level determines how many nodes will respond back with the success acknowledgment after the write operation is performed.
The coordinator sends a write request to replicas. If all the replicas are up, they will receive write request regardless of their consistency level. The node will respond back with the success acknowledgment if data is written successfully to the commit log and memTable.
For example, in a single data center with replication factor equals to seven, seven replicas will receive write request. If consistency level is three, only three replica will respond back with the success acknowledgment, and the remaining four will remain dormant. Suppose if remaining four replicas lose data due to node downs or some other problem, Cassandra will make the row consistent by the built-in repair mechanism in Cassandra.

What are the types of read requests that a coordinator can send to replicas.

There are three types of read requests: Direct request, Digest request and Read repair request.
The coordinator sends direct request to one of the replicas.
After that, the coordinator sends the digest request to the number of replicas specified by the consistency level and checks whether the returned data is an updated data.
After that, the coordinator sends digest request to all the remaining replicas. If any node gives out of date value, a background read repair request will update that data. This process is called read repair mechanism.

What is a Keyspace, tables and columns in Cassandra?

Cassandra Keyspace is like a schema in relational databases.
The top level data structure in Cassandra is the Keyspace.
Keyspaces are logical containers for tables, indexes, and other data structures.
When you define a keyspace, you also define a replication strategy and replication factor. This determines how replicas of data are made.
The high availability feature of Cassandra is because of replication. If a node fails, the data on the failed node, can still be accessed from one of its replicas.

Command to create a keyspace:
CREATE KEYSPACE scrutinyMonitor
WITH
replication = {'class': ' SimpleStrategy', 'replication_factor':3};

After creating the keyspace, we can define a table using the create table command. CREATE TABLE takes a list of attribute or column names, followed by a data type, such as text and int. It also takes a primary key clause to define which columns to use as the primary key.

CREATE TABLE scrutiny_employees(
id uuid,
emp_name text,
emp_age int,
emp_level int,
PRIMARY KEY (id))

Column Types in Cassendra:
Columns in Cassandra can be either a single value or a group of multiple values called collections.

Basic Data Types:

Int
BigInt
TinyInt
VarInt
Decimal
Double
Float for floating point numbers
Text for strings
ASCII
Varchar
Timestamp for points in time
TimeUUID
Date
Blob for arbitrary byte streams
UUID for Universally Unique Identifier.

Collection Data Type: In addition to above atomic types, in Cassandra we can use collections as column type:

List for collection of one or more elements and order matters
Map for key-value pairs
Set for no duplicates and when order does not matter.

Define Primary key and secondary indexes in Cassandra

Cassandra tables have primary keys, which uniquely identify rows in a table.
Primary keys are used to access data in Cassandra, but they are not enough to find rows, because Cassandra is designed to run on a cluster of servers. There is no single server in a highly-available Cassandra database.
We can run Cassandra on a single machine (mostly for development), but production databases are best run on clusters of multiple servers. To enable fast access to rows within tables that span multiple servers, Cassandra tables use two additional kinds of keys.
A partition key is used to determine which node in the cluster to store a row in.
A clustering column defines the order in which rows are stored.
In Cassandra, the primary key will uniquely identify a row, but it will also limit how we can retrieve rows.

Let's create a table to understand this. We create a table scrutiny_employees, in which we assume emp_id and emp_pan uniquely identify an application instance.

CREATE TABLE scrutiny_employees(
emp_id text,
emp_name text,
emp_age int,
emp_level int,
emp_pan text,
PRIMARY KEY (emp_id, emp_pan))

The first attribute in the primary key is emp_id is partition key. and emp_panis the cluster key
FYI, the first attribute specified in the primary key list is used as the partition key, which determines which node the row is stored on. The rest of the primary key is used as the clustering key, which determines how data is ordered on the disk.
We can query a table by specifying a where clause,but we can use either the columns in the primary key, or a secondary index in the where clause.

Define composite key.

Composite keys include row key and column name. They are used to define column family with a concatenation of data of different type.

Clustering Order

Cassandra does not provide a mechanism to sort query results at query time, so we have to consider sort order when creating a table.
Cassandra gives us the option to change the default sort order of rows on the disk. But this need to be done only while creating the table.
This is done by adding CLUSTERING ORDER BY clause.

CREATE TABLE scrutiny_employees(
emp_id text,
emp_name text,
emp_age int,
emp_level int,
emp_pan text,
PRIMARY KEY (emp_id, emp_pan))
WITH CLUSTERING ORDER BY (emp_pan DESC)

Define Secondary Index in Cassandra.
Let's say we created scrutiny_employees, in which we used emp_id and emp_pan as primary key. But now we want to fetch all the employees with the name 'Kelvin'.

SELECT * from scrutiny_employees where emp_name='Kelvin';

When we execute above query, we get an error 'Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING.'

To fetch the employees based on name we need to add a secondary index. We can do this by:

CREATE INDEX emp_name_index
ON scrutiny_employees(emp_name);

But we need to bit careful while creating secondary index. As it will create another tables which are updated along with the base table on which the index is defined. Both tables need to be updated when data is inserted. This can lead to additional overhead.
It is important not to use secondary indexes on columns with very low or very high numbers of unique values.

What is TTL in Cassandra?

TTL (Time to Live) is a useful feature of Cassandra, with which we can specify how long the data should remain in the database while inserting the data.
After the Time to Live period (which is in seconds) has passed, the data is marked for deletion.

e.g: below record will be mark for deletion after 86400 seconds.
INSERT INTO scrutiny_employees (emp_id, emp_name, emp_age, emp_level, emp_pan)
VALUES ('1', 'Kelvin', 33, 2, 'BOOS9987H')
USING TTL 86400;

To know how long the above record has to live, we need to execute below query. TTL(emp_name) will return different value depending on the time it is executed. Lets say the below command is executed after 6000 seconds, then TTL(emp_name) will return 80400.

SELECT emp_name, TTL(emp_name) FROM scrutiny_employees;

Example delete queries.
Below query will delete only the emp_name from scrutiny_employees table for employee with id=1.
DELETE emp_name from scrutiny_employees where emp_id='1';

If you want to delete the entire record from the employee with id 1:
DELETE from scrutiny_employees where emp_id='1';

-K Himaanshu Shuklaa..

Pages

January 01, 2020

Cassandra Interview Questions And Answers

No comments:

Post a Comment