Scrutiny: #Cassandra Part 2 : Data Model, Keys, Ring Representation, Virtual Nodes

ALSO CHECK: Cassandra Interview Questions And Answers

Cassandra

Cassandra is an open source, distributed database from Apache that is highly scalable and designed to manage very large amounts of structured data.
Cassandra is made to easily deploy over a cluster of machines located at geographically different places.
There is no master slave or central master server, so no single point of failure, no bottleneck, data is replicated, and a faulty node can be replaced without any downtime.
Cassandra is linearly scalable, which means that with more nodes, the requests served per second per node will not go down. Also, the total throughput of the system will increase with each node being added.
Cassandra is column oriented, much like a map (or better, a map of sorted maps) or a table with flexible columns where each column is essentially a key-value pair. So, we can add columns as we go, and each row can have a different set of columns (key-value).
Cassandra does not provide any relational integrity. It is up to the application developer to perform relation management.
It is a type of NoSQL database.
Apache HBase and MongoDB are quite popular NoSQL databases besides Cassandra.
It is widely used in Netflix, eBay, GitHub, Facebook, Twitter etc.
Cassandra does automatic partitioning and replication.

What is the main objective of creating Cassandra?

The main objective of Cassandra is to handle a large amount of data. Furthermore, the objective also ensures fault tolerance with the swift transfer of data.

What are the benefits of using Cassandra?

Apache Cassandra delivers near real-time performance.
Cassandra is established on a peer-to-peer architecture instead of master–slave, this ensures no failure
Cassandra assures amazing flexibility as it allows the insertion of multiple nodes to any Cassandra cluster in any data center. Also, any client can forward its request to any server.
Cassandra is highly scalable. It can be easily scaled up and scaled down as per the requirements. With a high throughput for read and write operations, this NoSQL application need not be restarted while scaling.
Cassandra allows data storage at multiple locations enabling users to retrieve data from another location if one node fails. Users have the option to set up the number of replicas they want to create.
Cassandra supports schema-free/schema-optional data model.

Cassandra Data Model

Cassandra has three containers, one within another.
Keyspace which is synonymous to a database in the RDBMS land is the outermost container.
Tables reside under keyspace. A table is basically a sorted map of sorted maps. Each table must have a primary key, which is called row key or partition key
If the primary key is made up of more than one column, the first component of this composite key is equivalent to the row key.
Each partition is associated with a set of cells and each cell has a name and a value. We can say these cells are synonymous with the columns in the traditional database.
A partition is completely stored on a node. The benefit of this is that the fetches are faster, but at the same time a partition is limited by the total number of cells that it can hold, which is 2 billion cells (The maximum number of cells per partition is limited by the Java integer’s max value, which is about 2 billion).
Also if we put everything on one partition it may cause lots of requests to go to only a couple of nodes(replicas), making them a hotspot in the cluster, which is not good. We can avoid this bucketing, this will make sure that the partition changes every month and each partition has only one month worth of records

Define Node.
A node represents a system that is a part of a cluster. It is the main area in which data is stored.

What is a Keyspace?
It is the outermost storage unit in a node. It contains many column families. When you define a keyspace, we need to also define a replication strategy.

Define a column family.
A keyspace contains many column families. They basically represent the table. A column family refers to a structure having an infinite number of rows. Those are referred by a key–value pair, where the key is the name of the column and the value represents the column data.

It is much similar to a hashmap in Java or a dictionary in Python. The column family is absolutely flexible with one row having 100 columns while the other having only 2 columns.

Can we add or remove column families in a working cluster?
Yes, we can, but while doing that we need to keep in mind the following processes:

Clear the commitlog with ‘nodetool drain’
Turn off Cassandra to ensure that there is no data left in the commitlog
Delete the SSTable files for the removed column families.

Give the data storage units in Cassandra
Cluster, Node, Keyspace, Column Family, Row, Column

Types Of Keys In Cassandra
There are 5 types of keys in Cassandra:

Primary key: This is the column or a group of columns that uniquely defines a row of the CQL table.
Composite key: Unlike RDBMS, we cannot just perform an ORDER BY operation across partitions in Cassandra. To do this we need to use a composite key. A composite key consists of a partition key and one or more column(s) that determines where the other columns are going to be stored. Also, the other columns in the composite key determine relative ordering for the set of columns that are being inserted as a row with the key.
Partition key or Row key: Cassandra’s internal data representation is large rows with a unique key called row key. It uses these row key values to distribute data across cluster nodes. Since these row keys are used to partition data, they as called partition keys. When we define a table with a simple key, that key is the partition key. If we define a table with a composite key, the first term of that composite key works as the partition key. This means all the CQL rows with the same partition key lives on one machine. (Every data with the same partition key will be stored in same node in the cluster).
Clustering key: This is the column that tells Cassandra how the data within a partition is ordered (or clustered). This essentially provides presorted retrieval if you know what order you want your data to be retrieve in.
Composite partition key: Optionally, CQL lets you define a composite partition key (the first part of a composite key). This key helps you distribute data across nodes if any part of the composite partition key differs.

Lets create a table. In this case empid is the primary as well as the partition key.

CREATE TABLE EMPLOYEES (
empid uuid,
email text,
PRIMARY KEY (empid)
)

In the below example, we create a composite key that uses state and city combination to uniquely define a CQL row. The state column is the partition key, so all the rows with
the same state node will belong to the same node/machine. The rows within a partition will be sorted by the city names.

CREATE TABLE STATE_CITY(
state text,
city text,
population int,
PRIMARY KEY (state, city)
)

In the below example we have created a composite key involving four columns: empName, experience, num_of_subordinates, department_name, with empName, experience constituting composite partition key. This means the rows with same empName, but different years of experience will be in different partition. Rows will be ordered by the num_of_subordinates followed by department_name.

CREATE TABLE EMPLOYEES (
empName text,
experience int,
num_of_subordinates int,
department_name int,
PRIMARY KEY ((empName, experience), num_of_subordinates, department_name)
)

Query First Approach

Apache Cassandra follows Query-driven data modeling methodology, whereby specific queries are the key to organizing data.
A query-driven database design facilitates faster reading and writing of data, i.e., the better the model design, the more rapid data is written and read.
In query first approach, we design our tables for specific queries rather than relational database. The drawback of this approach is that we might end up storing same data in different tables.

Ring Representation

A Cassandra cluster is called a ring.
Every node in a Cassandra cluster is given an initial token. This initial token defines the end of the range a node is responsible for.
Each node is responsible for storing all the rows with token values (a token is basically a hash value of a row key) ranging from the previous node’s initial token (exclusive) to the node’s initial token (inclusive).
This way, the first node, the one with the smallest initial token, will have a range from the token value of the last node (the node with the largest initial token) to the first token value.
If you jump from node to node, you will make a circle, and this is why a Cassandra cluster is called a ring.

Example:
Lets say we have 8 nodes, each node will be assigned a token(in our case it is 100, 200..800). Each node will responsible for storing the data with the token less than the value of that token and greater than the value aside the previous node. In our case Node one can store all the tokens which as less than 100 but greater than 800

Let's say we want to store employee information in a database based on the departments they are working. As of now let's assume an employee can work in a single department.
So we need to make sure, all the employee's working in a particular department will be stored on the same node of the cluster. So if we want to query all the employee's who are working in a particular department, we will be able to get it very quickly as they will be stored on a same node.
Now say we have 10 departments: HR, Marketing, Customer Support, Tech Support, Tech Team, Resource Management Team, Security, Travel, Lost And Found, Food And Beverages, Storage. All these department names will be our partition key. (FYI, every data with the same partition key will be stored in same node in the cluster)
Cassandra passes each partition key to a hash function, the purpose of this function is to turn the partition key (string) into a unique id. This unique id is known as 'tokens' in Cassandra. Based on these tokens Cassandra will decide on which node data need to be stored. In real Cassandra, these tokens will be 64 bits integers.
All the departments names will generate a unique value, which will ensure all the employee's with same department will be stored in same node.
Lets say we have 8 nodes, each node will be assigned a token(in our case it is 100, 200..800). Each node will responsible for storing the data with the token less than the value of that token and greater than the value aside the previous node. In our case Node one can story all the tokens which as less than 100 but greater than 800

Virtual Nodes in Cassandra

In above example token range is from 100-200, 200-300 and so on, which is quite less but in real life it's very large.
There's one token per node, and thus a node owns exactly one contiguous range in the ringspace.
Vnodes or Virtual Nodes change this paradigm from one token or range per node, to many per node. Within a cluster these can be randomly selected and be non-contiguous, giving us many smaller ranges that belong to each node.
By default, each node has 256 virtual nodes.Which can be changed and configured in Cassandra configuration.
Virtual nodes help achieve finer granularity in the partitioning of data, and data gets partitioned into each virtual node using the hash value of the key.
On adding a new node to the cluster, the virtual nodes on it get equal portions of the existing data. So there is no need to separately balance the data by running a balancer.

What are the replication strategies available in Cassandra?
Data replication enables high availability and durability. Cassandra replicates rows in a column family on to multiple endpoints based on the replication strategy associated to its keyspace.

The endpoints which store a row are called replicas or natural endpoints for that row. The replication strategy controls how the replicas are chosen and replication factor determines the number of replicas for a key. Replication strategy is defined when creating a keyspace and replication factor is configured differently based on the chosen replication strategy.

There are two kinds of replication strategies available in Cassandra.

SimpeStrategy

SimpeStrategy is rack unaware and data center unaware policy. It is commonly used when nodes are in a single data center.
In SimpleStrategy, successor nodes or the nodes on the ring immediate following in clockwise direction to the coordinator node are selected as replicas.

NetworkTopologyStategy

NetworkTopologyStategy is both rack aware and data center aware. Even if you have single data center, but nodes are in different racks it is better to use NetworkTopologyStategy because its rack aware and enables fault tolerance by choosing replicas from different racks.
Also if we have a plan to expand the cluster in the future to have more than one data center then choosing NetworkTopologyStategy from the beginning avoids data migration.
In NetworkTopologyStategy, nodes from distinct available racks in each data center are chosen as replicas. User need to specify per data center replication factor in multiple data center environment. In each data center, successor nodes to the coordinator node which are from distinct racks are chosen.
NetworkTopologyStrategy places replicas in the clockwise direction in the ring until reaches the first node in another rack.

-K Himaanshu Shuklaa..

Pages

November 11, 2019

#Cassandra Part 2 : Data Model, Keys, Ring Representation, Virtual Nodes

No comments:

Post a Comment