November 11, 2019

Cassandra

NoSQLDatabase
  • A NoSQL database is sometimes called as Not Only SQL. It is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. 
  • These type databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data.
  • Primary objective of a NoSQL database is to have: simplicity of design, horizontal scaling, and finer control over availability.
  • SQL was designed to be a query language for relational databases, and relational databases are usually table- based, much like what we see in a spreadsheet. In a relational database, records are stored in rows and then the columns represent fields in each row. SQL allows we to query within and between tables in that relational database.
  • On the other hand, NoSQL databases are more flexible, NoSQL databases allow we to define fields as we create a record.
  • Nested values are common in NoSQL databases. We can have hashes and arrays and objects, and then nest more objects and arrays and hashes within those.
  • Also fields are not standardized between records in NoSQL databases, we can have a different structure for every record in your NoSQL database.
Difference between NoSQLDatabase and a Relational database
  • Relational Database supports powerful query language, where as NoSQLDatabase supports very simple query language.
  • Relational Database has a fixed schema. No fixed schema in NoSQLDatabase.
  • Relational Database follows ACID (Atomicity, Consistency, Isolation, and Durability). On other hand NoSQLDatabase is only 'eventually consistent'.
  • Relational Database supports transactions, where as NoSQLDatabase does not support transactions.
Distributed Database
  • Distributed means splitting data or tasks across multiple machines.
  • In Cassandra no single node (a machine in a cluster is usually called a node) holds all the data, but just a chunk of it.
  • The main advantage of this is we are not limited by the storage and processing capabilities of a single machine. In future if the data gets larger we can add more machines.
High Availability
  • A high-availability system is one that is ready to serve any request at any time.
  • It is usually achieved by adding redundancies. So, if one part fails, the other part of the system can serve the request without telling this to the client.
  • Cassandra is a robust software, where joining and leaving of nodes are automatically taken care of.
  • With proper settings, Cassandra can be made failure-resistant. With this if the servers fail, the data loss will be zero.
Replication
  • Replication is achieved by frequent copying of data from a database in one computer/server to a database in another so that all users share the same level of information.
  • Cassandra has a pretty powerful replication mechanism.
  • It treats every node in the same manner and doesn't have any master-slave concept. IN Cassandra, data need not be written on a specific server (master) and we need not wait until the data is written to all the nodes that replicate this data (slaves). This means that the client can be returned with success as a response as soon as the data is written on at least one server.
Cassandra
  • Cassandra is an open source, distributed database from Apache that is highly scalable and designed to manage very large amounts of structured data.
  • Cassandra is made to easily deploy over a cluster of machines located at geographically different places.
  • There is no master slave or central master server, so no single point of failure, no bottleneck, data is replicated, and a faulty node can be replaced without any downtime.
  • Cassandra is linearly scalable, which means that with more nodes, the requests served per second per node will not go down. Also, the total throughput of the system will increase with each node being added. 
  • Cassandra is column oriented, much like a map (or better, a map of sorted maps) or a table with flexible columns where each column is essentially a key-value pair. So, we can add columns as we go, and each row can have a different set of columns (key-value).
  • Cassandra does not provide any relational integrity. It is up to the application developer to perform relation management.
  • It is a type of NoSQL database.
  • Apache HBase and MongoDB are quite popular NoSQL databases besides Cassandra.
  • It is widely used in Netflix, eBay, GitHub, Facebook, Twitter etc.
  • Cassandra does automatic partitioning and replication.
Cassandra Data Model
  • Cassandra has three containers, one within another.
  • Keyspace which is synonymous to a database in the RDBMS land is the outermost container.
  • Tables reside under keyspace. A table is basically a sorted map of sorted maps. Each table must have a primary key, which is called row key or partition key
  • If the primary key is made up of more than one column, the first component of this composite key is equivalent to the row key.
  • Each partition is associated with a set of cells and each cell has a name and a value. We can say these cells are synonymous with the columns in the traditional database.
  • A partition is completely stored on a node. The benefit of this is that the fetches are faster, but at the same time a partition is limited by the total number of cells that it can hold, which is 2 billion cells (The maximum number of cells per partition is limited by the Java integer’s max value, which is about 2 billion).
  • Also if we put everything on one partition it may cause lots of requests to go to only a couple of nodes(replicas), making them a hotspot in the cluster, which is not good. We can avoid this bucketing, this will make sure that the partition changes every month and each partition has only one month worth of records

Types Of Keys In Cassandra
There are 5 types of keys in Cassandra:
  • Primary key: This is the column or a group of columns that uniquely defines a row of the CQL table.
  • Composite key: Unlike RDBMS, we cannot just perform an ORDER BY operation across partitions in Cassandra. To do this we need to use a composite key. A composite key consists of a partition key and one or more column(s) that determines where the other columns are going to be stored. Also, the other columns in the composite key determine relative ordering for the set of columns that are being inserted as a row with the key.
  • Partition key or Row key: Cassandra’s internal data representation is large rows with a unique key called row key. It uses these row key values to distribute data across cluster nodes. Since these row keys are used to partition data, they as called partition keys. When we define a table with a simple key, that key is the partition key. If we define a table with a composite key, the first term of that composite key works as the partition key. This means all the CQL rows with the same partition key lives on one machine. (Every data with the same partition key will be stored in same node in the cluster).
  • Clustering key: This is the column that tells Cassandra how the data within a partition is ordered (or clustered). This essentially provides presorted retrieval if you know what order you want your data to be retrieve in.
  • Composite partition key: Optionally, CQL lets you define a composite partition key (the first part of a composite key). This key helps you distribute data across nodes if any part of the composite partition key differs.
Lets create a table. In this case empid is the primary as well as the partition key.

CREATE TABLE EMPLOYEES (
empid uuid,
email text,
PRIMARY KEY (empid)
)

In the below example, we create a composite key that uses state and city combination to uniquely define a CQL row. The state column is the partition key, so all the rows with
the same state node will belong to the same node/machine. The rows within a partition will be sorted by the city names.

CREATE TABLE STATE_CITY(
state text,
city text,
population int,
PRIMARY KEY (state, city)
)

In the below example we have created a composite key involving four columns: empName, experience, num_of_subordinates, department_name, with empName, experience constituting composite partition key. This means the rows with same empName, but different years of experience will be in different partition. Rows will be ordered by the num_of_subordinates  followed by department_name.

CREATE TABLE EMPLOYEES (
empName text,
experience int,
num_of_subordinates int,
department_name int,
PRIMARY KEY ((empName, experience), num_of_subordinates, department_name)
)

Query First Approach
  • Apache Cassandra follows Query-driven data modeling methodology, whereby specific queries are the key to organizing data. 
  • A query-driven database design facilitates faster reading and writing of data, i.e., the better the model design, the more rapid data is written and read. 
  • In query first approach, we design our tables for specific queries rather than relational database. The drawback of this approach is that we might end up storing same data in different tables.
Ring Representation
  • A Cassandra cluster is called a ring. 
  • Every node in a Cassandra cluster is given an initial token. This initial token defines the end of the range a node is responsible for.
  • Each node is responsible for storing all the rows with token values (a token is basically a hash value of a row key) ranging from the previous node’s initial token (exclusive) to the node’s initial token (inclusive).
  • This way, the first node, the one with the smallest initial token, will have a range from the token value of the last node (the node with the largest initial token) to the first token value.
  • If you jump from node to node, you will make a circle, and this is why a Cassandra cluster is called a ring.


Example:
Lets say we have 8 nodes, each node will be assigned a token(in our case it is 100, 200..800). Each node will responsible for storing the data with the token less than the value of that token and greater than the value aside the previous node. In our case Node one can story all the tokens which as less than 100 but greater than 800


  • Let's say we want to store employee information in a database based on the departments they are working. As of now let's assume an employee can work in a single department.
  • So we need to make sure, all the employee's working in a particular department will be stored on the same node of the cluster. So if we want to query all the employee's who are working in a particular department, we will be able to get it very quickly as they will be stored on a same node.
  • Now say we have 10 departments: HR, Marketing, Customer Support, Tech Support, Tech Team, Resource Management Team, Security, Travel, Lost And Found, Food And Beverages, Storage. All these department names will be our partition key. (FYI, every data with the same partition key will be stored in same node in the cluster)
  • Cassandra passes each partition key to a hash function, the purpose of this function is to turn the partition key (string) into a unique id. This unique id is known as 'tokens' in Cassandra. Based on these tokens Cassandra will decide on which node data need to be stored. In real Cassandra, these tokens will be 64 bits integers.
  • All the departments names will generate a unique value, which will ensure all the employee's with same department will be stored in same node.
  • Lets say we have 8 nodes, each node will be assigned a token(in our case it is 100, 200..800). Each node will responsible for storing the data with the token less than the value of that token and greater than the value aside the previous node. In our case Node one can story all the tokens which as less than 100 but greater than 800
Virtual Nodes in Cassandra

  • In above example token range is from 100-200, 200-300 and so on, which is quite less but in real life it's very large.
  • There's one token per node, and thus a node owns exactly one contiguous range in the ringspace.
  • Vnodes or Virtual Nodes change this paradigm from one token or range per node, to many per node. Within a cluster these can be randomly selected and be non-contiguous, giving us many smaller ranges that belong to each node.
  • By default, each node has 256 virtual nodes.Which can be changed and configured in Cassandra configuration.
  • Virtual nodes help achieve finer granularity in the partitioning of data, and data gets partitioned into each virtual node using the hash value of the key. 
  • On adding a new node to the cluster, the virtual nodes on it get equal portions of the existing data. So there is no need to separately balance the data by running a balancer.


asdwqerer

No comments:

Post a Comment

RSSChomp Blog Directory