Deploying Apache Cassandra

Apache Cassandra is a NoSQL, distributed database system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It offers multi datacenter clustering with asynchronous masterless replication, and near linear scalability. [1]

NoSQL design provides scalability and high availability, instead of ACID (Atomicity, Consistency, Isolation, Durability) [2] guarantee, like other more traditional RDBMS solutions. Cassandra employs the BASE (Basically Available, Soft-state, Eventual consistency) [3] principles which puts it in between the "available" and "partition tolerant" arm of the CAP theorem [4] triangle though such classifications are mostly meant to help answering the question "what is the default behaviour of the distributed system when a partition happens":

A Cassandra cluster is called a ring - each node consists of multiple virtual nodes (vnodes) responsible for a single continuous range of rows with token values (a hash value of a row key). Cassandra is a peer-to-peer system where data is distributed among all nodes in the ring. Each node exchanges information across the cluster every second using a gossip protocol. A partitioner determines how to distribute the data across the nodes in the cluster and which node to place the first copy of data on.

When a client sends a write request it can connect to any node in the ring. That node is called the coordinator node. In turn it delegates the write request to a StorageProxy service, which determines what nodes are responsible for that data. It identifies the nodes using a mechanism called a Snitch. A Snitch defines groups of machines that the replication strategy uses to place replicas of the data. Once the replica nodes are identified the coordinator node send a RowMutation message to them and then waits for a confirmation that the data was written. It only waits for some nodes to confirm, based on a pre-configured consistency level. If the nodes are in multiple datacenters the message is send to one replica in each data center with a special header telling it to forward the request to other nodes in that data center. The nodes that receive the RowMutation message first append it to the commit log, then to a MemTable and finally the MemTable is flushed to disk in a structure called SSTable. Periodically the SSTables are merged in a process called compaction.

When a client needs to read data back, it again connects to any node, the StorageProxy gets a list of nodes containing the requested key based on the replication strategy, the proxy node sorts the returned candidate nodes based on proximity using Snitch function (configurable). Once a node is selected the read request is forwarded to it for execution. That node then first attempts to read the data from its MemTable. If the data is not in memory Cassandra then looks into a SSTable on disk utilizing a bloom filter. At the same time, other nodes that are responsible for storing the same data will respond back with just a digest, without the actual data. If the digest does not match on some of the nodes, data repair process is started and those nodes will eventually get the latest data and become consistent. For further information on the architecture of Cassandra refer to [5].

The data in Cassandra is stored in a nested hashmap (a hashmap containing a hash map, which is basically data structure with key-value pairs) and it can be visualized as the following:

The keyspace is similar to a database and it stores the column families, along with other properties like the replication factor and replica placement strategies. The properties of the keyspace apply to all tables contained within the keyspace.

The column family is similar to a table and contain a collection of rows, where each row contains cells.

A cell is the smallest data unit (a triplet) that holds data in the form of "key:value:time". The timestamp is used to resolve consistency discrepancies during data repairs from inconsistent digests.

The row key uniquely identifies a row. Since each node in a ring contains only a subset of rows (the rows are distributed among the nodes) the row keys are sharded as well.

With all this in mind let's deploy two nodes, single DC cluster. First let's install the prerequisites:
Then let's install Cassandra:
The main configuration file is very well documented and most of the defaults are quite sensible. The few changes for the purpose of this blog are as follows:
The few important options are the cluster name, enabling authentication and authorization, specifying the IP of the seed node and the Snitch. In this example I am using the GossipingPropertyFileSnitch, in which you specify the DC and rack each node is in. As long as the nodes are configured with the same cluster name they'll discover each other and form the ring utilizing the gossip protocol.

Managing a Cassandra cluster can be accomplished by using the nodetool utility. Here are few examples of removing a live node, then re-adding it back to the cluster:
To take a snapshot of the data for backup run the following:
Once the snapshot is complete you can copy it to a secure location. To enable incremental backups edit cassandra.yaml and ensure "incremental_backups: true". To restore the data, stop Cassandra, clean up the data (only delete the *.db files) and commitlog directories and copy the files over.

And finally let's manipulate some data, by adding a new user, changing the default cassandra password and creating and inserting records: