An earlier post described how you can go about installing an open source single node Cassandra cluster. With the advent of cloud technologies, it’s easier to have access to multiple servers for the purpose of trying out a clustered environment.
These instructions show you how to set up a basic Cassandra cluster. The process for two data nodes is outlined here, but this can be easily replicated to add more nodes.
System Setup
These instructions were created using CentOS 6.7 and two data nodes. The servers were configured with 4 Gb of memory and a 40 Gb hard drive. After installing the OS, install a later version of Java 7 or later.
Networking Setup
Similarly to what was done in the single node setup, decide how you will refer to your servers: DNS name or IP address? If you are going to use the DNS name, make sure that the name resolves from all hosts by using the ping command. Add the names to /etc/hosts if necessary. In all places in these instructions where you see HOST_NAME, use either the DNS name or IP address.
User/Group Configuration
This guide will create 1 user for the Cassandra system and one group on each server:
Groups
cassandra – This group will contain the Cassandra system user.
Users
cassandra – This user will be used for all core Cassandra operations. It is a member of the cassandra group.
Steps
- Create the cassandra group as a system group.
- Create the cassandra user as a system user with the primary group being cassandra and also make it a member of the users group.
On each host, run these commands:
> sudo groupadd -r cassandra > sudo useradd -r -m cassandra -g cassandra -G users
Download and Install Cassandra
Download Cassandra from the Cassandra releases page at http://cassandra.apache.org/download/ . For this post I used the 2.2.3 release (named apache-cassandra-2.2.3-bin.tar.gz). On each host, as the cassandra user, extract the archive:
> tar -zxf apache-cassandra-2.2.3-bin.tar.gz
Now move the extracted directory to the path /usr/share, and create a symbolic link to it:
> sudo mv apache-cassandra-2.2.3 /usr/share > sudo ln -s /usr/share/apache-cassandra-2.2.3 /usr/share/cassandra
Why a symbolic link? A symbolic link will allow you to easily have multiple versions of Cassandra side by side in your installation, and you would be able to switch between them simply by changing the location that the link points to.
Create the Data Directories
As with the single node setup, now the data directories need to be created. There are 3 directories that need to be created for Cassandra. As the cassandra user, run the following:
> mkdir /usr/share/cassandra/commitlog > mkdir /usr/share/cassandra/data > mkdir /usr/share/cassandra/saved_caches
Configuration: cassandra.yaml
Now the configuration of the node needs to be set up. The yaml file format is similar to a properties file, except with a colon (:) after the name of the property instead of an equals (=) sign. The cassandra.yaml file is located under /usr/share/cassandra/conf. First, change the cluster name:
cluster_name: 'MyClusterName'
Specify whatever name you could like for the cluster, but be sure to change it since it’s an easy indicator that you are using the right configuration. Next, set the paths to the data directories:
data_file_directories: - /usr/share/cassandra/data
commitlog_directory: /usr/share/cassandra/commitlog
saved_caches_directory: /usr/share/cassandra/saved_caches
Now, set the seed to the IP address of the first server:
seed_provider: # Addresses of hosts that are deemed contact points. # Cassandra nodes use this list of hosts to find each other and learn # the topology of the ring. You must change this if you are running # multiple nodes! - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: # seeds is actually a comma-delimited list of addresses. # Ex: "<ip1>,<ip2>,<ip3>" - seeds: "IP_ADDRESS"
Replace IP_ADDRESS with the server’s IP address. This setting will be the same on all servers. Not every server needs to be a seed node. For a simple setup, a single seed node will suffice.
Next set the listen_address property to each server’s IP address:
listen_address: IP_ADDRESS
Then set the rpc_address to each server’s IP address:
rpc_address: IP_ADDRESS
Note that this will cause the config file to be different on each of the nodes of the cluster. You can make the process of updating the config files easy by configuring all of the settings on one server, then copying the entire file to the other server(s) and then updating only the properties that are unique to it.
Firewall Configuration
It’s almost time to start Cassandra. If you are running a server with a firewall enabled, open these ports to be able to access Cassandra services from outside the server:
- 7000
- 7199
- 9042
- 9160
Starting Cassandra
Now it’s time to run Cassandra for the first time. Run the following command on the seed node as the cassandra user:
> /usr/share/cassandra/bin/cassandra
This will start Cassandra and log the startup information to stdout. You want to see messages like the following:
INFO 21:14:53,291 Node /IP_ADDRESS state jump to normal INFO 21:14:53,360 Waiting for gossip to settle before accepting client requests... INFO 21:15:01,363 No gossip backlog; proceeding
Once you see this, you can repeat the same process on the other node(s). Look for the same message.
Checking the status of the cluster
After getting the nodes started, you can use the nodetool status command to verify that all hosts were registered. Run the following command on any node:
> /usr/share/cassandra/bin/nodetool status
You will see output similar to:
Datacenter: datacenter1
=======================
Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 192.168.110.32 235.2 KB 256 ? 8d1e16c2-854a-4670-baf3-cdb3a89d7a7c rack1 UN 192.168.110.33 301.17 KB 256 ? 7ae097c8-feb2-421a-ac24-b35aec638a2c rack1 Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
You should see all of your nodes listed with a UN status.
For ease of execution, it’s helpful to add the /usr/share/cassandra/bin directory to the path.
Startup/Shutdown Scripts
In a clustered environment, it is very beneficial to have a single way to start and stop all services on all hosts. Unlike Hadoop, Cassandra doesn’t include built in scripts for this. Here’s some simple ones.
Startup script
#!/bin/bash SERVERS="HOST_NAME1 HOST_NAME2" for SERVERNAME in $SERVERS; do echo Starting on $SERVERNAME... sudo -u cassandra ssh $SERVERNAME "/usr/share/cassandra/bin/cassandra" done
Shutdown Script
#!/bin/bash SERVERS="dmontroy-cs1.site dmontroy-cs2.site" for SERVERNAME in $SERVERS; do echo Stopping on $SERVERNAME... sudo -u cassandra ssh $SERVERNAME "/home/cassandra/bin/cassandra-stop.sh" done
The cassandra-stop.sh script is:
#!/bin/bash CASS_PID=`ps -ef |grep CassandraDaemon |grep -v grep |awk '{ print $2 }'` if [[ "$CASS_PID" == "" ]] then echo Cassandra does not appear to be running else kill $CASS_PID fi
Description
These scripts can be run as any user with sudo access to the cassandra user. The cassandra-stop.sh script should be located in the /home/cassandra/bin directory.
Optimum Server Configuration
If you see this message in the log:
Cassandra server running in degraded mode. Is swap disabled? : false, Address space adequate? : true, nofile limit adequate? : false, nproc limit adequate? : false
Make sure that these steps have been followed:
- Disable the swap partition in the /etc/fstab file
- Add these settings to the limits.conf file for the cassandra user:
cassandra - memlock unlimited cassandra - nofile 100000 cassandra - nproc 32768 cassandra - as unlimited
- Set vm.max_map_count = 131072 in /etc/sysctl.conf
Testing Your Servers
To verify that your servers are running properly, run the following as any user
> /usr/share/cassandra/bin/nodetool -h IP_ADDRESS info
You should see output similar to this:
Token : (invoke with -T/--tokens to see all 256 tokens) ID : c602825f-367d-4ae0-8c90-afcf6bb818bf Gossip active : true Thrift active : true Native Transport active: true Load : 105.82 KB Generation No : 1416107662 Uptime (seconds) : 266 Heap Memory (MB) : 26.85 / 928.00 Data Center : datacenter1 Rack : rack1 Exceptions : 0 Key Cache : size 2096 (bytes), capacity 48234496 (bytes), 76 hits, 96 requests, NaN recent hit rate, 14400 save period in seconds Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
If nodetool seems to pause and not return for upwards of a minute, there may be issues with the network configuration of your server.
Running Remotely
The Cassandra client tools can be run remotely from another workstation. Just make sure that workstation has Java 7, download the Cassandra archive, and expand it. Put the bin directory in your path for convenience.
In order to run remotely, make sure you have the ports listed above open in the firewall, if the server has a firewall enabled.
Creating Sample Data
In order to create data in Cassandra, run the CQLSH tool by running cqlsh from the Cassandra bin directory. The format is:
> cqlsh HOST
Where HOST is your host name or IP address of one of the servers. (Note: You can remember the host name by setting the CQLSH_HOST environment variable) When running CQLSH, you should see:
Connected to CLUSTER_NAME at HOST:9160. [cqlsh 4.1.1 | Cassandra 2.0.11 | CQL spec 3.1.1 | Thrift protocol 19.39.0] Use HELP for help. cqlsh>
Before creating any data, it needs to go into a keyspace. Create a keyspace as follows:
cqlsh> create keyspace sampledata with replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
SimpleStrategy is the default placement strategy, and the replication factor is set to 1 because this is a simple cluster. Production data sets would have a higher replication factor. Now that the keyspace is created, create a table:
cqlsh> use sampledata; cqlsh:sampledata> create table months ( id int PRIMARY KEY, name text );
Once the table is created, insert some data. Exit out of cqlsh, then create a file named months.cql with these contents:
insert into sampledata.months ( id, name ) VALUES( 1, 'January' ); insert into sampledata.months ( id, name ) VALUES( 2, 'February' ); insert into sampledata.months ( id, name ) VALUES( 3, 'March' ); insert into sampledata.months ( id, name ) VALUES( 4, 'April' ); insert into sampledata.months ( id, name ) VALUES( 5, 'May' ); insert into sampledata.months ( id, name ) VALUES( 6, 'June' ); insert into sampledata.months ( id, name ) VALUES( 7, 'July' ); insert into sampledata.months ( id, name ) VALUES( 8, 'August' ); insert into sampledata.months ( id, name ) VALUES( 9, 'September' ); insert into sampledata.months ( id, name ) VALUES( 10, 'October' ); insert into sampledata.months ( id, name ) VALUES( 11, 'November' ); insert into sampledata.months ( id, name ) VALUES( 12, 'December' );
Create the data as follows:
> cqlsh -f months.cql
If the command is successful, you will not see any output, and the command prompt will simply return. Now, let’s query the data. Run cqlsh again. At the command prompt, type this:
cqlsh> select * from sampledata.months;
If the data was inserted properly, it will show output similar to this:
id | name ---+----------- 5 | May 10 | October 11 | November 1 | January 8 | August 2 | February 4 | April 7 | July 6 | June 9 | September 12 | December 3 | March
(12 rows)
Note that the rows come back in a different order than inserted, depending on how the keys are actually inserted into the cluster. At this point, you now have a working multi node Cassandra cluster.