Be a Dooper!

  • Home
  • Hadoop
  • Cassandra
  • Hadoop Reference
    • core-site.xml (Hadoop 2.7)
    • core-site.xml (Hadoop 2.6)
    • hdfs-site.xml (Hadoop 2.7)
    • hdfs-site.xml (Hadoop 2.6)
    • hive-site.xml (1.1)
    • hive-site.xml (1.0)
    • hive-site.xml (0.14)
    • mapred-site.xml (Hadoop 2.7)
    • mapred-site.xml (Hadoop 2.6)
    • oozie-site.xml (4.0)
    • yarn-site.xml (Hadoop 2.7)
    • yarn-site.xml (Hadoop 2.6)
    • Older Versions
  • About

Setting up a Cassandra cluster

Posted on December 14, 2015 by dmontroy

An earlier post described how you can go about installing an open source single node Cassandra cluster.  With the advent of cloud technologies, it’s easier to have access to multiple servers for the purpose of trying out a clustered environment.

These instructions show you how to set up a basic Cassandra cluster.  The process for two data nodes is outlined here, but this can be easily replicated to add more nodes.

System Setup

These instructions were created using CentOS 6.7 and two data nodes.  The servers were configured with 4 Gb of memory and a 40 Gb hard drive.  After installing the OS, install a later version of Java 7 or later.

Networking Setup

Similarly to what was done in the single node setup, decide how you will refer to your servers: DNS name or IP address? If you are going to use the DNS name, make sure that the name resolves from all hosts by using the ping command. Add the names to /etc/hosts if necessary. In all places in these instructions where you see HOST_NAME, use either the DNS name or IP address.

User/Group Configuration

This guide will create 1 user for the Cassandra system and one group on each server:

Groups

cassandra – This group will contain the Cassandra system user.

Users

cassandra – This user will be used for all core Cassandra operations. It is a member of the cassandra group.

Steps

  1. Create the cassandra group as a system group.
  2. Create the cassandra user as a system user with the primary group being cassandra and also make it a member of the users group.

On each host, run these commands:

> sudo groupadd -r cassandra
> sudo useradd -r -m cassandra -g cassandra -G users

 

Download and Install Cassandra

Download Cassandra from the Cassandra releases page at http://cassandra.apache.org/download/ .  For this post I used the 2.2.3 release (named apache-cassandra-2.2.3-bin.tar.gz). On each host, as the cassandra user, extract the archive:

> tar -zxf apache-cassandra-2.2.3-bin.tar.gz

Now move the extracted directory to the path /usr/share, and create a symbolic link to it:

> sudo mv apache-cassandra-2.2.3 /usr/share
> sudo ln -s /usr/share/apache-cassandra-2.2.3 /usr/share/cassandra

Why a symbolic link? A symbolic link will allow you to easily have multiple versions of Cassandra side by side in your installation, and you would be able to switch between them simply by changing the location that the link points to.

Create the Data Directories

As with the single node setup, now the data directories need to be created.  There are 3 directories that need to be created for Cassandra.  As the cassandra user, run the following:

> mkdir /usr/share/cassandra/commitlog
> mkdir /usr/share/cassandra/data
> mkdir /usr/share/cassandra/saved_caches

Configuration:  cassandra.yaml

Now the configuration of the node needs to be set up.  The yaml file format is similar to a properties file, except with a colon (:) after the name of the property instead of an equals (=) sign.  The cassandra.yaml file is located under /usr/share/cassandra/conf. First, change the cluster name:

cluster_name: 'MyClusterName'

Specify whatever name you could like for the cluster, but be sure to change it since it’s an easy indicator that you are using the right configuration. Next, set the paths to the data directories:

data_file_directories:
     - /usr/share/cassandra/data
commitlog_directory: /usr/share/cassandra/commitlog
saved_caches_directory: /usr/share/cassandra/saved_caches

Now, set the seed to the IP address of the first server:

seed_provider:
    # Addresses of hosts that are deemed contact points. 
    # Cassandra nodes use this list of hosts to find each other and learn
    # the topology of the ring. You must change this if you are running
    # multiple nodes!
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          # seeds is actually a comma-delimited list of addresses.
          # Ex: "<ip1>,<ip2>,<ip3>"
          - seeds: "IP_ADDRESS"

Replace IP_ADDRESS with the server’s IP address.  This setting will be the same on all servers.  Not every server needs to be a seed node.  For a simple setup, a single seed node will suffice.

Next set the listen_address property to each server’s IP address:

listen_address: IP_ADDRESS

Then set the rpc_address to each server’s IP address:

rpc_address: IP_ADDRESS

Note that this will cause the config file to be different on each of the nodes of the cluster.  You can make the process of updating the config files easy by configuring all of the settings on one server, then copying the entire file to the other server(s) and then updating only the properties that are unique to it.

Firewall Configuration

It’s almost time to start Cassandra.  If you are running a server with a firewall enabled, open these ports to be able to access Cassandra services from outside the server:

  • 7000
  • 7199
  • 9042
  • 9160

Starting Cassandra

Now it’s time to run Cassandra for the first time.  Run the following command on the seed node as the cassandra user:

> /usr/share/cassandra/bin/cassandra

This will start Cassandra and log the startup information to stdout.  You want to see messages like the following:

 INFO 21:14:53,291 Node /IP_ADDRESS state jump to normal
 INFO 21:14:53,360 Waiting for gossip to settle before accepting client requests...
 INFO 21:15:01,363 No gossip backlog; proceeding

Once you see this, you can repeat the same process on the other node(s).  Look for the same message.

Checking the status of the cluster

After getting the nodes started, you can use the nodetool status command to verify that all hosts were registered.  Run the following command on any node:

> /usr/share/cassandra/bin/nodetool status

You will see output similar to:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns    Host ID                               Rack
UN  192.168.110.32  235.2 KB   256          ?       8d1e16c2-854a-4670-baf3-cdb3a89d7a7c  rack1
UN  192.168.110.33  301.17 KB  256          ?       7ae097c8-feb2-421a-ac24-b35aec638a2c  rack1
 
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

You should see all of your nodes listed with a UN status.

For ease of execution, it’s helpful to add the /usr/share/cassandra/bin directory to the path.

Startup/Shutdown Scripts

In a clustered environment, it is very beneficial to have a single way to start and stop all services on all hosts.  Unlike Hadoop, Cassandra doesn’t include built in scripts for this.  Here’s some simple ones.

Startup script

#!/bin/bash
SERVERS="HOST_NAME1
HOST_NAME2"

for SERVERNAME in $SERVERS; do
  echo Starting on $SERVERNAME...
  sudo -u cassandra ssh $SERVERNAME "/usr/share/cassandra/bin/cassandra"
done

Shutdown Script

#!/bin/bash
SERVERS="dmontroy-cs1.site
dmontroy-cs2.site"
 
for SERVERNAME in $SERVERS; do
  echo Stopping on $SERVERNAME...
  sudo -u cassandra ssh $SERVERNAME "/home/cassandra/bin/cassandra-stop.sh"
done

The cassandra-stop.sh script is:

#!/bin/bash
 
CASS_PID=`ps -ef |grep CassandraDaemon |grep -v grep |awk '{ print $2 }'`

if [[ "$CASS_PID" == "" ]]
then
  echo Cassandra does not appear to be running
else
  kill $CASS_PID
fi

Description

These scripts can be run as any user with sudo access to the cassandra user.  The cassandra-stop.sh script should be located in the /home/cassandra/bin directory.

Optimum Server Configuration

If you see this message in the log:

Cassandra server running in degraded mode. Is swap disabled? : false,  Address space adequate? : true,  nofile limit adequate? : false, nproc limit adequate? : false

Make sure that these steps have been followed:

  • Disable the swap partition in the /etc/fstab file
  • Add these settings to the limits.conf file for the cassandra user:
cassandra - memlock unlimited
cassandra - nofile 100000
cassandra - nproc 32768
cassandra - as unlimited
  • Set vm.max_map_count = 131072 in /etc/sysctl.conf

Testing Your Servers

To verify that your servers are running properly, run the following as any user

> /usr/share/cassandra/bin/nodetool -h IP_ADDRESS info

You should see output similar to this:

Token : (invoke with -T/--tokens to see all 256 tokens)
ID : c602825f-367d-4ae0-8c90-afcf6bb818bf
Gossip active : true
Thrift active : true
Native Transport active: true
Load : 105.82 KB
Generation No : 1416107662
Uptime (seconds) : 266
Heap Memory (MB) : 26.85 / 928.00
Data Center : datacenter1
Rack : rack1
Exceptions : 0
Key Cache : size 2096 (bytes), capacity 48234496 (bytes), 76 hits, 96 requests, NaN recent hit rate, 14400 save period in seconds
Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds

If nodetool seems to pause and not return for upwards of a minute, there may be issues with the network configuration of your server.

Running Remotely

The Cassandra client tools can be run remotely from another workstation.  Just make sure that workstation has Java 7, download the Cassandra archive, and expand it.  Put the bin directory in your path for convenience.

In order to run remotely, make sure you have the ports listed above open in the firewall, if the server has a firewall enabled.

Creating Sample Data

In order to create data in Cassandra, run the CQLSH tool by running cqlsh from the Cassandra bin directory.  The format is:

> cqlsh HOST

Where HOST is your host name or IP address of one of the servers.  (Note:  You can remember the host name by setting the CQLSH_HOST environment variable) When running CQLSH, you should see:

Connected to CLUSTER_NAME at HOST:9160.
[cqlsh 4.1.1 | Cassandra 2.0.11 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Use HELP for help.
cqlsh>

Before creating any data, it needs to go into a keyspace.  Create a keyspace as follows:

cqlsh> create keyspace sampledata with replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

SimpleStrategy is the default placement strategy, and the replication factor is set to 1 because this is a simple cluster.  Production data sets would have a higher replication factor. Now that the keyspace is created, create a table:

cqlsh> use sampledata;
cqlsh:sampledata> create table months ( id int PRIMARY KEY, name text );

Once the table is created, insert some data.  Exit out of cqlsh, then create a file named months.cql with these contents:

insert into sampledata.months ( id, name )
VALUES( 1, 'January' );
insert into sampledata.months ( id, name )
VALUES( 2, 'February' );
insert into sampledata.months ( id, name )
VALUES( 3, 'March' );
insert into sampledata.months ( id, name )
VALUES( 4, 'April' );
insert into sampledata.months ( id, name )
VALUES( 5, 'May' );
insert into sampledata.months ( id, name )
VALUES( 6, 'June' );
insert into sampledata.months ( id, name )
VALUES( 7, 'July' );
insert into sampledata.months ( id, name )
VALUES( 8, 'August' );
insert into sampledata.months ( id, name )
VALUES( 9, 'September' );
insert into sampledata.months ( id, name )
VALUES( 10, 'October' );
insert into sampledata.months ( id, name )
VALUES( 11, 'November' );
insert into sampledata.months ( id, name )
VALUES( 12, 'December' );

Create the data as follows:

> cqlsh -f months.cql

If the command is successful, you will not see any output, and the command prompt will simply return. Now, let’s query the data.  Run cqlsh again.  At the command prompt, type this:

cqlsh> select * from sampledata.months;

If the data was inserted properly, it will show output similar to this:

id | name
---+-----------
 5 | May
10 | October
11 | November
 1 | January
 8 | August
 2 | February
 4 | April
 7 | July
 6 | June
 9 | September
12 | December
 3 | March
(12 rows)

Note that the rows come back in a different order than inserted, depending on how the keys are actually inserted into the cluster. At this point, you now have a working multi node Cassandra cluster.

Posted in Cassandra | Tags: cluster, opensource |
« Installing Spark 1.4.1
Setting up a Spark Cluster »

Leave a comment Cancel reply

You must be logged in to post a comment.

Categories

  • Building (1)
  • Cassandra (2)
  • Cloud (4)
  • Flume (1)
  • Hadoop (14)
  • Hive (2)
  • MapReduce (1)
  • Oozie (1)
  • Other (1)
  • Spark (5)
  • YARN (3)

Archives

  • January 2018 (1)
  • May 2016 (1)
  • April 2016 (1)
  • December 2015 (2)
  • July 2015 (1)
  • June 2015 (1)
  • April 2015 (2)
  • January 2015 (2)
  • November 2014 (1)
  • October 2014 (1)
  • July 2014 (1)
  • June 2014 (1)
  • April 2014 (2)
  • March 2014 (1)
  • February 2014 (5)
© Be a Dooper!