By default, Cassandra uses org.apache.cassandra.locator.EndPointSnitch. It operates by simply comparing different octets in the IP addresses of each node. If two hosts have the same value in the second octet of their IP addresses, then they are determined to be in the same data center. If two hosts have the same value in the third octet of their IP addresses, then they are determined to be in the same rack. “Determined to be” really means that Cassandra has to guess based on an assumption of how your servers are located in different VLANs or subnets.
The org.apache.cassandra.locator.PropertyFileSnitch used to be in contrib, but was moved into the main code base in 0.7. This snitch allows you more control when using a Rack-Aware Strategy by specifying node locations in a standard key/value properties file called cassandra-rack.properties.
This snitch was contributed by Digg, which uses Cassandra and regularly contributes to its development. This snitch helps Cassandra know for certain if two IPs are in the same data center or on the same rack—because you tell it that they are. This is perhaps most useful if you move servers a lot, as operations often need to, or if you have inherited an unwieldy IP scheme.
The default configuration of cassandra-rack.properties looks like this:
There are a few basic properties of Cassandra’s write ability that are worth noting. First, writing data is very fast in Cassandra, because its design does not require performing disk reads or seeks. The memtables and SSTables save Cassandra from having to perform these operations on writes, which slow down many databases. All writes in Cassandra are append-only.
Because of the database commit log and hinted handoff design, the database is always writeable, and within a column family, writes are always atomic.
Let’s run an example that will delete some data that we previously inserted. Note that there is no “delete” operation in Cassandra, it’s remove, and there’s really no “remove,” it’s just a write (of a tombstone flag). Because a remove operation is really a tombstone write, you still have to supply a timestamp with the operation, because if there are multiple clients writing, the highest timestamp wins—and those writes might include a tombstone or a new value. Cassandra doesn’t discriminate here; whichever operation has the highest timestamp will win.
A simple delete looks like this:
Connector conn = new Connector();
Cassandra.Client client = conn.connect();
String columnFamily = "Standard1";
byte[] key = "k2".getBytes(); //this is the row key
Clock clock = new Clock(System.currentTimeMillis());
There were many examples of using batch mutate to perform multiple inserts in Chapter 4, so I won’t rehash that here. I’ll just present an overview.
To perform many insert or update operations at once, use the batch_mutate method instead of the insert method. Like a batch update in the relational world, the batch_mutate operation allows grouping calls on many keys into a single call in order to save on the cost of network round trips. If batch_mutate fails in the middle of its list of mutations, there will be no rollback, so any updates that have already occured up to this point will remain intact. In the case of such a failure, the client can retry the batch_mutate operation.
Word count is one of the examples given in the MapReduce paper and is the starting point for many who are new to the framework. It takes a body of text and counts the occurrences of each distinct word. Here we provide some code to perform a word count over data contained in Cassandra. A working example of word count is also included in the Cassandra source download.
First we need a Mapper class, shown in Example 12-1.
Example 12-1. The TokenizerMapper.java class
public static class TokenizerMapper extends Mapper<byte[],
SortedMap<byte[], IColumn>, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private String columnName;
public void map(byte[] key, SortedMap<byte[], IColumn> columns, Context context)
Cassandra has a Java source package for Hadoop integration code, called org.apache.cassandra.hadoop. There we find:
ColumnFamilyInputFormat
The main class we’ll use to interact with data stored in Cassandra from Hadoop. It’s an extension of Hadoop’s InputFormat abstract class.
ConfigHelper
A helper class to configure Cassandra-specific information such as the server node to point to, the port, and information specific to your MapReduce job.
ColumnFamilySplit
The extension of Hadoop’s InputSplit abstract class that creates splits over our Cassandra data. It also provides Hadoop with the location of the data, so that it may prefer running tasks on nodes where the data is stored.
ColumnFamilyRecordReader
The layer at which individual records from Cassandra are read. It’s an extension of Hadoop’s RecordReader abstract class.
There are similar classes for outputting data to Cassandra in the Hadoop package, but at the time of this writing, those classes are still being finalized.