Sunday, 18 May 2014

Understanding How CQL3 Maps to Cassandra’s Internal Data Structure



John Berryman — July 24, 2013 | 1 Comment | Filed in: solr Tagged: |
CQL3 appears to be the newly ordained, canonical, and best-practices means of interacting with Cassandra. Indeed, the Apache Cassandra documentation itself declares that the Thrift API as “legacy” and recommends that CQL be used instead. However at the same time, I’ve heard several people express their concern over the added layer of abstraction and an uncertainty about what’s really happening inside of Cassandra.
In this post we will open up the hood and take a look at exactly how Cassandra is treating CQL queries. Our methodology is simple! We will create CQL statements and then in the “legacy” cassandra-cli we will look at exactly how the corresponding tables get written. For your further edification, consider following along: To get started, download cassandra, cd to the cassandra directory, start it (bin/cassandra -f), and then in two new terminal windows, start cassandra-cli and cqlsh, (respectively bin/cassandra-cli and bin/cqlsh). We’ll be switching back and forth between cqlsh and cassandra-cli.

A Simple Example

First off we’ll create and use a test keyspace:
cqlsh> CREATE KEYSPACE test
       WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}
cqlsh> USE test;
Go ahead and flip over to cassandra-cli and use the test keyspace there as well.
[default] use test;
In our first example, let’s look at the simple table composed of three fields using a simple primary key.
cqlsh:test> CREATE TABLE example (
        ... field1 int PRIMARY KEY,
        ... field2 int,
        ... field3 int);
And we will toss in a few example rows:
cqlsh:test> INSERT INTO example (field1, field2, field3) VALUES ( 1,2,3);
cqlsh:test> INSERT INTO example (field1, field2, field3) VALUES ( 4,5,6);
cqlsh:test> INSERT INTO example (field1, field2, field3) VALUES ( 7,8,9);
And as we would expect, the resulting entry as viewed through CQL is
cqlsh:test> SELECT * FROM example;

 field1 | field2 | field3
--------+--------+--------
      1 |      2 |      3
      4 |      5 |      6
      7 |      8 |      9
However this looks very different when viewed from cassandra-cli:
[default@test] list  example;
-------------------
RowKey: 1
=> (column=, value=, timestamp=1374546754299000)
=> (column=field2, value=00000002, timestamp=1374546754299000)
=> (column=field3, value=00000003, timestamp=1374546754299000)
-------------------
RowKey: 4
=> (column=, value=, timestamp=1374546757815000)
=> (column=field2, value=00000005, timestamp=1374546757815000)
=> (column=field3, value=00000006, timestamp=1374546757815000)
-------------------
RowKey: 7
=> (column=, value=, timestamp=1374546761055000)
=> (column=field2, value=00000008, timestamp=1374546761055000)
=> (column=field3, value=00000009, timestamp=1374546761055000)
Even though this presentation of the data is much harder to decipher, it’s important to understand what’s going on here because cassandra-cli presents the information exactly as it’s stored internally in Cassandra. For each item above, there are 3 important things to look at: the row key (RowKey: <?>), the column name (column=<?>) and the column value (value=<?>). From these examples, we can make a couple of initial observations about the mapping from CQL statements to their internal representations.
  • The value of the CQL primary key is used internally as the row key (which in the new CQL paradigm is being called a “partition key”).
  • The names of the non-primary key CQL fields are used internally as columns names. The values of the non-primary key CQL fields are then internally stored as the corresponding column values.
You may have also noticed that these rows all contain columns with no column name and no column value. This is not a bug! It’s actually a way of handling the fact that it should be possible to declare the existence of field1=<some number> without necessarily specifying values for field2 or field3. (See here for further details.)

A More Complex Example

Now since we’re getting used to just how the CQL and cassandra-cli results are organized, let’s jump into the thick of it with another more complicated example. Back in CQLSH:
cqlsh:test> DROP TABLE example;
cqlsh:test> CREATE TABLE example ( 
        ... partitionKey1 text,
        ... partitionKey2 text,
        ... clusterKey1 text,
        ... clusterKey2 text,
        ... normalField1 text,
        ... normalField2 text,
        ... PRIMARY KEY (
                (partitionKey1, partitionKey2), 
                clusterKey1, clusterKey2
              )
        ... );
Here we’ve named the fields to indicate where they’ll end up in the internal representation. And we’ve also pulled out all the stops. Our primary key is not only compound, but it also uses compound partition keys, (this is represented by the double nesting of the parentheses of the PRIMARY KEY) and compound cluster keys (more on partition vs. cluster keys in a bit).
Correspondingly we will also insert data where, again, the name gives us a hint of where the data is headed in the internal Cassandra data structure.
cqlsh:test> INSERT INTO example (
        ... partitionKey1, 
        ... partitionKey2, 
        ... clusterKey1, 
        ... clusterKey2, 
        ... normalField1, 
        ... normalField2
        ... ) VALUES ( 
        ... 'partitionVal1', 
        ... 'partitionVal2', 
        ... 'clusterVal1', 
        ... 'clusterVal2', 
        ... 'normalVal1', 
        ... 'normalVal2'); 
If we query this in CQL, we get the results that we would expect:
cqlsh:test> SELECT * FROM example;    
 partitionkey1 | partitionkey2 | clusterkey1 | clusterkey2 | normalfield1 | normalfield2
---------------+---------------+-------------+-------------+--------------+--------------
 partitionVal1 | partitionVal2 | clusterVal1 | clusterVal2 |   normalVal1 |   normalVal2
But again, this looks very different in cassandra-cli:
[default@test] list example;
-------------------
RowKey: partitionVal1:partitionVal2
=> (column=clusterVal1:clusterVal2:, value=, timestamp=1374630892473000)
=> (column=clusterVal1:clusterVal2:normalfield1, value=6e6f726d616c56616c31, timestamp=1374630892473000)
=> (column=clusterVal1:clusterVal2:normalfield2, value=6e6f726d616c56616c32, timestamp=1374630892473000)
One by one, let’s consider where every element of the CQL SELECT statement ends up in the cassandra-cli list statement. Again, it’s important to remember that the cassandra-cli representation closely corresponds to how the data is stored in the actual Cassandra data structure!
  • Looking at partitionVal1 and partitionVal2 we see that these are concatenated together and used internally as the RowKey (a.k.a. the partition key).
  • Similarly, clusterVal1 and clusterVal2 are concatenated together and used in the names of the columns. But look what else they’re concatenated with – the non-primary key field names – specifically normalfield1 and normalfield2.
  • What came of the actual values of normalfield1 and normalfield2? They are encoded as the values of columns clusterVal1:clusterVal2:normalfield1 and clusterVal1:clusterVal2:normalfield2 respectively. That is 6e6f726d616c56616c31 becomes normalVal1 and 6e6f726d616c56616c32 becomes normalVal2.
  • Look what’s missing: partitionkey1, partitionkey2, clusterkey1 and clusterkey2 are not even present at all! This is because the components of the primary keys are being tracked seperately in the system.schema_columnfamilies table. Go ahead and try the CQL query:
    SELECT key_aliases, column_aliases FROM system.schema_columnfamilies WHERE keyspace_name=’test’ AND columnfamily_name=’example’;
And you’ll find our missing field names:
 key_aliases                       | column_aliases
-----------------------------------+-------------------------------
 ["partitionkey1","partitionkey2"] | ["clusterkey1","clusterkey2"]

A More Practical Example

In the example above, we still don’t have the big picture because we’re only looking at a single entry. It’s also a little difficult to identify with the example here because it’s abstract. So in our final example we look at something that includes several entries while at the same time being something that we can all readily identify with – tweets. Consider the following table:
CREATE TABLE tweets ( 
        ... user text, 
        ... time timestamp,
        ... tweet text,
        ... lat float,
        ... long float,
        ... PRIMARY KEY (user, time)
        ... );
Take special notice of how the primary key is defined. Again here we have a partition key, user. The partition key is always the first field in the primary key, and it can optionally be compound as in the previous example. We also have a clustering key, time. The clustering key or keys are the fields contained in the primary key asides from the partition key. It will become clear in a moment why the pieces of the primary key are labeled as “partition” or “clustering”.
After entering in several items we select the contents of the table and here’s what we get:
cqlsh:test> SELECT * FROM tweets;
 user         | time                     | lat    | long    | tweet
--------------+--------------------------+--------+---------+---------------------
 softwaredoug | 2013-07-13 08:21:54-0400 | 38.162 | -78.549 |  Having chest pain.
 softwaredoug | 2013-07-21 12:15:27-0400 | 38.093 | -78.573 |   Speedo self shot.
      jnbrymn | 2013-06-29 20:53:15-0400 | 38.092 | -78.453 | I like programming.
      jnbrymn | 2013-07-14 22:55:45-0400 | 38.073 | -78.659 |     Who likes cats?
      jnbrymn | 2013-07-24 06:23:54-0400 | 38.073 | -78.647 |  My coffee is cold.
What’s more, we can easily (and efficiently) select all tweets for a particular user
cqlsh:test> SELECT * FROM tweets WHERE user='jnbrymn';
And we can easily (and efficiently) select all tweets for a particular user within a particular time slice:
cqlsh:test> SELECT * FROM tweets WHERE user='jnbrymn' AND time>='2013-07-01';
In all of the queries, it is also significant that we are efficiently retrieving all information associated with each tweet (tweet text, lat, long, …) without having to issue multiple queries.
In order to understand why these queries are efficient, it’s informative to look again at the cassandra-cli listing of tweets:
[default@test] list tweets;
-------------------
RowKey: softwaredoug
=> (column=2013-07-13 08\:21\:54-0400:, value=, timestamp=1374673155373000)
=> (column=2013-07-13 08\:21\:54-0400:lat, value=4218a5e3, timestamp=1374673155373000)
=> (column=2013-07-13 08\:21\:54-0400:long, value=c29d1917, timestamp=1374673155373000)
=> (column=2013-07-13 08\:21\:54-0400:tweet, value=486176696e67206368657374207061696e2e, timestamp=1374673155373000)
=> (column=2013-07-21 12\:15\:27-0400:, value=, timestamp=1374673155407000)
=> (column=2013-07-21 12\:15\:27-0400:lat, value=42185f3b, timestamp=1374673155407000)
=> (column=2013-07-21 12\:15\:27-0400:long, value=c29d2560, timestamp=1374673155407000)
=> (column=2013-07-21 12\:15\:27-0400:tweet, value=53706565646f2073656c662073686f742e, timestamp=1374673155407000)
-------------------
RowKey: jnbrymn
=> (column=2013-06-29 20\:53\:15-0400:, value=, timestamp=1374673155419000)
=> (column=2013-06-29 20\:53\:15-0400:lat, value=42185e35, timestamp=1374673155419000)
=> (column=2013-06-29 20\:53\:15-0400:long, value=c29ce7f0, timestamp=1374673155419000)
=> (column=2013-06-29 20\:53\:15-0400:tweet, value=49206c696b652070726f6772616d6d696e672e, timestamp=1374673155419000)
=> (column=2013-07-14 22\:55\:45-0400:, value=, timestamp=1374673155434000)
=> (column=2013-07-14 22\:55\:45-0400:lat, value=42184ac1, timestamp=1374673155434000)
=> (column=2013-07-14 22\:55\:45-0400:long, value=c29d5168, timestamp=1374673155434000)
=> (column=2013-07-14 22\:55\:45-0400:tweet, value=57686f206c696b657320636174733f, timestamp=1374673155434000)
=> (column=2013-07-24 06\:23\:54-0400:, value=, timestamp=1374673155485000)
=> (column=2013-07-24 06\:23\:54-0400:lat, value=42184ac1, timestamp=1374673155485000)
=> (column=2013-07-24 06\:23\:54-0400:long, value=c29d4b44, timestamp=1374673155485000)
=> (column=2013-07-24 06\:23\:54-0400:tweet, value=4d7920636f6666656520697320636f6c642e, timestamp=1374673155485000)
We see here that internally in Cassandra, these tweets are stored in two different rows. The internal rows are keyed by the user names — the partition keys. This is why it is so efficient to retrieve all the tweets for a single user. The internal columns are named by the tweet times — the clustering keys. This is why it is so efficient to query for slices along the cluster keys. Also demonstrated here, the names of the non-private key fields are appended to the internal column names. This makes it possible to grab all fields that are associated with a given tweet without having to make independent queries.

Why? What’s the Rationale?

Why would so much trouble be taken to carefully disorganize the field names and field values like this? The answer is that this is actually a fantastic Cassandra data modeling pattern – so fantastic in fact, that the Cassandra community has decided to bet the farm on it, so to speak, and build an interface to it that abstracts away the messy bits. The interface — CQL.
But what does this pattern provide you? As it turns out, many things!
  • It provides fast look-up by partition key and efficient scans and slices by cluster key.
  • It groups together related data as CQL rows. This means that you can do in one query what would otherwise take multiple queries into different column families.
  • It allows for individual fields to be added, modified, and deleted independently.
  • It is strictly better than the old Cassandra paradigm. Proof: you can coerce CQL Tables to behave exactly like old-style Cassandra ColumnFamilies. (See the examples here.)
  • It extends easily to implementation of sets lists and maps (which are super ugly if you’re working directly in old cassandra) — but that’s for another blog post.
And although I haven’t addressed the matter here, the new CQL protocol allows for asynchronous communication as compared with the synchronous, call-response communication required by Thrift. As a result, CQL is capable of being much faster and less resource intensive than Thrift – especially when using single threaded clients.
Still not certain that you’re ready for CQL? Well don’t worry, legacy Cassandra clusters will still have to be supported by Thrift, so Thrift will be with us for quite some time yet. However if you’d like to take the leap into modernity, then take a look at this Datastax blog post. It should get you started in the right direction.

No comments:

Post a Comment

Angular Tutorial (Update to Angular 7)

As Angular 7 has just been released a few days ago. This tutorial is updated to show you how to create an Angular 7 project and the new fe...