Figure 2. ER Model for the Book rating site
Let's see how this can be implemented using the Cassandra data model. Listing 3 shows a potential schema with Cassandra, where the first line represents the "Books" column family which has multiple rows, each having properties of the book as columns. <TS1> and <TS2> denote timestamps.
Listing 3. Cassandra schema for the book rating sample
Books[BookID->(author, rank, price, link, tag<TS1>, tag<TS2> .., cmt+<TS1>= text + "-" + author) …] Tags2BooksIndex[TagID->(<TS1>=bookID1, <TS2>=bookID2, ..) ] Tags2AuthorsIndex[TagID->(<TS1>=bookID1, <TS2>=bookID2, ..) ] RanksIndex["RANK" -> (rank<TS1>=bookID)]
Table 1 is a sample data set as per the schema.
Table 1. Sample data for the book rating site
Column Family Name | Sample Dataset |
---|---|
Books | "Foundation" -> ("author"="Asimov", "rank"=9, "price"=14, "tag1"="sci-fi", "tag2"="future", "cmt1311031405922"="best book-sanjiva", "cmt1311031405923"="well I disagree-srinath") "I Robot" -> ("author"="Asimov", "rank"=7, "price"=14, "tag1"="sci-fi" "tag2"="robots", "cmt1311031405924"="Asimov's best-srinath", "cmt1311031405928"="I like foundation better-sanjiva") |
RanksIndex | "Rank" -> (9="Foundation", 7="I Robot") |
Tags2BooksIndex | "sci-fi" -> ("1311031405918"="Foundation", "1311031405919"="I Robot" "future" -> … |
Tags2AuthorsIndex | "sci-fi" -> (1311031405920="Asimov") "future" -> … |
This example shows several design differences between the relational and Cassandra models. The Cassandra model stores data about books in a single column family called "Books," and the other three Column Families are indexes built to support queries.
Looking at the "Books" column family in detail, the model uses a row to represent each book where a book name is the row ID. Details about the book are represented as columns stored within the row.
Looking closely, you might notice that data items stored (like comments, and tags that have 1:M relationship with books) are also within a single row. To do that, append the time stamp to the column names for tags and comments. This approach stores all data within the same column. This action avoids having to do JOINs to retrieve data. Cassandra circumvents the lack of support for JOINs through this approach.
This provides several advantages.
- You can read all data about a book through a single query reading the complete row.
- You can retrieve comments and tags without a JOIN by using slice queries that have cmt0-cmt9999 and tag0-tag9999 as starting and ending ranges.
Because Cassandra stores columns sorted by their column names, making slice queries is very fast. It is worth noting that storing all the details about the data item in a single row and the use of sort orders are the most crucial ideas behind the Cassandra data design. Most Cassandra data model designs follow these ideas in some form. User can use the sort orders while storing data and building indexes. For example, another side effect of appending time stamps to column names is that as column names are stored in the sorted order, comments having column names post-fixed by the timestamps are stored in the order they are created, and search results would have the same order.
Cassandra does not support any search methods from the basic design. Although it supports secondary indexes, they are supported using indexes that are built later, and secondary indexes have several limitations including lack of support for range queries.
Consequently, the best results in a Cassandra data design needs users to implement searches by building custom indexes and utilizing column and row sort orders. Other three-column families (Tags2BooksIndex, Tags2AuthorsIndex, and RankIndex) do exactly that. Since users need to search for books given a tag, "Tags2BooksIndex" column family builds an index by storing the tag name as the row ID and all books tagged by that tag as columns under that row. As shown by the example, timestamps are added as the column keys, but that is to provide a unique column ID. The search implementation simply reads the index by looking up the row by tag name and finding the matches by reading all columns stored within that rowID.
Table 2 discusses how each of the queries required by the application is implemented using the above Cassandra indexes.
Table 2. Comparison of query implementations
Query description | Query as SQL | Cassandra implementation |
---|---|---|
List books sorted by the rank | Run the query"Select * from Books order by rank" and then on each result do "Select tag from Tags where bookid=?" and "Select comment from Comments where bookid=?" | Do a slice query on "RankIndex" column family to receive an ordered list of books, and for each book do a slice query on "Books" column family to read the details about the book. |
Given a tag, find the authors whose books have the given tag. | Select distinct author from Tags, Books where Tags.bookid=Books.bookid and tag=? | Read all columns for the given tag from Tags2Authors using a slice query. |
Given a tag, list books that have the given tag. | Select bookid from Tags where tag=? | Read all columns for the given tag from Tags2BooksIndex using a slice query. |
Given a book, list the comments for that book in sorted order of the time when the comments were created. | Select text, time, user from Comments where bookid=? Order by time | In "Books" column family, do a slice query from the row corresponding to the given book. They are in sorted order due to timestamps used as the column name. |
Although the above design can efficiently support queries required by the book-rating site, it can only support queries that it is designed for and cannot support ad-hoc queries. For example, it cannot do the following queries without building new indexes.
Select * from Books where price > 50;
Select * from Books where author="Asimov"
It is possible to change the design to support those and other queries by either building appropriate indexes or by writing code to walk through the data. The need for custom code to support new queries, however, is a limitation compared to relational models where adding new queries often needs no changes to the schema.
From the 0.8 release, Cassandra supports secondary indexes where users can specify a search by a given property, and Cassandra automatically builds indexes for searching based on that property. That model, however, provides less flexibility. For example, secondary indexes do not support range queries and provide no guarantees on sort orders of results.
No comments:
Post a Comment