Saturday, 21 June 2014

cassandra data models :Book rating site

Figure 2. ER Model for the Book rating site
Flow diagram of the book site data model
Let's see how this can be implemented using the Cassandra data model. Listing 3 shows a potential schema with Cassandra, where the first line represents the "Books" column family which has multiple rows, each having properties of the book as columns. <TS1> and <TS2> denote timestamps.
Listing 3. Cassandra schema for the book rating sample
Books[BookID->(author, rank, price, link, tag<TS1>, tag<TS2> .., 
    cmt+<TS1>= text + "-" + author) …] 
Tags2BooksIndex[TagID->(<TS1>=bookID1, <TS2>=bookID2, ..) ] 
Tags2AuthorsIndex[TagID->(<TS1>=bookID1, <TS2>=bookID2, ..) ]
RanksIndex["RANK" -> (rank<TS1>=bookID)]
Table 1 is a sample data set as per the schema.
Table 1. Sample data for the book rating site
Column Family NameSample Dataset
Books

"Foundation" -> ("author"="Asimov", "rank"=9, "price"=14, "tag1"="sci-fi", "tag2"="future", "cmt1311031405922"="best book-sanjiva", "cmt1311031405923"="well I disagree-srinath")
"I Robot" -> ("author"="Asimov", "rank"=7, "price"=14, "tag1"="sci-fi" "tag2"="robots", "cmt1311031405924"="Asimov's best-srinath", "cmt1311031405928"="I like foundation better-sanjiva")
RanksIndex"Rank" -> (9="Foundation", 7="I Robot")
Tags2BooksIndex "sci-fi" -> ("1311031405918"="Foundation", "1311031405919"="I Robot"
"future" -> …
Tags2AuthorsIndex"sci-fi" -> (1311031405920="Asimov")
"future" -> …
This example shows several design differences between the relational and Cassandra models. The Cassandra model stores data about books in a single column family called "Books," and the other three Column Families are indexes built to support queries.
Looking at the "Books" column family in detail, the model uses a row to represent each book where a book name is the row ID. Details about the book are represented as columns stored within the row.
Looking closely, you might notice that data items stored (like comments, and tags that have 1:M relationship with books) are also within a single row. To do that, append the time stamp to the column names for tags and comments. This approach stores all data within the same column. This action avoids having to do JOINs to retrieve data. Cassandra circumvents the lack of support for JOINs through this approach.
This provides several advantages.
  • You can read all data about a book through a single query reading the complete row.
  • You can retrieve comments and tags without a JOIN by using slice queries that have cmt0-cmt9999 and tag0-tag9999 as starting and ending ranges.
Because Cassandra stores columns sorted by their column names, making slice queries is very fast. It is worth noting that storing all the details about the data item in a single row and the use of sort orders are the most crucial ideas behind the Cassandra data design. Most Cassandra data model designs follow these ideas in some form. User can use the sort orders while storing data and building indexes. For example, another side effect of appending time stamps to column names is that as column names are stored in the sorted order, comments having column names post-fixed by the timestamps are stored in the order they are created, and search results would have the same order.
Cassandra does not support any search methods from the basic design. Although it supports secondary indexes, they are supported using indexes that are built later, and secondary indexes have several limitations including lack of support for range queries.
Consequently, the best results in a Cassandra data design needs users to implement searches by building custom indexes and utilizing column and row sort orders. Other three-column families (Tags2BooksIndex, Tags2AuthorsIndex, and RankIndex) do exactly that. Since users need to search for books given a tag, "Tags2BooksIndex" column family builds an index by storing the tag name as the row ID and all books tagged by that tag as columns under that row. As shown by the example, timestamps are added as the column keys, but that is to provide a unique column ID. The search implementation simply reads the index by looking up the row by tag name and finding the matches by reading all columns stored within that rowID.
Table 2 discusses how each of the queries required by the application is implemented using the above Cassandra indexes.
Table 2. Comparison of query implementations
Query descriptionQuery as SQLCassandra implementation
List books sorted by the rankRun the query
"Select * from Books order by rank" and then on each result do "Select tag from Tags where bookid=?" and "Select comment from Comments where bookid=?"
Do a slice query on "RankIndex" column family to receive an ordered list of books, and for each book do a slice query on "Books" column family to read the details about the book.
Given a tag, find the authors whose books have the given tag.Select distinct author from Tags, Books where Tags.bookid=Books.bookid and tag=?Read all columns for the given tag from Tags2Authors using a slice query.
Given a tag, list books that have the given tag.Select bookid from Tags where tag=?Read all columns for the given tag from Tags2BooksIndex using a slice query.
Given a book, list the comments for that book in sorted order of the time when the comments were created.Select text, time, user from Comments where bookid=? Order by timeIn "Books" column family, do a slice query from the row corresponding to the given book. They are in sorted order due to timestamps used as the column name.
Although the above design can efficiently support queries required by the book-rating site, it can only support queries that it is designed for and cannot support ad-hoc queries. For example, it cannot do the following queries without building new indexes.
  • Select * from Books where price > 50;
  • Select * from Books where author="Asimov"
It is possible to change the design to support those and other queries by either building appropriate indexes or by writing code to walk through the data. The need for custom code to support new queries, however, is a limitation compared to relational models where adding new queries often needs no changes to the schema.
From the 0.8 release, Cassandra supports secondary indexes where users can specify a search by a given property, and Cassandra automatically builds indexes for searching based on that property. That model, however, provides less flexibility. For example, secondary indexes do not support range queries and provide no guarantees on sort orders of results.

Friday, 20 June 2014

CQL programming series - WHERE clause

 Breaking Down the CQL Where Clause blog posting was created by Ike Walker. To view more postings by Ike, check out the Mechanics of Flite blog here.

When explaining the differences between CQL and SQL I often start by rattling off a list of things
you can’t do in CQL. The list usually starts something like this:
  1. No joins
  2. No GROUP BY
  3. No arbitrary WHERE clauses
  4. No arbitrary ORDER BY clauses
The first 2 items on the list are fairly obvious. Joins and GROUP BY are not supported at all.
But what about the WHERE clause? SQL supports a whole bunch of operators and predicates in
the WHERE clause, and allows filtering on non-indexed columns. The CQL WHERE clause only works
for indexed columns, specifically columns in the primary key (both the partition key and the clustering
columns), and columns that have secondary indexes on them. Multiple predicates can only be
connected with AND, not OR, only a few operators are supported, and all of the operators have
restrictions. The Datastax CQL documentation describes what the WHERE clause supports.
Here’s what it says:
I mostly agree with that language, but I think this line is wrong:
I’d say that the = and IN operators are restricted, for example you can only use IN on the last column in the partition key. More on that later.
From my experience, here’s the botttom line of which operators are supported for which columns in the CQL WHERE clause:
  1. Partition key columns support the = operator
  2. The last column in the partition key supports the IN operator
  3. Clustering columns support the =, >, >=, <, and <= operators
  4. Secondary index columns support the = operator
Read on for examples.

1. Partition key columns support the = operator

I’ll borrow a table from Patrick McFadin’s post about time series data modeling for my examples:
Since this table has a composite partition key on (weatherstation_id,date) I need to include both of those columns if I want to use the = operator.
So this query is allowed:
But these queries are not allowed:

 

2. The last column in the partition key supports the IN operator

For single column partition keys, the IN operator is allowed without restriction. I’ll use a table with a single column partition key to illustrate that:
For composite partition keys, I have to use the = operator on the first N-1 columns of the partition key in order to use the IN operator on the last column. So this query is allowed:
But these queries are not allowed:

 

3. Clustering columns support the =, >, >=, <, and <= operators

The clustering columns support lots of operators, with the caveats mentioned in the documentation snippet I posted earlier.
The simplest case is to use the = operator for the partion key and the clustering columns, like this:
I can also use the = or IN operator on the partition key and do a range scan on the clustering columns:
Regardless of the operator, if I try to filter on the clustering columns only, I’ll get an error if I don’t use the ALLOW FILTERING command. For example:
If I add the ALLOW FILTERING command then I can use any operator on the clustering columns, but the query may be very slow and could eventually time out.
So these queries are all allowed:
But sometimes they may time out, like this:

 

4. Secondary index columns support the = operator

For the secondary index examples I’ll add a secondary index to the un-indexed temperature column:
Now I can filter on the temperature column with the = operator:
I can also use a secondary index in conjunction with the = operator on the partion key and any supported operator on the clustering columns. For example:

Anti-patterns in Cassandra

mplementation or design patterns that are ineffective and/or counterproductive in Cassandra production installations. Correct patterns are suggested in most cases.

Network attached storage

Storing SSTables on a network attached storage (NAS) device is not recommended. Using a NAS device often results in network related bottlenecks resulting from high levels of I/O wait time on both reads and writes. The causes of these bottlenecks include:
  • Router latency.
  • The Network Interface Card (NIC) in the node.
  • The NIC in the NAS device.
If you are required to use NAS for your environment, please contact a technical resource from DataStax for assistance.

Shared network file systems

Shared network file systems (NFS) have the same limitations as NAS. The temptation with NFS implementations is to place all SSTables in a node into one NFS. Doing this deprecates one of Cassandra's strongest features: No Single Point of Failure (SPOF). When all SSTables from all nodes are stored onto a single NFS, the NFS becomes a SPOF. To best use Cassandra, avoid using NFS.

Excessive heap space size

DataStax recommends using the default heap space size for most use cases. Exceeding this size can impair the Java virtual machine's (JVM) ability to perform fluid garbage collections (GC). The following table shows a comparison of heap space performances reported by a Cassandra user:
HeapCPU utilizationQueries per secondLatency
40 GB50%7501 second
8 GB5%8500 (not maxed out)10 ms
For information on heap sizing, see Tuning Java resources.

Cassandra's rack feature

Defining one rack for the entire cluster is the simplest and most common implementation. Multiple racks should be avoided for the following reasons:
  • Most users tend to ignore or forget rack requirements that racks should be organized in an alternating order. This order allows the data to get distributed safely and appropriately.
  • Many users are not using the rack information effectively. For example, setting up with as many racks as nodes (or similar non-beneficial scenarios).
  • Expanding a cluster when using racks can be tedious. The procedure typically involves several node moves and must ensure that racks are distributing data correctly and evenly. When clusters need immediate expansion, racks should be the last concern.
To use racks correctly:
  • Use the same number of nodes in each rack.
  • Use one rack and place the nodes in different racks in an alternating pattern. This allows you to still get the benefits of Cassandra's rack feature, and allows for quick and fully functional expansions. Once the cluster is stable, you can swap nodes and make the appropriate moves to ensure that nodes are placed in the ring in an alternating fashion with respect to the racks.

SELECT ... IN or index lookups

SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0.

Using the Byte Ordered Partitioner

The Byte Ordered Partitioner (BOP) is not recommended.
Use virtual nodes (vnodes) and use either the Murmur3Partitioner (default) or the RandomPartitioner. Vnodes allow each node to own a large number of small ranges distributed throughout the cluster. Using vnodes saves you the effort of generating tokens and assigning tokens to your nodes. If not using vnodes, these partitioners are recommended because all writes occur on the hash of the key and are therefore spread out throughout the ring amongst tokens range. These partitioners ensure that your cluster evenly distributes data by placing the key at the correct token using the key's hash value. Even if data becomes stale and needs to be deleted, this ensures that data removal also takes place while evenly distributing data around the cluster.

Reading before writing

Reads take time for every request, as they typically have multiple disk hits for uncached reads. In work flows requiring reads before writes, this small amount of latency can affect overall throughput. All write I/O in Cassandra is sequential so there is very little performance difference regardless of data size or key distribution.

Load balancers

Cassandra was designed to avoid the need for load balancers. Putting load balancers between Cassandra and Cassandra clients is harmful to performance, cost, availability, debugging, testing, and scaling. All high-level clients, such as Astyanax and pycassa, implement load balancing directly.

Insufficient testing

Be sure to test at scale and production loads. This the best way to ensure your system will function properly when your application goes live. The information you gather from testing is the best indicator of what throughput per node is needed for future expansion calculations.
To properly test, set up a small cluster with production loads. There will be a maximum throughput associated with each node count before the cluster can no longer increase performance. Take the maximum throughput at this cluster size and apply it linearly to a cluster size of a different size. Next extrapolate (graph) your results to predict the correct cluster sizes for required throughputs for your production cluster. This allows you to predict the correct cluster sizes for required throughputs in the future. The Netflix case study shows an excellent example for testing.

Lack of familiarity with Linux

Linux has a great collection of tools. Become familiar with the Linux built-in tools. It will help you greatly and ease operation and management costs in normal, routine functions. The essential list of tools and techniques to learn are:
  • Parallel SSH and Cluster SSH: The pssh and cssh tools allow SSH access to multiple nodes. This is useful for inspections and cluster wide changes.
  • Passwordless SSH: SSH authentication is carried out by using public and private keys. This allows SSH connections to easily hop from node to node without password access. In cases where more security is required, you can implement a password Jump Box and/or VPN.
  • Useful common command-line tools include:
    • top: Provides an ongoing look at CPU processor activity in real time.
    • System performance tools: Tools such as iostat, mpstat, iftop, sar, lsof, netstat, htop, vmstat, and similar can collect and report a variety of metrics about the operation of the system.
    • vmstat: Reports information about processes, memory, paging, block I/O, traps, and CPU activity.
    • iftop: Shows a list of network connections. Connections are ordered by bandwidth usage, with the pair of hosts responsible for the most traffic at the top of list. This tool makes it easier to identify the hosts causing network congestion.

Running without the recommended settings

Be sure to use the recommended settings in the Cassandra documentation.
Also be sure to consult the Planning a Cassandra cluster deployment documentation, which discusses hardware and other recommendations before making your final hardware purchases.

More anti-patterns

For more about anti-patterns, visit Matt Dennis` slideshare.

Monday, 16 June 2014

Jasper Advanced features -----Expanding a Crosstab in Jaspersoft iReport Designer

To create a crosstab report (sometimes called a "Pivot Table") in Jaspersoft's iReport Designer, simply drag a Crosstab component from the palette to a Summary band and configure through the wizard.  However, if you need more than 2 row or 2 column groups in the crosstab, you'll need additional configuration.


Jaspersoft's iReport Designer features a New Crosstab wizard that will guide you through adding a Crosstab component to a report.

The wizard is limited to 2 row groups and 2 column groups.  For most concise reports, this is sufficient.  Too many row and column groups can be disconcerting because various totals will appear throughout the tabular data.

Row Groups

If you need additional row and column groups, add a Row Group and then build a set of totaling cells into the cross tab.  This example works with the following spreadsheet.  A 2 x 2 crosstab was built with Year and Region forming the initial set of rows and Dept Name and Dept ID forming the initial set of columns.  This is the maximum set that can be configured using the wizard.

The spreadsheet has an additional column, "Sr. Officer", that will be added as an additional row group.

Data Supporting a Crosstab Report

Start by right-clicking in the Report Inspector and selecting "Add Row Group".

Add a New Row Group
The new row group will appear in the report's Crosstab tab.

Crosstab with Sr. Officer Variable Added
Total

The report can be previewed as this point and the data will total correctly.  To add in the Sr. Officer subtotaling, components need to be added from the palette for the label (Static Text) and each of three totals: Dept ID, Department, and Total.  The totaling components are Text Fields.


Crosstab with Totals Created

Each of the three Text Fields are configured to use the same variable $V{BudgetAmount_Measure}.  The type is set to java.lang.Integer.

The result is a report that includes the extra data.  You may want to apply background colors to make it appear more consistent with the other table cells.  (I skipped the extra formatting to make the cells stand out.)

Crosstab Report with Row Group Created Outside of Wizard
Most crosstab reports can be rendered using 2 row groups.  Jaspersoft's iReport Designer provides a convenient wizard walking you through crosstab creation.  If you need more row groups, add them manually then add extra components from the palette for the totals.

Angular Tutorial (Update to Angular 7)

As Angular 7 has just been released a few days ago. This tutorial is updated to show you how to create an Angular 7 project and the new fe...