Tuesday, 27 May 2014

Cassandra and Backups

Cassandra is a peer-to-peer, fault-tolerant system. Data is replicated among multiple nodes across multiple data centers. Single or even multi-node failures can be recovered from surviving nodes with the data.
Restores from backups are unnecessary in the event of disk or system hardware failure even if an entire site goes off-line.  As long as there exists one node with the replicated data in one data center, Cassandra can recover the data without having to restore data from an external source.
However, Cassandra backups are still necessary to recover from any errors made in data edits by client applications. There is a need for a “point-in-time” recovery of Cassandra in the event of data corruption or some other catastrophic situation.
Cassandra provides backup utilities and a restore process. Constant Contact has extended these features with scripts to provide more flexibility with backups and to simplify recovery operations.

Cassandra File Structure

To understand Cassandra backups, a brief overview of Cassandra file structure is necessary.
Cassandra keeps data in SSTable files. They are stored in the keyspace directory within the data directory path specified by the <DataFileDirectory> parameter in the cassandra.yaml file.  By default this directory path is /var/lib/cassandra/data/<keypace_name>.
These are written to when Cassandra fills its memtable. When the memtable contents are written to files, the memtable is cleared and ready to process new data. Cassandra continually makes new SSTable files in keyspace directories as its memtable “bucket” is filled and emptied.
Once the memtable is written to an SSTable file, the SSTable file is said to be unmutable – that is, no more writes are made to that file. However, Cassandra executes compaction, meaning multiple old SSTable files are merged into single new one, and obsolete data (and the SSTable file(s) that contained it) is removed. So while the data is unmutable, the files which contain the data are not.
A “point-in-time” recovery requires recovery of all the SSTable files in a keyspace exactly as they were in a given instant.

Native Cassandra Backup Tools

The Cassandra CLI has a snapshot utility. This flushes all in-memory writes to disk, then hard-links each current SSTable file for each keyspace in a “snapshots” subdir in the local disk keyspace area (<DataFileDirectory>/<keypace_name>/snapshots).
It creates a unique subdirectory for each set of snapshot files, so multiple instances in time can be preserved.
Versions of Cassandra 1.x and later also have an “incremental” backup feature.
There is no CLI command to execute incremental backups. It is enabled by changing the value of “incremental_backups” to “true” in the cassandra.yaml file.
The incremental backup feature creates a hard-link to each new SSTable as it is created to a backups subdir in the local disk keyspace area (<DataFileDirectory>/<keyspace_name>/backups).
As incremental backups only contain new SSTable files, they are dependent on the last snapshot created. Natively, Cassandra incremental backups only supplement the last snapshot made.
Note that in both the snapshot and incremental backups, Cassandra only creates hard-links to the active data files in the parent keyspace, not actual copies of the files.  As hard links are simply different pointer names to the same inode on the disk, the processes are quick and consume little disk space.
Because Cassandra doesn’t modify SSTable files after creating them, but simply adds new files and deletes old ones as needed, the hard-link method for backups does provide consistency for a successful restore. When Cassandra removes an old SSTable file from the active keyspace, a pointer to the file still exists in the snapshot or backup sub-directory.
As Cassandra continues to operate, the contents of the snapshot /backup sub-directories begin to diverge in size and content from the active keyspace area. The longer the local snapshots/backups are retained, the greater the local disk usage becomes, because the backup directories continue to retain links to older SSTable files that were removed from the active keyspace area.
The incremental backup feature in Cassandra 1.0 substantially reduces disk space requirements because it only contains links to new SSTable files generated since the last full snapshot. In contrast, all snapshots have links to all the files in the active keyspace area at the time the snapshot was made. The value of incremental backups in reducing disk usage is especially noticeable in larger data sets with minimal active writes.

Native Cassandra Recovery Process

Restoring a Cassandra keyspace means restoring all the keyspace SSTable files as they existed in a point in time.
Cassandra does not provide a native restore utility, but does provide a restore procedure. For each node in the cluster:
  1. Shut down Cassandra.
  2. Clear all files in commitlog directory (path defined by the <CommitLogDirectory> parameter in the cassandra.yaml file, by default /var/lib/cassandra/commitlog). Ideally, logs will be flushed before Cassandra is shut down, as the commitlog directory is a shared resource of all keyspaces, not just the one to be restored.
  3. Removing all current contents of the active keyspace (all *.db files).
  4. Copying contents of desired snapshot to active keyspace.
  5. Only if restored snapshot is the latest one, and you want the latest backup, copy contents of backup directory into active keyspace area on top of the restored snapshot files.
Note that the process must be executed on all nodes in the cluster, otherwise nodes that did not get the restored data will “update” the restored nodes with the newer, bad data.

Extensions to Cassandra Native Backup Tools

We had additional goals for our Cassandra backups that were unmet by native Cassandra tools, specifically:
  • Automation:While incremental backups occur automatically in Cassandra, snapshots are command-executed via the CLI. We needed to automate this process, scheduling it on a regular basis to establish consistent full backups for each node in the cluster.
  • Non-local storage: While Cassandra’s hard-link backup method is fast, it does not account for any potential problems with the local disk. We needed to make real copies of snapshot and backup files, and put them on storage devices separate from the local Cassandra node.
  • Multi-day/instance retention:While Cassandra allows for multiple snapshots to be retained on disk, it retains only one set of incremental backups, and those only cover changes made from the last snapshot.  We needed to retain incremental backups for previous snapshots as well.
To achieve these objectives, we utilized the common sysadmin tools Bash, Puppet, cron, and NFS.  We decided on these tools because they are routinely used in our group and would be the quickest and most straightforward method to achieve our goals.
Puppet
Puppet is used to distribute the backup script and cron entries to the hosts. The script is set up in Puppet as a template file, with variables to define the appropriate NFS target for each cluster for each type and location.
Cron
Cron is already in use in the hosts and managed via puppet. Puppet variables in the cron statements, defined by cluster type and location, are used to determine the execution times and command-line flags.
NFS
NFS was the most economical and quickest way to provide non-local storage to massive quantities of Cassandra nodes. We use a mix of NetApp appliances and Isilon filers for our on-line Cassandra backup repositories.
It is important to note in discussing NFS that all Cassandra nodes need backup, even though data is replicated across nodes. The reason is that the data is replicated to be eventuallyconsistent between all nodes – not that all nodes sharing a file set will have exactly the same data at the same time. Backing up all the nodes is the only way to ensure a consistent, complete backup of a keyspace.
This, however, means that the amount of NFS storage space needed for Cassandra backups may be larger than that for a similarly-scaled relational DB – because you will essentially be backing up X-number of copies of the same data (where X is the number of nodes in your Cassandra cluster that are configured to replicate to one another).
Bash
The Bash backup script, called “cc_backup.sh,” does the following:
  • If the interval between snapshots (full backups) is greater than 7 days, or if specifically invoked via the “–forcesnap” flag, runs the Cassandra snapshot cli command.
  • If the script takes a snapshot, it copies that snapshot to contents an NFS mount point.
  • If the script does not take a snapshot, it copies the current contents of the backup directory to NFS.
  • The script creates the directory structure as needed in the NFS mount point as follows: /<nfs_mount>/<hostname>/<date>/snapshots
    /<nfs_mount>/<hostname>/<date>/backups
  • The script tars and compresses each keyspace snapshot or backup directory as a single file. Snapshot tar files are identified by keyspace name and snapshot id number ( e.g., system-1360082572055.tar.gz ).  Incremental backup tar files are identified by keyspace name, parent snapshot id number, the phrase “bkup,” and the date of the backup copy to NFS (e.g., system-1360082572055_bkup-07FEB13.tar.gz). The script uses the latest snapshot listed on the node to identify the appropriate parent snapshot id number.
  • The tar processes are run with reduced CPU and I/O priority so as not to interfere with regular Cassandra operations.
  • The script prunes older snapshots and incremental backups from both NFS and the local file system.
  • On the local system disk, the script removes the oldest snapshots over the maximum local retention value specified. Upon creation of a new snapshot, the script removes the contents from the local incremental backup directory, as that content is now useless with the new snapshot.
  • On the NFS mount point, the script removes older snapshots and backups based on NFS retention values specified.
Unlike the native Cassandra process, which only keeps one set of incremental backup files, the script keeps multiple instances of incremental backup files on NFS to provide for more choices in a point-in-time restore. The script also keeps multiple copies of snapshot files at certain points in time to support recovery of older incremental backups that are dependent on the older snapshot files.
For example, Cassandra clusters may have a 3-day “live” recovery capacity on NFS (older recoveries, if available, would be dependent on NDMP tape restore to NFS first).  This 3-day “live” recovery potential would require a mix of both snapshots and incremental backups, depending on the day of the week.
Backup File Retention Examples
For example, let’s assume that a given cluster with 3-day NFS retention performs snapshots on Sundays and incremental backups the rest of the week.
The Monday morning NFS backup tree for a given Cassandra host in that cluster would have the following:
  • Previous Sunday snapshot (necessary for recovery of Friday and Saturday incremental backups)
  • Friday incremental backup
  • Saturday incremental backup
  • Yesterday’s Sunday snapshot (necessary for recovery of Monday incremental backup)
  • Monday incremental backup
This gives us the 3 days of recovery options not including Monday morning’s backup.
On Wednesday the NFS backup tree would have:
  • Latest Sunday snapshot (necessary for recovery of Monday, Tuesday, or Wednesday incremental backups)
  • Monday incremental backup
  • Tuesday incremental backup
  • Wednesday incremental backup

Extensions to the Native Cassandra Recovery Process

As our backup tool is merely a wrapper around the native Cassandra snapshot/backup process, data can be restored manually following the native Cassandra restore steps previously discussed.  Local snapshots/backups can be recovered this way, and our NFS backups can be restored the same way once they are untarred and uncompressed.
To simplify restores, however, we have created the script “cc_restore.sh.”
This script allows you to specify just the keyspace and date to restore from, and it will gather the appropriate snapshot and backup files from NFS to restore. The script will also verify that any and all Cassandra processes are offline before it will proceed to restore any data. It makes it easier to execute restores en masse across an entire cluster via func or ssh.
Syntax:
/usr/local/bin/cc_restore.sh (table) (date)
Where:
table= table name to restore, or specify “ALL”
table names case sensitive, specifying ALL should be all caps
date=  restore last instance of date specified in two-digit daymonthyear
or specify “lastlocal” to restore last local backup
Examples:
/usr/local/bin/cc_restore.sh PLINK_L1 05JUL11
/usr/local/bin/cc_restore.sh SharedVol1_F1 07JUL11
/usr/local/bin/cc_restore.sh ALL 11JUL11
/usr/local/bin/cc_restore.sh SharedVol1_F1 lastlocal

Future Challenges

Our backup/restore process is working, but we are constantly monitoring and tweaking it as the scale and complexity of our Cassandra environment grows.
Some of the things we are keeping an eye on:
System resource utilization and length of time to copy snapshots to NFS.
Although we feel we need the compression capability and file bundling of tar, we are aware of its aggressive use of system resources and the latency of NFS.
Careful, distributed scheduling of snapshots so no more than one site cluster goes to any one NFS appliance at the same time, coupled with highly optimized NFS client settings, have substantially reduced this impact.
Still, we are formulating possible mid-term and long-term alternatives should we encounter any performance problems.
Mid-term alternatives include using rsync in place of tar, and using non-Cassandra nodes to compress and tar-up the files on the NFS mount. This, however, will increase the NFS volume usage considerably.
Long-term alternatives include using a full-fledged Cassandra management solution such as Priam, which has its own built-in Cassandra backup/restore function within a Java VM. However the fast compression used by this tool is said to be substantial less effective than traditional gzip, meaning NFS volume usage will likely grow substantially. Also, the implementation of a separate Java VM for backups adds a level of complexity to the process, and would bring with it its own set of support requirements that would have to be addressed.
Backup space requirements.
As previously stated, Cassandra needs a lot of space for backups. We have to prepare for much more rapid acquisitions and implementations of NFS appliances. As our clusters grow in size, we are finding that horizontal scaling ability of Isilon filers is a good fit for Cassandra.
Cassandra Repair and Snapshot Conflicts.
In the previous Cassandra release, the Cassandra repair function would modify files that were still hard-linked to snapshot directory. This would generate an error if it occurred while the backup script was still copying files to the NFS directory, as tar would report that the source file had changed. We have not yet observed this behavior with Cassandra 1.X and above, but we continue to watch for this behavior.   
Continue the conversation by sharing your comments here on the blog.

Monday, 26 May 2014

Bulk Integration Into and Out of Cassandra CQL3 Data Models


Purpose

The purpose of this blog is to provide an aggregated view of batch/bulk techniques available for integrating with Apache Cassandra with a CQL3 based data models, including Cassandra 1.2 and 2.0 (DSE 3.2 and 4.0).  This blog post will be augmented overtime as techniques evolve. 

Background

As long as people store data digitally, there will be a need to move large "chunks" of data between systems.  This could be for reasons of:
  • aggregation and combination (i.e. Analytical systems) where multiple sources of data are combined and queries
  • archive purposes (removal of old and non needed data)
  • data migration projects
  • large scale data model changes
  • database transition initiatives
  • other
This blog will provide a comprehensive view of bulk integration techniques when moving data into and out of Cassandra: 
  • with links to code samples
  • tips and tricks
  • recommended use cases
Information will be presented using a matrix view that lists different techniques as rows.  The matrix will contain 2 columns, one for loading data into Cassandra from a source (Into Cassandra) and the other for loading data from Cassandra into a source (Out of Cassandra).   "A source" generally means an RDBMS system, flat file, Hadoop, mainframe, etc.  The cells of this matrix will contain a summary for the row topic as well as links to additional posts that contain details supporting the summary.

We will work to provide a low latency matrix in a subsequent post to help with near real time integration/data pipe-lining techniques.

Batch Integration Matrix

 

Batch Technique
Into Cassandra
Out of Cassandra
Cassandra Bulk Loader (SSTableLoader)*
·  Loads SSTables into a Cassandra ring from a node or set of nodes that are not actively part of the ring.
·  Good option for migration, cluster generation, etc
·  Enables parallel loading of data
·  Should be executed on a set of “non cluster” nodes for optimal performance.
·  Requires creation of SSTables (see below)
·  CQL3 Support depends on the creation of SSTables
·  Counters may not be fully supported
·  Leverages Cassandra internal streaming
·  After completion, run a repair
n/a
Bulk Loading via JMX*
·  JMX based utility to stream sstable data into Cassandra.
·  Leverages same code functionality as SSTableLaoder
·  Enables loading of data from a cassandra node into the same node without requiring the configuration of a separate network interface, which is required for SSTableLaoder on the same node
·  Same requirements and limitations of SSTableLoader.
·  Source and Documentation (look for bulkLoad(String))
n/a
Copy SSTables
Into Data Directory 
·  Loads SSTables into a Cassandra ring
·  Good option for migration, cluster generation, etc
·  Requires creation of SSTables (see below)
·  Could leverage Snapshot SSTables for migration purposes
·  CQL3 Support depends on the structure of SSTables.
·  Working example and blog to be provided soon
Out of Data Directory
·  Get access to SSTable data with minimal production system impact
·  Can be used as a source to populate another cluster
ColumnFamily<>Format
ColumnFamilyOutputFormat
·  Thrift based driver to load data into Cassandra Thrift based column families
·  Not compatible with CQL3 tables
·  Example
ColumnFamilyInputFormat
·  Thrift based driver to extract data out of Cassandra Thrift based column families
·  Not compatible with CQL3 tables
·  Example
BulkOutputFormat
·  Thirft based tool/no CQL3 support (CQL3 support has to be creatively programmed using composites similar to the SSTable loader techniques for the Simple and SimpleUnsorted SSTable Wrtiers
·  Used to stream data from MR to Cassandra
·  Similar to Bulk Loading above but no need for a “fat client”, i.e. cassandra node to execute process.
·  Can be used with Pig
n/a
CQL<>Format
CQLOutputFormat
·  CQL3 based driver to load data into Cassandra CQL3 tables
·  This is not necessarily a “bulk loader”
·  Example
CQLPagingInputFormat
·  CQL3 based driver to extract data out of Cassandra CQL3 tables
·  This is not necessarily a “bulk loader”
·  Example
CQL3 Statements via M/R
·  Have heard that several users simply create Map only jobs and insert data into Cassandra leveraging a java driver, like the DSE java driver.
·  Example
n/a
CQL3 Batch Statements
·  Batch statements group individual statements into single operations and can be used for bulk-like integration processes.
·  Use the UNLOGGED if performance is a concern, though you lose atomicity
n/a
ETL Tools - Pentaho
·  Visual, ETL approach for the bulk integration of Cassandra data with other sources.
·  Pentaho Data Integration 5 supports CQL3 (JIRA)
·  We have not had the chance to test this approach but will do so soon.
·  Working example and blog to be provided soon
·  Visual, ETL approach for the bulk integration of Cassandra data with other sources.
·  Pentaho Data Integration 5 supports CQL3 (JIRA)
·  We have not had the chance to test this but will do so soon.
·  Working example and blog to be provided soon
ETL Tools - JasperSoft
·  Visual, ETL approach for the bulk integration of Cassandra data with other sources.
·  Jaspersoft Studio with the Cassandra Connector v 1.0 supports CQL3 (Release)
·  We have not had the chance to test this approach but will do so soon.
·  Working example and blog to be provided soon
·  Visual, ETL approach for the bulk integration of Cassandra data with other sources.
·  Jaspersoft Studio with the Cassandra Connector v 1.0 supports CQL3 (Release)
·  We have not had the chance to test this approach but will do so soon.
·  Working example and blog to be provided soon
ETL Tools - Talend
·  Visual, ETL approach for the bulk integration of Cassandra data with other sources.
·  We are still investigating what Talend supports with regards to Cassandra.  We did notice that it appears as if Talend would like to support the SSTableLoader mentioned above. (JIRA)
·  We have not had the chance to test this approach but will do so soon.
·  Working example and blog to be provided soon
·  Visual, ETL approach for the bulk integration of Cassandra data with other sources.
·  We are still investigating what Talend supports with regards to Cassandra.  We did notice that it appears as if Talend would like to support the SSTableLoader mentioned above. (JIRA)
·  We have not had the chance to test this approach but will do so soon.
·  Working example and blog to be provided soon
Sqoop (DSE version)
·  Offers a good tool for bulk/batch integration between many different databases, hadoop, and Cassandra.  
·  The DSE 3.X version does not as yet support CQL3, but this should be coming in DSE 4.X.
·  This is a very promising feature and we will update once DataStax releases the CQL3 support.
·  For CQL2, here is a write up with a working example
·  Offers a good tool for bulk/batch integration between many different databases, hadoop, and Cassandra.  
·  The DSE 3.X version does not as yet support CQL3, but this should be coming in DSE 4.X.
·  This is a very promising feature and we will update once DataStax releases the CQL3 support.
·  For CQL2, here is a write up with a working example
CQLSH Copy
·  Good tool for bulk loading a small amount of data, less than 1 million records is the recommendation.
·  Uses .csv files as import sources only
·  Good tool for writing out a small amount of data, less than 1 million records is the recommendation, to a cs file.
·  Uses .csv files as export source only
Flume Integration
·  Found a project on GitHub that offers Flume integration into Cassandra.  
·  Currently this is Hector based.   


*Requires SSTable Creation

*SSTable Writers

In order to load data into Cassandra using the bulk loading technique, denoted by *, SSTables need to be generated to load.  The following table provides information on the available techniques for the creating of SSTables.

Batch Technique
Overview
Limitations
CQL3 Support
CQLSSTableWriter
·  Fully supports CQL3 compatible SSTable creation
·  Contained in C* 2.X and higher
Full
SSTableSimpleUnsortedWriter
·  Generates SSTables simply and easily using partitioned order
·  Does not inherently support CQL3
·  CQL3 support can be created using composite types
·  DataStax Example (Non CQL)
·  Limited CQL Support
·  More complex CQL configuration compared to CQLSSTableWriter
Limited
SSTableSimpleWriter
·  Generates SSTables but not in sorted order, requires data be added in partition sorted order
·  Not recommended for use unless there is a specific use case.
·  Requires inserting data in partition sorted order
Limited

Conclusion

This list is meant to be a comprehensive guide.  Please let us know if we missed anything or if you have feedback on specific techniques.  Hopefully this post provides value to people who are analyzing different techniques to move large chunks of data into or out of Cassandra.

Angular Tutorial (Update to Angular 7)

As Angular 7 has just been released a few days ago. This tutorial is updated to show you how to create an Angular 7 project and the new fe...