Perfect MVC code: Operations

Showing posts with label Operations. Show all posts

Tuesday, 27 May 2014

Cassandra and Backups

FEBRUARY 28, 2013 BY MICHAEL ANDERSON 3 COMMENTS

Cassandra is a peer-to-peer, fault-tolerant system. Data is replicated among multiple nodes across multiple data centers. Single or even multi-node failures can be recovered from surviving nodes with the data.

Restores from backups are unnecessary in the event of disk or system hardware failure even if an entire site goes off-line. As long as there exists one node with the replicated data in one data center, Cassandra can recover the data without having to restore data from an external source.

However, Cassandra backups are still necessary to recover from any errors made in data edits by client applications. There is a need for a “point-in-time” recovery of Cassandra in the event of data corruption or some other catastrophic situation.

Cassandra provides backup utilities and a restore process. Constant Contact has extended these features with scripts to provide more flexibility with backups and to simplify recovery operations.

Cassandra File Structure

To understand Cassandra backups, a brief overview of Cassandra file structure is necessary.

Cassandra keeps data in SSTable files. They are stored in the keyspace directory within the data directory path specified by the <DataFileDirectory> parameter in the cassandra.yaml file. By default this directory path is /var/lib/cassandra/data/<keypace_name>.

These are written to when Cassandra fills its memtable. When the memtable contents are written to files, the memtable is cleared and ready to process new data. Cassandra continually makes new SSTable files in keyspace directories as its memtable “bucket” is filled and emptied.

Once the memtable is written to an SSTable file, the SSTable file is said to be unmutable – that is, no more writes are made to that file. However, Cassandra executes compaction, meaning multiple old SSTable files are merged into single new one, and obsolete data (and the SSTable file(s) that contained it) is removed. So while the data is unmutable, the files which contain the data are not.

A “point-in-time” recovery requires recovery of all the SSTable files in a keyspace exactly as they were in a given instant.

Native Cassandra Backup Tools

The Cassandra CLI has a snapshot utility. This flushes all in-memory writes to disk, then hard-links each current SSTable file for each keyspace in a “snapshots” subdir in the local disk keyspace area (<DataFileDirectory>/<keypace_name>/snapshots).

It creates a unique subdirectory for each set of snapshot files, so multiple instances in time can be preserved.

Versions of Cassandra 1.x and later also have an “incremental” backup feature.

There is no CLI command to execute incremental backups. It is enabled by changing the value of “incremental_backups” to “true” in the cassandra.yaml file.

The incremental backup feature creates a hard-link to each new SSTable as it is created to a backups subdir in the local disk keyspace area (<DataFileDirectory>/<keyspace_name>/backups).

As incremental backups only contain new SSTable files, they are dependent on the last snapshot created. Natively, Cassandra incremental backups only supplement the last snapshot made.

Note that in both the snapshot and incremental backups, Cassandra only creates hard-links to the active data files in the parent keyspace, not actual copies of the files. As hard links are simply different pointer names to the same inode on the disk, the processes are quick and consume little disk space.

Because Cassandra doesn’t modify SSTable files after creating them, but simply adds new files and deletes old ones as needed, the hard-link method for backups does provide consistency for a successful restore. When Cassandra removes an old SSTable file from the active keyspace, a pointer to the file still exists in the snapshot or backup sub-directory.

As Cassandra continues to operate, the contents of the snapshot /backup sub-directories begin to diverge in size and content from the active keyspace area. The longer the local snapshots/backups are retained, the greater the local disk usage becomes, because the backup directories continue to retain links to older SSTable files that were removed from the active keyspace area.

The incremental backup feature in Cassandra 1.0 substantially reduces disk space requirements because it only contains links to new SSTable files generated since the last full snapshot. In contrast, all snapshots have links to all the files in the active keyspace area at the time the snapshot was made. The value of incremental backups in reducing disk usage is especially noticeable in larger data sets with minimal active writes.

Native Cassandra Recovery Process

Restoring a Cassandra keyspace means restoring all the keyspace SSTable files as they existed in a point in time.

Cassandra does not provide a native restore utility, but does provide a restore procedure. For each node in the cluster:

Shut down Cassandra.
Clear all files in commitlog directory (path defined by the <CommitLogDirectory> parameter in the cassandra.yaml file, by default /var/lib/cassandra/commitlog). Ideally, logs will be flushed before Cassandra is shut down, as the commitlog directory is a shared resource of all keyspaces, not just the one to be restored.
Removing all current contents of the active keyspace (all *.db files).
Copying contents of desired snapshot to active keyspace.
Only if restored snapshot is the latest one, and you want the latest backup, copy contents of backup directory into active keyspace area on top of the restored snapshot files.

Note that the process must be executed on all nodes in the cluster, otherwise nodes that did not get the restored data will “update” the restored nodes with the newer, bad data.

Extensions to Cassandra Native Backup Tools

We had additional goals for our Cassandra backups that were unmet by native Cassandra tools, specifically:

Automation:While incremental backups occur automatically in Cassandra, snapshots are command-executed via the CLI. We needed to automate this process, scheduling it on a regular basis to establish consistent full backups for each node in the cluster.
Non-local storage: While Cassandra’s hard-link backup method is fast, it does not account for any potential problems with the local disk. We needed to make real copies of snapshot and backup files, and put them on storage devices separate from the local Cassandra node.
Multi-day/instance retention:While Cassandra allows for multiple snapshots to be retained on disk, it retains only one set of incremental backups, and those only cover changes made from the last snapshot. We needed to retain incremental backups for previous snapshots as well.

To achieve these objectives, we utilized the common sysadmin tools Bash, Puppet, cron, and NFS. We decided on these tools because they are routinely used in our group and would be the quickest and most straightforward method to achieve our goals.

Puppet

Puppet is used to distribute the backup script and cron entries to the hosts. The script is set up in Puppet as a template file, with variables to define the appropriate NFS target for each cluster for each type and location.

Cron

Cron is already in use in the hosts and managed via puppet. Puppet variables in the cron statements, defined by cluster type and location, are used to determine the execution times and command-line flags.

NFS

NFS was the most economical and quickest way to provide non-local storage to massive quantities of Cassandra nodes. We use a mix of NetApp appliances and Isilon filers for our on-line Cassandra backup repositories.

It is important to note in discussing NFS that all Cassandra nodes need backup, even though data is replicated across nodes. The reason is that the data is replicated to be eventuallyconsistent between all nodes – not that all nodes sharing a file set will have exactly the same data at the same time. Backing up all the nodes is the only way to ensure a consistent, complete backup of a keyspace.

This, however, means that the amount of NFS storage space needed for Cassandra backups may be larger than that for a similarly-scaled relational DB – because you will essentially be backing up X-number of copies of the same data (where X is the number of nodes in your Cassandra cluster that are configured to replicate to one another).

Bash

The Bash backup script, called “cc_backup.sh,” does the following:

If the interval between snapshots (full backups) is greater than 7 days, or if specifically invoked via the “–forcesnap” flag, runs the Cassandra snapshot cli command.
If the script takes a snapshot, it copies that snapshot to contents an NFS mount point.
If the script does not take a snapshot, it copies the current contents of the backup directory to NFS.
The script creates the directory structure as needed in the NFS mount point as follows: /<nfs_mount>/<hostname>/<date>/snapshots
/<nfs_mount>/<hostname>/<date>/backups
The script tars and compresses each keyspace snapshot or backup directory as a single file. Snapshot tar files are identified by keyspace name and snapshot id number ( e.g., system-1360082572055.tar.gz ). Incremental backup tar files are identified by keyspace name, parent snapshot id number, the phrase “bkup,” and the date of the backup copy to NFS (e.g., system-1360082572055_bkup-07FEB13.tar.gz). The script uses the latest snapshot listed on the node to identify the appropriate parent snapshot id number.
The tar processes are run with reduced CPU and I/O priority so as not to interfere with regular Cassandra operations.
The script prunes older snapshots and incremental backups from both NFS and the local file system.

On the local system disk, the script removes the oldest snapshots over the maximum local retention value specified. Upon creation of a new snapshot, the script removes the contents from the local incremental backup directory, as that content is now useless with the new snapshot.
On the NFS mount point, the script removes older snapshots and backups based on NFS retention values specified.

Unlike the native Cassandra process, which only keeps one set of incremental backup files, the script keeps multiple instances of incremental backup files on NFS to provide for more choices in a point-in-time restore. The script also keeps multiple copies of snapshot files at certain points in time to support recovery of older incremental backups that are dependent on the older snapshot files.

For example, Cassandra clusters may have a 3-day “live” recovery capacity on NFS (older recoveries, if available, would be dependent on NDMP tape restore to NFS first). This 3-day “live” recovery potential would require a mix of both snapshots and incremental backups, depending on the day of the week.

Backup File Retention Examples

For example, let’s assume that a given cluster with 3-day NFS retention performs snapshots on Sundays and incremental backups the rest of the week.

The Monday morning NFS backup tree for a given Cassandra host in that cluster would have the following:

Previous Sunday snapshot (necessary for recovery of Friday and Saturday incremental backups)
Friday incremental backup
Saturday incremental backup
Yesterday’s Sunday snapshot (necessary for recovery of Monday incremental backup)
Monday incremental backup

This gives us the 3 days of recovery options not including Monday morning’s backup.

On Wednesday the NFS backup tree would have:

Latest Sunday snapshot (necessary for recovery of Monday, Tuesday, or Wednesday incremental backups)
Monday incremental backup
Tuesday incremental backup
Wednesday incremental backup

Extensions to the Native Cassandra Recovery Process

As our backup tool is merely a wrapper around the native Cassandra snapshot/backup process, data can be restored manually following the native Cassandra restore steps previously discussed. Local snapshots/backups can be recovered this way, and our NFS backups can be restored the same way once they are untarred and uncompressed.

To simplify restores, however, we have created the script “cc_restore.sh.”

This script allows you to specify just the keyspace and date to restore from, and it will gather the appropriate snapshot and backup files from NFS to restore. The script will also verify that any and all Cassandra processes are offline before it will proceed to restore any data. It makes it easier to execute restores en masse across an entire cluster via func or ssh.

Syntax:

/usr/local/bin/cc_restore.sh (table) (date)

Where:

table= table name to restore, or specify “ALL”

table names case sensitive, specifying ALL should be all caps

date= restore last instance of date specified in two-digit daymonthyear

or specify “lastlocal” to restore last local backup

Examples:

/usr/local/bin/cc_restore.sh PLINK_L1 05JUL11

/usr/local/bin/cc_restore.sh SharedVol1_F1 07JUL11

/usr/local/bin/cc_restore.sh ALL 11JUL11

/usr/local/bin/cc_restore.sh SharedVol1_F1 lastlocal

Future Challenges

Our backup/restore process is working, but we are constantly monitoring and tweaking it as the scale and complexity of our Cassandra environment grows.

Some of the things we are keeping an eye on:

System resource utilization and length of time to copy snapshots to NFS.

Although we feel we need the compression capability and file bundling of tar, we are aware of its aggressive use of system resources and the latency of NFS.

Careful, distributed scheduling of snapshots so no more than one site cluster goes to any one NFS appliance at the same time, coupled with highly optimized NFS client settings, have substantially reduced this impact.

Still, we are formulating possible mid-term and long-term alternatives should we encounter any performance problems.

Mid-term alternatives include using rsync in place of tar, and using non-Cassandra nodes to compress and tar-up the files on the NFS mount. This, however, will increase the NFS volume usage considerably.

Long-term alternatives include using a full-fledged Cassandra management solution such as Priam, which has its own built-in Cassandra backup/restore function within a Java VM. However the fast compression used by this tool is said to be substantial less effective than traditional gzip, meaning NFS volume usage will likely grow substantially. Also, the implementation of a separate Java VM for backups adds a level of complexity to the process, and would bring with it its own set of support requirements that would have to be addressed.

Backup space requirements.

As previously stated, Cassandra needs a lot of space for backups. We have to prepare for much more rapid acquisitions and implementations of NFS appliances. As our clusters grow in size, we are finding that horizontal scaling ability of Isilon filers is a good fit for Cassandra.

Cassandra Repair and Snapshot Conflicts.

In the previous Cassandra release, the Cassandra repair function would modify files that were still hard-linked to snapshot directory. This would generate an error if it occurred while the backup script was still copying files to the NFS directory, as tar would report that the source file had changed. We have not yet observed this behavior with Cassandra 1.X and above, but we continue to watch for this behavior.

Continue the conversation by sharing your comments here on the blog.

Sunday, 18 May 2014

Cassandra: tuning the JVM for read heavy workloads

We recently completed a very successful round of Cassandra tuning here at SHIFT. This post will cover one of the most impactful adjustments we made, which was to the JVM garbage collection settings. I’ll be discussing how the JVM garbage collector works, how it was affecting our cluster performance, the adjustments we made, their effects, the reasoning behind them, and share the tools and techniques we used.

The cluster we tuned is hosted on AWS and is comprised of 6 hi1.4xlarge EC2 instances, with 2 1TB SSDs raided together in a raid 0 configuration. The cluster’s dataset is growing steadily. At the time of this writing, our dataset is 341GB, up from less than 200GB a few months ago, and is growing by 2-3GB per day. The workload on this cluster is very read heavy, with quorum reads making up 99% of all operations.

How the JVM’s garbage collection works, and how it affects Cassandra’s performance

When tuning your garbage collection configuration, the main things you need to worry about are pause time and throughput. Pause time is the length of time the collector stops the application while it frees up memory. Throughput is determined by how often the garbage collection runs, and pauses the application. The more often the collector runs, the lower the throughput. When tuning for an OLTP database like Cassandra, the goal is to maximize the number of requests that can be serviced, and minimize the time it takes to serve them. To do that, you need to minimize the length of the collection pauses, as well as the frequency of collection.

With the garbage collector Cassandra ships with, the jvm’s available memory is divided into 3 sections. The new generation, the old generation, and the permanent generation. I’m going to be talking mainly about the new and old generation. For your googling convenience, the new gen is collected by the Parallel New (ParNew) collector, and the old gen is collected by the Concurrent Mark and Sweep (CMS) collector.

Description: http://media.tumblr.com/91421b322038c33cc8ab478102507f62/tumblr_inline_mzvi37dk5f1rd24f4.png

The New Generation

The new generation is divided into 2 sections: eden, which takes up the bulk of the new generation, and 2 survivor spaces. Eden is where new objects are allocated, and objects that survive collection of eden are moved into the survivor spaces. There are 2 survivor spaces, but only one is occupied with objects at a time, the other is empty.

When eden fills up with new objects, a minor gc is triggered. A minor gc stops execution, iterates over the objects in eden, copies any objects that are not (yet) garbage to the active survivor space, and clears eden. If the minor gc has filled up the active survivor space, it performs the same process on the survivor space. Objects that are still active are moved to the other survivor space, and the old survivor space is cleared. If an object has survived a certain number of survivor space collections, (cassandra defaults to 1), it is promoted to the old generation. Once this is done, the application resumes execution.

The two most important things to keep in mind when we’re talking about ParNew collection of the new gen are:

1) It’s a stop the world algorithm, which means that everytime it’s run, the application is paused, the collector runs, then the application resumes.

2) Finding and removing garbage is fast, moving active objects from eden to the survivor spaces, or from the survivor spaces to the old gen, is slow. If you have long ParNew pauses, it means that a lot of the objects in eden are not (yet) garbage, and they’re being copied around to the survivor space, or into the old gen.

The Old Generation

The old generation contains objects that have survived long enough to not be collected by a minor GC. When a pre-determined percentage of the old generation is full (75% by default in cassandra), the CMS collector is run. Under most circumstances, it runs while the application is running, although there are 2 stop the world pauses when it identifies garbage, but they are typically very short, and don’t take more than 10ms (in my experience). However, if the old gen fills up before the CMS collector can finish, it’s a different story. The application is paused while a full gc is run. A full GC checks everything: new gen, old gen, and perm gen, and can result in significant (multi-second) pauses. If you’re seeing multi-second GC pauses, you’re likely seeing major collections happening. If you’re seeing these, you need to fix your gc settings.

Our performance problems

As our dataset grew, performance slowly started to degrade. Eventually, we reached a point where nodes would become unresponsive for several seconds or more. This would then cause the clusters to start thrashing load around, bringing down 3 or more nodes for several minutes.

As we looked into the data on opscenter, we started to notice a pattern. Reads per second would increase, then the par new collection time and frequency would increase, then the read latency times would shoot up to several seconds, and the cluster would become unresponsive.

So we began tailing the gc logs, and noticed there were regular pauses of over 200ms (ParNew collections), with some that were over 15 seconds (These were Full GCs). We began monitoring Cassandra on one or two nodes with jstat during these periods of high latency.

jstat is a utility that ships with the jvm, it shows what is going on in your different heap sections, and what the garbage collector is doing. The command jstat -gc <pid> 250ms 0 will print the status of all generations every quarter second. Watching the eden figures, we could see that eden was filling up several times per second, triggering very frequent minor collections. Additionally, the minor collection times were regularly between 100 and 300 milliseconds, and up to 400 milliseconds in some cases. We were also seeing major collections happening every few minutes that would take 5-15 seconds. Basically, the garbage collector was so far out of tune with Cassandra’s behavior that Cassandra was spending a ton of time collecting garbage. Cutting the number of requests isn’t a real solution, and iostat made it pretty clear that the disk was not the bottleneck (read throughput was around 2MB/sec), so adding new nodes would be an expensive waste of hardware (we’d also tried adding new nodes, and it hadn’t helped).

Given this information, we came up with the following hypothesis: Each read request is allocating short lived objects for both the result being returned to the client/coordinator, as well as objects that actually process the request (iterators, request/response objects, etc, etc). With the rate that the requests are coming in, and the frequency of new gen collections, it seemed pretty likely that a lot of the objects in eden at the start of a gc would be involved in the processing of requests, and would therefore, be garbage very soon. However, given the rate of requests and ParNew collections, they weren’t yet garbage when inspected by the par new collector. Since 99% of the requests are reads, requests don’t have any long term side effects, like mutating memtables, so there’s no reason why they need to be promoted out of eden.

If this hypothesis was true, it had 2 implications:

First, the par new collection is going to take a long time because it’s copying so many objects around (remember, collecting garbage is fast, copying objects between eden/survivor spaces and generations is slow). The 200ms ParNew collection times indicated this was happening.

Second, all of these transient request related objects are getting pushed into the new gen, which is quickly getting filled up with objects that will soon be garbage. If these transient objects are moved into the old gen faster that the CMS collector can keep up, a major gc will be triggered, stopping cassandra for several seconds.

If this was the case, it seemed likely that increasing the size of eden would solve our problems. By reducing the rate that eden reaches capacity, more of eden’s contents will be garbage. This will make the par new collection faster, and reduce the rate that transient objects are pushed into the old gen. More importantly, objects would be promoted at a rate that the CMS collector can handle, eliminating major, multi second, stop the world collections.

I didn’t take any screen shots of jstat when the garbage collector was misbehaving, but this is an approximation of what we were seeing.

Description: http://media.tumblr.com/d7c958e37315b600b347c479be7253ba/tumblr_inline_mzvi41suBT1rd24f4.png

In this image, we can see that there are a lot of new gen collections (see the YGCT column). And we can see the survivor section usage switching back and forth very often, indicating a lot of young gen collections. Additionally, the old gen is continuously increasing as objects are prematurely promoted.

New GC settings

The initial heap settings were a total heap size of 8GB, and a new gen size of 800MB. Initially, we tried doubling the new gen size to 1600MB, and the results were promising. We were not having any more runaway latency spikes, but we were still seeing read latencies as high as 50ms under heavy load, which, while not catastrophic, made our application noticably sluggish. The new gen collection times were still higher than 50ms.

After a few days of experimenting with various gc settings, the final settings we converged on was 10GB total for the heap, and 2400MB for the new gen. We had increased the total heap by 25%, and tripled the size of the new gen. The results have been excellent. With these settings, I haven’t seen the read latencies go above 10ms, and I’ve seen the cluster handle 40 thousand plus reads per second with latencies around 7ms. New gen collection times are now around 15ms, and they happen slightly less than once per second. This means that Cassandra went from spending around 20% or more of it’s time collecting garbage, to a little over 1%.

This is a look at the garbage collection activity on one of our tuned up nodes today.

Description: http://media.tumblr.com/efbdae5e6e802c9a5e89b3b532f843c1/tumblr_inline_mzvi4fEWqj1rd24f4.png

You can see the eden consumption creep up over 2 seconds (see the EU column), then a minor GC is performed. Additionally, the old gen size is pretty stable.

Description: http://media.tumblr.com/5d0efca7288dc969c1ac4fc3d36e0151/tumblr_inline_mzvj254quj1rd24f4.png

Tools we used to diagnose the problems.

1) Opscenter: Datastax’s opscenter tool was very helpful and provided a highlevel view of our cluster’s health and performance

2) GC Logging: They’re not enabled by default, but the garbage collection logs give a lot of insight into the what the garbage collector is doing, and how often it’s doing it. To enable the gc logs, uncomment the GC logging options in cassandra-env.sh

3) iostat: reports disk usage. Running iostat -dmx 1 will print out your disk usage stats every second. You can use this to quickly determine if disk is your bottleneck.

4) jstat: as mentioned earlier, jstat provides a real time look at what gc is doing, and is very helpful. With jstat, you can watch the usage of eden, the survivor spaces, and the old gen, gc counts and times, and watch as the jvm shifts things arounds the different sections. Using the command jstat -gc <pid> 250ms 0 will print the status of all generations every quarter second.

For experimentation, we used a single node in our production cluster as our test bed. We would make incremental changes to the node’s settings and watch how it performed relative to the other nodes.

- See more at: http://tech.shift.com/post/74311817513/cassandra-tuning-the-jvm-for-read-heavy-workloads#sthash.qv3gu5ud.dpuf

Perfect MVC code