Top 4 Things To Know About Cassandra in EC2

If you ask people in the know what the best practices for running Cassandra in the cloud (and specifically Amazon's EC2) are, they'll usually just tell you not to. Cassandra is designed to be run bare-metal on commodity hardware. But luckily at the excellent Cassandra SF 2011 Conference, a few key points were repeated by presenters who are actually doing it:

1. Keep Cassandra's datastore on ephemeral drives, not EBS volumes

It seems counter-intuitive, but analysis of production workloads has produced a consensus that ephemeral drives have better and more consistent performance than EBS.

Although Amazon is tight lipped about their actual EC2 infrastructure, there are a couple of speculative reasons that have been given to explain this. First, it is thought that because EBS volume communication must take place over the same network interface as all the other traffic on the system that EBS performance degrades as network usage rises. Second, since it is assumed that EBS volumes are on shared NAS type systems without any guaranteed I/O throughput, when someone else on the same NAS starts abusing the disk you'll end up seeing performance degradation. (See for example these bonnie++ metrics and this blog post by one of the founders of Heroku.)

2. Don't use smaller instance types

Both Netflix and Urban Airship reported that they were running Cassandra clusters in EC2 on XLs or better. This makes sense: for a database you want to keep your working set in RAM if at all possible. But more than that, there is a general feeling among sysadmins who use EC2 extensively that the smaller instance sizes (in particular smalls, mediums and of course micros) seem to lose out on resources.

Some of the larger instances also have multiple ephemeral drives available for use. Combining them into a RAID 0 array has reportedly led to the best performance. (And don't worry about it being RAID 0-- after all, Cassandra handles machine failures at the application level so we're less concerned about making sure that storage is locally redundant.)

3. Perform backups into S3 or using EBS snapshots

If you have good tools for backing up into S3 (such as are provided by my employer, RightScale), Amazon's S3 provides a relatively cheap and easy place to store backups. Another option is to periodically copy Cassandra's data from an ephemeral drive to an EBS volume and then snapshot the EBS volume. Frankly there is a lot of room for improvement here and it would be nice to see an open source tool for Cassandra cloud backups.

4. Automate

Key to being able to leverage Cassandra in the cloud is the ability to automate deployment. Bringing up new nodes should of course be easy, but so should bringing up a whole new ring.


Popular posts from this blog

Monitoring with statsd and CloudWatch

Xen, "hwcap 0 nosegneg", and -mno-tls-direct-seg-refs

A Grand Adventure: compiling transmission on my home router