Monitoring with statsd and CloudWatch

For organizations on AWS looking for a monitoring solution, CloudWatch is an attractive choice. EC2 instances and services come with built in CloudWatch monitoring, and via SNS alerts can be routed to email or text messages. I recently had the opportunity to set up a new monitoring system for a client that backed into CloudWatch. It provided an interesting challenge since the project called for monitoring both application and system metrics.

My goal was to route all monitoring information through the same medium and store it in the same backend. While it is possible to provide application monitoring via statsd but system monitoring through something like collectd, I felt like it would be a cleaner solution to send all data over statsd and store it all in CloudWatch.

statsd

Developers really like working with statsd. It provides an easy, well supported way to write metrics out to a plugable backend. For example, the python pip module for statsd allows you to log metrics like this:

import statsd
c = statsd.StatsClient('localhost', 8125)
c.incr('counter') # increment a counter

The statsd daemon can run on each application host, collecting data and forwarding it to your backend. Typically this means running your own graphite host, which means managing another server. But my client didn't need the added overhead of running their own backend. This is where CloudWatch becomes a tantalizing option. Could we give our devs the statsd API and a CloudWatch backend? 

aws-cloudwatch-statsd-backend

Enter the AWS CloudWatch statsd backend module. This statsd backend provides the piece we need to write the custom metrics to CloudWatch. To install it use npm. Assuming statsd lives in /usr/share/statsd:

cd /usr/share/statsd
npm install camitz/aws-cloudwatch-statsd-backend

The configuration in /etc/statsd looks like this:


  backends: [ "aws-cloudwatch-statsd-backend" ],
  cloudwatch:
  { 
    accessKeyId: "aws_access_key_id",
    secretAccessKey: "aws_secret_key",
    region: "us-west-1",
    namespace: "my_namespace",
  }
}

That's it! Now our application metrics are using CloudWatch as a backend.

diamond

Another part of the solution requires feeding system metrics into CloudWatch, via statsd. While Amazon provides some really useful metrics that can detect a server going down, or how much network bandwidth a server is using, it doesn't provide any built-in way to look at how much memory a server is using, or what the CPU usage looks like.

Diamond is an extensible monitoring system written in Python. I think that one of the most import features of a monitoring system is how many metrics it can gather for you, and diamond can collect quite a bit. It has collectors for common system metrics like memory, CPU, disk and network as well as collectors for common protocols (e.g. HTTP) and daemons such as MySQL and Nginx.

It can also back into statsd. My /etc/diamond/diamond.conf file contains the line:

handlers = diamond.handler.stats_d.StatsdHandler

and that's about all that is required. I can then configure metric collectors in diamond, have them forward to statsd, and then statsd forwards all the information on to CloudWatch.

Hiccup: API request signing

One issue I did run into is that the aws-cloudwatch-statsd-backend is a little old. We're running some servers in the AWS eu-central-1 region, which only supports v4 signing for API requests. The aws-sdk package used by aws-cloudwatch-statsd-backend was a little old and didn't support v4.

The solution was easy enough-- I forked the github repo and updated aws-sdk version. Installing the fork via npm (via "npm install christopherdeutsch/aws-cloudwatch-statsd-backend") and everything just worked. Which is lucky because I don't really do node.js :)

The finishing touches

To tie it all together, I automated the infrastructure setup by (a) configuring and installing statsd, diamond, and aws-cloudwatch-statsd-backend via chef; and (b) automating the creation of CloudWatch alarms via terraform. This allows me to bring up arbitrary servers and have all the monitoring, metric and alarms created automatically.


Comments

Popular posts from this blog

Xen, "hwcap 0 nosegneg", and -mno-tls-direct-seg-refs

A Grand Adventure: compiling transmission on my home router