Using statsd and graphite from a rails app

I recently wanted to start recording various business level metrics for our Rails app. We already use New Relic for performance monitoring and while you can inject custom metrics you can only surface them if you have the Pro version (unless they happen to surface naturally as one of the top n slowest/highest throughput/highest total time metrics). I’d been hearing a lot of good things about statsd and graphite so decided to try my hand at setting those up. Graphite is written in python, so this meant venturing out of my cosy little ruby world.

Installation

These instructions worked pretty much identically on OS X Mountain Lion and on our amazon linux machines (which is a CentOS derivative). Just substitute brew for yum.

Graphite

Graphite is a collection of python components that come togther to form a graphing application: there’s a django based web app + api, a data storage library and a data collection mechanism. While I’m only graphite data from statsd there are loads of other tools that can send data to it (for example collectd collects system level metrics).

One of the ways of getting graphite is as a set of pip packages. On OS X the python installed by brew installs pip too, but on amazon linux I had to install pip itself first:

sudo easy_install pip

Graphite needs the cairo graphics library and its python bindings. With brew this is the py2cairo package but amazon’s package list just calls it pycairo. Then I installed graphite itself

sudo pip install twisted
sudo pip install django
sudo pip install django-tagging 
sudo pip install whisper
sudo pip install carbon
sudo pip install graphite-web

This should install a bunch of stuff in /opt/graphite. The python installed by brew puts binaries in /usr/local/share/python/ but graphite’s bin scripts expects to be in /opt/graphite/bin. I symlinked them into /opt/graphite/bin - if you try to just run them from /usr/local/share/python/ then they can’t find their graphite libs. On my development machine I chown’ed everything to me, on the production instance I created a graphite user which owns all this (and I run all the assorted daemons as this user too).

There’s a bunch of example configuration files in /opt/graphite/conf/. The only ones you really need at this point are carbon.conf, dashboard.conf, graphTemplates.conf, storage-aggregation.conf and storage-schemas.conf. I added the following to storage-schemas.conf, above the [default_1min_for_1day] section:

[stats]
pattern = ^stats\..*
retentions = 10:2160,60:10080,600:262974

[stats_count]
pattern = ^stats_counts\..*
retentions = 10:2160,60:10080,600:262974

This turns out to be relly important, but more on that later. Changes to this file are picked up within 60 seconds, but as I understand it this only affects new metrics - existing metrics use whatever settings were in effect when they were created. If you want to change these settings for an existing metric you need to use the supplied scripts for updating whisper files (whisper-resize.py and friends).

I also changed storage-aggregation.conf and added

[stats_counts]
pattern = ^stats_counts\.
xFilesFactor = 0
aggregationMethod = sum

above the default entry - I’ll explain why later. I left the other files with their default contents.

At this point you should be able to startup the carbon daemon by running

/opt/graphite/bin/carbon-cache.py start

It logs to /opt/graphite/log/carbon-cache if things go wrong.

Webapp

I must admit to complete ignorance when it comes to django, so some of this is probably super obvious. First I edited /opt/graphite/webapp/graphite/local_settings.py in order to set my timezone and to enable the default sqlite database configuration (by uncommenting the DATABASES section). Then I ran

python manage.py syncdb # cd to  /opt/graphite/webapp/graphite/ first

which sets up the database for django.

Then finally start up the web app with

python /usr/local/share/python/run-graphite-devel-server.py /opt/graphite

If you point your browser at localhost:8080 you should see the graphite web dashboard with a graph that says ‘No Data’. If the graph area contains a broken image icon then something is wrong, probably with cairo or py2cairo. The tree control lets you explore the set of metrics that are graphable, but at the moment the only ones there will be the ones carbon generates about itself.

A little laborious because I’m not too familiar with the python ecosystem, but no real issues. The one problem I did have was that initially my PATH wasn’t setup correctly so some commands ended up using the OS X provided python and others the brew installed python. This led to unhappy Fatal Python error: PyThreadState_Get: no current thread errors when I ran the web app.

Statsd

Statsd is a nodejs app that you send your metrics to. It aggregates them and sends them onto graphite in 10 seconds chunks. Installing this is easy

brew install nodejs #or whatever your platform install mechanism is
git clone https://github.com/etsy/statsd.git

Amazon’s distro doesn’t have nodejs yet so I installed from source there. You can edit dConfig.js if you need to set any statsd settings (I just changed the address of the graphite server) and then start it with.

node stats.js dConfig.js

In production I use forever to run it in the background.

Creating and manipulating metrics

I’m using the statsd-ruby gem to talk to statsd. Once installed it’s easy to use:

statsd = Statsd.new '127.0.0.1', 8125 #8125 is the default statsd port
statsd.increment('users.signup')

This should result in a metric appearing. Metrics are sent over UDP so should be fairly lightweight - in my simple tests incrmenting a counter took around 0.2ms on average - and will fail silently if statsd isn’t running. Statsd-ruby should soon support batching of metrics to reduce overhead when many metrics are sent in quick succession.

Different kinds of metrics

When you increment a counter metric, statsd actually pushes two metrics into graphite. In the previous example stats.users.signup and stats_counts.users.signup would be inserted. stats.users.signup is the rate at which signups ocurred (signups per second) as measured over the sample period, whereas stats_counts.users.signup is the raw value the counter changed by that same time period (i.e stats_counts metrics are dimensionless but stats metrics are 1/T (assuming the counter itself is dimensionless)).

Both can be useful depending on what you’re doing: the per second counters are always easy to compare (with the raw counters you need to know what the sample period is) but the raw counters are good when you want to the total number of times something has happened. Graphite has an enormous number of functions that you can use to transform your data in all sorts of ways: adding series together, integrate them, smooth them, average them, track moving averages, change the name used in the legend etc.

Working out which functions to use and whether to use stats or stats_counts can be a bit subtle, especially when you take into account how data is aggregated. For example, if you’re viewing a graph with an entire day’s worth of 1 minute data (1440 datapoints) on a graph that’s 720 pixels wide, graphite needs to combine pairs of datapoints.

By default it will average adjacent values, which is correct if the metric is stats.message_received (average number of messages received per second), but not if it’s stats_counts.message_received (the number of messages received in each sample period). In the latter case you’d want to graph cumulative(stats_counts.message_received) instead, so that it adds the two datapoints that it’s merging. Personally I’ve found stats_counts to be particularly subtle to deal with.

Aggregation

Statds aggregates your data in 10 second chunks before forwarding it onto graphite. For counters this aggregation is just summing, for timers statsd calculates averages, percentiles etc. Graphite in turn aggregates data to save space: 10 second data is eventually downsampled to per minute, per hour or per day data depending on the configuration in storage-schemas.conf. The correct way of doing this aggregation depends on the sort of data: a per second counter should be averaged, but a metric tracking the maximum response time should take the max of all the points being aggregated and a raw counter should sum the values.

How

The storage-aggregation.conf file defines how data is aggregated. The example rules use the last component of the the metric name: if it ends in .count then sum it, if it ends in .avg average it and so on. When submitting timing metrics statds adds a suffix for you, but when submitting counters you’re on your own. If you’re just interested in the per second counters then the default (averaging) is fine. For the stats_counts metrics you want to sum values, hence the rule I mentioned earlier.

As an aside, a comment in storage-aggregation.conf claims that the file is scanned every 60 seconds and changes are picked up automatically (and applied when creating a new metric). I had initially created metrics with the wrong aggregation and after deleting the .wsp files they were still being recreated with the old settings. Restarting carbon-cache fixed it, although I don’t know if my case was different because I was recreating existing metrics. You can use the whisper-info.py script to get information about a .wsp file, including what its aggregation settings are (the metric files are in /opt/storage/whisper).

When

The rules in storage-schemas.conf define how much data is kept at each time resolution. You might keep an hour’s worth of 10s data, a day’s worth of per minute data and a year’s worth of per 5 minute data. Obviously the more data you keep, the more disk space is used. You can keep a lot of historical data if you tune the resolution down a little - in the example given earlier 5 years worth of 10min data is kept, at a cost of around 3 megabytes per metric.

When using statsd with its default settings it’s crucial that you keep at least some 10 second data (the default graphite config keeps 1min resolution data only). If you don’t then the mismatch between statsd’s flush rate and graphite’s flush rate will cause you to lose data. Assume that graphite is only storing data with 1min ganularity:

At 00:00:01 you increment a statsd counter (value = 1)
At 00:00:05 you increment a statsd counter (value = 2)
At 00:00:10 stats sends this counter value (2) to graphite and resets the counter
At 00:00:20 stats sends this counter value (0) to graphite

If you’re only storing 60s data the both of those flushes fall into the same graphite aggregation period and the second one overwrites the first: at first graphs would show a counter of 2 and then that counter would change to 0. If you do change these settings you’ll need to resize any existing .wsp files you want the changes to apply to (see whisper-resize.py and friends).

Combining metrics

Aggregation also refers to the practice of combining multiple metrics into a new one. To do this you add rules to aggregation-rules.conf, run the carbon-aggregator.py daemon and reconfigure statsd to send its data there instead (it runs on port 2023 by default).

As an example, you might want to track successful and failed login attempts separately but also track total login attempts. If you emit users.login.successful and users.login.failed then the rule

stats.users.login.all (10) = sum stats.users.login.*

would sum your two login metrics into an all logins metric. You can also define multiple related aggregations at a time in this way. For example if you had a multitenanted app you might want to track login rate on a per tenant basis. Instead of incrementing users.login.successful you could increment users.client1.login.successful for your first tenant, users.client2.login.successful for your second tenant and so on. You could generate a per tenant total logins per second metric with this rule

stats.users.<tenant>.login.all (10) = sum stats.users.<tenant>.login.*

which would create stats.users.client1.login.all, stats.users.client2.login.all and so on. If you wanted logins per second across all tenants you could use

stats.users.all.login.all (10) = sum stats.users.*.login.*

I’ve found that I sometimes need to restart carbon-aggregator to get it to pick up changes to aggregations rules, despite comments in the example config files claiming the opposite.

Deployment

I’ve never deployed a python app, so I followed the path of least resistance, leaning heavily on the example configurations that come with graphite. This ended up being using apache and mod_wsgi. I renamed graphite.wsgi.example from /opt/graphite/conf to graphite.wsgi and the apache configuration is pretty much the example file from /opt/graphite/examples. This is pretty much boiler plate, but you might find it useful:

(graphite.wsgi) download

import os, sys
sys.path.append('/opt/graphite/webapp')
os.environ['DJANGO_SETTINGS_MODULE'] = 'graphite.settings'

import django.core.handlers.wsgi

application = django.core.handlers.wsgi.WSGIHandler()

from graphite.logger import log
log.info("graphite.wsgi - pid %d - reloading search index" % os.getpid())
import graphite.metrics.search

(graphite.conf) download

Listen 81
WSGISocketPrefix /var/run/wsgi #create this folder and make it writeable by the user apache runs as
<VirtualHost *:81> #port 81 because my proxy app runs on port 80
  ServerName graphite
  DocumentRoot "/opt/graphite/webapp"

  WSGIDaemonProcess graphite user=graphite processes=2 threads=3 display-name='%{GROUP}' inactivity-timeout=120
  WSGIProcessGroup graphite
  WSGIApplicationGroup %{GLOBAL}
  WSGIImportScript /opt/graphite/conf/graphite.wsgi process-group=graphite application-group=%{GLOBAL}

  WSGIScriptAlias / /opt/graphite/conf/graphite.wsgi

  Alias /content/ /opt/graphite/webapp/content/
  <Location "/content/">
          SetHandler None
  </Location>

  # The graphite.wsgi file has to be accessible by apache. It won't
  # be visible to clients because of the DocumentRoot though.
  <Directory /opt/graphite/conf/>
          Order deny,allow
          Allow from all
  </Directory>
</VirtualHost>

About the only change I made was to add user=graphite so that the app runs as the unprivileged graphite user.

Graphite doesn’t provide any sort of access control out of the box. We host all of our stuff on EC2, so leaving the web interface wide open wasn’t an option. If it was just me I would probably have settled for using ssh tunnels, but the point was to collect business metrics and ssh tunnels don’t really fly for the non technical people interested in these. It would also be really nice if people didn’t have to have a second set of credentials just to access the graphs.

In the end I wrote a small proxy app. Users make requests to the proxy app that then checks with our main app to see if they are authorized. If they are then the main app hands our proxy app a signed token (valid for a finite amount of time) that is used to authorize subsequent requests. The proxy app then forwards the request on to graphite and returns the response to the user. It’s not particularly clever (and almost certainly not an http compliant proxy) but it seems to work well enough. Graphite itself is not visible from the outside at all. It’s likely that there are better ways of achieving this within the python ecosystem but that’s not a world I’m familiar with.

Onwards

I feel like I’m still only scratching the surface of how best to use graphite both in terms of what to measure and how best to present it. One of the neat things about graphite is that the default web app is really just a way of creating urls for the rendering api. So far I’ve found it to be a reasonable way of nosing through all the options graphite has to offer but you’re in no way tied to the default ui - you can use any method you want to generate the image url for a graph.

One such alternative frontend is Graphiti. It also adds features such as being able to save snapshots of graphs, email them to people etc.

Cubism goes a step further: instead of letting graphite render the images it retrieves the raw data (graphite gives you a choice of its custom format, csv or json) and uses this to do its own rendering clientside. This also gets you nice features such as being able to hover the mouse over a datapoint and see its value. Cubism loads its data over ajax, so you may need to set CORS headers on your graphite server for it to work.

Both Cubism & Graphiti are easy to setup (there’s an example cubism frontend written using twitter bootstrap) and are well worth having a play with.

Space Vatican

Ramblings of a curious coder

Using Statsd and Graphite From a Rails App