I recently wanted to start recording various business level metrics for our Rails app. We already use New Relic for performance monitoring and while you can inject custom metrics you can only surface them if you have the Pro version (unless they happen to surface naturally as one of the top n slowest/highest throughput/highest total time metrics). I’d been hearing a lot of good things about statsd and graphite so decided to try my hand at setting those up. Graphite is written in python, so this meant venturing out of my cosy little ruby world.
Installation
These instructions worked pretty much identically on OS X Mountain Lion and on our amazon linux machines (which is a CentOS derivative). Just substitute brew
for yum
.
Graphite
Graphite is a collection of python components that come togther to form a graphing application: there’s a django based web app + api, a data storage library and a data collection mechanism. While I’m only graphite data from statsd there are loads of other tools that can send data to it (for example collectd collects system level metrics).
One of the ways of getting graphite is as a set of pip
packages. On OS X the python installed by brew installs pip
too, but on amazon linux I had to install pip
itself first:
sudo easy_install pip
Graphite needs the cairo graphics library and its python bindings. With brew this is the py2cairo
package but amazon’s package list just calls it pycairo
. Then I installed graphite itself
sudo pip install twisted
sudo pip install django
sudo pip install django-tagging
sudo pip install whisper
sudo pip install carbon
sudo pip install graphite-web
This should install a bunch of stuff in /opt/graphite
. The python installed by brew puts binaries in /usr/local/share/python/ but graphite’s bin scripts expects to be in /opt/graphite/bin. I symlinked them into /opt/graphite/bin - if you try to just run them from /usr/local/share/python/ then they can’t find their graphite libs. On my development machine I chown’ed everything to me, on the production instance I created a graphite user which owns all this (and I run all the assorted daemons as this user too).
There’s a bunch of example configuration files in /opt/graphite/conf/
. The only ones you really need at this point are carbon.conf, dashboard.conf, graphTemplates.conf, storage-aggregation.conf and storage-schemas.conf. I added the following to storage-schemas.conf
, above the [default_1min_for_1day]
section:
[stats]
pattern = ^stats\..*
retentions = 10:2160,60:10080,600:262974
[stats_count]
pattern = ^stats_counts\..*
retentions = 10:2160,60:10080,600:262974
This turns out to be relly important, but more on that later. Changes to this file are picked up within 60 seconds, but as I understand it this only affects new metrics - existing metrics use whatever settings were in effect when they were created. If you want to change these settings for an existing metric you need to use the supplied scripts for updating whisper files (whisper-resize.py
and friends).
I also changed storage-aggregation.conf
and added
[stats_counts]
pattern = ^stats_counts\.
xFilesFactor = 0
aggregationMethod = sum
above the default entry - I’ll explain why later. I left the other files with their default contents.
At this point you should be able to startup the carbon daemon by running
/opt/graphite/bin/carbon-cache.py start
It logs to /opt/graphite/log/carbon-cache if things go wrong.
Webapp
I must admit to complete ignorance when it comes to django, so some of this is probably super obvious. First I edited /opt/graphite/webapp/graphite/local_settings.py
in order to set my timezone and to enable the default sqlite database configuration (by uncommenting the DATABASES
section). Then I ran
python manage.py syncdb # cd to /opt/graphite/webapp/graphite/ first
which sets up the database for django.
Then finally start up the web app with
python /usr/local/share/python/run-graphite-devel-server.py /opt/graphite
If you point your browser at localhost:8080
you should see the graphite web dashboard with a graph that says ‘No Data’. If the graph area contains a broken image icon then something is wrong, probably with cairo or py2cairo. The tree control lets you explore the set of metrics that are graphable, but at the moment the only ones there will be the ones carbon generates about itself.
A little laborious because I’m not too familiar with the python ecosystem, but no real issues. The one problem I did have was that initially my PATH wasn’t setup correctly so some commands ended up using the OS X provided python and others the brew installed python. This led to unhappy Fatal Python error: PyThreadState_Get: no current thread
errors when I ran the web app.
Statsd
Statsd is a nodejs app that you send your metrics to. It aggregates them and sends them onto graphite in 10 seconds chunks. Installing this is easy
brew install nodejs #or whatever your platform install mechanism is
git clone https://github.com/etsy/statsd.git
Amazon’s distro doesn’t have nodejs yet so I installed from source there. You can edit dConfig.js
if you need to set any statsd settings (I just changed the address of the graphite server) and then start it with.
node stats.js dConfig.js
In production I use forever to run it in the background.
Creating and manipulating metrics
I’m using the statsd-ruby gem to talk to statsd. Once installed it’s easy to use:
1 2 |
|
This should result in a metric appearing. Metrics are sent over UDP so should be fairly lightweight - in my simple tests incrmenting a counter took around 0.2ms on average - and will fail silently if statsd isn’t running. Statsd-ruby should soon support batching of metrics to reduce overhead when many metrics are sent in quick succession.
Different kinds of metrics
When you increment a counter metric, statsd actually pushes two metrics into graphite. In the previous example stats.users.signup
and stats_counts.users.signup
would be inserted. stats.users.signup
is the rate at which signups ocurred (signups per second) as measured over the sample period, whereas stats_counts.users.signup
is the raw value the counter changed by that same time period (i.e stats_counts
metrics are dimensionless but stats metrics are 1/T (assuming the counter itself is dimensionless)).
Both can be useful depending on what you’re doing: the per second counters are always easy to compare (with the raw counters you need to know what the sample period is) but the raw counters are good when you want to the total number of times something has happened. Graphite has an enormous number of functions that you can use to transform your data in all sorts of ways: adding series together, integrate them, smooth them, average them, track moving averages, change the name used in the legend etc.
Working out which functions to use and whether to use stats
or stats_counts
can be a bit subtle, especially when you take into account how data is aggregated. For example, if you’re viewing a graph with an entire day’s worth of 1 minute data (1440 datapoints) on a graph that’s 720 pixels wide, graphite needs to combine pairs of datapoints.
By default it will average adjacent values, which is correct if the metric is stats.message_received
(average number of messages received per second), but not if it’s stats_counts.message_received
(the number of messages received in each sample period). In the latter case you’d want to graph cumulative(stats_counts.message_received)
instead, so that it adds the two datapoints that it’s merging. Personally I’ve found stats_counts
to be particularly subtle to deal with.
Aggregation
Statds aggregates your data in 10 second chunks before forwarding it onto graphite. For counters this aggregation is just summing, for timers statsd calculates averages, percentiles etc. Graphite in turn aggregates data to save space: 10 second data is eventually downsampled to per minute, per hour or per day data depending on the configuration in storage-schemas.conf
. The correct way of doing this aggregation depends on the sort of data: a per second counter should be averaged, but a metric tracking the maximum response time should take the max of all the points being aggregated and a raw counter should sum the values.
How
The storage-aggregation.conf
file defines how data is aggregated. The example rules use the last component of the the metric name: if it ends in .count
then sum it, if it ends in .avg
average it and so on. When submitting timing metrics statds adds a suffix for you, but when submitting counters you’re on your own. If you’re just interested in the per second counters then the default (averaging) is fine. For the stats_counts
metrics you want to sum values, hence the rule I mentioned earlier.
As an aside, a comment in storage-aggregation.conf
claims that the file is scanned every 60 seconds and changes are picked up automatically (and applied when creating a new metric). I had initially created metrics with the wrong aggregation and after deleting the .wsp files they were still being recreated with the old settings. Restarting carbon-cache fixed it, although I don’t know if my case was different because I was recreating existing metrics. You can use the whisper-info.py
script to get information about a .wsp file, including what its aggregation settings are (the metric files are in /opt/storage/whisper).
When
The rules in storage-schemas.conf
define how much data is kept at each time resolution. You might keep an hour’s worth of 10s data, a day’s worth of per minute data and a year’s worth of per 5 minute data. Obviously the more data you keep, the more disk space is used. You can keep a lot of historical data if you tune the resolution down a little - in the example given earlier 5 years worth of 10min data is kept, at a cost of around 3 megabytes per metric.
When using statsd with its default settings it’s crucial that you keep at least some 10 second data (the default graphite config keeps 1min resolution data only). If you don’t then the mismatch between statsd’s flush rate and graphite’s flush rate will cause you to lose data. Assume that graphite is only storing data with 1min ganularity:
- At 00:00:01 you increment a statsd counter (value = 1)
- At 00:00:05 you increment a statsd counter (value = 2)
- At 00:00:10 stats sends this counter value (2) to graphite and resets the counter
- At 00:00:20 stats sends this counter value (0) to graphite
If you’re only storing 60s data the both of those flushes fall into the same graphite aggregation period and the second one overwrites the first: at first graphs would show a counter of 2 and then that counter would change to 0. If you do change these settings you’ll need to resize any existing .wsp files you want the changes to apply to (see whisper-resize.py
and friends).
Combining metrics
Aggregation also refers to the practice of combining multiple metrics into a new one. To do this you add rules to aggregation-rules.conf
, run the carbon-aggregator.py
daemon and reconfigure statsd to send its data there instead (it runs on port 2023 by default).
As an example, you might want to track successful and failed login attempts separately but also track total login attempts. If you emit users.login.successful
and users.login.failed
then the rule
stats.users.login.all (10) = sum stats.users.login.*
would sum your two login metrics into an all logins metric. You can also define multiple related aggregations at a time in this way. For example if you had a multitenanted app you might want to track login rate on a per tenant basis. Instead of incrementing users.login.successful
you could increment users.client1.login.successful
for your first tenant, users.client2.login.successful
for your second tenant and so on. You could generate a per tenant total logins per second metric with this rule
stats.users.<tenant>.login.all (10) = sum stats.users.<tenant>.login.*
which would create stats.users.client1.login.all
, stats.users.client2.login.all
and so on. If you wanted logins per second across all tenants you could use
stats.users.all.login.all (10) = sum stats.users.*.login.*
I’ve found that I sometimes need to restart carbon-aggregator to get it to pick up changes to aggregations rules, despite comments in the example config files claiming the opposite.
Deployment
I’ve never deployed a python app, so I followed the path of least resistance, leaning heavily on the example configurations that come with graphite. This ended up being using apache and mod_wsgi. I renamed graphite.wsgi.example
from /opt/graphite/conf
to graphite.wsgi
and the apache configuration is pretty much the example file from /opt/graphite/examples. This is pretty much boiler plate, but you might find it useful:
1 2 3 4 5 6 7 8 9 10 11 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
About the only change I made was to add user=graphite
so that the app runs as the unprivileged graphite user.
Graphite doesn’t provide any sort of access control out of the box. We host all of our stuff on EC2, so leaving the web interface wide open wasn’t an option. If it was just me I would probably have settled for using ssh tunnels, but the point was to collect business metrics and ssh tunnels don’t really fly for the non technical people interested in these. It would also be really nice if people didn’t have to have a second set of credentials just to access the graphs.
In the end I wrote a small proxy app. Users make requests to the proxy app that then checks with our main app to see if they are authorized. If they are then the main app hands our proxy app a signed token (valid for a finite amount of time) that is used to authorize subsequent requests. The proxy app then forwards the request on to graphite and returns the response to the user. It’s not particularly clever (and almost certainly not an http compliant proxy) but it seems to work well enough. Graphite itself is not visible from the outside at all. It’s likely that there are better ways of achieving this within the python ecosystem but that’s not a world I’m familiar with.
Onwards
I feel like I’m still only scratching the surface of how best to use graphite both in terms of what to measure and how best to present it. One of the neat things about graphite is that the default web app is really just a way of creating urls for the rendering api. So far I’ve found it to be a reasonable way of nosing through all the options graphite has to offer but you’re in no way tied to the default ui - you can use any method you want to generate the image url for a graph.
One such alternative frontend is Graphiti. It also adds features such as being able to save snapshots of graphs, email them to people etc.
Cubism goes a step further: instead of letting graphite render the images it retrieves the raw data (graphite gives you a choice of its custom format, csv or json) and uses this to do its own rendering clientside. This also gets you nice features such as being able to hover the mouse over a datapoint and see its value. Cubism loads its data over ajax, so you may need to set CORS headers on your graphite server for it to work.
Both Cubism & Graphiti are easy to setup (there’s an example cubism frontend written using twitter bootstrap) and are well worth having a play with.