I got a series of Rollbar alerts a few days ago:
NoMemoryError: failed to allocate memory. The application wasn’t under any particular load at the time, and a quick look at memory usage stats in appsignal showed the telltale signs of a memory leak: slowly but inevitably rising memory usage.
A needle in a haystack
We had not noticed similar errors before, so my first thought was “what have we changed?”. As it turns out, nothing – the leak had just flown under the radar. The leak is pretty slow, so between deploys and instances scaling up and down due to variations in load, it was actually pretty rare for any given instance to be up long enough for the leak to have any real consequences. On the rare occasions it actually impaired functionality it was handled automatically as any instance failure would be.
I started the proverbial hunt for a needle in a haystack with Richard Schneeman’s excellent derailed_benchmarks tool ( the
perf:mem_over_time variant ). I didn’t really know what I was looking for though and wasn’t able to find anything very significant – I tried some of the app’s most hit endpoints but wasn’t able to reproduce the leak. Clearly there was a combination of circumstances or a difference in our production setup I was overlooking.
rbtrace to the rescue
My next port of call was rbtrace, via Sam Saffron’s excellent blog post on debugging memory leaks. The basic idea is to use rbtrace to run ObjectSpace’s debugging tools inside the Rails app running in production.
To get the most out of this you need to tell ObjectSpace to collect some details from each memory allocation by calling
ObjectSpace.trace_object_allocations_start. This has a noticeable overhead (both in terms of performance and memory), so I did this against a single Passenger instance. When using with Passenger, you need to set
rbtrace to not be loaded by bundler (
require: false in the Gemfile) and instead load it from a controller after the app is up (see https://github.com/tmm1/rbtrace/issues/9).
Then it’s much as Sam describes:
To dump a file with allocation info. I ran his first analyze script to aggregate the data and got some output that looked a little like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
(If you find everything reported as an empty generation (as I did at first), then you’ve probably forgotten to call
A slow but steady leak. Digging into what code was leaking objects the allocation locations were mostly like this:
Which was a little surprising – I was expecting to see some of our application code here. I eyeballed some of the actual values and they were the names of the methods that views get compiled into. This didn’t make any sense: in production Rails caches compiled views, so I should have a flurry of these when the app has just restarted but nothing once the app gets into a steady state.
After a few minutes double checking that template caching was actually turned on (it was) I decided that I needed to understand more about how template caching worked – maybe something was invalidating this cache? Off I went spelunking through the Rails codebase.
I fairly quickly found that this cache lives inside
ActionView::Resolver but it wasn’t used in the way I had assumed. I had thought there’d be one big resolver (with its cache), whereas there are often are multiple resolvers (and hence multiple caches).
A Rails controller has a
view_paths attribute, which is conceptually a list of paths to search, but is actually a
ActionView::PathSet that contains many
ActionView::Resolver objects (one for each path), i.e. each element in
view_paths is cached separately.
At this point it seemed logical to see if something was changing
view_paths – maybe that would explain things. A quick search through our codebase found the following controller instance method:
1 2 3 4 5
This app doesn’t actually have a separate mobile site anymore –
mobile_site_requested? is always true in the real world, although it is technically possible to request the old versions of the pages. This is hangover from when we redesigned the site from desktop only to a responsive, mobile first, site. A candidate for clearing up for sure, but it seemed pretty harmless.
When I did this in Rails console (looking for the outfits/show template)
It consistently returned the same
object_id: the cache was being used. On the other hand
1 2 3
resulted in a different template object for each controller, so prepending a view path like this was resulting in the cache not being used (for views in
app/views/mobile). This felt like I was definitely onto something!
The Rails code for prepending a path looks like this
1 2 3 4 5 6 7 8 9 10 11 12 13 14
So when you prepend a string, Rails creates a new resolver for that path and adds that to the path set. When done from a controller instance method, this means that each request uses a new resolver for
app/views/mobile (with an empty cache). Memory leak aside, my code was bad because it meant views were compiled over and over again.
Now that I understood that
prepend_view_path was key, I knew why I wasn’t able to reproduce the leak earlier – I’d forgotten about this part of the app and had been testing against a codepath that didn’t use
prepend_view_path. Once I adjusted my test setup I was able to see the same leak using derailed.
The actual leak
I was still curious about the actual leak: compiling templates over and over again was bad, but why was it leaking? The leaked values were the names of methods, which are symbols. Ruby garbage collects symbols since ruby 2.2, so how is this possible?
Although the headlines were “Ruby 2.2 adds Symbol GC”, the reality is a little more complicated. There are two kinds of symbols: mortal and immortal ones. Immortal ones are (simplifying grossly) used by the C parts of ruby and not collectable. Defining a method creates an immortal symbol (or transitions a symbol from mortal to immortal) so creating large numbers of methods with unique names will still cause a leak, even if you undefine the methods when you are done (see https://www.slideshare.net/authorNari/symbol-gc for more details on symbol GC).
It was a pretty small leak – a hundred or so bytes per distinct template or partial rendered per request, but this app gets over a million requests a day, so it all adds up. It does like this may have bitten other people before – see rails#22580 and rails#14301
In this very particular case the use of
prepend_view_path was unnecessary and would be possible to just remove it. This isn’t always the case – we’ve got other apps which do actually need to use
prepend_view_path (for example a white labelled application that has per request host view overrides) – so I wanted to see if I could fix the leak while keeping
prepend_view_path only creates a new resolver if you give it a string or pathname – if you give it a resolver, it will just add that to the array. Therefore we just need to create (and cache ourselves) an
ActionView::Resolver (i.e. replicate what the
typecast method was doing above), and prepend that instead of the string.
1 2 3 4 5 6
This is a graph of memory usage across the hosts for this application either side of when the fix above was deployed.
Pretty convincing! (Graph from appsignal)