Space Vatican

Ramblings of a curious coder

Debugging a Memory Leak in a Rails App

I got a series of Rollbar alerts a few days ago: NoMemoryError: failed to allocate memory. The application wasn’t under any particular load at the time, and a quick look at memory usage stats in appsignal showed the telltale signs of a memory leak: slowly but inevitably rising memory usage.

A needle in a haystack

We had not noticed similar errors before, so my first thought was “what have we changed?”. As it turns out, nothing - the leak had just flown under the radar. The leak is pretty slow, so between deploys and instances scaling up and down due to variations in load, it was actually pretty rare for any given instance to be up long enough for the leak to have any real consequences. On the rare occasions it actually impaired functionality it was handled automatically as any instance failure would be.

I started the proverbial hunt for a needle in a haystack with Richard Schneeman’s excellent derailed_benchmarks tool ( the perf:mem_over_time variant ). I didn’t really know what I was looking for though and wasn’t able to find anything very significant - I tried some of the app’s most hit endpoints but wasn’t able to reproduce the leak. Clearly there was a combination of circumstances or a difference in our production setup I was overlooking.

rbtrace to the rescue

My next port of call was rbtrace, via Sam Saffron’s excellent blog post on debugging memory leaks. The basic idea is to use rbtrace to run ObjectSpace’s debugging tools inside the Rails app running in production.

To get the most out of this you need to tell ObjectSpace to collect some details from each memory allocation by calling ObjectSpace.trace_object_allocations_start. This has a noticeable overhead (both in terms of performance and memory), so I did this against a single Passenger instance. When using with Passenger, you need to set rbtrace to not be loaded by bundler (require: false in the Gemfile) and instead load it from a controller after the app is up (see https://github.com/tmm1/rbtrace/issues/9).

Then it’s much as Sam describes:

1
bundle exec rbtrace --pid $PID -e 'Thread.new{GC.start;require "objspace";io=File.open("/tmp/ruby-heap.dump", "w"); ObjectSpace.dump_all(output: io); io.close

To dump a file with allocation info. I ran his first analyze script to aggregate the data and got some output that looked a little like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
generation 7650 objects 49
generation 7651 objects 59
generation 7652 objects 64
generation 7653 objects 44
generation 7654 objects 41
generation 7655 objects 64
generation 7656 objects 79
generation 7657 objects 59
generation 7658 objects 78
generation 7659 objects 43
generation 7660 objects 56
generation 7661 objects 74
generation 7662 objects 130
generation 7663 objects 48
generation 7664 objects 44
generation 7665 objects 58
generation 7666 objects 55

(If you find everything reported as an empty generation (as I did at first), then you’ve probably forgotten to call ObjectSpace.trace_object_allocations_start).

A slow but steady leak. Digging into what code was leaking objects the allocation locations were mostly like this:

1
2
/var/www/application/deploy/shared/bundle/ruby/2.4.0/gems/activesupport-4.2.11.1/lib/active_support/core_ext/module/remove_method.rb:4 * 49
/var/www/application/deploy/shared/bundle/ruby/2.4.0/gems/actionview-4.2.11.1/lib/action_view/template.rb:296 * 48

Which was a little surprising - I was expecting to see some of our application code here. I eyeballed some of the actual values and they were the names of the methods that views get compiled into. This didn’t make any sense: in production Rails caches compiled views, so I should have a flurry of these when the app has just restarted but nothing once the app gets into a steady state.

Template caching

After a few minutes double checking that template caching was actually turned on (it was) I decided that I needed to understand more about how template caching worked - maybe something was invalidating this cache? Off I went spelunking through the Rails codebase.

I fairly quickly found that this cache lives inside ActionView::Resolver but it wasn’t used in the way I had assumed. I had thought there’d be one big resolver (with its cache), whereas there are often are multiple resolvers (and hence multiple caches).

A Rails controller has a view_paths attribute, which is conceptually a list of paths to search, but is actually a ActionView::PathSet that contains many ActionView::Resolver objects (one for each path), i.e. each element in view_paths is cached separately.

At this point it seemed logical to see if something was changing view_paths - maybe that would explain things. A quick search through our codebase found the following controller instance method:

1
2
3
4
5
def set_view_path
  if mobile_site_requested?
    prepend_view_path "app/views/mobile"
  end
end

This app doesn’t actually have a separate mobile site anymore - mobile_site_requested? is always true in the real world, although it is technically possible to request the old versions of the pages. This is hangover from when we redesigned the site from desktop only to a responsive, mobile first, site. A candidate for clearing up for sure, but it seemed pretty harmless.

When I did this in Rails console (looking for the outfits/show template)

1
FooController.new.lookup_context.find_all('show', 'outfits').object_id

It consistently returned the same object_id: the cache was being used. On the other hand

1
2
3
c = FooController.new
c.prepend_view_path "app/views/mobile"
c.lookup_context.find_all('show', 'outfits').object_id

resulted in a different template object for each controller, so prepending a view path like this was resulting in the cache not being used (for views in app/views/mobile). This felt like I was definitely onto something!

The Rails code for prepending a path looks like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def unshift(*args)
  paths.unshift(*typecast(args))
end

def typecast(paths)
  paths.map do |path|
    case path
    when Pathname, String
      OptimizedFileSystemResolver.new path.to_s
    else
      path
    end
  end
end

So when you prepend a string, Rails creates a new resolver for that path and adds that to the path set. When done from a controller instance method, this means that each request uses a new resolver for app/views/mobile (with an empty cache). Memory leak aside, my code was bad because it meant views were compiled over and over again.

Now that I understood that prepend_view_path was key, I knew why I wasn’t able to reproduce the leak earlier - I’d forgotten about this part of the app and had been testing against a codepath that didn’t use prepend_view_path. Once I adjusted my test setup I was able to see the same leak using derailed.

The actual leak

I was still curious about the actual leak: compiling templates over and over again was bad, but why was it leaking? The leaked values were the names of methods, which are symbols. Ruby garbage collects symbols since ruby 2.2, so how is this possible?

Although the headlines were “Ruby 2.2 adds Symbol GC”, the reality is a little more complicated. There are two kinds of symbols: mortal and immortal ones. Immortal ones are (simplifying grossly) used by the C parts of ruby and not collectable. Defining a method creates an immortal symbol (or transitions a symbol from mortal to immortal) so creating large numbers of methods with unique names will still cause a leak, even if you undefine the methods when you are done (see https://www.slideshare.net/authorNari/symbol-gc for more details on symbol GC).

It was a pretty small leak - a hundred or so bytes per distinct template or partial rendered per request, but this app gets over a million requests a day, so it all adds up. It does like this may have bitten other people before - see rails#22580 and rails#14301

Fixing it

In this very particular case the use of prepend_view_path was unnecessary and would be possible to just remove it. This isn’t always the case - we’ve got other apps which do actually need to use prepend_view_path (for example a white labelled application that has per request host view overrides) - so I wanted to see if I could fix the leak while keeping prepend_view_path.

prepend_view_path only creates a new resolver if you give it a string or pathname - if you give it a resolver, it will just add that to the array. Therefore we just need to create (and cache ourselves) an ActionView::Resolver (i.e. replicate what the typecast method was doing above), and prepend that instead of the string.

1
2
3
4
5
6
@@mobile_views = ActionView::OptimizedFileSystemResolver.new("app/views/mobile")
def set_view_path
  if mobile_site_requested?
    prepend_view_path @@mobile_views
  end
end

Results

This is a graph of memory usage across the hosts for this application either side of when the fix above was deployed.

memory-usage

Pretty convincing! (Graph from appsignal)