Space Vatican

Ramblings of a curious coder

Who's Afraid of the Big Bad Lock?

Unless you’ve been living under a rock for the past few years you will have heard of C-ruby’s GIL/GVL, the Global Interpreter/VM Lock (I prefer to think of it as the Giant VM Lock). While your ruby threads may sit on top of native pthreads, this big bad lock stops more than one thread running at the same time. Allegedly one of the main reasons behind this was to protect non threadsafe ruby extensions and also to shield us from the horrors of threading. Personally it feels like a lot of 3rd party gems needed updating for 1.9 anyway (particularly with encoding related issues) and so it would have been a good opportunity to make that change. The complexities of threading, locking etc. could be handled by providing higher level abstractions over them (actors etc.).

True concurrency isn’t completely dead though. Ruby can already release the GVL when a thread is blocked on IO and if you are writing a C extension you can release the lock too. The mysql2 gem does this: clearly there is no point in holding onto the GVL when you’re just waiting on mysql to return results. Similarly Eric Hodel recently submitted a patch to the zlib extension so that the lock is released while zlib is doing its thing. This obviously doesn’t make mysql queries or zlib run any faster individually but it means you can run many in parallel and that these operations don’t block other unrelated threads. When even laptops have hyperthreaded quad-core processors, this is a good thing.

The magic API is rb_thread_blocking_region, whose header documentation comes with copious warnings. Threading after all is hard (and should not be mixed with alcohol (I speak from experience)). A call to rb_thread_blocking_region looks like this

rb_thread_blocking_region(do_some_work, argument,
                        unblocker, unblocker_argument);

When you call this ruby

  • releases the GVL (other ruby threads can now run)
  • calls your do_some_work function , passing it argument
  • reacquires the GVL

Operating without the GVL is a scary place to be. You can’t in general call any of the C-ruby api, because they all assume they hold the GVL. If you’ve ever written MP threaded code on Apple’s OS 8 you’ll feel right at home.

The second pair of arguments is the so called unblocking function (ubf) and its argument. If ruby needs to kill your thread (in response to Thread.kill, the VM exiting, etc) this function will be called. Your do_some_work should then exit. For example ruby’s bignum.c has this code that runs inside rb_thread_blocking_region

1
2
3
4
5
6
7
8
9
static VALUE bigdivrem1(void *ptr)
{
  //setup code removed
  do {
    if (bds->stop) return Qnil;
    //calculation code here
  } while (--j >= ny);
  return Qnil;
}

The ubf just sets the bds->stop flag so that bigdivrem1 returns early.

You can specify the constants RUBY_UBF_PROCESS or RUBY_UBF_IO to use the ruby provided ubf_select function which handles the common case of being blocked on a call to select or accept or other such functions (it’s not a general purpose ubf - see posting to ruby-core. The overall intent is that you don’t do an awful lot inside your ubf, just enough to get your main do_some_work function to stop. Sometimes there just isn’t a good way to stop what you’re doing; it’s not the end of the world if your ubf does nothing.

This works best when you can isolate a chunk of C code that doesn’t need any interaction with the ruby world. Code that wants to frequently call back into ruby (for example a sax style xml parser delivering events to a ruby class) isn’t a good fit.

In some cases you might consider rb_thread_call_with_gvl, which reacquires the GVL, executes some code for you and releases it. The headers around it are plastered with warnings about it being experimental, might be removed in ruby 1.9.2 but it would seem that it is here to stay. If you end up calling it a lot then you’re pretty much back to executing serially.

A simple example

The rdiscount gem (a markdown parser) presents a reasonable opportunity for this sort of work. RDiscount objects have a to_html method doesn’t do much other than call into the discount c library. Initially the core of this method looked like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static VALUE rb_rdiscount_to_html(int argc, VALUE *argv, VALUE self) {
  /*boring setup stuff */
  MMIOT *doc = mkd_string(RSTRING_PTR(text), RSTRING_LEN(text), flags);
  if ( mkd_compile(doc, flags) ) {
    szres = mkd_document(doc, &res);

    if ( szres != EOF ) {
      rb_str_cat(buf, res, szres);
      rb_str_cat(buf, "\n", 1);
    }
  }
  mkd_cleanup(doc);
  /*boring cleanup */
  return buf;
}

It just pushes a string through the markdown library and creates a ruby string from the result. The first thing I did was to write a small benchmark that loads a few hundred markdown file and converts them to html. I pulled the IO part out of the benchmark part because that wasn’t the bit I was interested in.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
require 'thread'
require 'benchmark'
require 'rdiscount'
include Benchmark
$content = []
Dir.glob('/Users/fred/markdown/*.markdown').each {|path| $content.push File.read path}

def parse_with_threads(count)
  q = Queue.new
  $content.each {|data| q.push data}
  count.times {q.push nil}
  threads = (1..count).collect do
    Thread.new do
      while s = q.pop
        RDiscount.new(s).to_html
      end
    end
  end
  threads.each {|th| th.join}
end

bmbm(5) do |x|
  x.report("1 thread") { 40.times {parse_with_threads 1}}
  x.report("2 thread") { 40.times {parse_with_threads 2}}
  x.report("4 thread") { 40.times {parse_with_threads 4}}
end

With vanilla rdiscount/master, this produces the following output on my quad-core iMac

1
2
3
4
user     system      total        real
1 thread   6.420000   0.010000   6.430000 (  6.426911)
2 thread   6.430000   0.010000   6.440000 (  6.439822)
4 thread   6.380000   0.020000   6.400000 (  6.395870)

This is pretty fast - the folder contained nearly 1400 files that we parsed 40 times over and it still only took a handful of seconds. However the results are identical no matter how many threads (to within 0.5%). The cpu usage (as measured by top) never goes above 100% (On OS X 100% means 100% of one core, so values of 400% on a 4 core machine are possible).

My changed version of rdiscount.c looks like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
static void *rb_rdiscount_to_html_no_gvl(void * param_block){
  rb_discount_to_html_param_block *block = (rb_discount_to_html_param_block*)param_block;
  block->doc = mkd_string(block->input, block->input_length, block->flags);
  if(mkd_compile(block->doc, block->flags)){
    block->szres = mkd_document(block->doc, &block->res);
  }
}

static VALUE
rb_rdiscount_to_html(int argc, VALUE *argv, VALUE self)
{
  /* boring setup stuff */
  rb_thread_blocking_region(rb_rdiscount_to_html_no_gvl, (void*)&block, NULL, NULL);
  if(block.szres !=EOF){
    rb_str_cat(buf, block.res, block.szres);
    rb_str_cat(buf, "\n", 1);
  }
  mkd_cleanup(block.doc);
  /* boring cleanup */
  return buf;
}

The setup code now marshals all of the parameters needed into a struct of type rb_discount_to_html_param_block and then uses rb_thread_blocking_region to execute rb_rdiscount_to_html_no_gvl which does the bulk of the work.

This time the output looks like this, on the same quad-core iMac.

1
2
3
4
user     system      total        real
1 thread   6.250000   0.060000   6.310000 (  6.315216)
2 thread   6.730000   0.110000   6.840000 (  3.463594)
4 thread   7.520000   0.350000   7.870000 (  2.146914)

The 2 thread case now runs in 55% of the time it took to run the 1 thread case. and the 4 thread case runs in 34% of the time. The most we could hope for was to halve execution time with 2 threads and quarter it for 4 threads, so we got pretty close. CPU utilisation is also much higher. There is of course some code which for which the GVL is held and there’s always some overhead. Still, not bad for what was essentially a 5 minute change!

Cleaning up after yourself

This code has one flaw: should ruby try to kill the thread then rb_thread_blocking_region won’t ever return. When the GVL is reacquired ruby will check whether the thread should be killed and bail out if appropriate. In our case we would leak the resources allocated by mkd_string. One way around this is to ensure anything that is allocated by rb_discount_no_gvl is also disposed of insode rb_discount_no_gvl. That doesn’t really work for in this case: we need to be able to convert the result back into a ruby object, and we can’t do that in the no-man’s land that is rb_thread_blocking_region. rb_thread_call_with_gvl doesn’t help since it also checks whether the thread should be killed when the GVL is reacquired.

In pure ruby when you want to make sure code gets executed even in the presence of such things you use ensure, and things are not so different (if more verbose) when using the C api: the rb_ensure function allows you to call a function while specifying a second function that should be called after, no matter what.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
static void *rb_rdiscount_to_html_no_gvl(void * param_block){
  rb_discount_to_html_param_block *block = (rb_discount_to_html_param_block*)param_block;
  block->doc = mkd_string(block->input, block->input_length, block->flags);
  if(mkd_compile(block->doc, block->flags)){
    block->szres = mkd_document(block->doc, &block->res);
  }
  return NULL;
}

/*wrapper function. only exists because we need to pass a function pointer to rb_ensure*/
static VALUE rb_discount_to_html_no_gvl_wrapper(VALUE arg){
  rb_thread_blocking_region(rb_rdiscount_to_html_no_gvl, arg, NULL, NULL);
  return Qnil;
}

/*called after rb_discount_no_gvl_wrapper. cleanups up markdown resources and sticks the result in block->result)*/
static VALUE rb_discount_to_html_no_gvl_cleanup(VALUE arg){
  rb_discount_param_block *block = (rb_discount_to_html_param_block*)arg;
  if(block->szres !=EOF){
      block->result = rb_str_buf_new(block->szres+1);
      rb_str_cat(block->result, block->res, block->szres);
      rb_str_cat(block->result, "\n", 1);
  }
  else{
    block->result = rb_str_buf_new(1);
  }
  mkd_cleanup(block->doc);
  return Qnil;
}
static VALUE rb_rdiscount_to_html(int argc, VALUE *argv, VALUE self)
{
  /* boring setup*/;
  rb_ensure(rb_discount_to_html_no_gvl_wrapper, &block, rb_discount_to_html_no_gvl_cleanup, &block);

  /*encoding massaging omitted*/
  return block.result;
}

This ensures that the cleanup/conversion code always run. Properly cleaning up after yourself in this sort of situation is definitely tricky.

When it’s a good fit, rb_thread_blocking_function can be pretty handy but it’s definitely a little verbose (if consistent with the rest of the ruby API). Perhaps exposing the internal BLOCKING_REGION macro would help ease some of the callback spaghetti. A better way of dealing with libraries that want to callback into ruby would be great too.