Profiling Ruby, or, How I Made Rails Start Up Faster

Last week Ruby application startup got about 2-3x faster when four patches of mine were committed to the MRI/CRuby interpreter. The improvements will be in the upcoming Ruby 2.0 release.

The techniques I used to identify what was slow, so I could fix it, are accessible to anyone and are useful for diagnosing any performance problem in MRI or any C extension. But they’re a little obscure, so I thought I’d do my part to make them better known.

Here’s what startup looked like for our frontend Rails application at Solano, before I dug in to fix anything:

  $ date; bundle exec rails c
  Thu Nov 15 17:08:09 EST 2012
  Time.now
  Loading development environment (Rails 3.0.17)
  irb(main):001:0> Time.now
  => 2012-11-15 17:08:21 -0500

That’s 12 seconds from hitting enter on rails c to actually getting a prompt. (11.96s when I measure systematically.) And that’s with the latest 1.9.3 release — with 1.9.2 it’s twice as bad. So I decided to do something about it. That means going through the four basic steps of fixing performance problems:

  1. Measure the problem with a benchmark. As they say, you can’t improve what you don’t measure.
  2. Profile to locate the problem.
  3. Fix the problem.
  4. Repeat until fast.

Steps 1 and 4 are relatively easy. Step 2 can be easy with the right tool — but the right tool isn’t easy to find or to start using. So in this post I’ll go into extra technical detail on Step 2, to make that step easy for you, my reader, in the future.

Step 1: Measure

Why is it so slow? First off, to make the problem easier to study we’ll isolate one thing that shouldn’t be slow but is. Solano’s Rails app requires about 2000 different files, including all its dependencies:

  $ bundle exec rails runner 'p $LOADED_FEATURES.size'
  2125

So suppose we isolate just that bit — require a few thousand files, and make the files empty so there’s nothing else going on.

  $ mkdir reqtest
  $ for i in {1..10000}; do touch reqtest/$i.rb; done

  $ time ruby -I reqtest -e '(1..10000).each {|i| require "#{i}"}'

That should be really fast. What actually happens?

             time     avg. /
     N   (best of 5)   file
   2000     0.286s     0.1ms
   4000     0.838      0.2
   8000     2.775      0.3
  16000    10.677      0.7
  32000    42.531      1.3

Hooray, it’s slow. That means we can focus on fixing this simplified case, and maybe that will fix our real problem.

First, it shouldn’t take a tenth of a millisecond just to require an empty file — that’s 100,000 ns, time enough to fetch 1000 separate items all the way from main memory. We don’t have that much real work to do.

But worse is that the more files we get, the slower and slower we run. We’re taking quadratic time. There’s no reason this shouldn’t be linear time. Why exactly is this so slow?

Step 2: Locate the Problem — Profile

To finger the specific code inside MRI responsible for this sluggish performance, we’ll need a good profiler. The best profiler I’ve found for this purpose is the Linux ‘perf’ tool. It can tell us

  • which C functions we’re spending our time in;
  • which functions called those functions, called those, etc, for the most time;
  • even which code inside each function took time, down to the machine instruction, and annotated with source lines to help us follow what’s what.

Using perf in a basic way is very simple. We run the slow command with perf record:

  $ perf record ruby -I reqtest -e '(1..10000).each {|i| require "#{i}"}'

That writes to a file called perf.data (or you can pass a different filename with -o). Then perf report reads that file (or another, with -i) and organizes the results. They come in a nice in-terminal browser and look something like this, showing how much time we spend in each function:

   58.68%      ruby  libc-2.15.so        [.] 0x25820
   12.36%      ruby  ruby                [.] rb_string_value_ptr
    7.90%      ruby  ruby                [.] rb_feature_p
    6.19%      ruby  ruby                [.] loaded_feature_path.part.2
    5.83%      ruby  [kernel.kallsyms]   [k] 0xc10359d8

The third column here is the “DSO”. perf is a system profiler, so it captures time spent not only in our own code but in external libraries and the kernel. We’re interested in studying the Ruby interpreter’s own code, so let’s limit to the “ruby” DSO — we can either select it in the perf browser or say perf report -d ruby:

   37.20%  ruby  rb_string_value_ptr
   23.78%  ruby  rb_feature_p
   18.62%  ruby  loaded_feature_path.part.2
    3.18%  ruby  0x108e0
    0.83%  ruby  file_expand_path

So we’re spending 79.60% of our time in three functions. Two are part of the require implementation, but the biggest is a general-purpose function used all over MRI. To really nail the issue, we should find out what code is responsible for calling rb_string_value_ptr() so much. Solution: call-graph profiling. This takes a little setup but is well worth it.

Step 2, better: Call-Graph Profiling for MRI

When perf records that we’re spending time in a function, it can also record the whole call stack so we can blame the callers too. To do this it needs a little help from us when we compile the program we’re profiling, i.e. the Ruby interpreter — we need to pass the -pg option to gcc. Unfortunately -pg has a side effect which MRI doesn’t digest well, so we also need a patch to MRI to neutralize the side effect. Here’s the full recipe for beautiful call-graph profiles:

  1. Go to your MRI source checkout (git clone https://github.com/ruby/ruby if you don’t already have one), pick a directory (say ~/lib/ruby-trunk) to put your Ruby build, and run
      autoconf; ./configure --prefix $DIR
  2. Apply my patch to make MRI safe for call-graph profiling:
      $ curl https://raw.github.com/gist/4108905/c765937648ec5a06104b6c5a3381ac0898c2eea0/sigprof.patch \
          | git am

    (If you’re not on an up-to-date trunk, git am may fail; try patch -p1 instead and commit the result yourself.)

  3. Build Ruby and install it to the $DIR you chose:
      $ make debugflags='-ggdb -pg'; make install

    (The -ggdb flag is there by default.)

  4. Run your slow job with perf record -g:
      $ perf record -g $DIR/bin/ruby -I reqtest -e '(1..10000).each {|i| require "#{i}"}'

    (If you use rbenv, a quick ln -s $DIR ~/.rbenv/versions/experimental; rbenv shell experimental eliminates the fussing with directories here.)

  5. Run perf report. You can also tune the display with the -g option; see perf help report.

And here are the results. I’ve taken advantage of the in-terminal browser to expand a handful of subtrees and leave others unexpanded.

  $ perf report -g graph,1 -d ruby
  -  32.16%  ruby  rb_string_value_ptr
     - rb_string_value_ptr
        + 28.38% rb_feature_p
        + 3.78% rb_require_safe
  +  27.41%  ruby  rb_feature_p
  -  22.10%  ruby  loaded_feature_path.part.2
     - loaded_feature_path.part.2
        + 21.00% rb_feature_p
        + 1.11% rb_require_safe

So the bulk of that time in rb_string_value_ptr() is actually the fault of rb_feature_p(). A summary of the time looks like this:

  77% rb_feature_p() (at least)
    27% directly
    28% in rb_string_value_ptr()
    21% in loaded_feature_path()
  23% other (including all GC, and some rb_feature_p() I missed)

We’re spending over 3/4 of our time in just one function, rb_feature_p(), either directly or in a couple of functions it calls. So it’s pretty clear what we have to fix.

Step 3: Fix

Our culprit rb_feature_p() is in load.c, which is the file that implements the require method. Here it is (as it was before my patches) stripped to the essential logic, highlighting the two slow calls it makes:

  rb_feature_p(const char *feature, const char *ext, int rb, int expanded, const char **fn)
  {
      features = /* $LOADED_FEATURES aka $" */;
      /* for v in features */ {
          f = StringValuePtr(v);
          if (loaded_feature_path(f, /*f.size*/, feature, /*feature.size*/))
              return 'u';
      }
      /* also check get_loading_table() */
      return 0;
  }

It takes $LOADED_FEATURES, which is the array of all the files we’ve loaded on previous #require calls — thousands of files, in a Rails app with many dependencies — and it iterates through that array, every time #require is called. That’s what makes the startup time slow. And that’s why we slow down as we require more files — because this array gets longer.

Connoisseurs of algorithms and data structures know how that search should work to be faster — instead of searching the whole list of previously loaded files as an array, it should use a hash table.

In this case, there are complications:

  • We’re not looking for an exact string match, but rather for a match accepted by loaded_feature_path(), which is complex. That means a simple hash table with the elements of $LOADED_FEATURES as keys won’t do the job.
  • $LOADED_FEATURES is exposed as part of the Ruby language, so we can’t just ditch it in favor of a more efficient data structure.
  • $LOADED_FEATURES is mutable in Ruby, so Ruby code could modify the array and expect the behavior of require to follow along. In case anybody out there is doing that, we’d like to follow along too. That means we have to somehow keep our better data structure consistent when the array has been modified.

(Whether these semantics are a good idea is another matter. Compare Python’s approach, which is simple to understand and simple to efficiently implement.)

Fortunately, with some close study of the loaded_feature_path() code on one hand and a clever use of a little-used feature of the Array implementation on the other, it’s possible to solve both of these problems. For details see the patches just linked to, or my $LOADED_FEATURES-index patch and overall patch report.

Step 4: Repeat

How fast are we now? In our benchmark:

  $ time ruby -I reqtest -e '(1..10000).each {|i| require "#{i}"}'

             time     speedup   avg. /
     N   (best of 5)             file
   2000     0.145       2.0x     0.07ms
   4000     0.241       3.5      0.06
   8000     0.700       4.0      0.09
  16000     2.547       4.2      0.16
  32000    10.053       4.2      0.31

We’re a lot faster than before — over 4x faster in the large cases, which is pretty nice. But we still get slower and slower the more files we require, still taking quadratic time.

And what about the task I really care about — starting our dependency-laden Rails application? 8.44 seconds. That’s 29% shorter than before, three-and-a-half sweet seconds less waiting time, but we should be able to do better.

So let’s rinse and repeat. What’s still slow in the new version? Here’s the top of the profile, with one item partly expanded to show its origin:

  $ perf report -g graph,3 -d ruby
  -  21.07%  ruby  st_foreach
     - st_foreach
        + 20.77% gc_mark_children
  +  19.74%  ruby  gc_mark
  +  16.92%  ruby  gc_mark_children
  +   7.48%  ruby  st_lookup
  +   6.00%  ruby  slot_sweep
  +   4.18%  ruby  mark_keyvalue

All the functions with ‘gc’, ‘mark’, or ‘sweep’ in their names belong to the garbage collector. Adding them up, we have over 67% of our time consumed by the garbage collector. So if we want to make much more progress, we’re going to have to cut down on that time. That means either making the GC itself more efficient — a tough job, though there is lots of room for improvement — or eliminating whatever allocations are giving it so much exercise.

Finding what those allocations are requires its own set of techniques, in “GC profiling” or “allocation profiling”. In my next post, I’ll write about how to do that and what I did with the results to make Ruby application startup another notch faster.

(You should follow me on Twitter here or Solano here so you see that next post.)


6 Comments on “Profiling Ruby, or, How I Made Rails Start Up Faster”

  1. [...] Profiling Ruby (or How I Made Rails Start Up Faster) Four steps: measure, find the problem, fix, and repeat. [...]

  2. Karl Baum says:

    This patch looks awesome. Are there instructions for back porting this to ruby 1.9.3?

  3. [...] I will find the time to learn Profiling Ruby, or, How I Made Rails Start Up Faster for Python and/or PHP if just so I really understand what goes on under the hood. (And you should [...]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 369 other followers