Profiling Ruby, or, How I Made Rails Start Up Faster
Posted: November 20, 2012 Filed under: Performance, Ruby 6 Comments »Last week Ruby application startup got about 2-3x faster when four patches of mine were committed to the MRI/CRuby interpreter. The improvements will be in the upcoming Ruby 2.0 release.
The techniques I used to identify what was slow, so I could fix it, are accessible to anyone and are useful for diagnosing any performance problem in MRI or any C extension. But they’re a little obscure, so I thought I’d do my part to make them better known.
Here’s what startup looked like for our frontend Rails application at Solano, before I dug in to fix anything:
$ date; bundle exec rails c Thu Nov 15 17:08:09 EST 2012 Time.now Loading development environment (Rails 3.0.17) irb(main):001:0> Time.now => 2012-11-15 17:08:21 -0500
That’s 12 seconds from hitting enter on rails c to actually getting a prompt. (11.96s when I measure systematically.) And that’s with the latest 1.9.3 release — with 1.9.2 it’s twice as bad. So I decided to do something about it. That means going through the four basic steps of fixing performance problems:
- Measure the problem with a benchmark. As they say, you can’t improve what you don’t measure.
- Profile to locate the problem.
- Fix the problem.
- Repeat until fast.
Steps 1 and 4 are relatively easy. Step 2 can be easy with the right tool — but the right tool isn’t easy to find or to start using. So in this post I’ll go into extra technical detail on Step 2, to make that step easy for you, my reader, in the future.
Step 1: Measure
Why is it so slow? First off, to make the problem easier to study we’ll isolate one thing that shouldn’t be slow but is. Solano’s Rails app requires about 2000 different files, including all its dependencies:
$ bundle exec rails runner 'p $LOADED_FEATURES.size' 2125
So suppose we isolate just that bit — require a few thousand files, and make the files empty so there’s nothing else going on.
$ mkdir reqtest
$ for i in {1..10000}; do touch reqtest/$i.rb; done
$ time ruby -I reqtest -e '(1..10000).each {|i| require "#{i}"}'
That should be really fast. What actually happens?
time avg. /
N (best of 5) file
2000 0.286s 0.1ms
4000 0.838 0.2
8000 2.775 0.3
16000 10.677 0.7
32000 42.531 1.3
Hooray, it’s slow. That means we can focus on fixing this simplified case, and maybe that will fix our real problem.
First, it shouldn’t take a tenth of a millisecond just to require an empty file — that’s 100,000 ns, time enough to fetch 1000 separate items all the way from main memory. We don’t have that much real work to do.
But worse is that the more files we get, the slower and slower we run. We’re taking quadratic time. There’s no reason this shouldn’t be linear time. Why exactly is this so slow?
Step 2: Locate the Problem — Profile
To finger the specific code inside MRI responsible for this sluggish performance, we’ll need a good profiler. The best profiler I’ve found for this purpose is the Linux ‘perf’ tool. It can tell us
- which C functions we’re spending our time in;
- which functions called those functions, called those, etc, for the most time;
- even which code inside each function took time, down to the machine instruction, and annotated with source lines to help us follow what’s what.
Using perf in a basic way is very simple. We run the slow command with perf record:
$ perf record ruby -I reqtest -e '(1..10000).each {|i| require "#{i}"}'
That writes to a file called perf.data (or you can pass a different filename with -o). Then perf report reads that file (or another, with -i) and organizes the results. They come in a nice in-terminal browser and look something like this, showing how much time we spend in each function:
58.68% ruby libc-2.15.so [.] 0x25820
12.36% ruby ruby [.] rb_string_value_ptr
7.90% ruby ruby [.] rb_feature_p
6.19% ruby ruby [.] loaded_feature_path.part.2
5.83% ruby [kernel.kallsyms] [k] 0xc10359d8
The third column here is the “DSO”. perf is a system profiler, so it captures time spent not only in our own code but in external libraries and the kernel. We’re interested in studying the Ruby interpreter’s own code, so let’s limit to the “ruby” DSO — we can either select it in the perf browser or say perf report -d ruby:
37.20% ruby rb_string_value_ptr
23.78% ruby rb_feature_p
18.62% ruby loaded_feature_path.part.2
3.18% ruby 0x108e0
0.83% ruby file_expand_path
So we’re spending 79.60% of our time in three functions. Two are part of the require implementation, but the biggest is a general-purpose function used all over MRI. To really nail the issue, we should find out what code is responsible for calling rb_string_value_ptr() so much. Solution: call-graph profiling. This takes a little setup but is well worth it.
Step 2, better: Call-Graph Profiling for MRI
When perf records that we’re spending time in a function, it can also record the whole call stack so we can blame the callers too. To do this it needs a little help from us when we compile the program we’re profiling, i.e. the Ruby interpreter — we need to pass the -pg option to gcc. Unfortunately -pg has a side effect which MRI doesn’t digest well, so we also need a patch to MRI to neutralize the side effect. Here’s the full recipe for beautiful call-graph profiles:
- Go to your MRI source checkout (
git clone https://github.com/ruby/rubyif you don’t already have one), pick a directory (say~/lib/ruby-trunk) to put your Ruby build, and runautoconf; ./configure --prefix $DIR
- Apply my patch to make MRI safe for call-graph profiling:
$ curl https://raw.github.com/gist/4108905/c765937648ec5a06104b6c5a3381ac0898c2eea0/sigprof.patch \ | git am(If you’re not on an up-to-date trunk,
git ammay fail; trypatch -p1instead and commit the result yourself.) - Build Ruby and install it to the
$DIRyou chose:$ make debugflags='-ggdb -pg'; make install
(The
-ggdbflag is there by default.) - Run your slow job with
perf record -g:$ perf record -g $DIR/bin/ruby -I reqtest -e '(1..10000).each {|i| require "#{i}"}'(If you use rbenv, a quick
ln -s $DIR ~/.rbenv/versions/experimental; rbenv shell experimentaleliminates the fussing with directories here.) - Run
perf report. You can also tune the display with the-goption; seeperf help report.
And here are the results. I’ve taken advantage of the in-terminal browser to expand a handful of subtrees and leave others unexpanded.
$ perf report -g graph,1 -d ruby
- 32.16% ruby rb_string_value_ptr
- rb_string_value_ptr
+ 28.38% rb_feature_p
+ 3.78% rb_require_safe
+ 27.41% ruby rb_feature_p
- 22.10% ruby loaded_feature_path.part.2
- loaded_feature_path.part.2
+ 21.00% rb_feature_p
+ 1.11% rb_require_safe
So the bulk of that time in rb_string_value_ptr() is actually the fault of rb_feature_p(). A summary of the time looks like this:
77% rb_feature_p() (at least)
27% directly
28% in rb_string_value_ptr()
21% in loaded_feature_path()
23% other (including all GC, and some rb_feature_p() I missed)
We’re spending over 3/4 of our time in just one function, rb_feature_p(), either directly or in a couple of functions it calls. So it’s pretty clear what we have to fix.
Step 3: Fix
Our culprit rb_feature_p() is in load.c, which is the file that implements the require method. Here it is (as it was before my patches) stripped to the essential logic, highlighting the two slow calls it makes:
rb_feature_p(const char *feature, const char *ext, int rb, int expanded, const char **fn)
{
features = /* $LOADED_FEATURES aka $" */;
/* for v in features */ {
f = StringValuePtr(v);
if (loaded_feature_path(f, /*f.size*/, feature, /*feature.size*/))
return 'u';
}
/* also check get_loading_table() */
return 0;
}
It takes $LOADED_FEATURES, which is the array of all the files we’ve loaded on previous #require calls — thousands of files, in a Rails app with many dependencies — and it iterates through that array, every time #require is called. That’s what makes the startup time slow. And that’s why we slow down as we require more files — because this array gets longer.
Connoisseurs of algorithms and data structures know how that search should work to be faster — instead of searching the whole list of previously loaded files as an array, it should use a hash table.
In this case, there are complications:
- We’re not looking for an exact string match, but rather for a match accepted by
loaded_feature_path(), which is complex. That means a simple hash table with the elements of$LOADED_FEATURESas keys won’t do the job. $LOADED_FEATURESis exposed as part of the Ruby language, so we can’t just ditch it in favor of a more efficient data structure.$LOADED_FEATURESis mutable in Ruby, so Ruby code could modify the array and expect the behavior ofrequireto follow along. In case anybody out there is doing that, we’d like to follow along too. That means we have to somehow keep our better data structure consistent when the array has been modified.
(Whether these semantics are a good idea is another matter. Compare Python’s approach, which is simple to understand and simple to efficiently implement.)
Fortunately, with some close study of the loaded_feature_path() code on one hand and a clever use of a little-used feature of the Array implementation on the other, it’s possible to solve both of these problems. For details see the patches just linked to, or my $LOADED_FEATURES-index patch and overall patch report.
Step 4: Repeat
How fast are we now? In our benchmark:
$ time ruby -I reqtest -e '(1..10000).each {|i| require "#{i}"}'
time speedup avg. /
N (best of 5) file
2000 0.145 2.0x 0.07ms
4000 0.241 3.5 0.06
8000 0.700 4.0 0.09
16000 2.547 4.2 0.16
32000 10.053 4.2 0.31
We’re a lot faster than before — over 4x faster in the large cases, which is pretty nice. But we still get slower and slower the more files we require, still taking quadratic time.
And what about the task I really care about — starting our dependency-laden Rails application? 8.44 seconds. That’s 29% shorter than before, three-and-a-half sweet seconds less waiting time, but we should be able to do better.
So let’s rinse and repeat. What’s still slow in the new version? Here’s the top of the profile, with one item partly expanded to show its origin:
$ perf report -g graph,3 -d ruby
- 21.07% ruby st_foreach
- st_foreach
+ 20.77% gc_mark_children
+ 19.74% ruby gc_mark
+ 16.92% ruby gc_mark_children
+ 7.48% ruby st_lookup
+ 6.00% ruby slot_sweep
+ 4.18% ruby mark_keyvalue
All the functions with ‘gc’, ‘mark’, or ‘sweep’ in their names belong to the garbage collector. Adding them up, we have over 67% of our time consumed by the garbage collector. So if we want to make much more progress, we’re going to have to cut down on that time. That means either making the GC itself more efficient — a tough job, though there is lots of room for improvement — or eliminating whatever allocations are giving it so much exercise.
Finding what those allocations are requires its own set of techniques, in “GC profiling” or “allocation profiling”. In my next post, I’ll write about how to do that and what I did with the results to make Ruby application startup another notch faster.
(You should follow me on Twitter here or Solano here so you see that next post.)
[...] Profiling Ruby (or How I Made Rails Start Up Faster) Four steps: measure, find the problem, fix, and repeat. [...]
This patch looks awesome. Are there instructions for back porting this to ruby 1.9.3?
Yes, there are. You can install ruby 1.9.3-p194 using these instructions if you’re using rbenv:
http://bugs.ruby-lang.org/issues/7158#note-1
or these if you’re using RVM:
https://gist.github.com/3993269
Nice! How do these patches relate to the falcon patch?
https://gist.github.com/1688857
Can they be used together? This is getting tough to keep track of
.
Hmm — it looks like that version of the falcon patches actually incorporates a backport of my changes. =) Yura aka funny-falcon says so in his description (https://gist.github.com/4136373 ), and the require-related changes look like my code.
Previous iterations of the falcon patches included two patches that accomplish the same purpose as my changes. There are some differences in technique, which helped these changes get into the Ruby mainline; they’re mentioned in my report at http://bugs.ruby-lang.org/issues/7158 .
[...] I will find the time to learn Profiling Ruby, or, How I Made Rails Start Up Faster for Python and/or PHP if just so I really understand what goes on under the hood. (And you should [...]