Silly Little Parrot Optimizations

chromatic on 2009-05-06T21:20:32

I've spent the past three days profiling Parrot and Rakudo startup times. Christoph Otto and Vasily Chekalkin did some great work on a roadmap item I added a while back -- specifically removing thousands of exported symbols from our shared library. (The more symbols you export, the longer dynamic linking takes.)

As every successful optimization changes the performance profile of your application, I've found some interesting bottlenecks. Some of them are even amusing, in the forehead-slapping "You're such a geek if you find this funny" way.

For example, my most recent trace showed that calculating the correct method resolution order in Rakudo classes created a lot of PMCs. In particular, a section of the algorithm removes an entry from an array by index. (In Parrot vtable terms, this is a delete_keyed_int operation.) Every instance of this PMC is a ResizablePMCArray, roughly akin to your standard Perl 5 array. For some reason, RPA had no specific implementation of the delete_keyed_int, which takes a primitive integer value and removes the PMC at that index in the array.

Instead, the RPA fell back to the default implementation of that operation. It takes the primitive integer and constructs a PMC Key from it in a boxing operation. Then it performs the original PMC's delete_keyed operation, which takes a Key PMC and removes the PMC identified by that key in the array.

RPA defines that operation. The first thing it does is to extract a primitive integer value from the Key.

I added a local delete_keyed_int definition and rewrite delete_keyed in terms of the latter. Rakudo now starts up 1.34% faster and allocates 9.38% fewer PMCs -- and all of those PMCs are PMCs with very short lifespans that would never survive the first garbage collection run. Avoiding even that is a performance improvement.

I estimate that Rakudo starts up nearly 40% faster now than it did when I started on Sunday night. We can get it faster yet.


A bas les symboles

slanning on 2009-05-07T06:04:35

I've spent the past three days profiling Parrot and Rakudo startup times. Christoph Otto and Vasily Chekalkin did some great work on a roadmap item I added a while back -- specifically removing thousands of exported symbols from our shared library. (The more symbols you export, the longer dynamic linking takes.)

How do you, roughly speaking, go about removing thousands of symbols from a library?

You said that, due to the local delete_keyed_int, Rakudo starts up 1.34% faster; and you said that Rakudo starts up 40% faster than it did on the weekend. Is a lot of that from removing the extra symbols, or is it mostly from other optimizations? (In other words, do you know how much speed-up results from removing the symbols?)

P.S. I realize you're probably not responsible for the Rakudo page, but it needs a short description of Rakudo somewhere. As someone who doesn't follow Perl 6 that closely (and/or is getting senile), I just sometimes forget what the names refer to. On the other hand, the page is claiming to be a "one stop" for Rakudo news, so maybe that implies that whoever is reading it is at least minimally aware of what Rakudo is. But on the Parrot page, for example, there is a paragraph at the top of the home page saying what it is; but when I went to the Rakudo page, I was looking around for an About button/page but there doesn't seem to be any.

Re:A bas les symboles

chromatic on 2009-05-07T20:27:47

How do you, roughly speaking, go about removing thousands of symbols from a library?

We removed their external visibility. (Apologies if you know all of this; I think it might be interesting to other readers.)

When you build a shared library in C, some of the functions and variables are usable by programs which use the shared library -- that's the point of building a shared library. On some platforms you must explicitly mark these functions as externally visible. Windows is an offender here. On most Unix-like platforms with which I'm familiar, all symbols are externally visible by default.

We suffered from many failures at the linking stage where Windows builds would fail because we added functions but forgot to mark them as exported. Modern versions of GCC include a flag which flips the external visibility of symbols: by default, they're invisible outside the shared library. You have to add an explicit "export this symbol" annotation so that code which uses the shared library can access them.

We enabled that flag quite a while ago so that we'd stop breaking the Windows build accidentally. With that change, any time we added a symbol which should be externally visible but forgot to annotate it correctly, even our Linux and BSD builds would fail at the linking stage, so we could correct it without waiting for feedback from a Windows tester.

That was great, but where we should be parsimonious with symbol exporting, we were overly generous. In particular, many generated functions which should never be visible nor accessible outside of libparrot itself were in the listed of exported symbols.

When your dynamic loader loads a shared library, it has to relocate all exported symbols that you use. Within the shared library, the compiler may have resolved all of the addresses of functions to local offsets, there's no way to predict at linking time where the library will be in memory. Thus if you want to call a function defined in a shared library, you have to resolve its offset within the shared library with the starting memory location of the library itself.

The more externally visible symbols you have to relocate, the more time the dynamic linker spends relocating them (and the more expensive it is to find the correct symbol).

Bacek and Christoph changed a code generator so that we could remove the visibility of several thousand symbols. I changed another code generator to remove a few thousand more. We still export too many, but two individual changes made a huge improvement in startup time.

(In other words, do you know how much speed-up results from removing the symbols?)

The first change cut Parrot's startup time in half. It had less of an effect on Rakudo because Rakudo does a lot more initialization. The ~40% improvement in Rakudo came from other changes. I don't have a profile from before that commit handy, but I estimate it was less than a 1% improvement, as Rakudo spends its time elsewhere.

I'll pass on your comments about the Rakudo home page; that's useful feedback. Thanks.