Refcounting Isn't All Bad

chromatic on 2008-04-22T06:50:04

A lot of people complain about reference counting as a memory management strategy in Perl 5, and they have a point. Unless you're very careful, reference counting spreads its tendrils throughout your system. Unless you're very disciplined, you'll forget to increase a reference count somewhere, and you'll end up collecting things in the wrong places.

The paper A Unified Theory of Garbage Collection discusses the most popular modern GC strategies to demonstrate how reference counting and tracing garbage collection are roughly equivalent, especially as concerns about performance, correctness, and conservativeness increase.

I explain this because I added reference counting to a small part of Parrot today as an optimization -- 11.73% on my Rakudo-building benchmark.

Parrot has an internal data structure called a stack. This represents a particular environment within the call graph. If you think of a normal program flow, where function A calls function B calls function C which returns to B and returns to A, you'll have a sense of a normal call stack. Parrot's stack is similar, except that we use continuation passing style (CPS), where it's not a stack, it's a graph, and control flow doesn't have to be that linear.

There's plenty of literature on CPS (including Luke Palmer's how to get a free sandwich with continuations, but the important idea is that function A sets a bookmark inside and passes that bookmark along to function B. B does the same thing when it calls C. At any point, you can look up any bookmark you happen to have and magically teleport (okay, it's just a jump instruction in your processor -- someday I'll explain how very low level languages have eval and very high level languages have eval and it's just the poor jerks in the middle who can't do anything useful without feeding the world's pointiest programming language into their AbstractFactoryFactoryFactories) to the point of that bookmark.

Ranty story short, our stack is a mostly-acyclic graph represented as a linked list, and stack entries -- or chunks -- are garbage collectable entities.

If you've even sat next to one of my "Let's Make Parrot Go Faster!" entries on a bus somewhere, or heard it playing softly in an elevator, you know that one good way to make Parrot go faster is to use fewer garbage collectable entities. One way to make an O(2n) algorithm faster is to make it an O(log n) algorithm. Another way is to make n smaller, and that's easier for now, especially because all I have to do is watch over Andrew Whitworth who's going to write a nice new garbage collector and keep track of Senaka Fernando, who's going to connect the nice garbage collector from Apache Harmony into Parrot.

One nice feature of reference counting (ah, now you see the connection!) is that you can recycle dead objects as soon as you know they're dead. One sad misfeature of a conservative garbage collector is that dead bodies stack up for a while until you can mulch them. One horrible thing about a stop-the-world full mark-and-sweep garbage collector (and try to guess what Parrot has at the moment) is that if you're out of memory in one of the arenas you care about, you have to mark all of the living GCable entities in your system and sweep all of the pools in all of the arenas to find everything that's alive and to recycle everything that isn't.

If you want to go fast, you want to avoid this until it's absolutely necessary.

Here's the interesting thing about stack chunks: Parrot stores them in one place in one function (stack_push) and unstores them in one place in one function (stack_pop). Again, the names are really bad because it's a graph and not a stack, but you get the point.

I noticed this when I profiled the Rakudo-building benchmark again and saw that stack_push was one of the most expensive inclusive calls, apart from the garbage collector. It calls a function which calls a function which calls a function which asks for a new stack chunk from the GC. When there are no more free stack chunks, the GC runs again, and that eats up precious cycles.

After a few minutes of thinking, I realized that the stack_push/stack_pop pair formed a boundary around a stack chunk's lifespan. Popping a chunk off of the stack meant that nothing else needed it. That chunk is immediately identifiably a dead object, suitable for recycling.

Rather than waiting for a full GC run to find every live object in the system and then run across this dead object and recycle it, I added a couple of lines of code to recycle it automatically then and there.

There was my immediate 11.73% speedup.

There was also a problem in the Continuation PMC tests; I noticed this only after I patted myself on the back for moving the bottleneck in my benchmark elsewhere. I'm sure you'll see it if you think for a moment, especially about how I kept writing that the stack isn't a stack, it's a graph implemented as a linked list.

The problem is that there's not a single linked list. If you take a continuation at various points in the call graph, you can have several linked lists that all share the same tail. If you exhaust one of those call chains, you may pop a stack chunk that's part of another call graph elsewhere. Your reference to it goes away, but it's still live elsewhere, so if you recycle it then and there, you may end up reusing the chunk and storing inappropriate data in it while something else expects it to stay pristine.

If you're very clever, you've figured out the solution.

I added a reference count to the stack chunk structure, incremented it on stack_push, decremented it on stack_pop, and recycled the chunk in the latter function only if the refcount is zero.

Is this a long-term solution? Probably not; I posted my initial investigation to the parrot-porters list, and we had a nice discussion on alternate techniques. We'll have to refactor the code more heavily once we find a better approach, but for now, I'll live with this optimization.

(Seneca Cunningham found what appears to be a crash related to this, but I think I've found the culprit and fixed it by adding another reference count pair to the Continuation PMC. See? I told you reference counting made itself pervasive.)


refcounting good

bart on 2008-04-22T19:33:19

I'm not sure why it couldn't be the basis for a permanent solution. All I can see that's wrong about refcounting is that occasionally it misses releasing memory that technically could be released, due to circular references.

So, a combination of refcounting for the quick releases and mark-and-sweep for the stuff that it overlooked, looks like a perfect marriage to me.

Re:refcounting good

chromatic on 2008-04-22T19:55:17

If we can keep the refcounting simple and hidden behind a single interface that every consumer of the refcounted objects use, it's fine and good. My goal is to keep refcounting from spreading throughout the system, to avoid creating subtle bugs. Circular references aren't a problem here. Premature freeing (based on refcounts) is.

New GCs?

RasterBurn on 2008-04-23T04:07:58

c, could you in a future post go into more detail about what to expect in the new GC? I've never heard of Apache Harmony before.

BrowserUk’s thoughts on GC and refcounting

Aristotle on 2008-04-23T05:21:39

“I believe there is a third way (And no, I’m not talking about Tony :) )”

Re:BrowserUk’s thoughts on GC and refcountin

chromatic on 2008-04-23T05:58:56

It's a decent idea. It allows for a copying/compacting collector, which is very handy. I don't immediately see how to decrease refcounts, but it'll probably come to me in a little while.

The whiteboard in my office has some notes on a similar system I've toyed with a little bit. My existing code there implements a bootstrapping garbage collector (that is, the memory used to store GC info is itself garbage collected).