Chromatic always knows the right thing to say, even if I'm not always quick enough to pick up on it. He was helping me do some debugging last night, and came out with this little gem that I never noticed:
(chromatic) The scheduler gets marked and marks its kids, but they get swept and it doesn't.
This didn't make any sense to me at the time, if the scheduler is getting marked and put on the queue, then all it's children should also be marked. All the objects travel through the pobject_lives interface, so the kids should be treated the same as the parent is. However, this ignores a problem that occurs inside pobject_lives. A few lines later, chromatic has this insight:
(chromatic) I ran into this when I was working with constants.
(chromatic) If something's already marked live, we don't mark its kids.
Still, nothing clicks for me. I set a few break points as per his recommendation and do a few debugging sessions. After making a few changes, I get a segfault on a dereferenced null pointer inside pobject_lives. I go take a look at it and see this line (or an approximation thereof):
if(card_mark == GC_IT_CARD_BLACK) return;
Here is the problem! Here is exactly the thing chromatic said the problem should be, and I wrote it, and I didn't even realize how big a problem it would be. Here's a rundown:
When I allocate a new object, I allocate it as black. I do this for a few reasons, but mostly to ensure a newly-created object doesn't get prematurely swept if we're doing an incremental (or even asynchronously-concurrent!) sweep. All objects start out black, which means that objects aren't marked, their kids aren't marked, and the children get swept even if they apparently have been marked.
So I fixed this problem. (click, drag, delete). That segfault disappears but now I'm failing assertions left and right (plus, null pointers seem to be floating around causing other problems).
Also, on chromatic's suggestion, I went back and redid the finalization code too, to ensure that objects are swept in reverse order. Another gem from chromatic last night, he figured out that all the errors in the string tests were a direct result of my lack of finalization. ParrotIO PMCs are line-buffered, and the last line isn't properly output because we never finalize and therefore never flush the buffers. Therefore, all those test errors are a direct result of this finalization problem.
I love chromatic, and when this whole thing is over, I'm going to bake him some cookies (if I can't find his address, I'll eat them in his honor).
Debugging garbage collectors is easy. At least, debugging your second garbage collector is easier.