Maybe We're Doing Concurrency Wrong in VMs

chromatic on 2008-10-23T04:46:10

I want to have multiple units of work scheduled and running in parallel with well-defined separation but specific and well-declared points of communication between them. I want preemption, memory protection, an abstract view of the underlying machinery and hardware, statistics, tuning, and a sane extension mechanism. Why does that sound familiar? Maybe it reminds me of something else you might find in computer science.

I will take CompSci for 1000 Alex.

Stevan on 2008-10-23T13:10:06

What are Operating Systems?

- Stevan

Re:I will take CompSci for 1000 Alex.

chromatic on 2008-10-23T17:42:32

Nailed it in one. I haven't figured out how an optimizer and a JIT work with this approach, however.

Re:I will take CompSci for 1000 Alex.

btilly on 2008-10-23T22:33:54

Then you haven't looked at the AS 400 operating system.
It works like this. There is a virtual machine that sits on the regular machine. The operating system sits inside the virtual machine, as do the programs. When programs run that have not been optimized, they get recompiled to machine code, but this action and the result is not visible from inside the virtual machine.

Re:I will take CompSci for 1000 Alex.

chromatic on 2008-10-23T23:46:01

Thanks for the reference. I think it might be different though because I don't want to replace the operating system, and because I'm considering concurrency pervasive even at the expression level.

Re:I will take CompSci for 1000 Alex.

btilly on 2008-10-24T18:27:52

While that is a cool abstraction, I think that for real performance, that is the wrong direction to go. There is a non-trivial minimum overhead to having a new process, and at some point at the expression level you'll find that parallelizing introduces more overhead than you're saving.
For parallelizing "simple" code what you really, really would like to do is to turn it into vector operations that can operate in parallel on blocks of data. And then you want to try to offload that to a specialized vector processing unit, like a GPU. That strategy is not appropriate for general purpose code, but when you hit an operation that can be parallelized in this way you will see amazing speedups. For computing RSA you get a 100 fold increase.

Re:I will take CompSci for 1000 Alex.

chromatic on 2008-10-24T22:00:25

There is a non-trivial minimum overhead to having a new process...

At the OS level, sure -- you need a new process struct, new page tables (even only to mark pages as COW), a new stack pointer, a new register backing store, and new process-local storage.
You don't need most of that in a VM, and you don't have that strict kernel/userspace divide. If you're clever, you can even use a smart generational garbage collector where most objects are free.
That strategy is not appropriate for general purpose code...

Which is why I want to experiment with a different strategy.

Re:I will take CompSci for 1000 Alex.

btilly on 2008-10-24T22:58:52

In the AS/400 operating system they don't need a lot of that stuff either. :-) Seriously, why not have every process use the exact same address space when there is no possible way to address a memory location that you haven't been given access to?
But that said, you still do need to have overhead. At a minimum you need to keep track of all of the "processes", be able to schedule them, and remember when 2 parallel pieces share a variable versus not sharing a variable. (If you parallelize a loop, and a lexical variable within my loop gets shared, then we have a problem.) If the code within the loop is cheap enough (say for a matrix operation), then this bookkeeping overhead could become very significant. Particularly since on a single CPU thread of execute the overhead is pure loss - it just wastes CPU cycles that could be used for useful computation instead.

Re:I will take CompSci for 1000 Alex.

Aristotle on 2008-10-25T00:00:47

The Haskell people tried parallelism at the expression level, which works even better for them because of side effects containment, but they found just what Ben said. The problem is that strong data dependencies between nearby locations permeate code at the micro scale while introducing potential for debilitating overhead. The algorithm really has to be designed with parallelism in mind, as Ben said; what the language can do is provide a very easy way to declare a desire for parallelism explicitly while hiding all the details necessary to make it work.

Re:I will take CompSci for 1000 Alex.

chromatic on 2008-10-25T10:38:13

Did the Haskell people write a paper about that? I'd like to read it. (You can solve more of the data dependency problem if you use a register machine and use a smart register allocator to identify dependencies... though the algorithm to calculate expected call graphs through a compilation unit may be somewhat hairy.)

Re:I will take CompSci for 1000 Alex.

Aristotle on 2008-10-25T19:26:24

I’m afraid I do not know. I got that from a talk I watched about concurrency in Haskell. Unfortunately I do not remember where I saw the link, or when or where it was given, or by whom, and I hardly remember any of the details of the concurrency sugar they created and why that design was chosen. These assertions about no free concurrency lunch (which I think the speaker briefly went into based on an attendee question, not as part of the prepared talk) were the only bit that really stuck with me.
Sorry. :-(