Parallel foreach

acme on 2006-07-06T13:59:21

Computers are getting faster, that's for sure. However, they are also getting more cores: new laptops thesedays are dual-core and servers are four-or-more-core. Cores are a fancy word for something a bit like another processor. I happen to have lots of things I want to process independently, but if I only use a single process I'll only use one core, a quarter of those available on my server. That's wasting CPU power. The solution is to do more than one thing at a time and common solutions for this are threading and forking. I've found a particularly neat solution which is a very nice idiom too: parallel foreach with Proc::ParallelLoop Something along the lines of the following, which will parallelise the loop by forking 4 workers at a time:

pareach \@todo, \&generate, { Max_Workers => 4 };

The nice thing is that Linux balances each long-lived process on a core, and suddenly my program runs about four time faster! (Okay, so cores aren't complete CPUs - they tend to have dismal floating point performance - but all I am doing are integer calculations so it is mostly the same in my case).

There are many modules on the CPAN which do something similar, but I particularly like the fact that this is a cute idiom: exactly like a foreach, but slightly parallelised. Neat!

See you at the London.pm social meeting tonight!

Multiple Cores and Floating Point Parallel-Each

n1vux on 2006-07-06T22:07:42

so cores aren't complete CPUs - they tend to have dismal floating point performance

That may depend upon brand and architecture. But really, Parallel-Each dispatch of floating point operations would be the wrong way, multi-core or single-core SMP.

Sun - UltraSparc T1 aka Niagra, T2000 - True.
*, **.
Prior USiii dual-core models had 1 FPU per Core, but it wasn't safe to max them. ***.
IBM/Apple - PowerPC E.g., QuadCore PowerMac G5 - False.
2xFPU/core = 8 FPUs in the 2 socket, 4 core G5 *, **
IBM - Power5 - False.
2 FPU per Core, since Power3 at least. Always had it, always will.
Intel - Dual-Core Xeons (IA32) - Maybe.
Has 4 shared FPUs for MMX/SSE SIMD (see discussion below). Serious High Performance Computing vendors are using this chip, where Floating Point counts.*, but it reportedly has fewer FPU per core than in old server-grade single-core-chip P4's, being more comparable to Pentium-M** (see below).
AMD - similar to Intel IA32?
64bit SSE FP will be faster than 32bit default x87 FP.*
Intel/HP - Itanium 2 - Maybe, *

The issue with Turion64 and Intel Duo-Core / Xeon is largely because they're based on the lower-power Mobile versions of the core. This apparently is because power-and-heat is the limiting design factor today. Heat/Power and cores sharing the FSB are why the clock speeds aren't as fast as the prior server single-core too.

But more to the point, if you have floating-point arrays to parallel-each over, you want to do it in native arrays and hardware dispatch, not Perl list structure and a perl-core XS module loop. This applies on any serious Fortran engine's FPU Vector mode as well as IA32 MMX SIMD. PDL already has native representation and parallel SMT threaded multiple dispatch, and has a TODO to use the native FP vectors. Even so, it should keep the FP pipeline much more active than a Perl-list-based parallel-each, plus it's CPAN packages interface to scientific libraries that can use vector-mode FP if acquired/built that way (e.g., pick a BLAS lib for Gnu Sci Lib).

Do Perl 6 Junctions do this anyway?

n1vux on 2006-07-14T20:13:05

Isn't there a Perl 6 way to do this?