Parallel tests for the win

brian_d_foy on 2008-08-09T13:54:00

Test::Harness 3 has the ability to run regression tests in parallel. Until recently, in parallel mode, it would treat all tests as being in one pool, and choose any order of running tests, to maintain the desired number of jobs running concurrently. This is certainly KISS-approved, and useful if any arbitrary pair of your tests can run simultaneously without tripping over each other, due to using the same resources, such as test databases or temporary files. However, some distributions have tests such as 00-setup.t and 99-teardown.t, and rely on the tests running in order. The perl core is a broad church, and contains such distributions (as well as its own tests that trip over each other). So not something we could use yet.

Aware of this, Andy Armstrong and others worked on a way of giving a set of rules to the harness, to describe what can, and more importantly can't, run in parallel. This appeared in release 3.11, but there were some bugs and leaks, and wasn't really good to go until 3.13, which I merged into core 10 days ago.

Nothing happened. Well, obviously, because if you want a job doing, you have to do it yourself.

So first I experimented on the work code. Our unit tests ran in about 15 minutes on my machine. As an aside, I'd also tried running Devel::NYTProf on the tests, and identified a big bottleneck (specifically, "why are we regenerating these test SQLite databases from fixed data at the start of several tests, when we could generate them once earlier, and just copy them on disk in the test, prior to opening the temporary copy), which my colleague fixed, and got us down closer to 10 minutes.

I tried running them all in parallel, but things broke - like the core, because of temporary file names that clashed. But I noticed that at least 50% of the time was spent in tests with names such as 000-strict.t (which verifies that everything uses strict) and suchlike - meta tests that validate constraints, and definitely don't write to ill-named temporary files or talk to databases. So I wrote a little program that searched for tests, partitioned them into two groups, and ran the groups in parallel. Initially, I had the meta tests running in series, but soon tweaked it to run them all in parallel. Run time went down from 10 minutes to about 6. WIN. Arguably I could post the script here, but it's at work and I'm not, so FAIL. Oops.

As an another aside, it transpired that our inability to run in parallel was because the recently added copy functionality used a fixed name! Fortunately he'd written it very cleanly with the name generation in a method, and simply adding $$ to the filename there was enough to make us run everything in parallel, so we could just move over to using the --jobs option to prove . Except that my colleague was considering parsing /proc/cpu to automatically work out how many jobs to run, which understandably a portable too like prove shouldn't do, so we'll probably stick with our program.

So then I tried transferring my skills to the Perl core tests. I knew that tests in t/op/*.t as is wouldn't run in parallel, so initially, I tweaked the harness script to run tests in each directory in series, with directories in parallel. Also, because of the way the tests are set up, I still wanted all the lower level tests (t/*/*.t) to run before starting all the extensions' and libraries' tests, which in practice meant that there was a big wait until the slow tests in op finished. (Serial offenders). Tests all passed, but there was no big speedup. OK, to be accurate, tests all passed once I tweaked the temporary file name generation in t/test.pl that is used by the runperl function (and derivatives) to run something in a fresh perl. Turns out that that was the biggest source of temporary file names clashing.

However, performance wasn't that great. It seemed that running tests in parallel wasn't any faster than running them in series. So I set about auditing t/*/*.t to find every use of temporary files. t/{uni,mro,lib}/*.t were already clean but t/op/*.t needed some fixing to use non-conflicting names. (grepping for unlink will actually find more than grepping for open, assuming that the code was well behaved and already cleaned up all its litter behind it.)

So, with t/op/*.t in parallel timings got better, but still not wow. So I found and fixed the races in t/{cmd,comp,run,io}/*.t too, so that they can all run in parallel. Which turned out to be useful, as several tests in t/io/*.t do sleep to generate specific age differences, which understandably makes those tests take a long elapsed time even on a fast CPU. Now they can run in parallel with something else actually doing work. (And just running t/op/*.t in parallel, queued as by default in lexical order, it's an eye-opener seeing how late through the alphabet the results of t/op/alarm.t show up - another sleep addict.)

So, what does it look like? Here's the before and after on a Macbook - a dual core machine, running a complete clean build, make -j3, and the tests. Before is without parallel tests, after is with 3 tests in parallel with TEST_JOBS=3. There's a touch pie before anything starts, ls -l pie at the end, and the last number is that time in seconds, care of perl -wle '$a = (stat"pie")[9]; print time - $a'

Serial tests

All tests successful.
Files=1553, Tests=209393, 992 wallclock secs (70.33 usr 11.47 sys + 644.42 cusr 59.71 csys = 785.93 CPU)
Result: PASS
      998.01 real       719.09 user        72.77 sys
Fri  8 Aug 2008 19:24:49 BST
-rw-r--r--  1 nick  admin  0  8 Aug 19:06 pie
1124

Parallel Tests

All tests successful.
Files=1553, Tests=209393, 460 wallclock secs (92.08 usr 13.23 sys + 630.34 cusr 59.47 csys = 795.12 CPU)
Result: PASS
      466.46 real       726.69 user        74.25 sys
Fri  8 Aug 2008 19:56:46 BST
-rw-r--r--  1 nick  admin  0  8 Aug 19:46 pie
591

So the total time drops from 1124 seconds to 591. A saving of 533 seconds, almost 9 minutes, almost 50%. Time for a tweak, recompile (ccache for the win) and retest is way more than halved. WIN. Probably even EPIC WIN. "Dear Sun, about that 32 core box you kindly loaned us a couple of years ago. Could we have it back please? We now have the technology to exploit it properly".

Of course, there are still improvements that could be made, if anyone is interested. In particular, right now, it would be useful to check all the test files at the top level of lib, ie lib/*.t, and change any that use custom temporary file routines to use File::Temp. I'm not sure how much more it would gain us, but it's easy to do in small bursts, and it won't be in the wrong direction.


did you try 9 jobs?

Eric Wilhelm on 2008-08-10T06:43:00

In most cases of parallel testing (for e.g. 12-40 short-running test files), I've found 8 or 9 jobs to be the sweet spot on a 2-cpu machine (assuming adequate ram.) With 3 tasks, I would see e.g. a 30-40% wallclock savings, vs 50-60% with -j 9.

I suspect it has to do with keeping the parser busy, but could be some cool aspect about the (Linux) kernel (starting near-concurrent tasks which access the same .pm/.so files on disk?)

Note that your wallclock lower bound is $cpu_time/$num_cpu - which is ~795s/2 => 398s according the output you posted, vs your 460s wallclock means you're losing a whole minute (assuming you have ram to spare, but at this duration you might want a browser's-worth of responsiveness on your machine unless you're going to go make tea.)

Re:did you try 9 jobs?

chromatic on 2008-08-10T07:59:21

A sweet spot at nine jobs seems to point to I/O bound processes, which makes sense given how much time Perl-based tests spend printing, the granularity of a lot of these tests, and the design of Perl's TAP tools.

More info?

kwilliams on 2008-08-11T15:09:54

Where can we get more info about how T::H decides which tests are okay to run in parallel? All I can find is a line in the 'Changes' log saying "Implemented rule based parallel scheduler."