Code Coverage and the Zune Bug

Ovid on 2009-01-05T10:28:22

Many of you have probably heard of the Zune bug and some of you may have seen the source code of the bug. Specifically, the bug resides in this:

BOOL ConvertDays(UINT32 days, SYSTEMTIME* lpTime)
{
    int dayofweek, month, year;
    UINT8 *month_tab;

    //Calculate current day of the week
    dayofweek = GetDayOfWeek(days);

    year = ORIGINYEAR;

    while (days > 365)
    {
        if (IsLeapYear(year))
        {
            if (days > 366)
            {
                days -= 366;
                year += 1;
            }
        }
        else
        {
            days -= 365;
            year += 1;
        }
    }


    // Determine whether it is a leap year
    month_tab = (UINT8 *)((IsLeapYear(year))? monthtable_leap : monthtable);

    for (month=0; month<12; month++)
    {
        if (days <= month_tab[month])
            break;
        days -= month_tab[month];
    }

    month += 1;

    lpTime->wDay = days;
    lpTime->wDayOfWeek = dayofweek;
    lpTime->wMonth = month;
    lpTime->wYear = year;

    return TRUE;
}

Regardless of whether or not you know this programming language (C), you should be able to see what caused the Zune player to freeze. In fact, it's a common code smell: the arrow anti-pattern.

Still don't see it? Take a close look at that while/if/if loop. Still don't see it? Well, there was some discussion of this code over at Reddit. There was some talk about whether or not it would be possible to find this bug automatically and that's really an interesting topic. Some type inference systems can catch infinite loops (the paper that Mark Jason Dominus drew on for his Strong Typing and Perl talk). However, I doubt such inference could catch this bug.

Using type systems to find infinite loop bugs requires that you have a sound type system for your language ("sound type systems" has a very specific meaning), something Perl 5 does not have (Perl 6 won't, either). Microsoft has Terminator, software which looks for infinite loops and, while admitting it can't handle everything, can find many infinite loop bugs. But Terminator probably isn't going to work for us -- it appears to be geared towards C programs and in any event, dynamic languages cause all sorts of havoc with many analysis tools.

So what are we to do? Given that the code above is probably a small sample in many hundreds of thousands of lines of code, it's quite understandable how that could be missed. But can we avoid that?

Well, yeah, actually. It wouldn't be fully automatic, but the simple answer is more than just "write tests for you software". It's "use code coverage tools". Rewriting the problem in Perl (and reformatting to avoid the excessive horizontal whitespace):

 1: while ( $days > 365 ){
 2:    if ( is_leap_year($year) ) {
 3:        if ( $days > 366 ) {
 4:            $days -= 366;
 5:            $year += 1;
 6:        }
 7:    }
 8:    else {
 9:        $days -= 365;
10:        $year += 1;
11:    }
12: }

Now does the problem stand out? What happens if $day == 366? Now we have a problem, but code coverage would have caught that.

Specifically, "conditional coverage" would have caught that the condition in line 3 was never false. Then, a sharp programmer could examine the coverage report and for those conditions which are easy to test, test them! It doesn't guarantee that your software is bug free, but it would have helped Microsoft avoid an embarrassing PR fiasco like this one.

What amazes me the most...

Alias on 2009-01-05T11:47:48

... is despite all the vast amounts of code in the Microsoft mobile areas, there wasn't something reusable for that, because clearly someone's had to write it from scratch some time in the last 4 years.

Re:What amazes me the most...

jplindstrom on 2009-01-05T13:40:47

Maybe it's so vast noone can find things in it.
Or maybe it's there, but coupled to other things they can't afford to bring in along with the deps.

Re:What amazes me the most...

btilly on 2009-01-05T18:14:53

It was not a Microsoft bug. Which goes to show the dangers of reusing someone else's code.

Not a strong typing kind of bug

ziggy on 2009-01-05T18:25:14

First, I appreciate the schadenfreude here as much as the next hacker, but it appears this bug was in the Freescale code for the platform, not the code Microsoft wrote for the Zune. So it effects more than just Zunes...

Second, the "some type inference systems" is misleading. The one type inference system that Andrew Koenig and MJD discuss is the Hindley-Milner type system, which is used in ML and its descendants (OCaml, F#, Haskell, etc.). Koenig's example demonstrated that a recursive algorithm would have never worked properly because it was missing a base case. That's totally different from a while loop in an imperative program. No type inferencing system can find a bug like this in an imperative program.

Third, this is not the kind of problem simple code coverage would be able to isolate. What happens if $days is 367 and $year is a leap year? Then the inner if branch fires, $year increments, and $days decrements by 1 leap year.

That's why this code works perfectly, except for one day every 4 years -- the last day of every leap year. You could try 1000 random inputs and demonstrate all branches are exercised, and still miss the case where $days = 366 in a leap year.

Date math is hard, and full of corner cases like this. This is a problem that by its very nature isn't amenable to automated reasoning, whether you're talking about type inferencing, theorem proving, automated testing or code coverage analysis. Your best bet for all date math related code is to build up a body of known edge cases and try them all, in a brute force manner.

Re:Not a strong typing kind of bug

Ovid on 2009-01-05T20:11:00

Thanks for the extra information.
I think, ultimately, that part of the problem here could be alleviated by coupling things which must be coupled. For example, having a DayOfYear type with a type coercion system could partially help. Imagine the following type definition (in Perl 6, which doesn't do inferencing, of course, but bear with me):
subset DayOfYear of Int where { 0 <= $_ <= 366 };

Obviously, that still leaves edge cases, but I would imagine with inferencing, while you wouldn't get a compile time check, you couldn't get this:
Int day = get_day_of_year(); day++; # runtime-fail if day was 365

Even though that's an "Int" and its static type would be an Int (thus preventing catching this at compile time), you could at least get a runtime failure when there's no suitable method/function to be called on its dynamic type. This will prevent certain classes of problems, but at the end of the day, there are still plenty of edge cases for date math and trying to decouple the day of the year from the year is the real culprit here.
If you merge that with Test::LectroTest (like Haskell's Quickcheck), and you might actually catch those errors in testing, even if you didn't explicitly test for them.
All of this, of course, if from a typing newbie, so if I'm talking complete bollocks, feel free to correct me!
Er ... except, now that I think of it, since its static type is an Int, that still allows adding two days together, something which DayOfYear would have to forbid. This means that DayOfYear would have to be able to know all operators available for Int and disallow inappropriate ones, something which is just begging for trouble. I guess the decoupling is the real problem and I don't know that coercive typing would help :(

Re:Not a strong typing kind of bug

ziggy on 2009-01-05T21:00:59

All of this, of course, if from a typing newbie, so if I'm talking complete bollocks, feel free to correct me!
You're talking complete bollocks. :-)
The kinds of things a strong static typing system can do is let you create types like DayOfYear, which let you specify a single calendar date, and prevents such nonsense like the 47th of April as being a "date". From there, you can create functions like addDays that take a DayOfYear and an integer number of days and return a new DayOfYear.
What this code was doing was taking a count of days since the beginning of time, and figuring out the current date by subtracting one year's worth of days at a time, and producing the current year and the current count of days since the start of the year. If we were dealing with a calendar that didn't have leap days, it would be a simple matter of calculating day_count div days_in_year and day_count mod days_in_year. But date math is more complex than that, so...
What type checking cannot do is infer specific magic properties about a simple counter used in a specific context. Adding and subtracting numbers is always just adding and subtracting numbers. If you want a type checker to help you out (in Java or in Haskell), then you need to define a new type with new behaviors that more closely matches the domain, whether you're talking about date math, tax math or statistical probabilities.

Re:Not a strong typing kind of bug

Aristotle on 2009-01-06T21:06:26

I think the mindblowingly clever technology you are looking for here is called a truth table. :-)