Finding Software Design Flaws

Ovid on 2006-06-01T17:59:39

I think many of you may be interested in my latest O'Reilly blog post. It discusses a free software product called Alloy which can analyze your software model and find "counter-examples" where your assumptions are false. It looks fairly easy to use and has tutorials, a reference guide and many links explaining its use.

Failure is an option

Matts on 2006-06-01T18:14:06

That's the problem with software - failure really is an option. It's not like we're building bridges or hospitals.

Case in point - today we discovered a bug in my spam scanning software that has been there for years. Hundreds of thousands of mails have triggered this bug. Yet we only just noticed it because failure wasn't a total showstopper. Creating the software with a tool like Alloy would have caught the bug (probably) but it would have also taken a hell of a lot longer to get the software written.

Re:Failure is an option

Ovid on 2006-06-01T18:36:21

Depending upon what you're doing, failure may not be an option. Consider the Therac-25, a well-known radiation therapy machine which killed at least 5 patients due to a software bug.

Or how about the doctors who were indicted for murder because they didn't double-check the results of some software and had several patients die as a result?

On a less lethal scale, tests can be used to prevent software flaws from reappearing, but if the underlying design of the software is flawed, the fixes that go in place are often patchwork messes which merely help bad systems limp along. Eventually, many systems with core flaws can turn into a big ball of mud. One of my friends is working at a company which has offered this person a huge amount of money to stay, but key flaws in their underlying software have made it very difficult for the software to scale and they may go out of business because they are having trouble keeping up with demand.

Software flaws caught at design time are usually less expensive to fix than post-deployment flaws and the larger the project, the more likely it is to have serious design flaws. Any tool which might help catch those flaws up front seems worth looking at.

To be honest, I'm not sure what you're getting at. I'm sure plenty of programmers have horror stories about software flaws driving companies into bankruptcy (the article I point to mentions this in regards to United Airlines, but plenty of smaller companies have issues like this and I've seen it firsthand).

Re:Failure is an option

Matts on 2006-06-01T18:51:01
What I mean is it's very easy to come up with the examples of where this sort of strictness of design is necessary (medical software, flight control, etc - stuff where people's lives are at stake!) but the majority of software developers don't work in those environments. They're hacking together a tool to help put together the monthly accounts, or displaying things from a database on a web site...

So what happens is that software developers don't get trained in formal design methodology, or if they do, they forget it (I learned Z at Uni, but I couldn't even recognise it now if I had to - I just don't use it).

I may also be suggesting that formal design isn't worth it for the majority of software development, simply because failures in the field (again, for the majority of software) simply aren't as critical as demands it.

I haven't made my mind up completely on this, but I've worked on enough projects to know it's not black and white.

Re:Failure is an option

mock on 2006-06-02T08:14:28
I have a couple of argument against this kind of reasoning.

First of all, most programmers do work in enviroments where bugs matter - the internet. If your web enabled CRUD app has an exploitable vulnerability, then you risk both exposing the rest of us on the internet to DDoS attacks, worms, etc from your script kiddy owned host, and the fraud or damage to your reputation which can result from a more savvy attacker.

Second, code reuse means that what looks like a non critical bug can quickly become catastrophic. Did anyone else spider CPAN when the sprintf bug came out to see what else was vulnerable? Or did people assume that Webmin was the only thing?

On the other hand, both you and I make a good living based solely on the fact that people are incapable of making secure software, so I don't know what I'm complaining about.

Re:Failure is an option

Matts on 2006-06-02T12:53:58
The thing is, what you wrote is today's reality. So failure to catch those things clearly was an option, and the world hasn't ended. Yeah it sucks, but that doesn't mean we aren't coping with it (ok, so that is debatable too :-))

I'm basically saying that a lot of places and jobs would like to do better, but can't afford it (again, mostly not due to financials, as catching these bugs later is more expensive, but due to time-to-market pressures).

Re:Failure is an option

mock on 2006-06-05T11:52:44
Yeah, that would be survival bias talking. When I was a kid I (ate lead paint, got shot with a bow, fell out of trees, split my head with an axe, etc) and I survived, so clearly these things aren't harmful. ;-)

The problem with the "I would like to, but it costs too much" argument, is that when inevitably something bad happens, the rest of us have to pay for it. Either directly in SYN floods, indirectly through the market (phishing, cc fraud, stock scams), or worst of all, forever and ever though ill thought out laws.

Pretty typical tragedy of the commons really...

Re:Failure is an option

Matts on 2006-06-05T12:31:46
Yup exactly, it's a mess. I suspect it won't change until people's lives are at risk though, and even then clearly that isn't enough - it needs to be a publicity disaster too. Get the newspapers to make people enraged about it.

It's funny, I almost wish we as software developers had ignored Y2K. Show the public what a real disaster looks like. Lets get it right for 2038, eh?

Re:Failure is an option

ziggy on 2006-06-01T20:05:39

Consider the Therac-25, a well-known radiation therapy machine which killed at least 5 patients due to a software bug.

The Therac 25 is a really important story, but it is an outlier, and ultimately not relevant to most discussions about bugs, reliability or catastrophic failure. There is no general lesson to learn from that, except to be extremely careful when working on a system where life is on the line (medical, embedded or otherwise).

Case in point: I've worked on many online publishing systems in my day, and the absolute worst case scenario is that somehow, the overcomplicated pile of mud crashes and the system goes offline. But even in those rare circumstances, there's always a backup plan to get something working immediately to restore service. The second most important worst case scenario is that too much content or too little content is reaching paying customers. Better too little than too much, since the too much may be everything, or embargoed news that could lead to a contract violation with a data provider. Ultimately, as bug ridden as these systems tend to get, no software problem is worth losing sleep over; no one is going to die, and no one is going to lose their job over the odd bug.

The more important problem is increasing and insurmountable complexity. Yet even that isn't critical, certainly not to the level of a Therac 25. Sure, a system can get so complex that it becomes unmaintainable. Either a business is healthy enough for a rewrite/upgrade/port to be feasible, or it's not a healthy business. Companies do not go under because of bugs and big unmaintainable systems; they go under because they are not flexible enough to evolve as the industry evolves around them. I witnessed one company migrate one app from dialup access to a VAX to a webapp; numerous shops went from DOS to Win3.1 to Win95 to webapp development, shepharding the same project all the way through. Whoever says rewrites aren't necessary or prudent is looking at small time scales or just fooling themselves.

The reason why the Therac 25 isn't relevant is because sometimes the price of total catastrophic failure really isn't all that bad. (Forgive me, Gene Cernan, but not every project is life-or-death.) I worked for an engineering firm once, and we had a project where we needed to install a curved piece of extruded aluminum into a showpiece entryway. We got the extrusion, sent it to the benders, and it didn't fit. Why? Someone sent the wrong measurements to the bender, and the bender did exactly what they were told. Total failure. Cost to the company? One extra, expensive piece of bent aluminum; if that was the entire profit margin for the entire project, that's bad management, not the end of the world.

Eliminating failure is a worthwhile and noble pursuit, but not when the effort is unwarranted or unprofitable.

Re:Failure is an option

mr_bean on 2006-06-02T03:26:04
I wondered how Brooks' distinction between accidental complexity and essential complexity fits into this distinction between acceptable and unacceptable failures.

Whether the failure is acceptable or not depends on the values of the clients, I think. Or does it?

I was thinking accidental complexity comes from the problem that the software is supposed to solve, but it looks like Brooks didn't think this way.

He said use of a high-level language frees a program from much of its accidental complexity.

I think I have a problem with his distinction.

An alternative classification of sources of complexity, into those coming from the nature of software and those coming from the nature of the problem the software is trying to solve, might give us a better fit between un/acceptable failure and complexity.

I think programmers have to cope more with failure than some other professions.
But I am not a programmer. I just play one on the Internet.

Re:Failure is an option

ziggy on 2006-06-02T14:40:13
Actually, you bring up a very good point.

In the systems I can remember at the moment, catastrophic failure related to essential complexity is intolerable. Catastrophic failure related to accidental complexity is accepted as part of the "cost of doing business". Prime example: IIS and Windows servers instead of something more solid, like VMS, TrustedSolaris or something even more paranoid that can run a webapp. :-)

You could make a convincing case that the inherent complexity of a computer is a part of the essential complexity of the Therac 25, that asynchronous communication is a part of the essential complexity of an ATM network, but operating systems are a form of accidental complexity in the realm of line-of-business apps.

To a first approximation, Brooks' distinction suffices, but it's not the entire story. High level languages free developers from a certain kind and a certain amount of accidental complexity, but not all of it.

Re:Failure is an option

mock on 2006-06-02T08:42:58
No, this is the worst case scenario: vulnerabilites in SAP or perhaps this Who turned out the lights. The price of catastrophic failure really can be that bad. The problem is, that the same components that you used in your publishing house, or to bend sheet metal are being used everywhere else as well - And they suck!

Now for my lovely little anecdote to debunk the rest of your point. Back before I worked for ActiveState, I was an IT consultant to a very large forestry company (who shall remain nameless). Someone didn't configure the access controls on the gis data server correctly. Someone accidentally uploaded the wrong data to that server. Several thousand acres of protected national park old growth forest accidentally got clearcut. But that isn't the funny part. The funny part is that the company paid the several million dollars in fines without even blinking, because it happens all the time. It was even profitable, but the net result is a lot less parkland for you and I to enjoy because of a measuring error.

Bugs matter.

Re:Failure is an option

ziggy on 2006-06-02T14:57:05

mock! How've you been? Where have you been hiding yourself?

Bugs matter.

Yes, they do, but not all bugs have equal weight. Not even security related bugs. Do I care if a package has a known buffer overflow if it's running inside my firewall? OK, I care, but do I care as much as I would if it's in the DMZ or on a public site? Do I care enough to patch inside the firewall first, leaving a public machine wide open?

We can trade annecdotes all day about how bugs matter or don't. In the end, though, the severity of a bug or a catastrophic failure is directly related to the amount of damage it could do when something breaks. The forestry example isn't of the same scale, because it has a lasting impact; customers losing access to information they've subscribed to is regrettable, should be prevented with all reasonable effort, but nothing to lose sleep over. (Setting loose millions of credit card or medical records on the other hand, well, that's pretty damn serious.)

In construction, waste happens and you need to factor that into the overall cost. It's annoying when that waste is expensive and needless, but it's certainly not catastrophic in any real sense, not when you should really be worrying about panes of glass falling and decapitating some innocent bystander.

Bugs matter, but so does perspective.

Re:Failure is an option

mock on 2006-06-05T11:39:13
Well I'm still kicking around Vancouver, however you might see me in London or Tokyo as well. I founded MailChannels with another former ActiveStater, and I've been making the bits go for CanSecWest and associated conferences for the last few years. Right now, I'm reworking our conference registration system, which entailed an audit of all the bits I was planning on using. I'm not really pleased with what I found.

While I don't disagree that perspective is necessary, obviously when limited resources are available, priorities must be assigned based upon perceived risk, I do think that the industry currently errs way too far on the side of closing its eyes and thinking of the queen.

Most Developers... hell, most people, don't think maliciously enough. They think that no one will notice, because they're project or company or library is "too small." What they fail to recognize, is that being malicious has become big business. Customers losing access to data matter when you're shorting a stock.

We live in a world where a (new) remote exploit for IE or Outlook currently goes for about $100,000 to the right interested parties - only some of which are spammers. Now normally I wouldn't care, I'd be happily profitting from both sides like I usually do, but in this case we all share the same internet, and someone else's darwinian malcompetence (cough*Sony DRM*cough) ends up costing me.
What I hoped to illustrate with my forestry example, is that someone elses bugs can effect your life without you even being aware of the bug, and that perhaps they should be treated a bit more like the attractive nuissance that they are.

We don't let construction companies dump their waste in our drinking water either.

Design By Contract

djberg96 on 2006-06-02T14:53:22

This looks suspiciously like DBC. Is it?

Re:Design By Contract

Ovid on 2006-06-02T15:18:58

No, it's not DBC, but I can see how it might look like that from my limited description. In DBC, things fail when bad stuff gets in there, but that's only if the bad stuff does get there. That might happen if your tests push through bad data, but if in the normal course of running the app the bad data doesn't get to the code, the contract holds. Alloy is explicitly looking for things which could reasonbly happen in your app and cause a failure state.

Also, DBC only ensures that the contract holds, not that the contract is reasonable. Alloy tests the design, not the code. Plus, from what I've seen of DBC, it operates on a much smaller scale than Alloy. Alloy can test the assumptions of an entire application and look for missed corner cases, bad asumptions, and so on. Read the Scientific American article. It's fascinating.