Why People Hate Normalization

Ovid on 2006-07-03T14:23:48

Imagine that I have tasks and users tables. Each task has a start status, and eventually has either a cancel or finish status. Each status must be tagged with the user.

I can't put the cancel or finish status in the tasks table because since they will be NULL to start with, I can't put a foreign key constraint from, say, the task.cancel_user_id column to the users.id column since foreign key constraints typically don't allow null values. Thus, to properly normalize them, I need at least two other tables, task_cancel and task_finish, each with at minimum a task_id and a user_id (and probably a date). This assumes that tasks entered are automatically started. If not, that needs a task_start table. So a conceptually simple problem now requires a minimum of five tables, tasks, users, task_start, task_cancel and task_finish.

That gets annoying and one might be tempted to create a fake user called "none" and insert that user id in the tasks table, but that's just special casing a particular ID and overloading its meaning. That's an even worse problem as most SQL queries which fetch data from the users table will need to account for the special case.

So getting back to the five tables, what if we want to fetch all of our data at once? If a task is neither cancelled or finished, we can't do an inner join against those tables, we have to do an outer join. So once again the rules of normalization take a relatively simple idea and start to complicate it to the point where programmers want to throw in the town and skip normalization.

Mind you, these sorts of "real world" problems crop up all the time. Did I miss something here? Shouldn't simple ideas be simple? It might seem like ORMs can hide some of these problems, but they add some of their own problems (link taken from a journal entry of TeeJay's).


re-Normalize

n1vux on 2006-07-03T16:07:19

I think you've overnormalized. You don't need task_start, task_finish, task_cancel junction tables. You've just traded trying to enforce UserID NOT NULL IF Status NOT NULL for enforcing that a task appears at most once in the UNION of the three task_{start,finish,cancel} tables, which will be even more awkward to code in your data dictionary (1=COUNT(*) FROM UNION...)

You need just one task_status table that has NOT NULL FKs to TaskID and UserID, plus a State enumerated type and an autoinsert date timestamp field. The status is the State that matches max(Stamp).

This is an INSERT-only, no UPDATE table that collects history. This is good if you have SarbOx or other audit-trail requirements; DB privilege grants can enforce this.

If you want to avoid nested queries, you need transactions, and add an Latest boolean column that is inserted true and cleared false when updating status by inserting a new status record. In this case, a stored proc with this transaction should be the only allowed way to insert, to assure the invariant predicate of only one active state per task.

In either case, defining a VIEW that sees only the Latest=TRUE rows (and perhaps another for Latest=TRUE and NOT State=FINISHED) will make life easier for folks building dashboards or reports.

As you'd defined it, TaskID FK is the sole primary key, as each task has but one user; you were splitting only to allow requiring UserID NOT NULL IF State NOT NULL. Making it a history log requires Primary Key (TaskID,Stamp). But splitting state out to a side table allows the possibility of multiple users working a task in parallel or in relay, in which case primary key is (TaskID,UserID,Stamp). That this normalization gives us extra flexibility suggests it's a good one.

Re:re-Normalize

Ovid on 2006-07-03T16:25:19

Ah, thanks. You're correct that my normalization wasn't correct. I can now see the value of what you propose and that does make the task simpler and more maintainable. As a caveat, though, some of what you suggest implies using a real database, not a toy like MySQL (which is what we're using).

This gives me another good rule of thumb to keep in mind. Naturally, when I see things like field_1, field_2, field_3 and so on, it's obvious that I need another table as what I'm looking at is essentially an array. However, what I described in my original post was essentially a hash and that should also be a suggestion (not guarantee) of poor normalization.

DBA Barbie says "normalization is hard!"

Re:re-Normalize

chromatic on 2006-07-03T17:30:46

You have to use a real programming language too, not a toy like Perl.

(Oh, I meant Perl version 2. What? There's a newer version? I bet it's a toy just like Perl 2.)

Re:re-Normalize / Toys

n1vux on 2006-07-03T17:44:43

As a caveat, though, some of what you suggest implies using a real database, not a toy like MySQL (which is what we're using).

Can you upgrade your MySQL to the latest & greatest? Yes, MySQL3 with default engine MyISAM is a toy. MySQL5.[01] with InnoDB engine (or BDB, or the coming Falcon engine) is no longer a toy -- subquery, views, transactions. (They've even got a cluster engine for partitioning.) [Engines] (I think this is the point of the Perl 2 joke.)

Barbie is correct that normalizing is hard if done right -- that's why DBA is a speciality.

Re:re-Normalize / Toys

chromatic on 2006-07-04T18:38:29

Bill nails it. I stopped having any sympathy for people who refuse to upgrade to new versions of free software at approximately 7 am on Sunday morning.

What happened at 7am Sunday?

n1vux on 2006-07-05T20:15:29

Bill nails it. I stopped having any sympathy for people who refuse to upgrade to new versions of free software at approximately 7 am on Sunday morning.

What happened Sunday? I don't see a journal entry for Sunday. Is there a new entry for stupid customer tricks?

Memo to self: update home servers to Ubuntu to 6.06LTS and Debian to 3.1 now that they're really stable ...

Re:What happened at 7am Sunday?

chromatic on 2006-07-06T01:08:56

I read yet another Nicholas Clark "This isn't fun anymore, and no one seems to want to pay someone to fix Perl 5" message on p5p. That's when I decided that companies which rely on Perl 5 but don't send bug reports or test snapshots and have all of their code in the Darkpan can fix their own bugs, or at least pay for the continued development of Perl. That is, they can do that if they care about their code.

Paying for Support for Perl5

n1vux on 2006-07-06T17:54:05

"companies which rely on Perl 5 but don't send bug reports or test snapshots and have all of their code in the Darkpan can fix their own bugs, or at least pay for the continued development of Perl. That is, they can do that if they care about their code."

That hits home.

Support: I tried to get commercial support for Perl, and couldn't. Yes, I could pay for it, but I couldn't get anything I could recognize as support. I talked to ActiveState at LinuxWorld 2005, interested in buying support for my $DayJob for multiple platforms (Windows, Linux, AIX, Solaris), since the VPs always ask "what's the support model" (which implies they'd be willing to pay for it). AS said they'd build and support everything except DBD::Oracle, DBD::DB2, DBD::Solaris for us. But Tim Bunce recommends if you build the DBD module you do it with a Perl and DBI you hand built. So what's the value proposition to my management of ActiveState commercial support, if I'm stuck with supporting DBD::* and either trying to replicate their build environment for DBD::* builds, which against Tim's advice, or re-building all from sources anyway, which will require all bugs in "supported" components to be replicated on "clean" (non-DBI, vendor-built) environment?

So my usual answer to the VPs about support is that they'll never see a critical bug in Perl itself. If their lame Perl scripters ever manage to stumble over a Bug in Perl itself -- not referring to a module bug -- they've probably strayed into a dark alley they don't belong in, they'd have bad code in his/her script that I can fix, not only as a work-around but as a basic improvement. And I offer to fix module bugs and/or be the internal point of contact to talk to the CPAN authors and use.perl, #perl community as needed if/when we hit a Module bug. (And even there, the only one of those we've hit was found by someone with enough savvy to patch the module himself.)

Giving Back Patches: I think you intended pay for the continued development to mean employing a p5p person with understanding that some of their job responsibility is giving back, if only to ensure local fixes are in the upstream for next-time, but allowed to fix RT bugs that sound good to fix too. I'd love to expand my semi-official Perl Evangelist & Guru role into one where Perl support was my fulltime job and I was paid to do P5P stuff. (That's about the only change to my current job that would be an improvement, so I can't complain.) I consider that I'm lucky that I get paid to read and post to use.perl.org a few hours a week. I'm going to try to help Parrot Cage-Cleaners, but that's going to be on my own time, I'm sure; if I use hardware that isn't mine, it won't be $DayJob's, at least not initially. (I may try for that later, in which case I might get some $Hours{$DayJob} for it as well.)

At least $DayJob now has a policy and procedure to approve giving back fixes to CPAN modules (or even Perl5) if we find a bug in the future, and maybe even to contribute new modules (if we can show the new module isn't of competitive advantage). Sadly, this good policy came with a policy requiring approval of CPAN modules (and all other F/LOSS) for internal use: GPL license on a CPAN module is a dead NO, due to CopyLeft viral nature (but ok on standalone utilities, barring code reuse); "Artistic" or "As Perl" or "LGPL" are all ok. (I need to review the current Artistic 2.0 Draft with the lawyers, and will provide feedback if they choke.)

The DarkPan: (I love that coinage!) While I've got the start of a couple potentially useful modules at home that may make it to CPAN when I get some TUITs to finish refactoring, I haven't found much DarkPan stuff at $DayJob. Lots of (simple) scripts, some using DBI or Net::FTP, but not a lot of proprietary modules. There may be a larger DarkPAN in other Business Units, but not mine -- our division uses Perl for light administration, not even fancy glue.

I did take a better-than-average Net::FTP script and put it's three functions in a .PM for easier reuse, but it's really peculiar to how we do transfers, and it's still really more of a template for cribbing than for literal module use, so it's staying DarkPan.

The one halfway interesting module I have personally at $DayJob isn't even reused out of my $USERID yet. It's just a WWWMech milkinging machine for an internal proprietary administrative web site to automate my personal workflow. I don't see anything in it worth refactoring into a CPAN module.

Oh, we do have one module that would have been interesting to try to get corporate Intellectual Property release on if I'd ever gotten budget to refactor it. It's a C-API wrapper for a commercial reporting package. But it was built Perl4 style, redirecting through system() to locally-built CLI utilities (that wrap the C API individually) with dubious security and non-scalable performance, with session create/destroy per request from the Perl script. If I built this today, I'd use the C API directly via XS/Inline, so that a Perl script execution had a single session. When I thought this toolset would spread in the organization, I had plan to refactor this hard and request permission (from Vendor and from Legal) to release it. But since there's only one Application using it internally, there's no internal reuse, thus no budget for refactoring, and no chance (or need) for external reuse. Anyone with the vendor's C API and Inline::C could do a better job today.

My most useful module on the $DayJob desktop is copied in from my old NT4 laptop at home and is Trivial ... it's to allow Perl OneLiners on Win32 to have wildcard fileargs like on a real shell, without needing to drag in Cygwin or MKS toolkit. It consists of just

BEGIN { @ARGV = map glob, @ARGV;}
so I can just say
C:\project\foo> perl -MMyGlob -lane "print if 1..1 or $F[3]=~/pattern/" logs/*.log

Supporting Perl - My $DayJob does provide some financial support to Perl, if not to p5p's and thus Perl5 itself - we've hosted Damian twice and Stonehenge once, and I'll try to bring them back again. We also had Uri on-site for a few months as an interim internal rent-a-guru, and I've inherited (on a very part time basis) that role and the re-nascent community he helped stitch back together. My internal sales pitch to justify the exepense of gettting World Class Trainers is that there's a downside to paying for "Perl" training with Corp.Training's favorite "we teach everything" one-stop-shopping vendors: trainers who have to check the slide's running footer for the class title to see which language they're teaching today aren't going to give good deep answers to intersting questions. You want the guy who wrote the book if you can get him, and for Perl we can. (I'm used to getting the guy that wrote the book from our local ACM PDS seminar series.)

(I've only lost that argument when Sun gave us "free" training credits with the new hardware. "Free" is hard to argue with. Damian comes close to being infinitely better than a slide-reader, but supposedly infinite ROI looks so good on paper. Of course, when you factor in the "opportunity" cost of the "free" offer, the ROI on "free" may drop precipitously, but that doesn't overbear most people's gut reaction to "FREE! FREE! FREE!". And some people like nearby off-site training, it's away from the BOSS and the PHONE and EMAIL without being away from home.)

Re:re-Normalize

dws on 2006-07-03T19:18:38

Bill pointed you in the same direction I was going to. Statuses accumulate over time, and you're typically only interested in the most recent one, unless you're rolling back time for audit purposes. But if you're stepping into an area where auditing might be a concern, you have to start paying attention to the distinction between "actual" time and "record" time (a distinction that can tie your brain in little knots when you try to think through queries). Martin Fowler has a decent intro to the issues.

NULLs

Theory on 2006-07-03T16:29:11

I can't put the cancel or finish status in the tasks table because since they will be NULL to start with, I can't put a foreign key constraint from, say, the task.cancel_user_id column to the users.id column since foreign key constraints typically don't allow null values.

While current relational theory is pretty contentious when it comes to NULL, the fact is that most databases that I'm aware of, including PostgreSQL, MySQL, and SQLite, allow foreign key columns to be NULL.

—Theory

Re:NULLs

Ovid on 2006-07-03T16:36:52

I was doing this in MySQL and having NULL values in the FK column caused MySQL to die a horrible death. And then just to prove my point, I created a small test case in MySQL and it didn't die. There must have been something else going on, but now I can't reproduce it. Damn.

Re:NULLs

Aristotle on 2006-07-04T09:05:37

I was thinking the same as n1vux all along while reading the post – there just isn’t any need for so many separate join tables, one is perfectly sufficient given a start/cancel/finish enum column. And I’d still do it with such a join table even if NULL FKs work: for one thing, such an enum column for would be naturally extendable without having to touch the schema. For another, you don’t ever need to UPDATE any rows – apart from better scalability, having immutable rows tends to reduce the complexity of all associated code.

Nah, normalisation is nice.