Database - Slave or Master? 2 of 3 - Object Persistence

mugwumpjism on 2006-08-21T22:13:06

One of the coolest things about "Object Persistence", is that it has the word "Object" in it, which of course means better than anything that was around before "Object Oriented Programming" was decided to be flavour of the decade. Even better than that, it even has the word "Persistence" in it, which sounds much more sophisticated and modern than "Database" or "Relational".

Then are shiny tools in this space, like Hibernate for Java. Using Hibernate, you can make an Enterprisey Macchiato Java Servlet that blends your database to a set of Java objects, and then provides Soaped-up Beany Hyper-galactic XML Web services for other parts of your Enterprisey code base to access. It's fantastic - you end up with a set of tables (all with surrogate IDs, of course) that you are guaranteed to not be able to write to safely from anything except the Java Server. This puts the Java developer in control. Which is the way (s)he likes it. Maybe hibernate doesn't have to work like this, but (s)he prefers it because it means that all the changes to the objects have to go through the same central set of functions. Otherwise, the development investment is wasted. And we can't have that, not at the price it cost.

Anyway, Tangram is not quite so antisocial as that. It at least behaves transactionally, given appropriate use of $storage->unload_all (also ->recycle) and distribution of clue. But it is currently anti-social in other ways, such as the surrogate ID requirement.

Wait a minute - the database has .oids, too

Postgres, and Oracle, both have a concept of .rowid; all tables except for 'index organised tables' have them by nature. I have observed that in the vast majority of code that uses the Tangram API, I never need to use or refer to this .id; in fact, when storing an object in multiple stores, its .id will vary across those stores. In light of this, while I consider surrogate IDs a design flaw - it's not a tragic one, it's consistent with what the database does anyway, and it has allowed for interesting patterns to be built in the mean-time while better ideas come forth. For a more detailed analysis of what I think is wrong with Tangram, see the bugs POD file in the Tangram distribution, especially the section on surrogate type columns (actually I've just tidied those up, if you're reading this before I make a release then read the fresh one here).

What defines "Object Persistence"?

Again, hazarding a set of common features of object persistence tools that could plausibly form part of a definition;

  1. They normally do have requirements of the database; usually not all valid DDL models can be mapped to a set of objects.
  2. They will map features of objects not usually considered relational concepts such as inheritance and Perl structures like Arrays and Hashes.

What's so cool about Tangram

The key design feature of Tangram is what is frequently referred to as being orthogonal - it is normally non-intrusive on the objects being stored. A given object may even exist in multiple stores simultaneously (but be represented by the same Perl object). The result? Classes do not need to be aware of their storage; any more than a tuple needs to know it's being stored in a table space.

This is implemented with Lazy Loading. The in-memory data structure is considered equivalent to the database form; via types such as Tangram::Type::Set::FromOne, it is possible to follow joins between tables by just walking Perl objects with visitor iterators like Data::Dumper.

Tangram Querying

For the cases where you have specific questions for your data model, and you are not just following adjacent relations between objects, lazy loading is not enough. We still need some form of query syntax.

For this, Tangram uses Tangram::Expr objects that represent database objects - and they use overload so that you can write your query expressions using standard perl operators (as far as overload allows). Depending on your inclination, you either "run screaming" from this syntax or love it.

In my experience, Tangram's query syntax makes some previously hard queries easy, and some "impossibly difficult" queries easy. You can build intricate joins with a consistent notation. For example, process a form, make a list of Tangram::Expr fragments, and then combine them into a filter that can be used for multiple queries.

  # get the table aliases
  my ($r_artist, $r_cd, $r_track)
      = $storage->remote(qw(Artist CD Track));

  # build a set of filter expressions - some of these
  # represent joins.
  my @filters =
      ( ( $r_artist->{name} eq "The Black Seeds" ),
        ( $r_cd->{artist} == $r_artist ),
        ( $r_cd->{tracks}->includes($r_track) ),
        ( $r_track->{name} eq "Heavy Mono E" )      );

  # AND them all together
  my $filter = reduce { $a & $b } @filters;

  # then use them for queries
  my (@cds)    = $storage->select( $r_cd, $filter );
  my (@tracks) = $storage->select( $r_track, $filter );

The query there is already getting reasonably impressive; the first ->select() maps to:

 SELECT
    t1.id,
    t1.type,
    t1.artist_id,
    t1.name
 FROM
    CD t1,
    Artist t2,
    Track t3
 WHERE
    t1.artist_id = t2.id         AND
    t2.name = "The Black Seeds"  AND
    t3.cd = t1.id                AND
    t3.name = "Heavy Mono E"

This is a simple example, and I have found that there are very few real queries on well designed schema that do not map to this syntax well. That being said, sub-selects require an undocumented syntax, and while I have some sympathy to the notion that you should be able to write sub-selects as joins most of the time, it's certainly an example that the API hasn't been extended in all directions yet.

Tangram Maps Inheritance

There are those that would say inheritance is about a relational a concept as an .mdb file, but I think that there is adequate justification for its use in data modelling.

A good question to ask when validating a relational schema to be normal form, is "what does this relation mean?" or "what fact is being represented by this tuple?". We can ask this question for all tables - and the basic answer is "there exists an object with these values"¹. The fact is that the object exists. Better answers can be made for individual tables; consider that answer a template - ask a meta-question, get a meta-answer.

This is where the argument for inheritance stems. The relations still describe existence of an object, but certain types of objects will have extra items in their tuple - relations to the extra properties bestowed upon them by their sub-classes.

In the CD store schema, for instance, 'Artist', 'Person' (perhaps better called 'Musician') and 'Band' are related like this. The justification is, that an artist can be either a musician or a band, but if we are relating to something in its capacity as an artist (ie, from the CD table, to say who released it), there also exists by association a relationship between the CD and all of the properties of the artist in its capacity of a musician or a band.

Tangram short-cuts the query overhead of this situation using a 'type' column. The type column is an index into the schema, and is used to derive a bitmap of which extra tables associated with a base class are expected to have rows for this primary key. This is a de-normalization of data, so technically a hack - as noted on Tangram::Sucks, it should be possible to detect the type using the presence or absence of tuples in the corresponding tables. Or, somewhat equivalently, NULLs when using "Horizontal" mapping - see Tangram::Relational::Mappings for a description of these terms). I'm told that David Wheeler's Object::Relation can work like this.

But what about the Schema?

Having a schema structure that is free from side-effects can be quite useful. Tangram has this down well; its input is a plain set of perl hashes and arrays, no side effects. If you want to use the pure objects to create code, you can still pass them to Class::Tangram::Generator. If you want to connect to storage, pass to Tangram::Storage. T2 was my attempt at making a set of objects that can both describe this Tangram model relationally, and itself be stored in a Tangram database. This is useful for building rapid application development / Computer-Aided Software Engineering (RAD / CASE) tools. Consider Umbrello; it could not compile classes as the objects were manipulated, otherwise you might override internal behaviour and break the running program!

You don't have to write comprehensive schemata any more

Consider the package manager, Fink. Whilst using Storable for persistence can make applications like Fink faster by reducing parse time to load their entire state at start-up, it is still not as fast as a Berkeley, ISAM or SQLite-style database which is loaded on demand for small accesses.

The general approach is not making the whole schema relational in one go, but instead cherry-picking out the columns that you think are useful enough to be indexed, and throwing the rest into a single column that contains a Storable, Data::Dumper or even YAML data field which is used to construct the rest of the object. Tangram::Type::Dump::Any is built for this. I wrote a Tangram schema for Fink that does this, which is lurking here.

You end up with a data source which can be queried on all mapped columns, and almost all close that was written for the old, non-Tangram system works too - because previously, the only option was to follow Perl references, but we've made sure they all get lazy loaded.

Where Object Persistence Wins

RAD-developed, and imported models
In the RAD case, the model for your program is developed with a tool; the relational mapping is then derived by mutating the generated model.
In the imported case, it comes from the metamodel of an another module, such as Class::Tangram or Class::Meta.
In both of these cases, a general form of translation is "all that is required" - write a few rules about how to convert from one metamodel to another, and you have automatic Object Persistence. Sadly this "all that is required" part can get quite difficult to understand and debug.
retro-fitting storage around existing objects
This works out best when you have code that already stores via something like Storable, and hasn't been written relationally in the first place, just like Fink.

Yes, I know this is another absurdly long post in a multipart series. That's actually mostly in this case because I have more to say about it, rather than being a particular endorsement of the approach. But more on what I will endorse in the next part.

Footnotes:

  1. Yes, I know there is a widely circulating school of thought saying "that's not The Right Wayâ„¢ to do object-relational mapping, you should be using object values as columns and tuples as object relations". The former isn't available in current databases, and the latter is done using classes that consist only of foreign keys (Tangram::Type::*::FromOne relations).


enjoying this series

perrin on 2006-08-23T02:10:33

While I don't currently use Tangram, I always found the filter syntax and the inheritance support really impressive. It makes for a fun read.