Log::Apache::SQLite

Ovid on 2008-08-29T10:40:28

No, that's not its name and I've no idea what to call it, much less how to release it, but due to some serious production problems we've had in the past couple of days and lack of some obvious tools we've needed, I got fed up and started ripping apart log files and shoving them into an SQLite database. Here's the schema:

    CREATE TABLE status (
        id      INTEGER PRIMARY KEY,
        code    INTEGER NOT NULL,
        message VARCHAR(30) NOT NULL
    );

    CREATE TABLE path (
        id   INTEGER PRIMARY KEY,
        path VARCHAR(500) NOT NULL
    );

    CREATE TABLE client (
        id      INTEGER PRIMARY KEY,
        host    VARCHAR(255) NOT NULL,
        agent   VARCHAR(255)
    );

    CREATE TABLE log (
        id        INTEGER PRIMARY KEY,
        name      VARCHAR(255) NOT NULL,
        md5_hex   VARCHAR(32)  NOT NULL,
        date_run  DATETIME     NOT NULL DEFAULT CURRENT_TIMESTAMP,
        last_line INTEGER      NOT NULL
    );

    CREATE TABLE request (
        id             INTEGER PRIMARY KEY,
        request_date   DATE,
        response_time  INTEGER,
        method         VARCHAR(6),
        protocol       VARCHAR(10),
        bytes_received INTEGER,
        path_id        INTEGER,
        remote_user    VARCHAR(255),
        referer        VARCHAR(255),
        status_id      INTEGER,
        client_id      INTEGER,
        log_id         INTEGER
    );

Of course, this being SQLite, we don't have real foreign key constraints and the schema is a bit of a hack. I could write triggers to fake foreign keys, but at fewer than 100 inserts per second (and yes, I'm using one big transaction), a 2 million line log file can take almost 6 hours to process. Adding those triggers will make this even worse. I'm probably going to have to switch to MySQL to (maybe MyISAM since this is really read-only).

One problem I faced was how to prevent someone parsing the same log file more than once. Well, you'll note that we have log.md5_hex up there. A log file to be processed must have at least five lines or else it's rejected. Otherwise, we calculate the MD5 hex value of the first five lines and use that as a signature. We also track the last line number processed. Throw a log file at it which it's already seen and it will only process new lines.

On the plus side, it's nice to be able to write SQL queries against my log file. Naturally, this was initially just a quick and easy hack. Regrettably, I now see other packages out on the 'net which do this. Sigh.


Splunk?

melo on 2008-08-29T12:14:04

If you can afford it, splunk looks good.

http://www.splunk.com/

If not, I like

http://search.cpan.org/dist/Log-Log4perl-Appender-Spread/

for log collecting and then use the hashed spread groups to insert into multiple MyISAM tables in parallel, or in multiple servers.

If using a single server, a merge table on top gives you a nice global view for quick and dirty stuff. For more complex processing, you can use a simple map-reduce-style operation where you query your multiple mysql log databases in parallel and reduce their results into the final one.

Best regards,