No, that's not its name and I've no idea what to call it, much less how to release it, but due to some serious production problems we've had in the past couple of days and lack of some obvious tools we've needed, I got fed up and started ripping apart log files and shoving them into an SQLite database. Here's the schema:
CREATE TABLE status ( id INTEGER PRIMARY KEY, code INTEGER NOT NULL, message VARCHAR(30) NOT NULL ); CREATE TABLE path ( id INTEGER PRIMARY KEY, path VARCHAR(500) NOT NULL ); CREATE TABLE client ( id INTEGER PRIMARY KEY, host VARCHAR(255) NOT NULL, agent VARCHAR(255) ); CREATE TABLE log ( id INTEGER PRIMARY KEY, name VARCHAR(255) NOT NULL, md5_hex VARCHAR(32) NOT NULL, date_run DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, last_line INTEGER NOT NULL ); CREATE TABLE request ( id INTEGER PRIMARY KEY, request_date DATE, response_time INTEGER, method VARCHAR(6), protocol VARCHAR(10), bytes_received INTEGER, path_id INTEGER, remote_user VARCHAR(255), referer VARCHAR(255), status_id INTEGER, client_id INTEGER, log_id INTEGER );
Of course, this being SQLite, we don't have real foreign key constraints and the schema is a bit of a hack. I could write triggers to fake foreign keys, but at fewer than 100 inserts per second (and yes, I'm using one big transaction), a 2 million line log file can take almost 6 hours to process. Adding those triggers will make this even worse. I'm probably going to have to switch to MySQL to (maybe MyISAM since this is really read-only).
One problem I faced was how to prevent someone parsing the same log file more than once. Well, you'll note that we have log.md5_hex up there. A log file to be processed must have at least five lines or else it's rejected. Otherwise, we calculate the MD5 hex value of the first five lines and use that as a signature. We also track the last line number processed. Throw a log file at it which it's already seen and it will only process new lines.
On the plus side, it's nice to be able to write SQL queries against my log file. Naturally, this was initially just a quick and easy hack. Regrettably, I now see other packages out on the 'net which do this. Sigh.
If you can afford it, splunk looks good.
If not, I like
http://search.cpan.org/dist/Log-Log4perl-Appender-Spread/
for log collecting and then use the hashed spread groups to insert into multiple MyISAM tables in parallel, or in multiple servers.
If using a single server, a merge table on top gives you a nice global view for quick and dirty stuff. For more complex processing, you can use a simple map-reduce-style operation where you query your multiple mysql log databases in parallel and reduce their results into the final one.
Best regards,