So you load a 10000 record file. On record 9990, an error is detected. Thankfully, Oracle throws an exception, and all your work is rolled back. Otherwise, you'd have a real dickens of a time starting over when you need to reload the same file.
Except that your DBA just called you, as mine did, and said, "You're affecting performance by not committing frequently! You need to commit every 1000 records or so!" That's great, but the purpose of transaction processing is so that all my records can go in as a unit. I want them all to go in at once, or not at all.
This is not the database I regularly work on, nor the DBA I regularly work with. As a matter of fact, the real problem to me seems to be that this database is extraordinarily slow. The regular database I work on would load all 10000 records in a heartbeat and not even shrug. Over there, loading 24 hours worth of data took 24 hours. On this database instance, loading 24 hours worth of data seems to take over 24 hours...
So what do I do? If I commit before the transaction is done, I lose the whole point of transaction processing. Isn't the database supposed to handle this kind of crap on its own? How do I appease the DBA without losing the fundamental point of why I need it to work this way?
However, business logic / natural and operational efficiency are different. Having each file restartable is quite desirable at an operations level. But having no pending transaction grow too large is a very real operational issue for the DBAs. Your debate with the DBA is strictly in operational space; even if there was a natural business answer, the operational issues might trump it anyway. This is however a solvable conundrum.
PLAN B -- Well, maybe you could use Perl or *nix split(1) to split the input file into subfiles of 1k rec and process them in sequence, marking them each as completed when committed. Then if you need to restart, you can avoid the done ones and pickup at the 1000-line break to which the DB rolled-back.
This requires only wrapping your existing job with a splitter/re-sequencer script.$ split --lines=1000 inputfile pieces.
$ ls piece*
pieces.aa pieces.ab pieces.ac pieces.ad pieces.ae
...
# now process pieces.?? in sequence, commit each.
# = pieces.a[a-j] for your 10k records...
PLAN C -- Alternatively, you include in the transaction an update to a side table or the status table indicating the last file and line processed. Upon restart, that's where you resume. You probably already have a table listing files received with status
This is somewhat more invasive, changing Schema and basic logic. But probably well worth it.
PLAN F -- The obvious answer of having the SQL avoid inserting records already in existence is of course so very very wrong, I hope I don't need to explain why.
(Hint: speed and/or race conditions)
Your instinct of asking the DBA why is DB can't handle a transaction of 10k records when other DBs can is of course a good question to ask. The DBA may have a good answer, though -- if the DBA is any good, they're paid to have a good answer to that question, one we outsiders do not want to have to understand. If the DB in question is optimized for On-Line usage, it may have internal structures that make queueing a huge transaction awkward. Or it may be running on junk hardware or an older rev of the DB or whatever. If there's a real answer, be prepared to fall back to plan B or plan C above (split).
If your business requires you to accept all valid records immediately and re-try only the invalids, you'd not rollback the invalid but commit all 999 good records in one batch and write the invalid 1 record to an exception file that could be reprocessed after repair. This is stored proc uglyness
Re:__
jdavidb on 2005-12-15T15:48:23
Thanks. I think in addition to the plans n1vux offered, I needed that validation to know I'm not being completely unreasonable.
I did do some checking after my post and there are some 40K-record files I am loading. But that's not even a difference of an order of magnitude.