Google woes

Beatnik on 2005-05-30T15:39:52

Our all-time favorite search engine has a nice feature that let's you read PDF files (among others) as HTML. However, that feature could use a little tweaking here and there.

Re:

Aristotle on 2005-05-30T22:56:45

Not Google’s fault. The design of Postscript and PDF is such that the only guarantee you have is that rendering them will yield the exact same result, wherever it happens. Any other operation, such as trying to extract text, is not reliably possible.

It works in a usefully large number of cases because documents are generally machine-generated by low-complexity processes. Think of how it is easy to scrape information out of HTML pages using regexes if they were all generated by a script which populates tables from database queries. Such a scraper works, but it’s by no means a reliable HTML parser, and neither is Google a reliable PDF extractor. The difference is that unlike HTML, scraping PDF is impossible to do reliably. That’s why all shrink-wrap PDF extraction software is actually OCR stuff.