My favourite question at job interviews is this. Not that it is suitable for every situation, but when a person tells that he is quite wise in both Perl and databases, I ask it.
Imagine that you are creating the search engine (yet another google, yes) and you have billions of pages scanned and an inverse index. The structure of index is straightforward: table of tripples ($word_id, $page_id, $number_of_occurrences). Both identifiers are numbers in tables that contain every known word and every known URI. Number of occurrences tells how many times a word appeared on that particular page.
The question itself is how to display 10 most relevant pages for the query "Putin -Bush", where minus indicates that the word must not appear on a page. Search result should also contain the total number of suitable pages.
Queries with each word in separate give millions of pages, and you cannot store in memory indexes that were found, and of course you cannot make any calculations with two such lists fully kept in memory.
Can we assume that it is capable of doing a join like this?
SELECT with1.page_id
, with1.number_of_occurrences
FROM reverse_index with1
LEFT JOIN reverse_index without1
ON with1.word_id = ?
AND without1.word_id = ?
AND with1.page_id = without1.page_id
WHERE without1.page_id IS NULL
ORDER BY with1.number_of_occurrences DESC
If we are forced to assume that this can't work, then the simplest way to proceed is to sort each result set with a merge-sort (doesn't have to be kept in memory), then do a "merge-filter", then do another sort to get it into the order you want.
If you have lots of machines to throw at this (like Google does) then each of these steps can be set up as a parallel algorithm, and the whole algorithm, excepting a final gathering of data at the last step, can run in parallel. But even that final gathering can work off of partially summarized data, with the result that your response time can be kept bounded.
All good stuff to know, but most people won't run into a situation where you can't just shove it down to the database and let the database take care of it.
Re:How beefy is the database?
andy.sh on 2008-08-04T18:26:57
Yes, there is a Google's paper about MapReduce algorithm which might be used for the situation.
People whom I asked usually forgot that even to display 10 top links you cannot perform two queries with LIMIT 10 (or 20), one to select pages with most occurences for the word1, and then grep them to exclude those with word2: in this case you can get into multiple queries.
Re:How beefy is the database?
btilly on 2008-08-04T18:48:13
Heh, that reminds me of an interesting challenge I faced once.
I was creating a basic desktop query tool which would be used to summarize how much money had been spent in, say, a particular area on a particular kind of product. Products were categorized at various levels. For example a specific glass beaker would be classified as the exact item, as a glass beaker, as a laboratory supply made out of glass, and under laboratory supplies.
I had two interesting requirements. The first is that the database shipped to people in a particular region should not have specific order data for other regions. The second is that I needed to support various reports of the form, "Show how much of X was purchased locally and across the whole company over time period X." Where X could be anything from a specific product to, say, any laboratory supply.
How did I do it? (Note, I did not just precan a set of reports. There were simply far too many reports that could have been run to successfully pre-can it.)