I used to be interviewing people for various positions which relate to web development. Not to say that it is a routine process, but with time pass I have a collection of my favourite questions which I ask often. Yesterday I put one more question into it. Here it is.
Suppose that you have a website with deeply nested sections. Each concrete section has its own part in the URL between slashes, and may be a subsection of the section of higher level. No depth limit is said, neither any restrictions are allowed in duplicating section names. Thus we may face with URLs like /alpha/, or /alpha/beta/, or /alpha/beta/gamma/, and also /alpha/gamma/beta/, which is a separate leaf of site tree.
Every section's URL part is stored in the database in a table with three fields: ID of the section, its URL and an ID of the parent (if present), like this: (1, 'alpha', 0), (2, 'beta', 1), ..., (N, 'beta', M). Last 'beta's are different sections with the same piece of a URL.
The task is to propose fast algorithm for parsing the URL and obtaining corresponging section ID. IDs of parents are not necessary, neither is the depth level. Just find an ID of the section described by an URL. Note that storing in the database full URLs with slashes is prohibited.
Your algorithm is already forced on you by your data structure. You've left little choice except the obvious split followed by querying each piece in turn in a loop. Your time will be dominated by the time for round trips to the database. The two ways to improve that are to avoid round trips to the database - eg by caching information locally (memcached comes to mind), or by not going out of the database for the round trips (you can do this with a stored procedure).
Assuming that the paths are relatively static, I would suggest caching. That is because your scalability limit is the database capacity. So when you can offload work from the database, your scalability improves.
If you want to do something smarter, then you'll need to pick a different data structure. The obvious one is to store the full path, but you've already banned that. Joe Celko has a number of articles describing the benefits of a nested set solution. That solution has no real advantage for finding a single URL. But if you ever want to display the whole tree, then it has a lot of benefits and is worth looking into.
Anyways this feels too simple to me, so I'm sure there is some clever twist that you're aiming for. I don't know what it is, but unless you're going to provide some big hints, you're going to be filtering for people who have seen the trick you're interested in. My experience suggests that questions that look for such tricks are usually useless interview questions. (Of course you can always use it to make decisions. But a correct versus an incorrect answer will tell you little about how good the employee will be once hired.)
Re:What am I missing?
andy.sh on 2008-08-04T12:11:21
I expect to hear at least two things: 1) repeated correct queries should be resolved without recursive algorithms (even if DB is used for every request) and 2) they must be cached to avoid DB calls for those URLs which look good in structure but lead to page 404.
There is no the only answer, so both correct or incorrect answers will be suitable and accepted. The main goal is to understand whether a person is able to think about bottle necks of his/her algorithm.Re:What am I missing?
btilly on 2008-08-04T14:34:46
The "should be resolved without recursive algorithms" sounds to me like you believe some myth about performance.
The database doesn't care what the code looks like that makes the database calls, it just cares what the calls are. So it can't even see that the code is recursive.
You can see the recursion on the code that runs on webservers. There the performance differences are marginal, and your webservers should not be a bottleneck since you can always just add more webservers.
Therefore the dominating concern should be code clarity and quality. In this case the code is simpler when it is written in an iterative fashion, so you should write it that way. There are some cases where it is simpler when it is written recursively, in those you should write recursively. An example would be a case where one lookup might return something telling you to do a different kind of lookup, that might or might do the same. (Of course you should think twice before setting up a data structure that does that in a CRUD application - you're probably over complicating something.)
If you want a better example to demonstrate that people can understand basic bottlenecks of their algorithms, try setting up an unnecessary double loop and see whether they can understand the performance repercussions. Bonus points if you can have the inner loop at a different place in the code than the outer loop. Double bonus points if the inner loop is issuing a database query, therefore keeping the database from doing an efficient join.
That would mirror the most common performance mistakes that I've run into in the real world.
Re:What am I missing?
andy.sh on 2008-08-04T18:12:48
Maybe I did not explain my though clear, but I am not against recursion in general; many things look easy in a form of recursion
:-) What about
/url/parsing/mechanism, than only one loop (no recursion) is needed to get the last ID, for (@uri_part) {select id
.... where uri = $_ and parent_id = $prev_id} But for common websites (where there are only few sections with duplicated names) it is even easier not to make one query per level. It might be faster to select all the data for all URI parts in one query using IN:
select id, parent_id from t where uri in ('alpha', 'beta', 'gamma')
and later find the path to last section's ID in Perl.
BTW, my current solution I use myself is to prepare a hash with full URIs and keep in in memory so that I can get direct answer by fetching
$id{$given_uri}
.
One option would be to build SQL for a self-join at run-time.
So
select f2.id
from foo f1, foo f2
where f1.parent = 0 and f1.name = 'alpha'
and f2.parent = f1.id and f2.name = beta'
This won't work if the path can have extra information after the structure. eg.
If the tree can be deeply nested, I'd go for a different data structure.
Re:self-join
andy.sh on 2008-08-04T12:13:00
Yes!:-) When you have three, four or more nested parts you get too complicated SQL query which is very good for making DDOS attacks with wrong URLs.