Dear Log,
I spent all day poking at writing a chapter on spiders.
I think there's roughly four kinds of LWP programs:
I can imagine writing a program that does the third (a single-site spider), and indeed I'm doing so for the chapter, and I think it'll comprise the meat of the chapter, showing off LWP::RobotUA.
But it seems that many people want to write a program like the fourth -- a freely-traversing spider -- and for them I'm trying to muster something more useful than just saying "DON'T! [endchapter]".
After spending much of the day watching the cursor blink ON and OFF and ON and OFF, I think that that section will be "Don't, because... [fifty good reasons]".
Because just about everyone I know who admins a server, has had some numb-skull's useless aimless spider come and hammer their server senseless for just no good reason. "I was just searching the whole web for pages about Duran Duran, is all! How was I supposed to know your host would contain an infinite URL-space, in its events calendar site?"
It sounds like you need to handle the issue of senselessly hammering someone's webserver with the third part of the chapter. Perhaps all you need to do with the fourth part is outline what needs to be changed (as an exercise for the reader) and continue with the litany of reasons why you shouldn't do this unless you really know what you're doing?
Re:Third order spiders
darobin on 2002-03-16T16:44:04
Once you've got a link checker / site spider (type 3), how much work is it to make a general purpose onconstrained spider (type 4)?
Actually, I think it's more the other way round. An unconstrained spider is easy to write, it's adding the constraints that's harder
;) Re:Third order spiders
vsergu on 2002-03-16T18:24:51
Yes, adding the constraints is harder, but the one constraint that distinguishes type 3 from type 4 isn't hard at all. All you need to do it check that each URL starts with a particular prefix before you fetch it.
The more difficult constraints, such as avoiding infinite URL spaces and not hammering a server too hard, apply to type 3 just as much as type 4. But I guess you don't need to worry so much about them when you're running a type 3 program on a site that you control, since you'll be aware of any problems (or should be) and can stop the program when they come up.
A type 3 program that's running on a site you don't control, however, seems to be just as troublesome as a type 4 program, so I'm not sure the distinction is all that useful.Re:Third order spiders
darobin on 2002-03-16T19:08:52
True. Not hammering a site is rather simple, just sleep correctly (using LWP::RobotUA). Avoiding infinite space URLs otoh is hard and can probably only rely on heuristics (don't request more than n docs from server foo, don't request URLs that are longer than n chars, etc) or on the owner of the site having setup a robots.txt that you can read with WWW::RobotRules.