Arachnia!

TorgoX on 2002-03-16T09:37:42

Dear Log,

I spent all day poking at writing a chapter on spiders.

I think there's roughly four kinds of LWP programs:

  1. Programs that get one object off the web and process it (i.e., save it, whatever).
  2. Programs that get one object off the web, and find everything it links to, and process (save, etc) those.
  3. Programs that get a page, process it, look at everything that it links to that's on the same host, and get and process all those, and everything that they link to, and everything that those link to, recursively.
  4. Programs that get a page, process it, look at everything that it links to wherever it is on the Web, and get and process all of those, and everything that they link to, and everything that those link to, recursively. I constantly write programs that do the first two, but I don't generally call them "spiders". I generally reserve the term "spider" for the last two.

    I can imagine writing a program that does the third (a single-site spider), and indeed I'm doing so for the chapter, and I think it'll comprise the meat of the chapter, showing off LWP::RobotUA.

    But it seems that many people want to write a program like the fourth -- a freely-traversing spider -- and for them I'm trying to muster something more useful than just saying "DON'T! [endchapter]".

    After spending much of the day watching the cursor blink ON and OFF and ON and OFF, I think that that section will be "Don't, because... [fifty good reasons]".

    Because just about everyone I know who admins a server, has had some numb-skull's useless aimless spider come and hammer their server senseless for just no good reason. "I was just searching the whole web for pages about Duran Duran, is all! How was I supposed to know your host would contain an infinite URL-space, in its events calendar site?"


    Third order spiders

    ziggy on 2002-03-16T15:00:16

    Once you've got a link checker / site spider (type 3), how much work is it to make a general purpose onconstrained spider (type 4)? Not much.

    It sounds like you need to handle the issue of senselessly hammering someone's webserver with the third part of the chapter. Perhaps all you need to do with the fourth part is outline what needs to be changed (as an exercise for the reader) and continue with the litany of reasons why you shouldn't do this unless you really know what you're doing?

    Re:Third order spiders

    darobin on 2002-03-16T16:44:04

    Once you've got a link checker / site spider (type 3), how much work is it to make a general purpose onconstrained spider (type 4)?

    Actually, I think it's more the other way round. An unconstrained spider is easy to write, it's adding the constraints that's harder ;)

    Re:Third order spiders

    vsergu on 2002-03-16T18:24:51

    Yes, adding the constraints is harder, but the one constraint that distinguishes type 3 from type 4 isn't hard at all. All you need to do it check that each URL starts with a particular prefix before you fetch it.

    The more difficult constraints, such as avoiding infinite URL spaces and not hammering a server too hard, apply to type 3 just as much as type 4. But I guess you don't need to worry so much about them when you're running a type 3 program on a site that you control, since you'll be aware of any problems (or should be) and can stop the program when they come up.

    A type 3 program that's running on a site you don't control, however, seems to be just as troublesome as a type 4 program, so I'm not sure the distinction is all that useful.

    Re:Third order spiders

    darobin on 2002-03-16T19:08:52

    True. Not hammering a site is rather simple, just sleep correctly (using LWP::RobotUA). Avoiding infinite space URLs otoh is hard and can probably only rely on heuristics (don't request more than n docs from server foo, don't request URLs that are longer than n chars, etc) or on the owner of the site having setup a robots.txt that you can read with WWW::RobotRules.