WWW::Search Module Advisory

hfb on 2002-03-01T16:39:18

hfb writes "Recently Google requested that a module be removed from CPAN as it violated their terms of use agreement. The author agreed to remove the module without a fuss, but those of you who have modules for other search engines, who are considering writing one or who write crawlers should be aware of this development. The search engines likely don't mind sensible and courteous crawlers but cannot abide the DoS-like crawling that happens with poorly written clients. WWW::Search namespace, this means you."


Agreement?

drhyde on 2002-03-01T19:14:33

Call me an idiot, but I'm buggered if I can see any such agreement on google's site. It ain't on the front page, and it ain't on the results page.

Re:Agreement?

krellis on 2002-03-01T19:19:54

Google Terms Of Service are linked from the bottom of the "All About Google" page. Took me a while to find it, too, since silly me didn't think to look at the bottom of that page.

Re:Agreement?

miyagawa on 2002-03-01T19:21:18

it is on http://www.google.com/terms_of_service.html

BTW I have two modules related to Google, on CPAN. but I've not got any requests from Google. (Though Apache::No404Proxy::Google has a disclaimer inside which points to the problem of violating ToS, and how to get approval of Google)

Re:Agreement -- Permission vs. Forgivness?

billybob on 2002-03-07T21:08:15

Any experience on anyone making a permission request to google?

My crawler is hopefully perceived to be very well behaved but this is sort of like -- "Do you want to bring attention to your crawler by asking?"

After reading their TOS and manually reviewed their robots.txt I'm left with no reason to believe they would be reasonable.

Guns don't kill people, people kill people

hossman on 2002-03-01T22:32:17

Assuming the issue here was the "Personal Use Only " and "No Automated Querying" sections, I really don't see how the existence of a module can violate Gogles TOS (unless there's some automated test-case that they take issue to).

As long as the module CAN be used in accordance with their TOS (ie: as long as I can use it to write a script which I use for Personal Use) then the module itself is not in violation. If they don't like the way some asshole is using the Module, they should go after the asshole.

I can write a meta-searching site that violates their TOS using nothing but Apache, /bin/sh, and lynx -- does that mean lynx violates their TOS and should be pulled from circulation?

(Admittedly, I don't know the details of what module was pulled, maybe it was called Apache::SearchGoogleWithYourOwnAds and did all the work for you to create a proxy to Google that showed their results with your Ads -- but i doubt it could have been that bad.)

Re:Guns don't kill people, people kill people

hfb on 2002-03-02T17:29:36

I don't they would have asked for it's removal had it not been a pressing issue. I get at least one fucktard a day who wrote a crawler to use on the cpan search engine that doesn't respect robots.txt or anything else for that matter effectively crippling the service for everyone else...idiots with a little perl and not a lot of common sense can ruin your day.

The author removed the module voluntarily. However, the others in the namespace would do well to consider the applications of their modules and compliance with the terms of service to avoid this sort of problem in the future.

Search engines like to provide a useful service without the added hassle of someone trying to hoover their database with 50 queries a second or more. I consider abusive crawlers to be a menace and a threat to freely available search engines like google, CPAN search and others.

Re:Guns don't kill people, people kill people

hossman on 2002-03-02T19:21:53

I still say they should be going after the users, not the code.

I mean, if people are slamming their site with a module, taking the module off CPAN isn't going to stop them -- they've still got it, and they'll still use it.

There is definitely somethign to be said however for trying to make you modules play as nicely as possible -- having a section in the documentation on how to use the module responsably is good, but module writters might also want to consider putting "safety valves" in their code, that users have to go out of their way to open. That way you're doing your part to make your software play nice with the rest of the children, and you can point a clear finger at the user for disabling the safety feature.

I'm reminded of some code a buddy showed me a few years ago. There was YA Buffer Overflow hole in some software, and the person who found the hole had released a C program to exploit it (in the spirit of SATAN). If you compiled the code (or got a binary from someone) and used it without looking at the source to understand what it did, you would never notice the #ifdef SCRIPT_KIDDIE block that put the users name, email, IP, hostname, and a bunch of other really usefull information into the large string that was generated to overrun the buffer -- giving anyone who had patched the bug all the data they needed to track down the person trying to hack them.

Perhaps people writting modules in the WWW::Search hierarchy could put similar data into X- headers without documenting the "feature" so Search engines can better block/track assholes abusing the module.

Re:Guns don't kill people, people kill people

dlc on 2002-03-04T18:56:31

Perhaps people writting modules in the WWW::Search hierarchy could put similar data into X- headers without documenting the "feature" so Search engines can better block/track assholes abusing the module.

Even outside of the context of this discussion, this is a fabulous idea. Not so much to enable search engines to block abusers, but because software that uses the Net should be self-identifying, especially software that iteratively traverses a site.

(darren)

Re:Guns don't kill people, people kill people

vsergu on 2002-03-05T00:00:28

We don't need no stinkin' X- headers. That's what User-Agent is for.

Re:Guns don't kill people, people kill people

hossman on 2002-03-05T04:41:31

Most of the CPAN Modules I've seen that act as HTTP clients set the User-Agent, but they also have a documented method for the User to override it (in case they need to masquarade as a particuar User-Agent.

I'm suggesting some headers that would be completely undocumented, and could only be overridden using an undocumented method. Most people would be completely unaffected (since the extra X headers would be ignored) and anyone who was affected wouldn't have too much trouble looking at the source to figure out where the headers were coming from (especially if the headers themselves were self documenting...

X-CPAN-Module: WWW::Search::FooBar
X-CPAN-Module-Info: you can remove these headers using hide_extra_headers()
X-CPAN-Module-User: hossman@fucit.org

Hypocrisy.

solhell on 2002-03-01T22:41:31

Let me make sure if I understood this correctly. Google doesn't let some program to query their website, then retrieve the search results, parse them and use them. Like a metasearch script that extracts information from multiple search engines and combines them. They supposedly doesn't allow people doing that.
Let's rephrase that; a remote program (a web browser in a sense) visits their webpage, parse the data to keep only the url's of webpages that obviously Google doesn't own and only use that information that Google doesn't own. So why would this be a problem. And how is this different than Google crawling peoples web pages, caching their data and images.

This is not a DoS attack. You don't crawl google iteratively in parallel. It is a simple one page query.

You might argue that people can put some files to prevent search engines to index their pages. Don't forget, google extract copyrighted material from others pages, and a metasearch script extract only the data google doesn't own at all.

Re:Hypocrisy.

jerry22 on 2002-03-10T00:51:40

Hmm, let me sure *I* understand this correctly. You are proposing that it is ok to steal their results (i.e., cost them money) without *any* compensation? How exactly do you expect them to stay in business? Their business model is simple: either you pay per query (like Yahoo), or it's free and you "pay" (in the aggregate) by viewing/clicking on ads. Automated queries without pay don't make any sense economically.

Re your comment on ownership of the information, you're missing the point too--Google does own the result list (which 10 results). That it doesn't own the actual URLs is irrelevant. (If you don't agree with this argument: why do you query Google and don't go directly to the sites? Uhm, perhaps because you don't know which ones? Well, that's the information they provide and own.)

What was the module in question?

jbc on 2002-03-02T19:09:47

Just out of curiousity, what was it that was removed, and how much can we know about what it was that was deemed DoS-y about it? It would make it easier to assign an appropriate karmic burden to the Googlemeisters if I knew what they objected to.

What about my terms of use?

Starky on 2002-03-03T02:33:37

Curious as to what kind of precedent Google is encouraging, I have posted my own terms of use.

The terms I stipulate are that anyone may access my homepage for any reason at any time by any means, automated or not, with the exception of Google or any of its agents. Google must pay a fee of $1 per hit for any access to any web page or graphic element on my domain, whether that access is through a browser, a spider, or a robot.

I'm looking forward to my first check!

Re:What about my terms of use?

solhell on 2002-03-03T10:50:47

The problem is that Google has more and better lawyers and they are not afraid to use them :)

Re:What about my terms of use?

hfb on 2002-03-03T16:41:02

Grow up

Search engines are quite possibly the single most useful part of the internet. Try to spend a day without them. Google provides the service for free and largely advertisement-free as well. Unless you would prefer that Google and others resorted to subscription only or ad filled content it is not an unreasonable request to consider the consequences and potential misuses of these modules.

Google also respects the ban in robots.txt if you don't wish them to index your site. A lot of other crawlers don't.

THE AUTHOR VOLUNTARILY OFFERED TO REMOVE IT AT GOOGLES REQUEST. NO PUPPIES WERE KILLED IN THE FILMING OF THIS MOVIE.

Re:What about my terms of use?

autarch on 2002-03-04T00:56:42

A module that simply goes to google, does a search, and displays or parses the result is not violating their terms of service AFAICT, thought it could be used to do so.

Of course, their terms of service are very unclear here. What is "automated searching"?

Anyway, what's scary to me is that a corporation is telling people what kind of code they can and cannot write and distribute.

But of course, without actually knowing what the removed module did I can't really say too much, though I am sympathetic to people who are suspicious of Google.

Re:What about my terms of use?

hfb on 2002-03-04T01:36:24

I don't know any details other than the author voluntarily agreed to remove it from CPAN. I found it interesting and sympathise with Google as abusive crawlers are difficult to identify and stop as well as consume far more resources than is 'fair use'. I don't know that removing the module from distribution will ease or solve the problem but I certainly can't blame them for trying.

Again, the point isn't whether or not Google is a big bad old mean corporation for picking on a particular module...the point is that when writing and distributing these kinds of modules authors should keep in mind potential legal snafus as well as being a good internet neighbor by designing the modules in a way that makes such problems more difficult to create. Just because you can do something doesn't mean you should or that it's right.

Re:What about my terms of use?

Elian on 2002-03-04T03:15:25

I worked for a search engine for a few years. There were days when 30% of our traffic was meta-search engines, automated placement checkers, and toolbar search things. None of which displayed the banner ads we were putting up. That's a lot of traffic, with a lot of cost associated with it, with no revenue from it at all.

Like it or not, advertising pays some of the bills, and automated search tools skip the advertising. And, while the ads don't pay for all the traffic (I'd wager a lot that Google's public search is a loss-leader, with each search costing them more than the ads return) they offset some of the cost.

Yes, google gets the content for free. But, on the other hand, you get the searches for free too. Skip the ads, or abuse the service (which is easy with a program) and the service will go away as a free public service.

Been there. Done that.

Googlewhacking

Mur on 2002-03-04T15:07:22

This may be related to Googlewhacking.

Disclaimer: the term "Googlewhack" was invented by my employer, Gary Stock (his home page is where it all started). At one point he was getting hundreds of emails a day, people offering up their clever googlewhacks, arguing about scoring and legitimacy, describing programming hacks, etc. The scariest was somebody who said he was coding up a script to walk through /usr/share/dict/words (or whatever) and submit all the word pairs to Google ... yipe!

Middle solution?

jeppe on 2002-03-04T15:53:31

How about someone made a module with google's cooperation, using some specific interface. That way, Google could monitor how many queries are run using the engine. We'll stay in line, and Google will be protected from leeches making Google proxies (they will be able to spot the popular ones when looking through their log files).

Proxies?

cjb on 2002-03-07T12:38:39

According to Google:
" You may not take the results from a Google search and reformat and display them, "


I do all of my browsing through a ad-removing proxy. I suppose they could claim my proxy is also in violation of the TOS, since my software reformats every page that I see.

Sad, really.