use HTTP::Proxy to log my web accessing history

agent on 2006-01-20T14:03:52

Yeah, I visit many websites everyday. what I'm wanting and what I'm always looking for is a facility to automagically keep a record of the URLs and page titles I've just accessed, so that I can analyse the history some time later to find out the focus of my interest in a particular period of time, for example. And it's very likely that I can come up with even more interesting statistical consequences.

The Mozilla browser doubtlessly gives builtin support for accessing history, but unfortunately exporting that history info is not trivial. what I want is not only the URLs, but also the corresponding page titles (if any!) and the visiting time stamp.

Several weeks ago, I happily found that the CPAN module HTTP::Proxy can come to the rescue. What I need to do is just writing several lines of Perl code using that module, running this script at the background as a local HTTP proxy server, and setting my web browser to simply use that. By doing this, my local proxy has a chance to monitor all the HTTP traffic between my browser and the Internet.

It's fun to see that my local proxy server can also use a remote proxy. so my local one then becomes a secondary proxy, no? ;D

The HTTP::Proxy module also supports logging internally, thus my code is even simpler: use HTTP::Proxy ':log'; my $logfile = ">>$home/myproxy.log"; open my $log, $logfile or die "Can't open $logfile for reading: $!"; my $proxy = HTTP::Proxy->new( logmask => STATUS, logfh => $log, ); The logmask parameter here controls what kind of things the proxy should record. the STATUS constant indicates only basic URL and response code will be logged. What I get in the log file is something like this: [Fri Jan 13 17:25:42 2006] (1888) REQUEST: GET http://www.perl.com/ [Fri Jan 13 17:25:53 2006] (1888) RESPONSE: 200 OK [Fri Jan 13 17:25:53 2006] (1888) REQUEST: HEAD http://www.google.com/mozilla/google.src [Fri Jan 13 17:25:54 2006] (1888) RESPONSE: 200 OK [Fri Jan 13 17:25:54 2006] (1888) REQUEST: GET http://www.perl.com/styles/main.css [Fri Jan 13 17:25:55 2006] (1888) RESPONSE: 304 Not Modified ... Hmm...very cute! However, HTTP::Proxy's builtin logging mechanism doesn't respect HTML titles. Thus I need to provide a user agent of my own: package MyUA; use HTTP::Proxy ':log'; use base 'LWP::UserAgent'; sub send_request { my ($self, $request) = @_; my $response; eval { $response = $self->SUPER::send_request( $request ); }; if ($@ and not $response) { return HTTP::Response->new(500, $@); } if ($response->is_success) { my $type = $response->header('content-type'); if ($type and $type =~ m[text/html]i) { if ($response->content =~ m[\s*(.*\S)\s*]si) { $proxy->log( STATUS, 'TITLE', $1); } } } return $response; } Now we have HTML titles recorded down as well, as witnessed in my log file: [Tue Jan 17 20:33:47 2006] (2484) REQUEST: GET http://perladvent.org/2004/20th/ [Tue Jan 17 20:33:50 2006] (2484) TITLE: Perl 2004 Advent Calendar: Filesys::Virtual [Tue Jan 17 20:33:50 2006] (2484) RESPONSE: 200 OK Then feed the customized user agent to my HTTP::Proxy instance I created earlier: my $agent = MyUA->new( env_proxy => 1, timeout => 100, ); $proxy->agent( $agent ); At last, we enter an infinite loop as every http proxy server: while (1) { eval { $proxy->start(); }; warn $@ if $@; } That's it!

It already works for me, but there're still several pitfalls in this solution:

  • Images won't display in MS Internet Explorer (Mozilla works fine, however)
  • It seems to me that HTTP::Proxy doesn't support forking by default so it leads to poor performance if I request multiple URLs simultaneously. (BTW, Is there a way to switch to a forking engine? I can't find a word in its POD docs.)
  • SSL connection doesn't work on my box.
Have fun!


Using your own agent...

BooK on 2006-01-22T15:53:23

Hi! I'm the author of HTTP::Proxy. :-) Glad you like it.

I don't think you need to define your own agent to log information. In fact, I think I should never have opened the opportunity to set your own agent. You could simply use a response filter that catches the title tag and print it in you log file.

I also don't understand your while(1) loop. The $proxy->start() is already a while(1) loop.

And you say that the proxy doesn't fork? That probably means you're running it under Win32, don't you? Alas, the forking code doesn't work very well under Windows. Also, maybe the documentation doesn't state this clearly, but you can change the engine (HTTP::Proxy::Engine subclass) by passing the engine parameter to the constructor. On Unix, I use the ScoreBoard engine. This way: my $proxy = HTTP::Proxy->new( engine => 'ScoreBoard' );

Regarding images in Internet Explorer, I'm not sure what's going on. I know for sure that HTTP::Proxy doesn't support pipelined requests (because that's what apt-get does, and it fails for the moment).

As for SSL connections, HTTP::Proxy supports the CONNECT method, but cannot look inside.

Re:Using your own agent...

agent on 2006-01-26T15:45:52

Thank you very much for your comments! Yeah, it's odd not to use the filter mechanism. Filters make things simpler.

I'm so glad to receive feedback from you, the very author of HTTP::Proxy. :=)

Re:Using your own agent...

Vegetable on 2006-12-23T17:19:16

Heeeeeeeello,

I'm new to HTTP::Proxy and I was wondering if anyone if anyone could debug the following code for me...

The script is written to display all actions performed while I'm browsing on the Cmd Prompt of Windows (since logfh is default to be *STDERR).

use HTTP::Proxy; use HTTP::Recorder;

my $proxy = HTTP::Proxy->new(logmask => ALL); $proxy->start();

For some reason, no messages are displayed even though I'm browsing like crazy. Is there anything I missed?

Thanks