Parse::HTTP::UserAgent: yet another user agent string parser

Burak on 2009-09-04T04:25:35

I was using HTTP::BrowserDetect for a long time. Not because it's a pice of art or accurate, but because of laziness perhaps. When I had some free time, I thought about re-inventing the wheel, like I did several times before. The main reason for re-inventing is the source code and interface of the module (try to read it, you'll understand) and the lack of new releases. Also, it's not accurate.

There are two other alternatives though: HTML::ParseBrowser and HTTP::DetectUserAgent. The former is really good parser-wise, while the latter is actually a sniffer and does not give you a verbose result.

So, I wrote Parse::HTTP::UserAgent. It tries to be verbose and parse as much as possible from the junk named "User Agent String". It tries to identify the major browsers first and then falls back to minor/old ones with an extended probe. The parsed structure has many fields like:

name               Browser name. You may need to check original_name() if faker (like Maxthon).
version_raw        Browser version
version            version(version_raw)->numify: The float version of the parsed version.
original_name      The original name (i.e.: Maxthon)
original_version   The original version (i.e.: 2.0 (Maxthon))
os                 Operating system. Windows names returned instead of versions
lang               The "user interface" language of the browser
toolkit            [tk_name, tk_version, version(tk_version)->numify]. Gecko, Trident, etc.
dotnet             If it has .NET CLR version in the string, this'll have all versions
mozilla            If a Mozilla browser, returns Moz version: [original, version(original)->numify]
strength           Encryption strength (I guess this does not have much value today)
robot              UA is a robot
extras             Any non-parsable junk. Arrayref.
parser             The name of the parser that returned the result set
generic            Parsed by a generic parser? Bool.
string             The original User Agent String
unknown            User Agent String can not be parsed
device             ***not implemented yet
wap                ***not implemented yet
mobile             ***not implemented yet

The module also has ->as_hash and ->dumper methods for debugging purposes.

The biggest difference is; it parses the fakers like Maxthon accurately. Also extracts .NET versions and toolkit names and versions. It also identifies Opera 10 (btw, Opera is the first thing I install on a new system) correctly.

The version numbers are converted to decimals to ease comparison (I dislike that major/minor stuff the others implement). The conversion also removes any junk string (like "gold") from the version number. While using version is good, as it handles all the nasty stuff, I got some regression from 5.6.2 smokers after releasing the module. It looks like they (5.6.2) have the pure perl version::vpp (I couldn't compile the xs version under 5.6.1 either) which has some kind of bug. I've opened a ticket about the issue, but also added a workaround to fool version::vpp (postfix '.0' if version is three digits). I currently have no idea about 5.5.x but 5.6.x seem to be fine at least (also tested myself with ActivePerl 5.6.1 on a virtual Windows XP).

The module also has some example programs in it for benchmarking. I'll give some figures below. The test system is: Windows Vista Home Premium SP2 32bit & P8600 @ 2.40GHz & ActivePerl 5.10.0.1004

C:\>perl -Ilib eg\bench.pl -c 1000
*** The data integrity is not checked in this run.
*** This is a benchmark for parser speeds.
*** Testing 161 User Agent strings on each module with 1000 iterations each.

This may take a while. Please stand by ...

          Rate    HTML   HTML2 Browser   Parse  Parse2  Detect
HTML    12.6/s      --     -2%    -63%    -75%    -82%    -90%
HTML2   12.9/s      2%      --    -62%    -75%    -81%    -90%
Browser 34.2/s    170%    166%      --    -33%    -51%    -73%
Parse   51.1/s    304%    297%     50%      --    -26%    -59%
Parse2  69.4/s    449%    439%    103%     36%      --    -44%
Detect   125/s    888%    871%    266%    144%     80%      --

The code took: 241.65 wallclock secs (228.21 usr +  0.08 sys = 228.29 CPU)

---------------------------------------------------------

List of abbreviations:

HTML      HTML::ParseBrowser v1
HTML2     HTML::ParseBrowser v1 (re-use the object)
Browser   HTTP::BrowserDetect v0.99
Detect    HTTP::DetectUserAgent v0.01
Parse     Parse::HTTP::UserAgent v0.16
Parse2    Parse::HTTP::UserAgent v0.16 (without extended probe)

HTML::ParseBrowser is slow as hell. Even re-using the object as the doc suggests does not help. It's good that I wasn't aware of the module until now :p HTTP::BrowserDetect is not a good performer too. But the interface is extensive and it's kinda defacto standard in this area. It tries to match with *anything* possible and this choice slows it down (who cares if $ua->win31 is true as of today right?). HTTP::DetectUserAgent is the speedy one here. It doubles Parse::HTTP::UserAgent even when the extended probe is disabled. However it gains this speed with several CAVEATs as the version number suggests.

C:\>perl -Ilib eg\accuracy.pl
*** This is a test to compare the accuracy of the parsers.
*** The data set is from the test suite. There are 161 UA strings
*** Parse::HTTP::UserAgent will detect all of them
*** A tiny fraction of the regressions can be related to wrong parsing.
*** Equation tests are not performed. Tests are boolean.

This may take a while. Please stand by ...

----------------------------------------------------------------------------------------------
| Parser                 | Name FAILS     | Version FAILS  | Language FAILS | OS FAILS       |
----------------------------------------------------------------------------------------------
| HTTP::DetectUserAgent  |   27 -  16.77% |   37 -  23.27% |   67 - 100.00% |   35 -  24.31% |
| HTTP::BrowserDetect    |   28 -  17.39% |    8 -   5.03% |   67 - 100.00% |   20 -  13.89% |
| HTML::ParseBrowser     |    0 -   0.00% |    3 -   1.89% |   42 -  62.69% |   19 -  13.19% |
| Parse::HTTP::UserAgent |    0 -   0.00% |    3 -   1.89% |    3 -   4.48% |    4 -   2.78% |
----------------------------------------------------------------------------------------------

Parse::HTTP::UserAgent is not perfect, but at least it seems to be close. HTML::ParseBrowser is more accurate on name/version matching. Speedy HTTP::DetectUserAgent seems to be the worst. However there is one caveat, the test data is from the Parse::HTTP::UserAgent test suite. So, Parse::HTTP::UserAgent is not actually that good yet since there are some patterns it can not match.

Note: The module is already on CPAN, but you can get the latest code and non-CPAN content from the code repository. The repo also has a etc/Migration.pod for HTTP::BrowserDetect users.