Parse URI with grammars

andy.sh on 2008-07-30T08:30:51

At a German Perl Workshop in February 2008 I gave a lightning talk "Syntaxanalyse von URL mit die Grammatik" ("Syntax analysis of URL with grammars"). I showed there how a website can utilize BNF-like grammars for parsing URL strings.

Typical mod_perl application contains only one entry point where all user queries will come to. Thus before the script can do anything it have to understand what to do depending on the URI requested.

In that lightning talk I proposed to use Parse::RecDescent module to parse URIs. At the time of GPW that scheme was in a test exploitation at a development server. A week ago I started to implement another web-application, namely simple online forum, and wanted to start with simple regular expressions for parsing URLs:

my ($sectionName, $sectionPage, $threadID, $threadPage, $messageID) = $uri =~ m{ ^/ (\w+) (?:-(\d+))? (?: / (\d+) (?:-(\d+))?

(?: / (\d+) )? )? /$ }x;


Here I extract parts of an URL which correspond to section name, thread and messages IDs and page numbers which can be added to first two parts of the URL. All the parts are optional in that sense that you can cut the tail of a URI, or page numbers might be missed. Seems to be very easy to write such a regexp, but then I realised that I forgot to add URLs for posting messages. The URL for writing a message looks like a URL for reading, the only difference is that it contains suffix post/ in it.

It is quite easy to add optional thing to any existing regular expression, even if not taking into consideration that capturing variables like $1 will shift their numbers (which is even not the case with 5.10's named captures). But along with getting new variables I wanted to set up a value of additional variable that defines the type of a URL requested: either it is an URL for reading or writing. To do that I had to analize values and states of variables obtained after applying the regexp. I definitely did not want to do that because 1) I would obtain lots of if/elses, and 2) the logic would move out of the regex scheme. Features like embedded Perl code ?{} were not attractive as well.

So I again came to using a grammar, even in this simple task of analizing six components of an URL. Implementaion of optional suffixes like post/ is as easy as it could be in ordinal regexp: just add post(?) and define the term for it, in my case it is the string containing the same letters: 'post' and a slash after it.

Here is the grammar which is used now (grammar actions are not shown here). It also gives direct answer '404' if the URL was invalid.

uri : S section S thread S message S EOL | S section S thread S post(?) EOL | S section S post(?) EOL | S post(?) EOL | /.*/

section : sectionUri page(?)

thread : threadID page(?)

message : messageID

sectionUri : word

threadID : number

messageID : number

number : /\d+/

word : /\w+/

page : '-' /\d+/

post : 'post' S

S : '/'

EOL : /\Z/


As long as the application is run under mod_perl, I successfully factored out all the code into a separate module which starts working in the phase of loading Apache, and thus the only job to do for every request is calling $parser->uri() method to match the given URL against the grammar which is already parsed.


Have you seen....

Ron Savage on 2008-07-30T23:19:33

http://search.cpan.org/user/wonko/CGI-Application-Dispatch-2.12/

Re:Have you seen....

andy.sh on 2008-07-31T15:19:02

timtowtdi