toke.c

ethan on 2005-04-18T07:55:19

I never quite understood why Perl offered no hooks into its lexer and parser. They're contained in the interpreter, the very same program that runs my Perl scripts.

So I snuck a peek at the dreaded toke.c. My initial thought was that it was merely a matter of calling yylex() after initializing a few of the global PL_* variables appropriately. Only that on closer inspection there turned out to be exactly 99 of these global variables involved in the lexing process, including those dealing with the various perl stacks, control OPs and symbol tables.

So what I did was create a C++ class with 99 member variables. Each function in toke.c became a method that no longer works on PL_variable but this->pl_variable instead. Some non-lexer related functions had to be modified thusly, too, such as Perl_init_stacks() and a handful of those Perl_save_*() functions in scope.c. The whole purpose of that was to make the lexer re-entrant.

With these adjustments (and a few hundred #undefs/#defines), the actual XS code is very tiny:

MODULE = Perl::Lexer		PACKAGE = Perl::Lexer

Lexer *
Lexer::new ()
    CODE:
    {
        RETVAL = new Lexer();
        RETVAL->Pinit_stacks(aTHX);
    }
    OUTPUT:
        RETVAL
    CLEANUP:
        RETVAL->ME = newSVsv(ST(0));

void
Lexer::set_string (SV *line)
    CODE:
    {
        THIS->lex_start(aTHX_ line);
    }

void
Lexer::next_token ()
    CODE:
    {
        int tok = THIS->yylex(aTHX);

        /* skip empty lines */
        if (tok && THIS->bufptr) 
            while (THIS->bufptr == '\n') THIS->bufptr++;
	
        if (tok == 0)
            XSRETURN_EMPTY;

        EXTEND(SP, 2);
        ST(0) = sv_2mortal(newSViv(tok));
        ST(1) = sv_2mortal(newSVpv(TOKENNAME(tok), 0));
        XSRETURN(2);
    }

void
Lexer::DESTROY ()

And a sample script along with its output looks like this:

use blib;
use Perl::Lexer;

my $string = <<'EOS';
$a{1} = 1;
print keys %a;
EOS

my $lexer = Perl::Lexer->new;
$lexer->set_string($string);
while (my $l = $lexer->next_token) {
    print $l, " ";
}
print "\n";

__END__
$ WORD { THING ; } ASSIGNOP THING ; LSTOP UNIOP % WORD ; ;

A couple of problems still exist: Once the lexer sees a comment, an empty line or a shebang line, it seems to gobble up all characters up to the end of the string and thus finishes scanning. The shebang-line stuff is done in S_find_beginning() in perl.c before parsing even starts. As for empty lines, I suppose they are handled by perl's parser and not its lexer.

The last thing that needs to be done is making the actual attributes belonging to a token available. Ideally, this is just a matter of exposing yylval to the outside world.