I never quite understood why Perl offered no hooks into its lexer and parser. They're contained in the interpreter, the very same program that runs my Perl scripts.
So I snuck a peek at the dreaded toke.c
. My initial thought was that it was merely a matter of calling yylex()
after initializing a few of the global PL_*
variables appropriately. Only that on closer inspection there turned out to be exactly 99 of these global variables involved in the lexing process, including those dealing with the various perl stacks, control OPs and symbol tables.
So what I did was create a C++ class with 99 member variables. Each function in toke.c
became a method that no longer works on PL_variable
but this->pl_variable
instead. Some non-lexer related functions had to be modified thusly, too, such as Perl_init_stacks()
and a handful of those Perl_save_*()
functions in scope.c
. The whole purpose of that was to make the lexer re-entrant.
With these adjustments (and a few hundred #undefs/#defines), the actual XS code is very tiny:
MODULE = Perl::Lexer PACKAGE = Perl::Lexer
Lexer * Lexer::new () CODE: { RETVAL = new Lexer(); RETVAL->Pinit_stacks(aTHX); } OUTPUT: RETVAL CLEANUP: RETVAL->ME = newSVsv(ST(0));
void Lexer::set_string (SV *line) CODE: { THIS->lex_start(aTHX_ line); }
void Lexer::next_token () CODE: { int tok = THIS->yylex(aTHX);
/* skip empty lines */ if (tok && THIS->bufptr) while (THIS->bufptr == '\n') THIS->bufptr++; if (tok == 0) XSRETURN_EMPTY;
EXTEND(SP, 2); ST(0) = sv_2mortal(newSViv(tok)); ST(1) = sv_2mortal(newSVpv(TOKENNAME(tok), 0)); XSRETURN(2); }
void Lexer::DESTROY ()
use blib; use Perl::Lexer;
my $string = <<'EOS'; $a{1} = 1; print keys %a; EOS
my $lexer = Perl::Lexer->new; $lexer->set_string($string); while (my $l = $lexer->next_token) { print $l, " "; } print "\n";
__END__ $ WORD { THING ; } ASSIGNOP THING ; LSTOP UNIOP % WORD ; ;
S_find_beginning()
in perl.c
before parsing even starts. As for empty lines, I suppose they are handled by perl's parser and not its lexer. yylval
to the outside world.