utf8

cogent on 2002-10-28T07:33:14

This is make testdb using Devel::DProf on JavaScript::Tokenizer to tokenize a largish bit of real-world JS code when regular expressions include such Unicodey things as \p{Nd} and \x{000A}:

01:20:18 [cogent@birthday] Parser>$ dprofpp
Total Elapsed Time = 167.9217 Seconds
  User+System Time = 151.5851 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 88.6   134.3 263.93  32590   0.0041 0.0081  utf8::SWASHNEW
 7.75   11.74 150.49  22400   0.0005 0.0067  JavaScript::Token::try
 3.91   5.930  7.911  32590   0.0002 0.0002  utf8::SWASHGET
 3.01   4.562 155.29   1602   0.0028 0.0969  JavaScript::Tokenizer::pop
 0.40   0.609  0.235  32579   0.0000 0.0000  utf8::DESTROY
 0.31   0.469  0.231  20787   0.0000 0.0000  JavaScript::Token::__ANON__
 0.12   0.189  0.091   1600   0.0001 0.0001  JavaScript::Token::newlines
 0.11   0.170  0.114   4892   0.0000 0.0000  JavaScript::Token::length
 0.09   0.140  0.128   3258   0.0000 0.0000  JavaScript::Token::bool
 0.07   0.110  0.073   3200   0.0000 0.0000  JavaScript::Tokenizer::state
 0.07   0.110  0.072   3209   0.0000 0.0000  JavaScript::Token::BEGIN
 0.07   0.110  0.128   1600   0.0001 0.0001  JavaScript::Tokenizer::token_types
 0.07   0.100  0.062   3261   0.0000 0.0000  UNIVERSAL::isa
 0.05   0.080  0.025   4800   0.0000 0.0000  JavaScript::Token::lexeme
 0.05   0.080  0.062   1600   0.0000 0.0000  JavaScript::Token::line

And this is the same code, when the unicode is replaced with similar (though obviously not identical) entities such as \d and \n

02:14:27 [cogent@birthday] Parser>$ dprofpp
Total Elapsed Time = 3.985910 Seconds
  User+System Time = 3.932676 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 111.   4.389  4.051  20230   0.0002 0.0002  JavaScript::Token::try
 31.5   1.239  5.505   1447   0.0009 0.0038  JavaScript::Tokenizer::pop
 12.1   0.479  0.198  21661   0.0000 0.0000  JavaScript::Token::__ANON__
 6.84   0.269  0.259   1445   0.0002 0.0002  JavaScript::Token::newlines
 4.55   0.179  0.136   1445   0.0001 0.0001  JavaScript::Tokenizer::token_types
 3.81   0.150  0.109   3181   0.0000 0.0000  UNIVERSAL::isa
 3.05   0.120  0.146   3178   0.0000 0.0000  JavaScript::Token::bool
 2.54   0.100  0.044   4335   0.0000 0.0000  JavaScript::Token::lexeme
 2.03   0.080  0.019   4658   0.0000 0.0000  JavaScript::Token::length
 1.53   0.060  0.041   1445   0.0000 0.0000  JavaScript::Tokenizer::line
 1.27   0.050  0.012   2890   0.0000 0.0000  JavaScript::Tokenizer::state
 1.27   0.050  0.031   1445   0.0000 0.0000  JavaScript::Token::line
 1.02   0.040  0.147      4   0.0100 0.0367  main::BEGIN
 0.76   0.030 -0.006   1445   0.0000      -  JavaScript::Token::string
 0.76   0.030  0.029      8   0.0037 0.0037  JavaScript::Token::BEGIN

autrijus says this is a well-known problem. <sigh.> Out comes Unicode support from JavaScript::Token::Regex.