This is make testdb
using Devel::DProf
on JavaScript::Tokenizer
to tokenize a largish bit of real-world JS code when regular expressions include such Unicodey things as \p{Nd}
and \x{000A}
:
01:20:18 [cogent@birthday] Parser>$ dprofpp Total Elapsed Time = 167.9217 Seconds User+System Time = 151.5851 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 88.6 134.3 263.93 32590 0.0041 0.0081 utf8::SWASHNEW 7.75 11.74 150.49 22400 0.0005 0.0067 JavaScript::Token::try 3.91 5.930 7.911 32590 0.0002 0.0002 utf8::SWASHGET 3.01 4.562 155.29 1602 0.0028 0.0969 JavaScript::Tokenizer::pop 0.40 0.609 0.235 32579 0.0000 0.0000 utf8::DESTROY 0.31 0.469 0.231 20787 0.0000 0.0000 JavaScript::Token::__ANON__ 0.12 0.189 0.091 1600 0.0001 0.0001 JavaScript::Token::newlines 0.11 0.170 0.114 4892 0.0000 0.0000 JavaScript::Token::length 0.09 0.140 0.128 3258 0.0000 0.0000 JavaScript::Token::bool 0.07 0.110 0.073 3200 0.0000 0.0000 JavaScript::Tokenizer::state 0.07 0.110 0.072 3209 0.0000 0.0000 JavaScript::Token::BEGIN 0.07 0.110 0.128 1600 0.0001 0.0001 JavaScript::Tokenizer::token_types 0.07 0.100 0.062 3261 0.0000 0.0000 UNIVERSAL::isa 0.05 0.080 0.025 4800 0.0000 0.0000 JavaScript::Token::lexeme 0.05 0.080 0.062 1600 0.0000 0.0000 JavaScript::Token::line
And this is the same code, when the unicode is replaced with similar (though obviously not identical) entities such as \d
and \n
02:14:27 [cogent@birthday] Parser>$ dprofpp Total Elapsed Time = 3.985910 Seconds User+System Time = 3.932676 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 111. 4.389 4.051 20230 0.0002 0.0002 JavaScript::Token::try 31.5 1.239 5.505 1447 0.0009 0.0038 JavaScript::Tokenizer::pop 12.1 0.479 0.198 21661 0.0000 0.0000 JavaScript::Token::__ANON__ 6.84 0.269 0.259 1445 0.0002 0.0002 JavaScript::Token::newlines 4.55 0.179 0.136 1445 0.0001 0.0001 JavaScript::Tokenizer::token_types 3.81 0.150 0.109 3181 0.0000 0.0000 UNIVERSAL::isa 3.05 0.120 0.146 3178 0.0000 0.0000 JavaScript::Token::bool 2.54 0.100 0.044 4335 0.0000 0.0000 JavaScript::Token::lexeme 2.03 0.080 0.019 4658 0.0000 0.0000 JavaScript::Token::length 1.53 0.060 0.041 1445 0.0000 0.0000 JavaScript::Tokenizer::line 1.27 0.050 0.012 2890 0.0000 0.0000 JavaScript::Tokenizer::state 1.27 0.050 0.031 1445 0.0000 0.0000 JavaScript::Token::line 1.02 0.040 0.147 4 0.0100 0.0367 main::BEGIN 0.76 0.030 -0.006 1445 0.0000 - JavaScript::Token::string 0.76 0.030 0.029 8 0.0037 0.0037 JavaScript::Token::BEGIN
autrijus says this is a well-known problem. <sigh.> Out comes Unicode support from JavaScript::Token::Regex
.