PHP oddity

jarich on 2007-08-27T11:00:32

The PerlNet Wiki uses Mediawiki which is written in PHP so too are the plugin scripts we use to manage it. As the wiki is open to anonymous edits, we have a fun arms race with spammers. Most of the time, they get caught by our blacklist and their edits don't even get saved. However recently one spammer has done something clever...

When I first spotted this afternoon's spam I checked it for common urls and found that most links pointed to ifrance.com. I would have added it to our blacklist but it was already there! So I checked that the anti-spam bot was still active and it was. Why wasn't this page being picked up? I took a copy of the cleanup script, printed out the regex and ran that with some of the page text: yup it matched!

I printed out what the script was seeing as the text and it matched the page content. I printed that to a file and ran a simple regex over it, yup it matched. I ran the bigger regex over it, no match.

I looked at the data again. Couldn't see anything special about it except that all of it was on one line. Well... surely it couldn't be a memory issue. I was reading the whole text into memory before performing the regex, how would line boundaries make a difference? I wasted time looking into other possibilities.

I eventually came back to the fact that it was a _very_ long single line... that a simple regex could match. Could that be it anyway? I removed a few thousand characters and wow! It started matching again.

I eventually found that (for the size of regular expression we're using) strings of 13808 characters or less would match, but any more and the match would fail.... silently. I did this with the following code:


My string started with 14979 characters!

I wondered how much of this was because it was a very long _line_ as opposed to a very long string. So I edited the data file to add newlines after each url. It matched immediately!

I thought about the length of the regular expression (it's 2584 characters). The simple regular expression ifrance\.com had worked, so I wondered if the failure was due to alternation or capturing. I added in a small hunk of the real regex for about 30 characters (4 alternations) and it still matched. Removing a third of the real regex length (string length, not necessarily alternation opportunities) resulted in matching the string one character earlier but that was it.

Odd.


Seems like a bug

Shlomi Fish on 2007-08-27T15:31:23

This seems like a bug, and needs to be reported. Can you supply the offending string and regex? It may be a bug in PCRE, which is the regex engine that PHP uses for doing that. Trying the same with an equivalent ANSI C program using PCRE, may be instructive in pin-pointing the problem.

Regards, Shlomi Fish.

Re:Seems like a bug

jarich on 2007-08-28T05:12:07

That's what I think too. Mind you, I'm using PHP 4.3.10-22 so maybe it's already been fixed. I mentioned it only because it seemed so unlikely.

If you have a more up to date version of PHP or just want to have a play, then you can download my test script and two test files (both which fail at first and then match as they lose length) from http://perltraining.com.au/user/jarich/php-pcre.tgz.

The two files are clean.txt and dirty.txt. dirty.txt is one page of the spam that was successfully tricking the regular expression; the spam is obnoxious and sexually explicit, certainly not safe for work. When using dirty.txt you'll probably want to chop off the last 1150 characters straight away just to speed up the success time. clean.txt is some generated text, which oddly had to be much, much bigger than dirty.txt in order to make the regex fail. For more data points: all the spam on 28th August 2007 from http://perl.net.au/ cleaned up between 2 and 3pm (Australian Eastern Standard Time) which mentions ifrance.com cause this problem.

Passing the same data to Perl with an exactly equivalent regular expression resulted in a match straight away.

All the best,

jarich

Re:Seems like a bug

Shlomi Fish on 2007-08-28T10:11:28

Running a slightly modified test script against php-cli-5.2.3-10mdv2008.0 with libpcre0-7.2-1mdv2008.0, I'm getting:

746
745
744
743
742
741
740
739
738
737
Matched spam.

So it got worse. I can later try it with grep -P or with pcregrep.

Kind of a known bug.... sort of

jarich on 2007-08-28T07:39:01

I brought this up with Ben Balbo (an excellent PHP programmer) and he mentioned that there are two similar bugs which have been submitted in the past.

As he says:

The first suggests it's a limitation of PCRE, and the second simply dismisses it as not implying a bug in PHP itself.

As the PCRE website appears to be having problems, I'm at a loss how to get this issue fixed. I've worked around it, but really I'd just rather it behaved properly.