November 2010 Archives

Two RegExp bugs in Internet Explorer 8

It turns out there are at least two bugs in IE 8's Javascript regex engine. One of them is widely known, the other one doesn't seem to be.

Testcase: /^(?:(?=(.))a|(?=(.))b|(?=(.))c)/.exec('bar')
Correct result: ['b', undefined, 'b', undefined]
Actual result: ['b', 'b', 'b', '']

The last '' is the known bug: capturing groups that don't participate in a successful match are set to '' instead of undefined. Slightly annoying, but not too bad.

But the 'b' at index 1 is just wrong: The first capturing group was entered, but that branch failed (because the target string didn't start with an 'a'). At this point all captures from this branch should have been reset to their previous state (in this case undefined (or '' for IE)). That didn't happen.

End result: we get a successful match, but the captured strings may have completely bogus values.

I think this is pretty funny because I once wrote a toy "regex engine" when I didn't really know anything about byecode or automata or anything. It used "brute force" backtracking based on recursive function calls (no explicit stack). Well, it had that exact bug ... until I noticed and fixed it a few months later. In other words, this is a beginner's mistake in state management/backtracking.

I'm pretty sure Microsoft does some testing before it releases software. Didn't anyone notice this?

About mauke

user-pic I blog about Perl.