Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Michael Ash's Regex Blog

Regex Musings

Looking again at the Lookahead bug

A few years ago a wrote about about a bug in Internet Explorer's Regex engine that affected patterns with lookaheads. Well the bug came back in the form of a question on RegexAdvice.com. It too was a password regex, though not as complex as the previous pattern that introduced me to this bug.

The first pattern had three conditions that were being tested for with lookaheads.

^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15}$

With the current pattern only one lookahead was being used.

^(?=.*?\d)[a-z][a-z0-9]{5,7}$

In both patterns the pattern had a min and max length. In to original attempts of both the length was being checked after the lookahead(s) test. While this is perfectly fine in a non VBScript/JScript world, this is were the bug kicks in those regex engines. Actually it's probably the same regex engine for both languages, which is probably why it only effects IE it's the only browser that uses VBScript and Jscript natively. I don't recall in my original testing for this bug if I tested it server-side given my previous blog comments, mostly likely I only tested client-side. However the recent question it was failing server-side so it's not actually in the browser but more likely the DLL for those languages. Anyway the previous blog article covered the behavior that was happening.

Steve Levithan looked much closer at the problem in general and discussed it on his blog. He came to the conclusion that the qualifiers with a minimum boundary of zero, within the lookahead were the culprit. He provides a couple of simple examples. I think he's partially right but I don't think the 1+ qualifiers are excluded from the problem. I think his provided examples were a little too simple for them to be effected.

OK let look at the regex pattern in the recent question

^(?=.*?\d)[a-z][a-z0-9]{5,7}$

The requirements were 6 to 8 alphanumeric (English) string that started with an alpha character, with at least one digit. Let's use “abc123” as the text string.

Now the above pattern was supplied by one of the resident pros who frequent the message boards. Let's get pass the fact the pattern itself is correct and satisfies the requirements. I'm going to rewrite the pattern to

^(?=\D*\d)[a-z][a-z0-9]{5,7}$

Now this is functionally the same, it's just within the lookahead the pattern is greedy instead of lazy and it will make the point I'm trying to make easier to see (I hope) without having to deal with backtracking. Now this pattern suffers from the bug in VBScript/JScript. Now Steven suggested the one or more qualifier (+) doesn't suffer from the problem but change the to a + , which doesn't effect the match because the first test after the lookahead is for an alpha so the lookahead placed as it is will always match at least one character with the \D* pattern. So now we have

^(?=\D+\d)[a-z][a-z0-9]{5,7}$

which is also bitten by the bug.

I first tried the same approach on my original attempt dealing with the bug, but no luck. I tried using the + qualifier, still didn't work. Now the person posting the question stated without the lookahead the remaining pattern worked fine, excluding the at least one digit test. So I begin testing the part of patterns such as

^a-z][a-z0-9]{5,7}$

^(?=\D*\d)[a-z]

^(?=\D+\d)[a-z]

are just a few attempts, all work as expected. It was only when I put the whole pattern back together that it begin failing. But after some more trial and error I think I've come across a pattern with the bug. Going back to my original examination of the problem I discovered that this pattern ^(?=\D*\d)[a-z][a-z0-9]{2,7}$ matches the test string “abc123” now this doesn't seem to quite fit with my original assessment of what was happening because in that pattern after the lookahead test it's just testing for a certain number of any characters, but it this pattern it's looking for a specific range of characters but if you up the min boundary on the last qualifier by one to ^(?=\D*\d)[a-z][a-z0-9]{3,7}$ the pattern fails again. Now if you change the zero to infinity qualifier in the lookahead to one to infinity qualifier in the modified pattern so you get ^(?=\D+\d)[a-z][a-z0-9]{3,7}$ the pattern matches again. Bump up the min boundary of this new pattern to ^(?=\D+\d)[a-z][a-z0-9]{4,7}$ Bugaboo! The pattern fails again.

Now I don't have access to the source code so this is just supposition but here's what I believe is happening. I think the data examined by the lookahead is being stored in a stack structure. But just looking at the patterns that are working and failing it looks like once the lookahead is satisfied values when a qualifier is encountered in the consuming portion of the pattern, the lookahead's match is reconsidered with the minimum boundary of the qualifier in the lookahead popped off the stack of the lookahead's match. Let's look at the first pattern that worked

^(?=\D*\d)[a-z][a-z0-9]{2,7}$

OK we'll start with after the lookahead is matched. Now the lookahead is supposed to be non-consuming so the pointer should still be at the beginning of the string.

^(?=\D*\d) matches “abc1” Now the rest of the pattern matches normally until we get to the qualifier in the consuming portion of the pattern we are looking for at least two alphanumeric characters. At that point the lookahead match is reconsidered, the lower bound being zero, nothing is remove for the stack but the current character pointer now points to the character after the lookahead's match value of “abc1”.The qualified part of the pattern [a-z0-9]{2,7}$ can be satisfied by “23” so we get a match

Now if we do the same thing with ^(?=\D*\d)[a-z][a-z0-9]{3,7}$ and apply the same logic the regex fails because the qualifier in the consuming portion is look for at least 3 characters and there aren't that many if it tries to satisfy that part of the pattern after “abc1” in the test string.

Now let's look at ^(?=\D+\d)[a-z][a-z0-9]{3,7}$ with the same logic. The only difference from the previous pattern is the lower boundary of the lookahead qualifier. It's now 1. So if we pop 1 character of the stack of the lookahead's match of “abc1” we get “abc” leaving us with “123” to be matched by the consuming qualifier, which is just enough.

Now take ^(?=\D+\d)[a-z][a-z0-9]{4,7}$ apply the same logic, now the pattern fails because the consuming qualifier is look for as least 4 characters after the stack popping of lookahead's match.

If you changed the lookahead qualifier to {2,} the pattern would match. You can continue upping the qualifiers by one in the consuming portion to make the pattern not match then non-consuming part of the pattern to get the match so the behavior seems pretty consistent with my theory which the consuming qualifier is pointing to the end of the lookahead match and moving back the number of character of the lookahead minimum boundary. It also seem to explain the effect of my original encounter with the bug. It may as well as why the workaround of testing the length first with another avoids the bug because that consuming qualifier it that case is usually just , though + seem to work too which doesn't quite fit but consuming qualifiers with minimum boundaries of zero or one don't seem to be effected in any case. In all of the above test cased those values were below the minimal threshold of every successful test. However the test cases above has only one lookahead that doesn't backtrack, who know what influence backtracking, additional lookaheads or how addition qualifiers in the consuming parts pattern would be effected. Now while the test values support my theory I can't say for sure that things are happening exactly the way I've laid out. The actual mechanics may be different but whatever is happening under the hood clearly pointers are being corrupted and the regex engine is loosing it's place

The thing that was so confusing about this was it only kicks in with a qualifier in the consuming portion is encountered but if there was something between the lookahead and the qualified portion it match normally. So this is something it's really hard to make test cases for because you get these ghost value popping up latter on in the test than I expected. Not to mention the pattern itself is correct so even with a tool that will let you step through the matching you don't get this behavior unless that tool is using VBScript's regex engine and I haven't seen such a tool for that engine.

If you are using lookaheads with JavaScript client-side then you are going to be susceptible to the bug in IE because it will use the Jscript engine. And while you should always validate server-side if VBSCript or Jscript is your server-side language you are still at risk. So a platform like classic ASP which uses both of those languages by default is at risk client and server side, but a platform like PHP while it still suffer the bug client-side for IE, should work correctly server-side which is using a different regex engine. Same goes for non-web clients using the JScript/VBScript DLL.

The workaround for the strong password type of regex is when using lookaheads to include the upper boundary test(s) before the no upper boundary test then use .* to consume characters. The bound test should keep the pattern from running forever. However depending on the complexity of your criteria this may not always be an option but try it first anyway.

Published Saturday, February 21, 2009 1:05 PM by mash
Filed under:
Anonymous comments are disabled