.*? The reason to any slow regex!
Hi guys,
We all have been there, we have some text we need to allocate using a regex, and we cannot really define what can be between 2 sub matches, so we simply uses the easiest solution .*?. You will probebly wont notice the problem, until the day when the regex cannot find what we are looking for, and then it's takes hours to execute.
The problem will most likely occur when we use .*? twice or more in the same regex. let's try to understand why this is happening.
The reason to this issue is very simple. .*? means any char, zero or more times, but the shortest available.
If my regex is this one: <html>(.*?)<body>(.*?)<span> and the string I'm looking in has an html tag followed by a body tag followed by a span tag, then the regex will execute fast and I will be happy. but if we don't have a body tag or a span tag than the regex will go nuts trying all the possible combinations. why? because we ask it to do so!
In order to simplify the problem, I will create small sample. my text is: "abcde1234" and my regex is [a-z].*?(a).*?\d now I'll write the candidates the regex can check for the first part [a-z].*?
a, ab, abc, abcd, abcde, b, bc, bcd, bcde, c, cd, cde, d, de, e (15 candidates)
Now lets look at the candidates of the second part .*?\d
abcde1, abcde12, abcde123, abcde1234, bcde1, bcde12, bcde123, bcde1234, cde1, cde12, cde123, cde1234, de1, de12, de123, de1234, e1, e12, e123, e1234, 1, 12, 123, 1234, 2, 23, 234, 3, 34, 4 (30 candidates) 15 option combine with 30 will be 450 options. in case of "abcdef1234" we have 21 combine with 31 that's 651 options. add a number "abcdef12345" and you get 21 times 40 equal to 840 and so on and so on......
You can see for your self how by using simple math we can see how each time we add a new char, we get multiple numbers of new combinations for candidates. when you have 5000 chars in a source of an html page, then we are talking about billions of candidates. When you build a regex you should avoid writing regex that has too many candidates.
Here are some methods to reduce the candidates in your regex
- run the regex on a small part of the code which is relevant. You can do that by using another regex for allocating the area you wish to work with.
- avoid using more than 1 .*? in your regex. remember 2 .*? makes 2 dimension 3 .*? makes 3 etc. the more dimensions the more candidates you have.
- use negative definition instead of .*?
I wrote in the beginning of this post that the problem will be relevant the day we cannot find the match, well the reason for that is that when the regex engine will stop execution it finds the match. so if we have 12 billion candidates, but the match is found within the first 10000 candidates, than we will not notice the performance issue. but the day that the regex will make a full scan of all candidates we will see that the regex is slow.
I hope this information will help you writing efficient regex!
May the force be with you.