Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

any reason why not to (.+?)

Last post 10-20-2008, 5:49 PM by ddrudik. 7 replies.
Sort Posts: Previous Next
  •  08-20-2008, 1:21 PM 45501

    any reason why not to (.+?)

    when regexing for stuff I generally use the format:

     <somthing always before>(.+?)<something always after>

     

    is there any particular reason why this is a bad idea?

  •  08-20-2008, 3:29 PM 45507 in reply to 45501

    Re: any reason why not to (.+?)

    Well, I wouldn't call it a bad idea, but (in certain cases) you could do better. For a detailed explanation, read the paragraphs "Laziness Instead of Greediness" and "An Alternative to Laziness" from http://www.regular-expressions.info/repeat.html
  •  10-09-2008, 2:11 AM 47035 in reply to 45507

    Re: any reason why not to (.+?)

    Hi.I have the following problem. I want to search a string like this "Sub ...<content>(also includes linebreaks)...Sub End". When I used the regexp "^Sub.*?(End Sub)$" everything works fine. Now I read, that this solution isn't good for performance. Therefore I tried to use a regexp like "^Sub[^(End Sub)]+(End Sub)$" but it doesn't work. Where is the mistake in the regexp? Can anybody help? Thanks for solutions.

     Hardie

  •  10-09-2008, 7:06 PM 47054 in reply to 47035

    Re: any reason why not to (.+?)

    You are misusing character classes - an all too common mistake. See Michael's excellent blog entry at http://regexadvice.com/blogs/mash/archive/2008/01/31/A-touch-of-Character-Class.aspx. Consider the subpattern:

    [^(End Sub)]+

    What this character class definition tells the regex engine is to match one or more characters that are not one of '(', ')', 'E','S', 'b', 'd', 'n', 'u' and the space character. This will match "#$%$#%" or "theBearwentoverthemoUntain" as these do not contain any of those characters. However "the bear went over the mountain" would not match because of the lower case 'b', the lower case 'u' and the space characters - actually you would get a match of "the". (This assumes the matches are case sensitive).

    What you are wanting to do is to match anything that is not the string "End Sub". For that you need to use lookaheads in the following way:

    ((?!End Sub).)+

    What this does is to start at the current location within the text and see if the "End Sub" characters follow. If they don't (this is a negative lookahead) then match whatever the next character is (according to the 'singleline' or 'dot matches newline' option setting) which will also advance the current locaiotn within the text. If you are wanting to capture all of the characters matched by this pattern then you will need to insert it within a match group as in

    (((?!End Sub).)+)

    Given the appropriate options settings, try:

    ^Sub((?!End Sub).)+End Sub$

    This will return all of the characters beginning with a "Sub" that starts at the beginning of the text/line ('multiline' option) and grabs everything to the first "End Sub" as long as it is the last thing in the text/line.

    Personally I think forcing the 'Sub' to be at the start of a line and the 'End Sub' to be at the end may be a bit restrictive in general as it does not allow for any text formatting or comments. Also I prefer to make these things case insensitive but I know that some language editors/compilers force the required case. As always, you need to understand the nature of the source you are examining.

    As for why the performance is supposed to not be good, I would check that you have exactly the same situation. In general using the non-greedy '.*?xxx' pattern where you are looking to grab everything until you get to xxx works in basically the same way as the pattern I have provided above with the exception that the regex engine will try to make a match (not just a lookahead) with the 'xxx' part after it matches each character with the '.'. This means that the backtracking will amount to about as much work as my suggestion with the lookahead. If you have text such as:

    sub end suc eeeend sub end sua end sub

    then you will make the regex engine scan/match ahead for 6 characters because it can match the "end su" part before it finds a mismatch. In both cases you would then either reject the match or backtrack over all of those characters before you moved forward 1 character and tried again.

    The subpattern '.*'  can be very bad for performance and cause excessive bactracking (as well as finding the wrong terminating strings) in some situations. Also when you nest subpatterns within each other and use quantifiers, then you can get some very long match times if you are not careful.

    Susan

  •  10-20-2008, 2:52 AM 47348 in reply to 45501

    Re: any reason why not to (.+?)

    While using lazy quantifiers is certainly better than the careless use of (.+), it is preferable to resort to using negated character classes instead (if possible).

    By example:

    $str = '<h1>This is a headline</h1>';

    preg_match('#<h1>([^<]+)</h1>#', $str, $match);

    echo $match[1];

    The issue with using lazy quantifiers involves the matter of jumping back and forth in passing control within regex. So let's re-examine my simple example above using (.+?) instead.

     preg_match('#<h1>(.+?)</h1>#', $str, $match);

    So after the initial <h1> is matched, what happens here is that the regex has to capture the first upcoming character it finds (due to the + quantifier). So in this case, it captures the letter T. But since it is lazy, it has met the minimum requirement (one or more times... well, it found something to capture at least once.). So now regex passes control over to what is after the (.+?) (which in this case is <) in the pattern and checks to see if the next character after T is < ... it isn't,as the next character is an h.. so, now the regex appends the h character into the capture. Again, since it is lazy, it once again passes control over to check if the character after h is < ... and of course, at this point, it isn't.. so it appends the i to the capture... this cycle keeps going until finally, regex has captured 'This is a headline'. once that last e is captured, it will once again pass over control to see if the next character is < and it is! So now, it checks if the character after that is / and it is.. and so forth until the entire </h1> is matched, thus completing the condition.

    As you can see, there is alot of backtracking and the passing of control to check and see what comes next. Conversely, by using ([^<]+), so long as the next character is not <, it is included in the capture (without the need of any backtracking)... so this is far more linear and streamlined. Only once it reaches the < character, does the capture stops, and then a test match for </h1> is done, and if is in fact those following characters, the condition is met.

     I strongly suggest picking up Jeff Friedl's book 'Mastering Regular Expressions'. It is a fantastic book that really goes deep into how regex engines *think*.. A very good way to get a much greater understanding of issues such as this.

    http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1224485329&sr=8-1 

    Cheers,

    NRG 

     


    If preg was a woman, I'd marry her. But I could just see the pattern... she would replace me with someone with money ($1).
  •  10-20-2008, 4:59 PM 47382 in reply to 47348

    Re: any reason why not to (.+?)

    That's a good case for benchmarking, I don't do it nearly enough in the patterns I use, but here's a benchmarking example for the two patterns supplied to illustrate the speed difference:

    <pre>
    <?php
    $str = '<h1>This is a headline</h1>';
    $patterns=array('#<h1>(.+?)</h1>#','#<h1>([^<]+)</h1>#');
    foreach($patterns as $pattern){
      if(preg_match($pattern, $str)){
        $time_start = microtime(true);
        for ($i = 0; $i < 100000; $i++) {
          preg_match($pattern, $str, $match);
       }
       $time_end = microtime(true);
       $elapsed_time = round($time_end-$time_start,4);
       echo "pattern: ".htmlentities($pattern)."<br>iterations: $i<br>match: $match[1]<br>time: $elapsed_time seconds<hr>";
      }
    }
    ?>

     


  •  10-20-2008, 5:21 PM 47383 in reply to 47382

    Re: any reason why not to (.+?)

    Excellent example, ddrudik. Illustrates things nicely. I too am guilty of not benchmarking much...

    Granted, to be fair, using either method on a simple small string will not matter, as the speed difference is negligible at best (we would certainly not perceive the milliseconds difference). But if needed to plow through tons of content, less efficient patterns will start to rear their ugly heads. I keep issues like this in the back of my mind as a matter of coding pactices and principals. Prior to getting Jeff Friedl's book, I used to do even worse and throw around stuff like .+ or .* without knowing the implications. Now I am MUCH more mindful when constructing patterns.

     So I do the following when I build patterns...

    if I can use negated character classes... do that... if not, then at the very least, use lazy quantifiers... only in rare situations will .+ or .* serve well enough without the hinderance of backtracking (like say having .+ or .* at the very end of a pattern where you absolutely want to match the rest of a string regardless of what it matches).

     Nice demo once again :)

    Cheers

    NRG 


    If preg was a woman, I'd marry her. But I could just see the pattern... she would replace me with someone with money ($1).
  •  10-20-2008, 5:49 PM 47385 in reply to 47383

    Re: any reason why not to (.+?)

    Thanks, it's unfortunately true that even an inefficient regex pattern (short of infinite backtracking exceeding the limit in PHP) will likely perform in an acceptable manner.


View as RSS news feed in XML