Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

replace pattern between anchor tags

Last post 03-10-2010, 5:34 PM by james438. 4 replies.
Sort Posts: Previous Next
  •  03-09-2010, 9:35 PM 60617

    replace pattern between anchor tags

    Platform: PCRE

    Purpose:  am trying to teach myself a little PCRE.  In particular an expression has been nagging me:

    <?php

    $text="this-that <a href=\"http://www.my-site.com/index.php\">this-that</a> this-that";

    $text=preg_replace('/(<a.+>)(.+)(<\/a>)/Ue',"'$1'.str_replace('-','XX','$2').'$3'",$text);

    echo "$text";

    ?>

    //this-that <a href=\"http://www.my-site.com/index.php\">thisXXthat</a> this-that

    It seems to me that this is something of a cheat as I am using php as opposed to a PCRE pattern.  

    How should I replace "-" that is found within opening and closing anchor tags without resorting to the /e modifier?

    This is just an exercise to help me better understand an aspect of PCRE.

    Filed under: , ,
  •  03-09-2010, 11:16 PM 60620 in reply to 60617

    Re: replace pattern between anchor tags

    My standard approach to this is a pattern such as:

    -(?!((?!</?a\b).)*</a)

    with whatever you want as the replacement string.

    You can set the "singleline" option depending on your circumstances - ditto the "ignore case" option if the "a" tag can possibly be in upper case.

    What this does is to locate a "-" character and then start looking forward for either a "<a..." or "</a" sequence (or the end of the line). If there is a match on the "</a" then we must be somewhere after the start of the opening "<a..." tag (assuming the tags are correctly balanced within the line/string you are parsing) and so we don't accept the match. If it is anything else (the "<a..." or end of line/string situations) then we must be outside the paired tags and so the match is accepted.

    Because this is only matching on the "-" itself (the rest of the pattern is a lookahead), it will make the replacement and then move on to look for any subsequent "-" (assuming the replace operation looks for all matches)

    Also watch out for the '.+' parts of your pattern. It will actually match the first "<a..." with the last "<a>" in the string. What happens is that the "<a" text is matched (BTW this will match "<arrow" and "<aardvark" and any other 'tag' that begins with "<a" - generally you should do what I've done to limit the tag name with the '\b' anchor) and then the '.+' will match all characters from there to the end of the line/string before it tries to match whatever else may follow. In this case it is the ">" character and so it will start backtracking (which means releasing one of the characters it has previously matched with the '.+') until it either gets to the last character (in which case that part of the match is declared a "fail") or the ">" character has been released and so matches. Unfortunately this means that you are matching the FIRST "<a" with the LAST ">", even if there are many "<a>...</a>" sequences in between.

    In your case, the pattern will then use the '(.+)' part to again grab everything to the end of the line/string before it looks for the "</a>" sequence. Again this starts the backtracking process but, because we previously matched the "<a" at the start with the ">" of the "</a>" at the end, the backtracking will run out of characters before it can find a subsequent match. Therefore it will go back to the previous '.+' in the '<a.+>' part and release a single character form there. This will allow the '(.+)' part to again grab everything to the end of the line/string and we start this part of the backtracking again.

    I'm sure you can see that this can lead to a LOT of work (in some cases this can lead to what appears to be an infinite loop) by the regex engine, especially if you are scanning a large block of text (say a web page!).

    One way around this is to stop the '+' (or '*') quantifier from being greedy by using the '+?' quantifier. This makes the regex engine match one character and then see if the following part of the pattern will match. If it doesn't then it will match another character and try again. In this way it moves forward (albeit slowly) and so matches the beginning characters with the NEXT matching end characters.

    Susan

  •  03-10-2010, 11:38 AM 60661 in reply to 60620

    Re: replace pattern between anchor tags

    Thanks for your explanation. I know that I had a similar problem in an earlier post, but for now it takes a bit for me to wrap my mind around many different regular expressions.

    I was thinking that I was avoiding the greediness aspect of my pattern when I used the /U modifier. That is assuming that that was what you were talking about with $text=preg_replace('/(<a.+>)(.+)(<\/a>)/Ue',"'$1'.str_replace('-','XX','$2').'$3'",$text);

    but I am trying to move away from that type of PCRE usage and more towards what you mentioned with -(?!((?!</?a\b).)*</a)

    I added back references to your pattern so that it looks like -(?!((?!<\/?a\b).)*<\/a) I suspect that they are needed since I was using "/" as a delimiter. I just tested this and that was indeed the case. I tried "@", "#", and the double quotes as delimiters and they worked just fine as delimiters and removed the need to escape the forward slash.

    Your expression matches the pattern found outside of the anchor tags, however what if I wanted to do the opposite and replace the pattern found between the anchor tags. How would I do that? The best I could come up with is:

    <?php
    $text="hi-ll-upu<a href=\"http://www.animev-iews.com/index.php\">yo-me-yo</a>tipsy-me";
    $text=preg_replace('/(>.*?)-(.*?<\/a\b)/','$1XX$3',$text);
    echo "$text";
    ?>

    This produces hi-ll-upu<a href="http://www.animev-iews.com/index.php">yoXX>tipsy-me

    as opposed to the desired result:

    hi-ll-upu<a href=\"http://www.animev-iews.com/index.php\">yoXXmeXXyo</a>tipsy-me

    I suspect that I am on the right track though.



    P.S. I do wish that I had the ability in these forums to place my code in [CODE]code[/CODE] tags. It would make reading posts a little easier on the eyes.
  •  03-10-2010, 5:08 PM 60672 in reply to 60661

    Re: replace pattern between anchor tags

    Try:

    -(?=((?!</?a\b|>).)*</a)

    As you can see this is just a minor variant on my first suggestion except that the first lookahead is now positive (this allows the match to be within and between the tags - before it was outside the tags) and there is an additional way to stop the inner loop and that is the ">" (this stops a match from within a tag).

    Susan

  •  03-10-2010, 5:34 PM 60673 in reply to 60672

    Re: replace pattern between anchor tags

    Thank you very much for your help and explanation.
View as RSS news feed in XML