Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Help with Regular Expressions to Replace Chars Found Between HTML Tags

Last post 07-03-2008, 8:00 PM by Aussie Susan. 7 replies.
Sort Posts: Previous Next
  •  07-02-2008, 12:59 PM 43694

    Help with Regular Expressions to Replace Chars Found Between HTML Tags

    Hi,

    I am trying to implement the following regular expressions in Javascript but could use some help to get them working properly:

    1. Goal: Replace the single quote character found between HTML tags with ’ special character.
    RegEx:

    value = value.replace(/>(.*?)\'(.*?)</gi, ">$1&rsquo;$2<");

    2. Goal: Replace the double quote character found AFTER A SPACE between HTML tags with space and &ldquo; special character.
    RegEx:

    value = value.replace(/>(.*?)\s|&nbsp;\"(.*?)</gi, ">$1\s&ldquo;$2<");

    3. Goal: Replace the double quote character found BEFORE A SPACE between HTML tags with &rdquo; special character and space.
    RegEx:

    value = value.replace(/>(.*?)\"\s(.*?)</gi, ">$1&rdquo;\s$2<");

    4. Goal: Replace the dash character found BEFORE AND AFTER A SPACE between HTML tags with space and &mdash; special character and space.
    RegEx:

    value = value.replace(/>(.*?)\s|&nbsp;-\s|&nbsp;(.*?)</gi, ">$1\s&mdash;\s$2<");

    These are working somewhat but contain a few bugs and need to be tweaked. Any help would be appreciated. Thank you!
  •  07-02-2008, 7:19 PM 43710 in reply to 43694

    Re: Help with Regular Expressions to Replace Chars Found Between HTML Tags

    An initial comment based on a quick observation:

    In the 2nd and 4th regexes, be careful of the extent of the alternation operator's effect. Most operators work on the single item that immediately precedes them (e.g. the quantifiers) but the alternation operator operates on everything within the group on either side of it. Therefore

    />(.*?)\s|&nbsp;\"(.*?)</gi, ">$1\s&ldquo;$2<"

    has the 2 alternate parts:

    />(.*?)\s
    and
    &nbsp;\"(.*?)</gi, ">$1\s&ldquo;$2<"

    which may not be what you want. To limit the alternation to the '\s' and '&nbsp;' you should use

    (\s|&nbsp;)

    If you don't want to introduce a matching group, then use

    (?:\s|&nbsp;)

    The same goes for the 4th case except your current pattern will be in 3 parts, and the solution is the same.

    Susan
     

  •  07-02-2008, 9:22 PM 43717 in reply to 43710

    Re: Help with Regular Expressions to Replace Chars Found Between HTML Tags

    Susan,

    Thank you. I implemented your suggestion, and I also changed the . to [^<] or [^>] because . was too loose. I now have:

    value = value.replace(/>([^<]*?)\'([^>]*?)</gi, ">$1&rsquo;$2<");

    value = value.replace(/>([^<]*?)(?:\s|&nbsp;)\"([^>]*?)</gi, ">$1\s&ldquo;$2<");

    value = value.replace(/>([^<]*?)\"\s([^>]*?)</gi, ">$1&rdquo;\s$2<");

    value = value.replace(/>([^<]*?)(?:\s|&nbsp;)-(?:\s|&nbsp;)([^>]*?)</gi, ">$1\s&mdash;\s$2<");

    They seem to be working properly now, but I still need to do more testing. Do you see anything else wrong or do they look good to you? 

  •  07-02-2008, 11:36 PM 43719 in reply to 43717

    Re: Help with Regular Expressions to Replace Chars Found Between HTML Tags

    Without a good set of sample text to test against, I am not in a position to make comments on if they 'look good'. I would suggest that you create a set of examples that you know:

    - definitely should match
    - definitely should not match
    - are on the borderline but should match
    - are on the borderline but should not match

    Looking at the example of:

    ([^<]*?)\"

    I think there may be an easier way. What you are asking for is a lazy search forward that will stop just before the next '<' OR the next double-quote. The regex engine will be stepping forward a character at a time (effectively) checking for both the '<' and '"' end conditions. Therefore you can make its job more deterministic by using

    ([^<"]*)"                      (I'll leave you to work out the necessary escaping)

    When the first part completes, the next character is either a '<', the end of the text/line or the double-quote character. In the first two cases you are not dealing with a double-quote before the start of the next tag so you want it to fail anyway. The second case is what you want and it will succeed on the following double-quote.

    Once you have found a 'quote-space' (or whatever) pattern you are then scanning forward to the start of the next tag (with similar comments about the use of the character class actually searching for the '>' at the end of the next tag and backtracking to the '<'). What this will do is miss out those cases where there are two (or more) instances of the pattern between the tags. For example:

    <a>hello "world" and "how are you" today</a>

    would have:

    hello "world

    as match #1 and:

    and "how are you" today

    as match #2. Although you have the 'g' option set, the regex engine will not rescan the replaced text and so  will get:

    <a>hello "world&rdquo;\sand "how are you" today</a>

    as the result without multiple passes. Do you really need to match the text to the start of the next tag? You already know that the '" ' text must be after a tag and before the start of the next one from the first match part and, because of the check for the end of the preceding tag, you will need to make multiple passes anyway. Personally, I would use

    "\s(?=[^<>]*(<|$))

    (I'm fairly sure that variable length lookaheads) which only matches the 'double-quote space' pattern but checks ahead for the start of the next tag or the end of the string (assuming the multiline option is off). It will sill not find a match within a tag because the character set within the lookahead will stop at either a '<' or '>' (or the end of the text by implication) but the following '<' will cause the lookahead to fail if it is within a tag.

    By the way, you will not find the patterns you are after before the first or after the last tag. Is that a problem? My suggestion above will get over that as well.

    Finally, can you put a '\s' into the replacement string in javascript without it putting the literal "\s" characters into the output? Normally '\s'  is a match shortcut for a character set with all of the whitespace characters in it. I suspect that you want a literal space character in the replacement string.

    Susan

     

  •  07-03-2008, 10:03 AM 43726 in reply to 43719

    Re: Help with Regular Expressions to Replace Chars Found Between HTML Tags

    > Do you really need to match the text to the start of the next tag?

    No. I just want to ensure the text I replace is between HTML tags (and I'm not altering the attributes of the tags themselves).

    Using your suggestions, I've adjusted the 4 regexes to:

    value = value.replace(/'(?=[^<>]*(<|$))/gi, "&rsquo;$1");
    value = value.replace(/(?:\s|&nbsp;)\"(?=[^<>]*(<|$))/gi, " &ldquo; $1");
    value = value.replace(/\"(?:\s|&nbsp;)(?=[^<>]*(<|$))/gi, "&rdquo; $1");
    value = value.replace(/(?:\s|&nbsp;)-(?:\s|&nbsp;)(?=[^<>]*(<|$))/gi, " &mdash; $1");
  •  07-03-2008, 1:52 PM 43730 in reply to 43726

    Re: Help with Regular Expressions to Replace Chars Found Between HTML Tags

    I don't think they are working as intended now. Can you explain:

    (?=[^<>]*(<|$))

    I don't quite understand that. I also don't know if I should only have it after the quote character I am searching for.

     

  •  07-03-2008, 7:31 PM 43739 in reply to 43730

    Re: Help with Regular Expressions to Replace Chars Found Between HTML Tags

    This is a positive lookahead expression. It checks that the characters that follow the current point in the text match a sub-pattern but it does not move the pointer into the text and it does not incldue the lookahead characters in the overall match. The expression inside the lookahead is exactly the same as if you were doing an actual matching process. The 'positive' aspect refers to the fact that a lookahead can succeed or fail, just like the normal matching process can, with a positive lookahead returning 'success' if the characters DO match and 'fail' if they do not (which of course can trigger a backtracking step). A 'negative' lookahead does the same thin butr reverses the logic: 'success' if the following characters do not match the pattern and 'fail' if they do.

    There are positive and negative lookbehind patterns as well. The capabilities of regex engines to handle lookaround constructs varies considerably. Some allow the lookahead and lookbehind to have a variable length (i.e. use '*' and '+' style quantifiers which can match different numbers of characters) while others restrict either (usually the lookbehind) or both to fixed length matches.

    As for the pattern itself: at this point in the overall process, you have found the 'double-quote space' (or whatever) pattern and you now need to see if this occurs within a tag or between tags. Of course, a tag begins with a '<' and ends with a '>' and the 'rules of the game" indicate that these characters should not occur as literal characters anywhere else (i.e. they should be replaced by the '&xxx;' style constructs). Therefore, we want to scan ahead until we come to one or other of these characters -- that's the '[^<>]*' part.

    There are actually 3 conditions that will stop the character set match: either the '<' or '>' characters or the end of the text. (Note that in this case, the end of a string will not cause the character set quantifier to stop regardless of the 'singleline' and 'multiline' settings). So, then the character set has finished checking characters we know that the next character will be one of '<' or '>' or we will be at the end of the text.

    If the next character is '>' then we must have started from within a tag and this we don't want.

    If the next character is '<' then we must have started outside a tag and this is the start of the next tag: this situaiton is OK

    If the we are at the end of the text, then again we must have started outside a tag so the world is looking good.

    Therefore we need to check for either the '<' character or the end of the text: hence the (<|$) part (remember that the '$' is not a literal dollar sign in this context - it is a placeholder for the end of the line/text (depending on the 'multiline' option setting).

    Putting it all together:

    (?=[^<>]*(<|$))

    will search forward and only return a success if we are not in the middle of a tag.

    Given the larger context of your pattern, this means that, to get a completely successful match, we are matching the 'double-quote space' (or whatever) characters (only) and know that this is not occurring within a tag. Therefore we can replace the matched character only with our replacement string.

    Does that help?

    Susan

     

  •  07-03-2008, 8:00 PM 43740 in reply to 43726

    Re: Help with Regular Expressions to Replace Chars Found Between HTML Tags

    Had a closer look at your expressions and I have a few suggestions.

    value = value.replace(/'(?=[^<>]*(<|$))/gi, "&rsquo;$1");

    The '$1' in the replacement string is referring back to the first match group (NOT the whole match). I'm not sure about javascript, but you may or may not actually be capturing anything in match group #1 (which is actually inside the lookahead - I don't think is should be matching). Remember that the lookahead does not contribute to the overall match so the only character in the whole match will be the single quote. Therefore the pattern could be:

    value = value.replace(/'(?=[^<>]*(<|$))/gi, "&rsquo;");

    Next, looking at:

    value = value.replace(/\"(?:\s|&nbsp;)(?=[^<>]*(<|$))/gi, "&rdquo; $1");

    Remember that a replace operation will replace the entire match of the expression and that includes any components that you match within a non-capture group. While it may be useful at times, you cannot pick and choose the parts of the overall match that will go into the complete match (to do this you need to create match groups and reconstruct what you want from those parts). Therefore, what is happening here is that you will be matching the double-quote AND the space (in whatever form) and replace them both with the replacement string. As before, the $1 may or may not be grabbing the '<' character used to terminate the lookahead but it is certain that it is not appropriate here. Therefore I would suggest moving the check for the space inside the lookahead if you don't want to have it replaced, as in:

    value = value.replace(/\"(?=(\s|&nbsp;)[^<>]*(<|$))/gi, "&rdquo;");

    or:

    value = value.replace(/\"(\s|&nbsp;)(?=[^<>]*(<|$))/gi, "&rdquo; ");          # note the space as the last character if the replacement text

    Minor point, but in the case of:

    value = value.replace(/(?:\s|&nbsp;)\"(?=[^<>]*(<|$))/gi, " &ldquo; $1");

    there is no advantage to be gained by making the leading space check non-capturing as it will be included in the overall capture, and you have included a space in the replacement string, so:

    value = value.replace(/(\s|&nbsp;)\"(?=[^<>]*(<|$))/gi, " &ldquo; ");

    should do the trick.

    Susan

     

View as RSS news feed in XML