Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Back Reference Overwrites Changes

Last post 12-01-2007, 11:00 AM by ddrudik. 18 replies.
Page 1 of 2 (19 items)   1 2 Next >
Sort Posts: Previous Next
  •  11-20-2007, 4:15 PM 36728

    Back Reference Overwrites Changes

    I ran into an interesting situation where back references overwrite previous changes in a regular expression.

    I needed a regular expression that updated a pattern but also checked for something that comes before the pattern.  It should leave everything before the pattern as it was before the updates.  There can be multiple matches of the pattern to replace.  I expected replacements made by the regular expression to be visible in future back references but it doesn't seem to be the case. 

    Only the last matched pattern is updated.  (Actually i think all matches are updated, the first updates just seem to be overwritten.)

    Is there a way to make these previous updates show up in the future back references? What is the correct way to approach this problem?
     

    Language: JavaScript

    Example Function:

            function replaceTest() {
                var reg, str;
                var test_str = "istriiiiiingi";
                str = "([\\w]+)(i)([\\w]*)"    // find the i character where there is at least one preceding character.
                reg = new RegExp( str, "ig" );     
                test_str = test_str.replace(reg, "$1a$3")
                alert(test_str);    //result is "istriiiiiinga", how do we make this "istraaaaaanga"
                return false;
            }

     

     

  •  11-20-2007, 4:51 PM 36730 in reply to 36728

    Re: Back Reference Overwrites Changes

    <script language="JavaScript">
    var test_str = "istriiiiiingi";
    test_str = test_str.substr(0,1) + test_str.substring(1,test_str.length).replace(/i/gi,'a');
    alert(test_str);
    </script>


  •  11-20-2007, 5:21 PM 36734 in reply to 36730

    Re: Back Reference Overwrites Changes

    This seems fine if all you wish to do is leave the first letter of the word, but what if you want to find a real pattern.

    Can you accomplish the same thing with back references?  My question is mainly about back references and the use of $.

    I attempted to make a simplified example here so that I could provide the code as well as the example text but I don't think it hit the point.......My real situation was that I wanted to bold all occurrences of a word in a paragraph except if the given word is inside of a tag (<tag attr="this is inside">this is not inside</tag>).  The word should be bold if it appears in the inner html of a tag, but not in the attributes or tag name.

    I eventually accomplished this without using the back reference but the thought still remained... What if I needed to use a back reference that included the section were previous changes occur?  Are you out of luck or is there a way to do this?
     

     

  •  11-20-2007, 5:59 PM 36739 in reply to 36734

    Re: Back Reference Overwrites Changes

    This is precisely why the posting guidelines instructs you to post real-world text samples instead of made up ones.  Feel free to post some of your real-world data if you would like a different solution.


  •  11-20-2007, 6:00 PM 36740 in reply to 36734

    Re: Back Reference Overwrites Changes

    I'm sorry to say this, but this is a classic reason why there are posting guidelines at the top of this forum, and why they specifically ask for real examples and not made up ones.

    Having got that off my chest, how about:

    (?<!<[^<>]*)\b(inside)\b(?![^<>]*>)

    with a replacement string of (say)

    <b>$1</b>

    which will turn your 'real' example from

    <tag attr="this is inside">this is not inside</tag> 

    into

    <tag attr="this is inside">this is not <b>inside</b></tag>

    I have used the 'ignore case'  option. Obviously you can choose whatever word you want to look for - I have arbitrarily chosen "inside" for this example. By the way, this may not work for all regex engines as it uses a variable length look-behind, and I don't know if Javascript will allow this.

    One other thing about terminology: from what I understand, a 'back reference' occurs within the pattern such as "<(\w+)>(.*?)</\1>" which will find the opening and corresponding closing tag, whatever the tag name happens to be. What you are using here is a match group reference within the replace text.

    However, I would also suggest that you look at using the HTML/XML DOM for this type of thing - it will automatically separate text that is inside a tag from that which is outside, and will allow you to manipulate the text more easily.

    Susan 

  •  11-20-2007, 8:07 PM 36743 in reply to 36740

    Re: Back Reference Overwrites Changes

    As best I understand it, lookbehinds are not supported in JS.

    Pending source text example, here's a shot in the dark:

    My source text:

    this is not inside either <tag attr="this is inside">this is not inside not inside for sure</tag> this is not inside neither

    Match on:

    ([^<]*)(\binside\b)

    Replace with:

    $1<b>$2</b>

    Result:

    this is not <b>inside</b> either <tag attr="this is inside">this is not inside not <b>inside</b> for sure</tag> this is not <b>inside</b> neither

    Note that only the last instance of the the keyword is matched outside of tags.  Also note that this sort of pattern will cause script code lines to be matched as well if they contain the keyword, I would recommend searching the forum for matching outside of html tags, there's a number of previous posts on this subject that may help hurry along the solution.


  •  11-27-2007, 5:41 PM 36994 in reply to 36743

    Re: Back Reference Overwrites Changes

    <h4>Keyword</h4>
    <p class="noindent">This is a paragraph about keywords. I wish i could highlight all the things that say keyword but not when the keyword is inside of a tag, in the attributes or the tag name.</p>
    <p class="image"><img src="/someDir/someOtherDir/images/keyword.jpg" alt="../images/keyword.jpg"></p>
    <p>Everything I try will only highlight the last occurence of a keyword within a paragraph.  I need all keywords in the paragraph hightlighted.  I want to highlight the keyword even if it is only part of the word.  For example all but the last letter of keywords should be hightlighted.  How can I highlight keyword?</p>
    There just may be a keyword here too. keyword.
    </div>
     

    Above is an example of the actual situation I am encountering.  My original problem was that only the last occurrence of the keyword was being highlighted.

    I tried to search the forum for  "matching outside of html tags" and I couldn't find anything that I could make work.
    So far all of my regular expressions either end up highlighting the instance of "keyword" in the file path, only highlight the last occurrence of the word, or just don't work.

    My code takes the inner HTML of the DIV tag and tries to apply a regular expression replacement to highlight the keyword with a span tag around it.

    I appreciate your responses.

    I was having a problem testing the example with the ? at the beginning.  I am getting an "invalid quantifier" error... I was getting this when testing some of the examples found in the matching outside of html discussion.

    I am already using the getElementsByTagName to locate the tags that I need to highlight in, but these tags may also contain tags.  I'm not sure how I use the DOM to locate the tags inside of the innerHTML of the containing tag after I locate it with getElementsByTagName.

    Language JavaScript

    Thanks,

    Andy 

  •  11-27-2007, 7:39 PM 37008 in reply to 36994

    Re: Back Reference Overwrites Changes

    JavaScript doesn't support lookbehind, but you can often emulate it. But you don't need to in this case. JavaScript's ability to use functions for replacements and iterate through strings using a while loop and the exec method offer a number of ways you can easily pull this off.

    Try this:

    var newStr = oldStr.replace(/[^<]+|<[^>]+>/g, function ($0) {
        return ($0.charAt(0) == "<") ? $0 : $0.replace(/\bkeyword\b/gi, "<b>$&</b>");
    });

    That is vastly preferable to parsing through the DOM tree, finding each text node, checking its value, creating and inserting new element nodes for highlighting, creating and inserting new text nodes for them, and so forth. A regex against the innerHTML of the elements you want to search within is the way to go if you care about performance (although there are some caveats here) or code brevity, and don't have JavaScript events attached to descendant elements.


    My regex-centric blog :: JavaScript regex tester
  •  11-28-2007, 6:52 PM 37053 in reply to 37008

    Re: Back Reference Overwrites Changes

    With JavaScript's capabilities I think I was approaching the problem wrong.  There is probably a better way to do this but in case it is helpful to others facing similar issues, here is my approach.

    Instead of trying to highlight using the Regex Replace method I used the Exec method to locate the search terms inside of each innerHTML block and make substitutions of each match after checking to make sure that the match is not inside of a tag.

    After locating the search term I find the start and end positions of all the tags that are inside the innerHTML block using another regular expression.  For each matched search term I make sure the start position is not between any of the start and end positions of the tags.  If it's in a tag then I use the substring method to remove and replace the section of the text matched.  I do the replacement on a temporary copy of the innerHTML and replaced the actual innerHTML in the document after all highlighting is complete.

    I won't go into more detail because this isn't about JavaScript.

    Thanks for the help. 

  •  11-28-2007, 8:19 PM 37056 in reply to 37053

    Re: Back Reference Overwrites Changes

    AProgrammer:
    There is probably a better way to do this

    Um, yes. The JavaScript code I posted above is a more efficient and undoubtedly lighter weight approach. You can simply replace "var newStr = oldStr.replace..." with "element.innerHTML = element.innerHTML.replace...".


    My regex-centric blog :: JavaScript regex tester
  •  11-30-2007, 5:06 PM 37146 in reply to 37056

    Re: Back Reference Overwrites Changes

    You must be some sort of JavaScript Regex genius, that is a seriously impressive function.  Every time I look at it I feel like an idiot.

    Your code is definitely a lighter weight approach.  It turns my 20 lines of confusing code into a single line of brilliantly confusing code... nice work. 

    I would like to use your code but there is one slight problem I am having with it.

    My text block is very large and running the line of code you posted on it takes about a minute and a half.  During this time the browser is unresponsive.

    I would like to ask the user if they would like to continue every 20 seconds or so.  If they say no in a confirm dialog I want the regex to stop replacing things. 

    Is there a way to stop the execution of the regex from inside "function($0)"?

     

    Thanks 

  •  11-30-2007, 6:03 PM 37148 in reply to 37146

    Re: Back Reference Overwrites Changes

    AProgrammer, are you just trying to highlight a word in the current HTML page?

    Here's what someone else created for a tooltip highlighter, it might be adaptable for your purpose:

    <style type="text/css">
     <!--
      .Glossary { cursor: pointer;
                  color: #F00;
                  text-decoration: underline;
      }

      #wordMeaning div { padding: .5em;
      }

      #wordMeaning h1  { background: #CC0;
                         border: 1px solid #000;
                         font-size: 1.2em;
                         margin: 0;
                         padding: 0 .3em;
      }

      #wordMeaning     { background: #FF9;
                         border: 1px solid #000;
                         min-width: 250px;
                         position: absolute;
                         display: none;
      }
     -->
    </style>
    <scrip t>
    function addTip(word,tip,style){
     reg=new RegExp('( '+ word+')','gi')
     document.body.innerHTML=document.body.innerHTML.replace(reg,'<span class="'+style+'" title="'+tip+'">$1</span>')
    }

    function init(){
    addTip('rabbits','Rabbits are annoying pests\nsome people like them though','Glossary')
    addTip('frogs','Frogs are best poached in a\nthick cream and garlic sauce.\nRabbits taste better though','Glossary')
    }
    </scrip t>

    <body onload="init()"><img src="http://blog.ewanscorner.com/images/rabbits.jpg">
    <p>Lorem ipsum frogs and rabbits arrabbitsation froggles rabbits dolor sit amet,
    consectetuer Frogs adipiscing elit, sed rabbits diem nonummy nibh euismod tincidunt ut
    lacreet dolore magna aliguam erat volutpat. Ut wisis enim ad minim veniam, quis nostrud
    exerci Rabitts tution ullamcorper suscipit FRogs and rabbits lobortis nisl ut aliquip ex
    ea commodo consequat. Duis te RABBits arrabbitsation froggle feugifacilisi. Duis autem
    dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore frigs eu
    feugiat nulla facilisis at vero eros et accumsan et iusto odio rabbit dignissim qui
    blandit praesent luptatum zril frogs delenit au gue duis dolore te feugat nulla facilisi. </p>
    </body>


  •  11-30-2007, 8:05 PM 37153 in reply to 37148

    Re: Back Reference Overwrites Changes

    Where your code fails is that it only highlights words that begin with the term being searched for.  In my case if the search term is "determine" and the document contains the word "predetermined" then the determine part of predetermined should be highlighted.  If you remove the space before the search term in your regular expression, the image location is modified and you have no more pretty rabbits to look at. 

    That problem has been fixed, highlighting is now working.  Stevezilla00 posted a great example that works.  His example includes word boundaries but I removed them and everything is great.

    My problem now is speed.  I work with eBooks and in some cases my content is large enough so that the highlight function takes a very long time to complete.  For example when highlighting author names of a bibliography with over 9000 entries the function takes about a minute and a half to complete.  Some things just take a long time, but I would like to give the user an option to stop execution of the function.  I saw your request for a URL before you posted the code (space in script trick?).  The content is not currently available to the general public but I am going to create a sample page that is.  When the page is available I will post the URL.

     Thanks for the help.
     

  •  11-30-2007, 8:35 PM 37154 in reply to 37153

    Re: Back Reference Overwrites Changes

    Sounds like you should move from JS (which I assume you are running client-side) to ASP or some other server-side language to speed up the process.


  •  11-30-2007, 9:02 PM 37155 in reply to 37154

    Re: Back Reference Overwrites Changes

    I develop mainly in FireFox.  The slow execution times I reported here are only valid for FireFox.  I tested in IE from an external machine when I set up the publicly available example and was surprised.  IE executed the JavaScript in about 10-15 seconds.  FireFox takes 75-90 seconds.  The slow speed is likely a FireFox problem (i have the most recent version), but if the document were larger (and I do have larger documents) execution could be slow in either browser.  I would like to ask the user if they want to continue every 20-30 seconds.

    The sample has an alert printed before and after the bulk of the highlighting code to illustrate the time it takes to highlight.  The document is rather large so it takes awhile to load before the script is executed.  The timer with the confirm box was valid when I was selecting the "P" tag instead of the body.  In that case the look would execute 1000's of times and the timer works allowing users to break out of the loop.  The new case only loops once on the Body tag and therefore my timer code becomes useless.  The timer stuff was left as an example of something that I would like to include in the function($0) function but I don't see a way to break out of the execution of the replace code.  Is it possible? 

    Here is the URL:
      http://hro.abc-clio.com/regextest/sample.asp

     Thanks
     

Page 1 of 2 (19 items)   1 2 Next >
View as RSS news feed in XML