Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Pattern Exclusions

Last post 09-26-2008, 1:51 AM by Aussie Susan. 3 replies.
Sort Posts: Previous Next
  •  07-29-2008, 7:02 AM 44686

    Pattern Exclusions

    I have a regex to find quoted strings: \"([^\"]*)\"

    Now that works fine and dandy, however I also now need to exclude when it falls inside a single quote string, for example:

     
    'Hello "how" are you?'

     

    Using my regex it'd still find "how" and I want it simply ignored, anyone any ideas? This can be on its ownline or as part of another (it's JavaScript).
     

  •  07-29-2008, 7:42 PM 44723 in reply to 44686

    Re: Pattern Exclusions

    Have a look at a current discussion going on in the 'Construction Advice' forum about a very similar problem (ignoring semicolons when in a quoted string rather than double-quotes) at http://regexadvice.com/forums/44722/ShowThread.aspx#44722

    There are two approaches outlined there:

    - mine is to locate the character (double-quote in your case) and then check to an even number of single-quote characters either (or both) before and/or after it - if even then you are NOT in a quoted string, if odd then you are

    - Sergei Z's approach is to locate the start of a single-quoted  substring and skip over it regardless of the intervening characters OR to look for the specific character. This uses the fact that most regex engines test alternations in the order they are specified and will accept a match when the first alternate succeeds.

    If they don't help you, then let us know.

    Susan 

  •  09-26-2008, 12:41 AM 46680 in reply to 44723

    Re: Pattern Exclusions

    Hi Susan, Thanks for your reply.

    I Studied the RegEx negative lookaheads yday night.

    Basically I am trying to parse the C++ code and fish out all occurances of pow() fuction call (Strating with "pow" and ending with a ;). I want to exclude the C style comments for pow occurances. I have written below expression now.

    Basically I am trying to parse the C++ code and fish out all occurances of pow() fuction call (Strating with "pow" and ending with a ;). I want to exclude the C style comments for pow occurances. I have written below expression now.

    First one to find the pow occurance and other to ignore it from the comments. Now both parenthesis are working separately but not together :(. i.e, negation is failing.

    Regex matchpow = new Regex("(pow.*\\s*;)(?! /\\*[^\\*]+\\*/)");

     Anu

     Anu

     

  •  09-26-2008, 1:51 AM 46683 in reply to 46680

    Re: Pattern Exclusions

    I think you may have replied to the wrong topic, but be that as it may....

    OK, lets try breakdown what you are trying to do in such a way that may show you why I think you are misunderstanding how lookarounds work.

    Lets start by finding the letters "pow". This is simply the pattern 'pow'.

    The next part is to locate the next semicolon which should indicate that we have reached the end of the function call. The best (in my opinion) pattern to do this is '[^;]*;' which accepts each non-semicolon character and finally the first semicolon that we come to after the "pow". (This gets around the other problem I mentioned to you in the other post where the '*' quantifier is greedy and may well grab too many characters).

    If I understnad you correctly, you now want to remove from the text we have just match any c-style comments. In other words to take:

    xxxx pow( /*arg 1*/ 10.0, /* arg 2 */ 5.0); yyyy

    as the raw input text and end up with:

    pow( 10.0, 5.0);

    There are several ways this can be done, but first a couple of points. Remember that once you have matched a character you cannot 'unmatch' it without backtracking which also unmatches all of the characters that follow it. This means that there is no way to go back into the total set of characters you have matched and remove them form the match. Further, a regex must match each and every character from the first to the last - it cannot skip over characters. Finally the majority of regex engines cannot record more than one set of text for any match group (.NET being the notable exception).

    Therefore, the easiest way is to use 2 passes with the first being the regex we developed above:

    pow[^;]^;

    We can apply this to our source text and we will end up with one match for each occurrence of the 'pow' function. We can then use a second regex pattern to search for and replace any c-style comments - something like:

    /\*((?!\*/)[\s\S])*\*/

    This looks for the "/*" start sequence and then steps forward a character at a time ("the '[\s\S]' character set will allow us to pass over newlines regardless of the 'singleline' option setting whereas the '.' is controlled by that option and may or may not do what you want if the function call can split over several lines) provided that we do not see the '*/' sequence. This last part uses a negative lookahead.

    The way any lookahead works is, starting from the current point in the source text, it performs the normal matching process as defined by its sub-pattern. However, it NEVER includes any characters that it uses in this process in the match but neither does it advance the current pointer in the source text. If it is a positive lookahead, then the overall match will continue if the lookahead match succeeded and fail if it didn't. Negative lookaheads do exactly the opposite in that the allow the overall match to proceed if the sub-match failed and vice versa.

    In this case we want to stop the matching process of the "[\s\S]" subpattern if we some up against the "*/" character sequence. Therefore we perform a negative lookahead just before we try the "[\s\S]" match. When this happens we know that the current point in the source text is immediately before the "*/" sequence and so we match those characters.

    Now that we have found a c-style comment, we can remove that text either by coding this ourselves or using the 'replace' regex function. (As a side note, be careful that you replace the correct text. If you grab all of the matches at once and then try to replace the text yourself, remember to account for the fact that removing any text will alter the position information of all subsequent matches. In these circumstances I tend to work from the last match back to the first so that the match positions the regex engine gives me will always be accurate.)

    If you know that there will only ever be at most a single comment withn the "pow....;" sequence, then you can use something like:

    (pow((?!/\*|;)[\s\S])*)((/\*((?!\*/)[\s\S])*\*/)(((?!;)[\s\S])*))?;

    with a replacement string of:

    $1$6

    to find all of the text form the "pow" until either the semicolon or the start of the comment, and store this into match group #1. If we have found a comment, then we use basically the same code as above to match it an then move on to match from there everthng until the semicolon and put that into match group #6 (count the parentheses that are not lookaheads or use a regex testing engine that provides breakdown of an expression to get the correct matching group numbers). Note that there may not be a coment and so the last part of the whole pattern is optional (the trailing '?'). We can then reconstruct the required text from the part before any comment and any part after.

    Unless you are using the .NET regex engine, then you cannot simply generalise this pattern to handle any number of comments, and even with the .NET engine you cannot do a regex replacement as there is no syntax that gives you access to the individual match captures - you have to do that from your code.

    What I think you were trying to do is to somehow tell the regex engine that, once you have found the "pow.....;" sequence to go back into it and remove any c-style comments. Unfortunately that is not possible and I hope that my very long explanation give you some idea as to why.

    Finally, be careful about the pattern you did use to find the end of a comment: namely '[^\\*]+'. This will stop when it finds either a back slash or an asterisk. In this case you really want the "*/" sequence but your subpattern will also stop when it finds a something like the first lone '/' in:

    asdasd /* asdasd / dsfsdf / sadasd */

    (i.e. the underlined part) and the following check for "*/" will fail. Unless you KNOW that a lone '/' will NEVER occur in a comment then this may lead to problems in the future.

    Susan

View as RSS news feed in XML