Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Negative Lookahead - my first adventure

Last post 02-09-2010, 12:03 PM by mash. 8 replies.
Sort Posts: Previous Next
  •  02-05-2010, 9:25 AM 59322

    Negative Lookahead - my first adventure

    I have finally bit the bullet and am trying to banish my inherent FEAR of RegularExpressions. Having started I now know why I have resisted for so long.

    I am doing some Url rewriting and I am convinced regular expressions is the most elegent solution to my problem. But its really hard....

    The problem:

    I want to write an expression that changes this:

    http://somedomain.com/usa/diy/products/prodtype/productname.aspx

    to

    http://somedomain.com/usa/diy/products/search.aspx?ProductId=3026

    I don't really want to go into detail about the whys and the where fors of how I actually get the string ProductId=3027, but in order to give some background i will take /prodtype/productname.aspx and look that up somewhere in a list to find the correct product id.

    The problem with this is that it is possible that a url could be either:

    http://somedomain.com/usa/diy/products/prodtype/productname.aspx or
    http://somedomain.com/usa/diy/products/prodtype/search.aspx

    My guess is I need to use negative look ahead to match something.aspx but ignore search.aspx?

    I tried many combination to no avail but this is what i think it the closest:

    ((.+)(\/products)(\/.+\/)((?<!search).aspx))

    So I want to match:

    anything/product/anything/something.aspx

    but only where the something is not search.

    If it makes any difference I am this in C#. Can anyone please save me from tearing my hair out.

    Thanks

     

  •  02-05-2010, 9:52 AM 59323 in reply to 59322

    Re: Negative Lookahead - my first adventure

    Conufsingly if this helps:

    this works ignoring anything aspx files matching search but bring back all others:

    ((.+)(?<!search).aspx)

     but as soon as I add this it stops working: 

    ((.+)(\/products)(\/.+\/)(?<!search).aspx)

    Can anyone explain why and how to correct this problem. It's important because I only want this rule to apply pages in the subsite products/subsite.

  •  02-05-2010, 12:37 PM 59326 in reply to 59323

    Re: Negative Lookahead - my first adventure

    .*/(?!search.aspx)([^/]*?.aspx

    would only match on

    http://somedomain.com/usa/diy/products/prodtype/productname.

    not on

    http://somedomain.com/usa/diy/products/search.aspx?ProductId=3026

    tested OK in Expresso 3.0

  •  02-05-2010, 12:38 PM 59327 in reply to 59326

    Re: Negative Lookahead - my first adventure

    corrected to:

    .*/(?!search.aspx)[^/]*?.aspx

  •  02-08-2010, 11:00 AM 59407 in reply to 59322

    Re: Negative Lookahead - my first adventure

    Thanks mate this worked great.....I managed to implement it into my example but when i tried it in another circumstance it failed. Probably due to my lack of understanding on how Regular Expressions work.

    I am trying to use the same to match any string with ".aspx" in it unless it contains "/Lists/"

    So it must match sitename.com/subsite/subsite/pagename.aspx

    but fail on these:

    sitename.com/Lists/subsite/pagename.aspx

    sitename.com/subsite/Lists/pagename.aspx

    sitename.com/subsite/Lists/subsite/pagename.aspx

    I tried this to no avail:

    ((.+)(?!\/Lists\/)(\/.+\.aspx))

  •  02-08-2010, 5:34 PM 59429 in reply to 59407

    Re: Negative Lookahead - my first adventure

    Try:

    ^(((?!/lists/|\.aspx).)*)\.aspx

    (with the 'ignore case' option in) assuming that the text you want to match starts at the beginning of the string.

    What this does is step through each character looking for either the "/lists/" or ".aspx" sequences. If the following text is neither, it captures the character and moves on. If either is found, then it stops and checks for the ".aspx" sequence which indicates the end of a successful match. If the stepping/checking process stopped on the "/lists/" text, then the final match will fail and that fails the whole pattern.

    The reason your pattern doesn't work is because I think you have the wrong idea about how regex engines work and/or how lookarounds behave. What the '(.+)' part of your pattern does is to start at the beginning of the string and match every character to the end of the string (or line if the "singleline' option is clear). It THEN moves to the next part of the pattern - '(?!/Lists/)' which checks to see of the following characters are NOT the "/Lists/" sequence; in this case we are either at the end of the string or at a newline character - in either case the next is not "/Lists/" (indicates a fail) and so the lookahead succeeds (because it is a negative lookahead).

    It then moves on to the next part of the pattern which is the "\.aspx" sequence. Again we are either at the end of the string or at a newline (remember that the lookahead does not alter the current text position pointer) and so reports a "fail". This makes the regex engine start to backtrack a character at a time, releasing the characters that the '(.+)' part previously matched. As each character is released, the regex engine will recheck the lookahead which will generally fail (given your test strings and the fact we are now working back from the end of the string towards the beginning) and so, being a negative lookahead, turn this into a "success" and retry the ".aspx" sequence.

    Again, given your test strings, it will repeat this process 4 times until it releases the ".aspx" at the end of each string and so finally matches on the last part of the pattern and declares a match overall.

    The end result is that the negative lookahead is effectively useless - even if it does backtrack to the point where the sequences will match (and therefore return a "fail" - negative lookarounds are a bit confusing to describe!) all that will happen is that it will force another backtrack to release the character before the "/Lists/" which will mean the subsequent negative lookahead will not match (i.e. succeed) and so the overall matching process will continue.

    Regex engines always work one character at a time. (If you understand such things, they are based on non-deterministic finite state automata [some regex variants are deterministic] with each state transition being driven by the next character in the input string). This means that it will only consider some part of the pattern based on the next character in the input string and where it is currently working within the pattern. I think you were trying to say "match everything with '(.+)' but reject it if you ever find a "/lists/" sequence as you go".

    A lookahead (or lookbehind), whether positive or negative, only works based on the starting point of the current place within the input string. Lookarounds are one of the "anchors" that many regex engines use to test the character sequences around the current text position without changing that position. The are NOT free-ranging tests to see if the matching sequence occurs anywhere before or after the current point.

    Susan

  •  02-09-2010, 4:12 AM 59451 in reply to 59429

    Re: Negative Lookahead - my first adventure

    I see I think I understand better. So if I going to use positive or negative look ahead behind the regex before the statement needs to be sufficiently defined such that it stops at the correct place in the string when i do the comparison.

    So to precede it with something too generic would just allow it keep on succeeding until the end of the string then do the comparison and check that .aspx doesn't equal /Lists/ and return success.

    So then my problem becomes difficult because although there are some patterns in the string to determine place the .Lists could be anywhere going forward, but perhaps not going backwards.

    I tried your example but that didn't offer me what I wanted. Would it make it easier if I make the expression check from the end of the string instead for example:

    I know that the /Lists/ will always be followed by a subsite then the page name so it will be in the form:

    http://anything/Lists/subsite/pagename.aspx

    Can i do something like:

    (.+\/.+\/.+\.aspx)(?!\/Lists\/.+\/.+.aspx)

    Where I say match /something/else/pagename.aspx as long as that is not /Lists/something/pagename.aspx.

    I guess the problem i will have there is that .+ will match a slash / successfully.

    Can I do a match that says match .+ thats not a /

  •  02-09-2010, 4:34 AM 59452 in reply to 59451

    Re: Negative Lookahead - my first adventure

    I changed the above to look for anything other than a / with [^\/]+

    So I ended up with:

    ((\/[^\/]+\/[^\/]+\/)([^\/]+\.aspx))(?!\/Lists\/[^\/]+\/[^\/]+\.aspx)

    the first matches /soemthing/something/page.aspx but I am trying to say but only do that if its not /Lists/anything/page.aspx.

    In Expresso that doesnt work but I read somewhere that some implementations of Regular expressions don't support complex regex inside a negative look behind. Does Expresso? What I am trying to do is in .NET which apparently does support it but my testing is in Expresso.

     Thanks

  •  02-09-2010, 12:03 PM 59505 in reply to 59451

    Re: Negative Lookahead - my first adventure

    mccafferypaul:

    I see I think I understand better. So if I going to use positive or negative look ahead behind the regex before the statement needs to be sufficiently defined such that it stops at the correct place in the string when i do the comparison.

    So to precede it with something too generic would just allow it keep on succeeding until the end of the string then do the comparison and check that .aspx doesn't equal /Lists/ and return success.

    So then my problem becomes difficult because although there are some patterns in the string to determine place the .Lists could be anywhere going forward, but perhaps not going backwards.

    I tried your example but that didn't offer me what I wanted. Would it make it easier if I make the expression check from the end of the string instead for example:

    I know that the /Lists/ will always be followed by a subsite then the page name so it will be in the form:

    http://anything/Lists/subsite/pagename.aspx

    Can i do something like:

    (.+\/.+\/.+\.aspx)(?!\/Lists\/.+\/.+.aspx)

    Where I say match /something/else/pagename.aspx as long as that is not /Lists/something/pagename.aspx.

    I guess the problem i will have there is that .+ will match a slash / successfully.

    Can I do a match that says match .+ thats not a /

    No I don't think you do understand lookaheads.  Have you ever heard the idiom " Close the barn door after the horse has bolted "? That is what you are doing.

    Think of a lookahead as more of "look both ways before crossing the street"

    • A positive lookahead would be "if no oncoming traffic then proceed" - if condition met, proceed
    • A negative would be "If oncoming traffic abandon crossing" - if condition met - cancel entire operation.
    considering to check afterwards is pointless, either you've made it or you've been hit by oncoming traffic. Read Susan's post again.  Look at her's and Sergei's patterns. Notice where they are testing the lookahead condition.

     



    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
View as RSS news feed in XML