Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Words on either side...

Last post 03-10-2010, 2:01 PM by gmspring. 3 replies.
Sort Posts: Previous Next
  •  03-09-2010, 10:04 AM 60538

    Words on either side...

    Another newbie question. I have been programming for years but just discovered Regex and have been using it to really speed up what I'm doing. It's very powerful. However, I still get lost in some of the complexity.

    I have looked to see if this question has been asked before but didn't see anything. If it has, I apologise, and perhaps you could point me in the right direction.

    I am pulling text from a database and, using vbscript in an asp web page (running on an IIS Windows server - not .Net), I need to present the data as 5 words on either side of a search term or word, or part of a word in some cases. For example, from the following text I want instances of the term here:

    "This is the first sentence of the paragraph. A second sentence, here. A third sentence finishes the paragraph.

    There is also a fourth sentence."

    I can get 5 words either side using the following pattern (that I painfully concocted) "(\b(\W+)?(\w+)\.?,?\s){0,5}here(\b(\W+)?(\w+)\.?,?\s){0,5}"

    HOWEVER, this won't work beyond punctuation marks or line breaks (I have set multiline=true, so it's probably the full stop of the previous paragraph that blocks the pattern). The results I would like to see are:

    "... the paragraph. A second sentence, here. A third sentence finishes the ..."

    "... third sentence finishes the paragraph. There is also a fourth sentence..."

    The line break isn't THAT big a deal as I could remove these first and replace with a white space.

    Can someone help me produce 2 sperate patterns that would produce the above results? One pattern that finds whole words only with 5 words either side. And one pattern that finds all instances of here, even if part of another word, with 5 words either side.

    Thanks for any help or pointers.

    Greg

  •  03-09-2010, 6:24 PM 60604 in reply to 60538

    Re: Words on either side...

    A couple of things come to mind.

    Firstly you won't see both of the expected results as they overlap. When a regex completes a match, it has a pointer into the string it is processing that will have been moved to the last matched character. When it restarts the next match (with the vbscript regex object "Global" flag set to true) then it will start immediately after where it finished the last time. Therefore the "A third sentence finishes the" will leave the text pointer after the "e" of "the" and so the next match will start there.

    You seem to be getting the "singleline" and "multiline" operations a bit mixed up. They have unfortunate names and so this confusion occurs a lot. The "multiline" option controls the behaviour of the "^" and "$" anchors. If the option is not set, then they refer to the beginning and end (respectively) of the entire string being processed. If the option is set, then they refer to the start and end of each line in a multiline string - with the line breaks being determined by the 'newline' character.

    On the other hand, the "singleline" option controls how the '.' operator works. If the option is not set, then the '.' operator will match all characters EXCEPT the newline character; if the option is set then it will match all characters INCLUDING the newline character.

    In your case, you want the '.' to match across a line break and gso you really want the "singleline" and not the "multiline" option to be set. Unfortunately  the regex object in vbscript does not have this option. Therefore you need to use some other technique to define the match you want such as '[.\n]' or '[\S\s]' where you are creating a character set that contains all characters ('\s' refers to all whitespace characters and '\S' refers to all non-whitespace characters - together they cover all characters).

    You say that "here" can be part of a word. That's OK but your pattern specifies a space before the keyword and so your second example cannot happen. What this will do is use the fact that there can be 0 "words" before the keyword and so start with the keyword (remember Im talking about the "here" being part of a word). Also the same thing happens when the "here" is not at the end of a word (as in "herein") because you have both the '\b' anchor and the '\W' at the start of the following word.

    You are assuming that a "word" is fully matched by the '\w' set of characters. Again this is just a check that you mean that to happen as it will assume that "O'Conner' will be seen as 2 words with the apostrophe acting as a word separator but "under_score" will be seen as one word.

    Given the above, I have modified your pattern a little to be:

    (\b\w+\W+){0,5}here(\W+\w+){0,5}

    with the "ignore case" option set, and I end up with the two matches from your test string of:

    the paragraph. A second sentence, here. A third sentence finishes the

    and

    here is also a fourth sentence

    I've not bothered with the punctuation as that is covered by the '\W' part of each definition.

    Also note that you will only be able to recover the last word matched on either side of the keyword because of the way a regex engine uses a single area to save the captured text for each match group. You can get the entire phrase from the regex match but if you want to get the "before" and "after" parts, the you will need something like:

    ((\b\w+\W+){0,5})here((\W+\w+){0,5})

    where the preceding phrase will be in match group #1 and the following phrase in match group #3.

    Susan

     

    Edit: if you want to overcome the problem with "here" being within a word and still capture the (up to) 5 words on each side, then try:

    ((\b\w+\W+){0,5})\w*here\w*((\W+\w+){0,5})

  •  03-10-2010, 6:51 AM 60645 in reply to 60604

    Re: Words on either side...

    Susan,

    I haven't tried your solutions yet, or digested what you wrote. I will give that the respect it desrves later today. However I had to reply now to say thank you very much for such a thorough and considered reply. Not just an answer but a great explanation for someone who is still new to this.

    Can I award you points for being so helpful. 20 out of 10!

    Thanks. I'll reply again later.

    Greg

     

  •  03-10-2010, 2:01 PM 60670 in reply to 60604

    Re: Words on either side...

    Susan,

    Thank you for pointing all that out to me. I'm still getting my head around the various keys and properties (flags) and your explanation has really helped. Your final solution is brilliant and works a dream.

    I suppose my pattern -  "(\b(\W+)?(\w+)\.?,?\s){0,5}here(\b(\W+)?(\w+)\.?,?\s){0,5}" - was too specific because of its complexity. I'm still not 100% certain why it broke off the pattern where punctuation occured but I will unpick it given your instructions and examples to see where it went wrong.

    Thanks again for your help.

    Greg

View as RSS news feed in XML