Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Word boundaries (PHP)

Last post 01-29-2010, 3:41 PM by mash. 3 replies.
Sort Posts: Previous Next
  •  01-29-2010, 9:17 AM 59080

    Word boundaries (PHP)

    Hi all,

     I'm using quite a simple regex  with word boundaries:   \bword\b . My intention is that "word" should match, but not "xyzword" or "wordxyz". My problem is that "-word" matches (I don't want it to) because the hypen isn't part of the word boundary class.

     So I tried this:

         \b(^-|)word\b

     ie, "word boundary followed either by anything that isn't a hypen, or by nothing", but that didn't work.

     Can anyone help please? Is there a way to override what \b matches, or perhaps I should stop using \b and construct my own set of alternatives

     

     

  •  01-29-2010, 11:22 AM 59082 in reply to 59080

    Re: Word boundaries (PHP)

    Ok first off let's be clear. The pattern \bword\b only matches "word" it does not match "-word" it match the string "word" within that string.  Which bring up a very important point in regex construction context with the input source. http://regexadvice.com/blogs/mash/archive/2007/10/01/Remember-where-you-come-from.aspx

     I'm not sure what you mean by

    womble:

    because the hypen isn't part of the word boundary class

    The boundary metacharacter does not match any characters it match the position between a "word character" and a "non word character". The term is a bit misleading because regexes don't actually know if some is a word or not.

    Secondly, your attempt does not express what you think it does.

    (^-|) says match a hyphen at the begining of the input or null

    which within the context the rest of the pattern only the null will ever match.

    Unless you are planning to write your own custom regex engine, No you can not redefine the metacharacters for custom use.  You can define a more specific pattern to what you actual want but that would depend on your source

    \b(?<!-)word\b

    would work with the samples you provided in the way you desired but without knowing your source input I wouldn't know if that fits your goals.

     

     


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  01-29-2010, 2:40 PM 59091 in reply to 59082

    Re: Word boundaries (PHP)

    Sorry, yes. I was sloppy with description - I appreciate that it matches "word" within the context of "-word".

     What I'm writing is a basic swear word filter, using preg_replace(). So, eg, I'd want to filter "hell" but not "shell" or "hello".  I'd also want to filter if "hell" had a punctuation mark either side, eg "hell, heaven".  \b mostly seems fine for this. My problem is that I don't want to filter out "hell" if it has a hyphen either side, eg: "hell-bent" or  "something-hell" (can't think of a good example there).

     

    Extending your suggestion, would 

        \b(?<!-)word\b(?<!-)

     be correct?  

  •  01-29-2010, 3:41 PM 59093 in reply to 59091

    Re: Word boundaries (PHP)

    womble:

    Sorry, yes. I was sloppy with description - I appreciate that it matches "word" within the context of "-word".

     What I'm writing is a basic swear word filter, using preg_replace(). So, eg, I'd want to filter "hell" but not "shell" or "hello".  I'd also want to filter if "hell" had a punctuation mark either side, eg "hell, heaven".  \b mostly seems fine for this. My problem is that I don't want to filter out "hell" if it has a hyphen either side, eg: "hell-bent" or  "something-hell" (can't think of a good example there).

     

    Extending your suggestion, would 

        \b(?<!-)word\b(?<!-)

     be correct?  

    No that would not be correct

    \b(?<!-)word\b(?!-)

    would not not match "word" with a hyphen on either end. 

    The problem with that approach is that hyphenated swear words will get thru. If the swear word was "beep" (choose your prefer 4 letter word)  beep-head would get through. You might need to give a little more thought to your criteria.  Or try to find some existing filters that have tackled this already.

     


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
View as RSS news feed in XML