Got more questions? Find advice on: ASP | SQL | XML | Windows
Welcome to RegexAdvice Sign in | Join | Help

Playstation Parsing Problem

  •  07-26-2007, 12:42 PM

    Playstation Parsing Problem

    Hello there, I am using PHP, preg_match to pull out of various Auction titles for video games, the video game platform of those games.  I have a list of valid ones and trying to match back to them.  However as this is completely user entered data there is a enormous amount of variety in the titles, which leads to some interesting reglular expressions. 

     I've had a decent amount of experience using regex's for simple tasks, but very limited with the advanced features such as look ahead assertions.  The most troublesome so far has been simply "Playstation", I'll try to explain why I'm having trouble, I think my main problem is that I'm approaching this from the wrong way, hopefully somebody can help.

    There are a few reasons this gets complex (for me!) such as:

    • Playstation 1, Playstation 2, Playstation 3, PSP - are all different platforms, and to complicate things sometimes they are backwards compatible (ie PS2 game that can be played on a PS3 - these games should be marked as PS2 - but this might be un-realistic for the first pass)
    • Playstation is equivalent to PS or PS1 or Playstation one or Playstation System
    • Titles also contain the video game name
    • Titles often contain multiple keywords that are significant (ie "Test Drive 4 (Playstation) Accolade PS1")


    Some example titles that all should point to Playstation:

    "Playstation 2007 Madden Football"
    "Metal of Honour game for Playstation, 2 controllers included"
    "Playstation Big Bass World Championship 2"
    " Tiger Woods PS-1"
    "Resident Evil for Playstation one"
     

    Some examples that should NOT match Playstation:

    "Namco World Rally Championship PSP"
    "Need for Speed Most Wanted Playstation 2"
    "Silent Hill Playstation PSII Mint"
    "PSP Playstation Portable Classics Collection"


    My attempts have been a bit frustrating, squash one problem and unlock one I have already solved, here is a basic idea of what I've been trying to do:

    /\b(Play( )?station|PS)( )?(?(?=(one|1|[\-\,\;\:]|system))|(?:(?!(2|3|II\b|III\b|P\b|Portable)).)*$)/i 

    The idea was to identify the Playstation keyword, then do a conditional to determine if we can now know that we have the right version, if those conditions are not true, check the rest of the string for keywords to fail us out.  What this ends up doing is falsely failing on anything with a 2 or 3 or a P at the end of a word.

    So I need a better way to eliminate anything with PSP as a word, or anything pointing to the PS 2 or 3, or another way of attacking the problem.

    Hope this wasn't to confusing, and thanks to anybody for reading this far!

    Any thouhgts/help are appreciated!

     
     -Pat

View Complete Thread