Got more questions? Find advice on: ASP | SQL | XML | Windows
Welcome to RegexAdvice Sign in | Join | Help

Uncommon URL pattern retrieval

  •  08-19-2008, 6:14 AM

    Uncommon URL pattern retrieval

    Hi,

    Anybody out there, I'd be immensly greatful if you helped me fix one regular expressions of mine.

    Basically I've written a regular expression in PHP identifying a pattern of a certain type of URL

    REGEX: /(http\:\/\/dx\.doi\.org\/.*?)(\s|$|;|>)/

    it will catch strings of URLs from the 'http://dx.doi.org/' domain, looking like this: http://dx.doi.org/10.1016/j.apcatb.2006.03.022

    now, the problem is that the punctuation signs which signify the end of such URLs contain the semicolon and angular bracket, which works perfectly fine in 99% of the cases in the database I'm working on. As for the rest 1%, the semicolon and the angular bracket are actually parts of the URL, so parsing those strings will give me only part of the URL but not all of it.

    this kind of 'uncommon' URL looks like this : http://dx.doi.org/10.1002/1521-3951(200209)233:2<238::AID-PSSB238>3.0.CO;2-E

    1) The question is whether I can insert a  'probable occurence' of a semicolon within the regular expression in such a way that it will not be recognized as the end of the URL. Actually all these weird links contain a sequence 'CO;2' two characters before the end of the URL; these four characters could be used as an exception to the rule that a semicolon signifies the end of the URL.

    2) similar problem with the '>' character. Normally it is also used to identify where the URL has ended, but no in the above mentioned case.

    I'm not very good with regular expressions at all, so I have very vague idea of how this task could be realized.

    If you have any ideas, please let me know

    ksyandagger

View Complete Thread