Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Parse HTML A tag to extract attributes

Last post 06-29-2008, 7:35 PM by Aussie Susan. 2 replies.
Sort Posts: Previous Next
  •  06-28-2008, 2:33 PM 43567

    Parse HTML A tag to extract attributes

    Dear all,

     I need help parsing HTML A tag with attributes. The problem is that the source is not always properly formatted. I didn't find anything in RegExLib site.

    I need to extract URL and the text between A tags

    Here are the specifics:

    • There can be single quotes, double quotes, no quotes or combination of thereof.
    • Attributes can be in any order.
    • Tags and atributes can be in uppercase or lowercase
    • Extra spaces should be ignored
    • The target is .NET engine

    These are examples of strings that should be properly parsed:

    <A href="http://www.microsoft.com" target=_self>Microsoft</A>

    <a href="http://www.microsoft.com" target=_self>Microsoft</a>

    <A HREF="http://www.microsoft.com" target=_self>Microsoft</A>

    <A target="_self" href="http://www.microsoft.com" >Microsoft</A>

    <A href='http://www.microsoft.com' target=_self>Microsoft</A>

    <A href=http://www.microsoft.com target=_self>Microsoft</A>

    Any help will be greatly appreciated.

    Mike

    Filed under:
  •  06-28-2008, 6:21 PM 43569 in reply to 43567

    Re: Parse HTML A tag to extract attributes

    After a long thread I suspect someone will wisely recommend you parse HTML with the HTML DOM instead of regex.
  •  06-29-2008, 7:35 PM 43592 in reply to 43569

    Re: Parse HTML A tag to extract attributes

    As always, Doug makes good sense and the DOM is probably the right way to do this (based on the information you provided).

    However, the following is a direct extract of a recent reply I gave to a similar question (http://regexadvice.com/forums/thread/43275.aspx) in this forum:

     

    My effort would be along different lines:

    <img(\s+((\w+) (=("[^"]"|[^\s>]+)|)))*/?>

    with the 'ignore case' option set as a safety measure in case someone used "<IMG" at the start.

    This pattern will accept any number of attributes within the tag and an option "/>"  or ">" tag ending. Each attribute optionally can be followed by an '=' and a value which can be a quoted string (without escaped or doubled double quotes) or any string of non-whitespace characters.

    Each capture of  match group #2 (you said you were using .NET so captures are available to you) will contain each attribute and its value as a whole. The match group #3 captures will contain just the attribute names in the order they appear and the captures of match group #4 will contain the values (or NULL if the attribute has no value associated with it).

    You will need to scan the captures of match group #3 for the attributes you are interested in and then look at the corresponding capture in match group #4 for any possible value.

     

     Then only change you may need to make is to allow for single as well as double quotes as in:

    <img(\s+((\w+) (=(["'][^"']["']|[^\s>]+)|)))*/?>

    Susan 

View as RSS news feed in XML