Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

To not obtain a tag in html document

Last post 05-27-2008, 5:52 AM by jaloplo. 8 replies.
Sort Posts: Previous Next
  •  05-21-2008, 9:30 AM 42439

    To not obtain a tag in html document

    Hi all,

    I want to do something that I don't know if it possible. I have an HTML document from which I obtain all texts that are inside tags, for example, all text inside a span, p, a, etc.

    My regex is: (>)[\r\n\t]*([\w\s&;]+)[\r\n\t]*(<) and it woks great. Now, I don`t some tags and they are different from the others by an attribute. An example would be the following, the attribute that discriminate is "class":

    <body>
      <span>Hello, this text is valid.</span>
      <div class="NoValid">This text is not valid.</div>
      <div>This text is valid.</div>
    </body>

    Which regex is the one to get all text excepting the first div???

    Thanks in advance.

  •  05-21-2008, 7:43 PM 42455 in reply to 42439

    Re: To not obtain a tag in html document

    Firstly, please read the posting guidelines (the sticky note at the top of this forum) as there is important information that we need to properly help you mentioned there that you have not provided.

    I tried to use your pattern against the sample text you have provided and it only matched the line breaks and not the text in between (say) the <span> and </span> tags. This is because you have used text  in your sample that is not included in your pattern. When I added a '.' and ',' characters to the character class, then it picked up the text I think you are after (as well as the line breaks).

    Having said that, the following pattern locates all of the line breaks and the 'this text is valid' lines:

    (?<!class=[^<>]*)>([^<>]*)<

    The trick here is to use a variable length lookbehind as part of the test to find the end of a tag. Unfortunately this construct is not widely available (basically .NET is the only one I think).

    Susan 

    Edit:

    If you are really only after the TEXT (as opposed to the line breaks) where the text does not contain other tags and you are using a regex that supports variable length lookaheads, then try:

    <(\w+)(?![^>]*?class)[^>]*>([^<]*)</\1>

    This will not find "<span>Hello <b>there</b></span>" as a whole (there again, neither would your pattern - it would split it into several matches). but it will find "there".

    I forgot to suggest beforehand that you could do this with the HTML DOM which may be a more flexible way to go.

  •  05-22-2008, 4:49 AM 42481 in reply to 42455

    Re: To not obtain a tag in html document

    (?x)
        (

            <(\w++)\b  (?# search the name of tag)

                    (?>(?<!class=)(?>"[^"]*"|'[^']*'|[^<>]))*> (?#search all attributes of the tags only if they are not prefixed by class)

                            ([^<>]+|(?1))*   (?# recursively match the value of the node)

            </\2>  (?#close tag)

        )

  •  05-22-2008, 10:51 AM 42490 in reply to 42481

    Re: To not obtain a tag in html document

    Hi again,

    Thansk both of you for your responses. Susan, I think you help me a lot with your answer. You have reason about my regex doesn't match the text I said in the previous post, and thanks to get the main idea of my question. I'll try to do my own regex following yours. Thanks for your post.

    Killahbeez, I can't test your regex, Expresso tell me that fourth line is not well formed. Can you try exactly to write why is not well formed?? I tried to do myself but my knowledge about regex is limited.Thanks for your response.

    Once I have it, I'll write another response with my solution.

  •  05-22-2008, 11:36 AM 42493 in reply to 42490

    Re: To not obtain a tag in html document

    Hi again,

    I have the regex that works on expresso but not in .net. Well, here is the regex that works fine for me:

    (((?<!class="NoValid"))>)[\r\n\t]*([\w\s&;.,:]+)[\r\n\t]*(<)

    This regex gets all text between '>' and '<' in an html page, avoiding tags with the attribute 'class="NoValid"'. Now, I tested in Expresso and works fine, but when I put on my c# code doesn't work the same way. The string I'm using on C# is:

    @"(((?<!class=""NoValid""))>)[\r\n\t]*([\w\s&;.,:]+)[\r\n\t]*(<)"

    What is the fault on this regex???

    Thanks again!!!

  •  05-22-2008, 7:30 PM 42522 in reply to 42493

    Re: To not obtain a tag in html document

    1) in what way does it not "work the same way"? What do you get in Expresso and what do you get in your program

    2) can you show us the snippet of your code that is doing this work. Your string looks superficially OK (except that you have two sets of parentheses around your lookbehind that can be reduced to one) so it may well be the options that Expresso is using but you are not, or the nature of the text string being scanned etc.

    Susan 

  •  05-26-2008, 6:13 AM 42613 in reply to 42522

    Re: To not obtain a tag in html document

    Hi Susan,

    I changed the way to obtain my objective. I decide to get all tags and filter their content in code, so I can decide if it's a good or bad content. So, my regex now is this:

    (?xi)
    <(?'tagName'\b\w+(:?\w+)?)[^>]*>(?'content'(?=</\k'tagName'>))</\k'tagName'>
    |
    (?xi)
    <(?'tagName'\b\w+(:?\w+)?)[^>]*>(?'content'.*)</\k'tagName'>

    As you can see, the difference is the 'content' group, but I don't know why this difference is so important to get the inside text. In the next example you can see the output of each one of the regex:

    <span id="control_01"></span><span id="control_02"></span><span id="control_03"></span><span id="control_04"></span>

    If I run the first regex (always using Expresso) the ouput is four elements, four 'span' elements. But, the second regex returns only one element with 'content' equals to

    </span><span id="control_02"></span><span id="control_03"></span>

    Can you explain me why is this the expected output???

    Thanks.

  •  05-26-2008, 7:29 PM 42635 in reply to 42613

    Re: To not obtain a tag in html document

    I would like to comment on several things.

    Firstly, as you have written your patterns, it is really a single pattern. I'm assuming that, when you talk about the 'first' and 'second' regex's that you are referring to the first and second alternates. In what follows, I've used just the first 2 lines as the 'first' regex and the last 2 as the 'second' regex.

    With the first regex, looking at the "(?'content'(?=</\k'tagName'>))" part, this will never match anything, and the reason you are getting a match from your sample data is that you have no text between the start and end 'span' tags. What is happening is that you are starting a named group ('content') and then using a lookahead to make sure the next part of the text is the end tag. Remember that lookarounds are placeholders, they do not match any text and do not advance the the place in the text that the regex engine is examining.

    Looking at your original post, I beleive that you are wanting to capture the text between the tags, in which case you should change this to something like "(?'content'((?!</\k'tagName'>).)*)". What this does is check to see if the next part of the text is the end tag. If it is not, then the '.' consumes 1 character and adds it to the match and then goes back to check the next character. When the lookahead finally does find the end tag, it 'fails' which exits the named match group and moves on to the part that matches the end tag. The way I have written it will allow for no text between the start and end tags, but also matches and captures any text that is there. (Of course there is always the warning about using '.' if the text can span lines breaks in which case the 'singleline' mode will be needed).

    Now the 'second' regex. The reason it matches everything is that the ".*" part of the 'content' match group is greedy. This means that it will match as many characters as it possibly can and in your test case (in the above post, not the original post) that means to the end of the string. The regex engine then tries to match the end tag but can't (it's at the end of the string!). Therefore it backtracks by 1 character and tries again. Again it fails and so it will repeat the process until either it has backtracked over all of the characters matched by the 'content' group or it finds a match - in this case that is what happens. However, this is the LAST end tag in the string, not necessarily the end tag that immediately follows the one that started the match in the first place.

    The straightforward solution is to make the ".*" non-greedy i.e. ".*?". This will then perform as you expect in this case.

    Hope this makes sense.

    Susan

     

  •  05-27-2008, 5:52 AM 42643 in reply to 42635

    Re: To not obtain a tag in html document

    You were right, again.

    Thanks for your extense response that makes me realized about how regex works. Your two first paragraphs were what I mean on my previous post, so thanks again to understand me (sorry about my english, I'm form spain and always trying to explain the best I can).

    Your explanation about my unique regex was so helpful and, also, how I had to change it for a properly work. Now, I find what I need and not other tags that aren't necessary.

    Thanks for all, Susan, and for me this post can be closed.

    The final regex is:

    (?xi)
    <(?'tagName'\b\w+(:?\w+)?)[^>]*> (?# Get the tag and all of its attributes if have it)
    (?'content'((?!</\k'tagName'>).)*) (?# Get the inner content of the tag, could be more tags)
    </\k'tagName'> (?# Get the closed tag)

View as RSS news feed in XML