Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Regexp to retrieve summary info text from a larger html chunk ?

Last post 07-03-2008, 8:49 PM by mash. 7 replies.
Sort Posts: Previous Next
  •  06-29-2008, 3:28 PM 43586

    Regexp to retrieve summary info text from a larger html chunk ?

    I am looking for a regexp that can be used to retrieve a summary text from a larger Html chunk so that I am not getting any Html code in my smaller summary text?

    /Johan   www.nätdejting.nu 


    itansvar.se
  •  06-29-2008, 7:29 PM 43591 in reply to 43586

    Re: Regxp to retrieve summary info text from a larger html chunk?

    Johan,

    Can you please read the sticky note in the companion 'Construction Advice' forum (http://regexadvice.com/forums/thread/43218.aspx) and provide us with the details that are mentioned there?

    However, if you are simply looking to strip out all of the HTML tags and leave the text behind, then I would suggest using the the HTML/XML/... DOM and look for the 'InnerText' value.

    Susan 

  •  06-30-2008, 6:25 PM 43621 in reply to 43591

    Re: RegExp to retrieve summary info text from a larger html chunk?

    I am using DotNetNuke and .Net framwork 2.0 and VB.net as programming language and the task to perform is to publish a summary list of new postings in a web content management system. But using inneText can only be done to wellformed XML doesn't it? And I am not sure it will be wellformed

    /Johan    www.brommakontorshotell.se


    itansvar.se
  •  06-30-2008, 7:04 PM 43622 in reply to 43621

    Re: RegExp to retrieve summary info text from a larger html chunk?

    I can only comment on my experience, but I have found that the DOM approach has always worked for the tasks I have had. If the code is not well formed to the extent that it cannot be displayed/interpreted then I doubt if anything is going to help much. Also, why would any (commercial or commercially used) web CMS system be producing badly formed pages if it is expecting to be used for anything useful?

    Also, you have not really told us what you are trying to do. Are you wanting to remove all tags (regardless of what is in them) from a text? What about comments within the file? Are there specific items you are after to create the 'summary'?

    Susan 

  •  07-01-2008, 5:42 AM 43642 in reply to 43622

    Re: RegExp to retrieve summary info text from a larger html chunk?

    Hi Susan

     

    thanks for your advice. yes I am in the first step just wanting to take away all the Html tags and only retrieving the text ... and then take a smal piece of this to make a summary. So the first thing I want to accomplish is to delete all the Html tags .... even if the tags doesn't have any closing tags but as you say most of the text I am working with has wellform html structure

    /Johan www.rekryteringsföretag.com 


    itansvar.se
  •  07-01-2008, 10:20 AM 43661 in reply to 43642

    Re: RegExp to retrieve summary info text from a larger html chunk?

    Hi again all of you and Susan

    This is a sample I am trying to sort out:

    <p>Fredrik Ljungberg is a famous football player</p>
    <p> <font face="Courier New">Age: 31</font></p>
    <p>Team: West Ham</p>
    <p>&#160;</p>
    <p>sdf</p>
    <p>Information about him</p>
    <p>&#160;</p>
    <p>can be easy or though to find</p>
    <p>&#160;</p>

    I tested this with the XmlDocument and its InnerText property but I ran into a lot of errors so I am still trying to to it in Reg.Exp

    Could any of you help me with a Reg.Exp sample in VB.Net that is deleting/replacing the red background text above with just an empty space?

    /Johan   www.vattenskoter.info 


    itansvar.se
  •  07-01-2008, 11:31 AM 43667 in reply to 43661

    Re: RegExp to retrieve summary info text from a larger html chunk?

    This seams to be working for me ...yet ..

                Dim HtmlText2 As String = ""
                Dim re As New Regex("<\S[^>]*>")
                HtmlText2 = re.Replace(HtmlText, "")
                Dim HtmlTextSummary As String = MakeSummary(HtmlText2)

    I'll post corrections if I run into trouble

    /Johan    www.skribenter.net


    itansvar.se
  •  07-03-2008, 8:49 PM 43744 in reply to 43661

    Re: RegExp to retrieve summary info text from a larger html chunk?

    johjoh682:

    Hi again all of you and Susan

    This is a sample I am trying to sort out:

    <p>Fredrik Ljungberg is a famous football player</p>
    <p> <font face="Courier New">Age: 31</font></p>
    <p>Team: West Ham</p>
    <p>&#160;</p>
    <p>sdf</p>
    <p>Information about him</p>
    <p>&#160;</p>
    <p>can be easy or though to find</p>
    <p>&#160;</p>

    I tested this with the XmlDocument and its InnerText property but I ran into a lot of errors so I am still trying to to it in Reg.Exp

    Could any of you help me with a Reg.Exp sample in VB.Net that is deleting/replacing the red background text above with just an empty space?

    /Johan   www.vattenskoter.info 

    You have to use the HTML DOM to parse HTML.  HTML is not xml.  You could use the XMLDocument if you are working with valid XHTML.

    Your snippet isn't well formed so whether you have XHTML isn't clear but I don't think it is.


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
View as RSS news feed in XML