Got more questions? Find advice on: ASP | SQL | XML | Windows
Welcome to RegexAdvice Sign in | Join | Help

HTML content, "grouped by" wrapped tags (for lack of a better term)

  •  01-23-2008, 4:27 PM

    HTML content, "grouped by" wrapped tags (for lack of a better term)

    Hey all,

    Before I start, I just want to say that below you will find a modifier of "U".  I know a lot of people say this should not be used, so if there is a way that my issue can be resolved where the U modifier won't be needed, great!

     Language: PHP (4)

    Users submit content via a wysiwyg editor we provide.  This means we allow html markup.  I perform cleanup on what the provide (sanity type stuff) and then I must split their content into multiple pages every x number of words (I have this covered).

    Anyways, I am using a regexp to capture all html blocks so that I get the content ("display" words) and its wrapped html markup.

    To elaborate, here is what is desired:

    Original:

     <p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span></span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span></strong></p>
    Use regexp to break it into:
    [0] => <p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span></span></strong></p>
    [1] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span></strong></p>
    [2] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span></strong></p>
    Here two methods I've used for the regexp, each one with varying success depending on the content submitted.
    $html_markup:
     <p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span></span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span></strong></p>
    --------------------------------------------------------------------------------------
    preg_match_all("/<[^\/]?(\w+)((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>(.*)<\/\\1>/iusU" ,$html_markup,$arry,PREG_PATTERN_ORDER);
    GIVES:
     <p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span></span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span></strong></p>
    --------------------------------------------------------------------------------------
    This one is VERY close.  This almost gives me exactly what I want!
    preg_match_all("/<[^\/](.+)>(.*)<\/(.+)>/iusU",$html_markup,$arry,PREG_PATTERN_ORDER);
    GIVES:
    [0] => <p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span>
    [1] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span>
    [2] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span>
    Notice how it is missing the remaining html tags?  for example, the </span></strong></p> that should appear after the </span> that is there.
    Any help would be greatly appreciated.
View Complete Thread