Hey all,
Before I start, I just want to say that below you will find a modifier of "U". I know a lot of people say this should not be used, so if there is a way that my issue can be resolved where the U modifier won't be needed, great!
Language: PHP (4)
Users submit content via a wysiwyg editor we provide. This means we allow html markup. I perform cleanup on what the provide (sanity type stuff) and then I must split their content into multiple pages every x number of words (I have this covered).
Anyways, I am using a regexp to capture all html blocks so that I get the content ("display" words) and its wrapped html markup.
To elaborate, here is what is desired:
Original:
<p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span></span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span></strong></p>
Use regexp to break it into:
[0] => <p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span></span></strong></p>
[1] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span></strong></p>
[2] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span></strong></p>
Here two methods I've used for the regexp, each one with varying success depending on the content submitted.
$html_markup:
<p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span></span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span></strong></p>
--------------------------------------------------------------------------------------
preg_match_all("/<[^\/]?(\w+)((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>(.*)<\/\\1>/iusU" ,$html_markup,$arry,PREG_PATTERN_ORDER);
GIVES:
<p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span></span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span></strong></p>
--------------------------------------------------------------------------------------
This one is VERY close. This almost gives me exactly what I want!
preg_match_all("/<[^\/](.+)>(.*)<\/(.+)>/iusU",$html_markup,$arry,PREG_PATTERN_ORDER);
GIVES:
[0] => <p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span>
[1] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span>
[2] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span>
Notice how it is missing the remaining html tags? for example, the </span></strong></p> that should appear after the </span> that is there.
Any help would be greatly appreciated.