Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Michael Ash's Regex Blog

Regex Musings

Higher plane solution

I’ve come up with a workaround to Unicode plane bug I talk about earlier

 

Originally I was trying to identify Unicode values outside of the ASCII range, code point higher than 127, for a replace.  However the problem was characters in any other plane besides plane zero returned two matches per character.  Meaning I could get two incorrect replacements for those characters.  Turns out the replacement would just not taken place which was still a problem.

 

The following two regexes allowed me to get around this behavior.

            Plane 0 excluding standard ASCII pattern  (?![\uD800-\uDBFF])(?![\uDC00-\uDFFF])[\u0080-\uFFFF]

            Non-Plane 0 pattern [\uD800-\uDBFF][\uDC00-\uDFFF]

 

The ranges D800-DBFF and DC00-DFFF are the High and Low surrogate values respectfully. Surrogate code points are the values in UTF-16 encoding of the two 16-bit code units that make up a Supplementary Character

 

You may notice in the plane 0 pattern the surrogate pairs are in separate lookaheads.  You might think that they could be combined into one look ahead, but this won’t work.  I’m not exactly sure why this is, but doing so will get you a match on the low surrogate of a non-plane 0 character.

 

I've posted these pattern to the regexlib.  Unfortunately most of the characters used in the examples can't be displayed on the site. The sample characters I used are

Plane 0 : ☻

Plane 1:  𐌈, 𐍈 and 𐐏

Sponsor
Published Monday, November 08, 2004 12:45 PM by mash

Comments

No Comments
Anonymous comments are disabled