Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Michael Ash's Regex Blog

Regex Musings

Unicode character class

 

One day a co-worker ask me to look at a regex that wasn’t working. The regex was checking for e-mail and was using the character class \p{L}, which I didn’t recognize. Fortunately my co-worker had already worked around the non-working expression so there was no pressure to fix it but the curiosity of what \p{L} meant got the best of me so I did some research. I didn’t have to look to far because the Regexlib’s cheat sheet (http://regexlib.com/CheatSheet.htm) mentions it. Basically the \p{name} character class deals with Unicode.


“Matches any character in the named character class specified by {name}. Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z, IsGreek, IsBoxDrawing..”


Unfortunately that is about all the detail the description which is pulled from the description that Microsoft’s documentation gives. The samples are not very helpful. More research revealed that \p{L} matches Alpha characters. Not just the alphabet but any alpha character of any language. I found a few more Unicode groups and ranges and tried them out. Without much luck. Most of the regex websites and programs I used to test the \p{name} expression wouldn’t even accept the syntax. Those that did I couldn’t get match on almost everything I tried. I tried using

\p{Ll} – lower case alpha’s

\p{N} – digit and numeric character (ie. Roman Numerals)

\p{Basic Latin} – part of the Basic Latin characters


For the longest I just thought that the \p{name} just didn’t work. I was partly right and partly wrong.


Of the regex engines that didn’t bark at the syntax didn’t recognize the \p class and simple tried to match the character ‘p’. The only regex engine that recognize the character class was the Microsoft VB regex engine, but the was not any documention on what Unicode group and ranges it accepted. So I just tried my best guess. Starting backwards with my test values


Basic Latin is the Unicode range that includes the English alphabet. Yet I keep getting and error saying it was an unknown property. I tried with a space, without (BasicLatin), with and underscore(Basic_Latin), nothing worked. Trying trail and error and taking a hint form the one of the sample ranges I learned that the range must be prefixed with the word “Is”. So the correct syntax is \p{IsBasicLatin}. Now I have to point out two things First the IsBasicLatin is case sensitive. Isbasiclatin or isBasicLatin won’t work. Second the Microsoft documentation points you to an out of date link on the Unicode site for the valid ranges and groups. Now even if you look at current documentation (http://www.unicode.org/charts/) note that the Unicode range is NOT IsBasicLatin but simply “Basic Latin”. Prefixing the range with “Is” and removing the spaces between words, if there are any, seems to work fine with Microsoft vb regex engine. Also the “Is” prefix is an implemention preference. I read somewhere that “Is” or “In” may prefix the range if a prefixed is used. So while some regex engines may accept “IsBasicLatin” others may take “InBasicLatin” or “BasicLatin” if or when they accept the \p class


My second test \p{N} (numerical characters) worked fine with digits, 0-9, but when I typed in a Roman Numeral (I, V or X) I would not get a match. This is a combination of poor documentation and my own ignorance. When the documentation I found said \p{N} or \p{Nl}, which matches letter numbers but not the digits 0-9, would match Roman Numerals I thought it meant the alpha character we use for Roman Numerals, which I typed in. However what it really meant was the Unicode range of characters that are roman numerals (U2160-217F) which are the various symbols that make up Roman Numerals.


Now my final test \p{Ll}. Now this one actually doesn’t work as it is suppose to. The first letter in the curly brace, capital L (again case-sensitive), is the group. In this case alpha characters. The second letter, lower-case l, tell which case. Other options for case are lower-case u for uppercase and lower-case t for Title-case. The case modifier is optional so \p{L} matches all cases. Now that’s how it is suppose to work from what I have read. The Microsoft engine doesn’t work. Any of these expressions \p{L}, \p{Ll}, \p{Lu} or \p{Lt} will match any case, but only the first one should.  The rest should be case-sensitive.


Published Saturday, April 17, 2004 1:47 AM by mash
Filed under:

Comments

 

TrackBack said:

May 6, 2004 1:32 AM
 

TrackBack said:

May 7, 2004 2:25 AM
Anonymous comments are disabled