Regular expressions, Unicode style....

by Michael S. Kaplan, published on 2005/04/23 08:00 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/23/411106.aspx


A few days ago, Scott Hanselman asked in the Suggestion Box:

I'm doing an English/Spanish site with ASP.NET using some client side validation with Regular Expressions.

I wanted to write a single Regular Expression for most large text fields:

^[\w\d\s-'.,&#@:?!()$\/]+$

Notice that I'm using \w and \d for WORD characters and DIGITS respectively. I was assuming that JavaScript would allow "áÁéÉíÍóÓúÚñÑüÜ" and characters like it when a browser is configured for Spanish, but it seems only to care about A-Za-z.

I wanted to avoid using A-Za-z as its so English Focused.

What's the i18n "right thing to do" when using Regular Expressions?

Ayudame por favor! ;)

Of course he did not wait for me to answer <grin>, instead choosing to post about it on his own blog in a post entitled Internationalized Regular Expressions.

Its funny, I find myself using Regular Expressions more often in Visual Studio's Find/Replace than I do in actual code using the RegEx classes. Not sure what that means, but it is probably bad....

Anyway, the help for Regular Expressions that you get to if you click the Help button on the Find dialog has the following table in it, that I have used quite a bit:

The following table lists the syntax for matching by standard Unicode character properties. The two-letter abbreviation is the same as listed in the Unicode character properties database. These may be specified as part of a character set. For example, the expression [:Nd:Nl:No] matches any kind of digit.

Expression Syntax Description
Uppercase letter :Lu Matches any one capital letter. For example, :Luhe matches "The" but not "the".
Lowercase letter :Ll Matches any one lower case letter. For example, :Llhe matches "the" but not "The".
Title case letter :Lt Matches characters that combine an uppercase letter with a lowercase letter, such as Nj and Dz.
Modifier letter :Lm Matches letters or punctuation, such as commas, cross accents, and double prime, used to indicate modifications to the preceding letter.
Other letter :Lo Matches other letters, such as gothic letter ahsa.
Decimal digit :Nd Matches decimal digits such as 0-9 and their full-width equivalents.
Letter digit :Nl Matches letter digits such as roman numerals and ideographic number zero.
Other digit :No Matches other digits such as old italic number one.
Open punctuation :Ps Matches opening punctuation such as open brackets and braces.
Close punctuation :Pe Matches closing punctuation such as closing brackets and braces.
Initial quote punctuation :Pi Matches initial double quotation marks.
Final quote punctuation :Pf Matches single quotation marks and ending double quotation marks.
Dash punctuation :Pd Matches the dash mark.
Connector punctuation :Pc Matches the underscore or underline mark.
Other punctuation :Po Matches commas (,), ?, ", !, @, #, %, &, *, \, colons (:), semi-colons (;), ', and /.
Space separator :Zs Matches blanks.
Line separator :Zl Matches the Unicode character U+2028.
Paragraph separator :Zp Matches the Unicode character U+2029.
Non-spacing mark :Mn Matches non-spacing marks.
Combining mark :Mc Matches combining marks.
Enclosing mark :Me Matches enclosing marks.
Math symbol :Sm Matches +, =, ~, |, <, and >.
Currency symbol :Sc Matches $ and other currency symbols.
Modifier symbol :Sk Matches modifier symbols such as circumflex accent, grave accent, and macron.
Other symbol :So Matches other symbols, such as the copyright sign, pilcrow sign, and the degree sign.
Other control :Cc Matches end of line.
Other format :Cf Formatting control character such as the bidirectional control characters.
Surrogate :Cs Matches one half of a surrogate pair.
Other private-use :Co Matches any character from the private-use area.
Other not assigned :Cn Matches characters that do not map to a Unicode character.

I use these all the time when I am trying to get behavior that respects more of Unicode.

Not sure if this will help you with what you are looking for, but it is the way I use to get internationally aware regular expressions.... :-)

 

This post brought to you by "ר" (U+05e8, a.k.a. HEBREW LETTER RESH)
Because this post is כשר לפסח in anticipation of the festivities that start less than 24 hours from now)


# Jonathan on 23 Apr 2005 7:35 AM:

It's כשר לפסח, not קשר לפסח.
כשר = Kosher
קשר = knot, connection

# Michael S. Kaplan on 23 Apr 2005 8:55 AM:

Interesting... looking at food containers I actually see both spellings. :-(

Not sure what to do with that -- good think it's not Passover yet I suppose....

# Svend Tofte on 23 Apr 2005 10:10 PM:

I think the linked authors problem is that he is using JScript, probably some 5.6 version, which doesn't have the (.NET) features, you list here.

Also, I've never really exercised the regex part of .NET, so I'm no expert there, but my copy of "Mastering Regular Expressions" tells me that by default "\w" matches things, such as :Ll, :Lu, and a few others.

I guess they could break backward compatability, when there is no existing code-base, which may (what do I know) have been the problem with the regex part of JScript (or whereever the regex code that JScript uses lives), since suddenly allowing much wider input, may not have flown well with alot of existing websites, relying on \w to mean just [A-Za-z], and no more.

And thanks for a very cool blog btw. I never dabble much in these waters (internationalization, etc), but when I do, it's always scary.

# Michael S. Kaplan on 23 Apr 2005 10:36 PM:

Yep, Scott is indeed looking for a JScript solution, I told him a comment to his post that there may not be an answer there other than hardcoding the desired characters....

Glad you like the blog! :-)

# Travis Illig on 25 Apr 2005 12:20 PM:

JavaScript supports \u0000-\uFFFF style Unicode character expressions, so you could theoretically expand the character classes into the corresponding \uXXXX-\uYYYY style range(s) on the server side to feed to the client.

I've posted a code sample here:
http://www.paraesthesia.com/blog/comments.php?id=809_0_1_0_C

Obviously something you'd want to cache rather than calculate each time, but it's something to think about.

go to newer or older post, or back to index or month or day