by Michael S. Kaplan, published on 2005/04/23 08:00 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/23/411106.aspx
A few days ago, Scott Hanselman asked in the Suggestion Box:
I'm doing an English/Spanish site with ASP.NET using some client side validation with Regular Expressions.
I wanted to write a single Regular Expression for most large text fields:
^[\w\d\s-'.,&#@:?!()$\/]+$
Notice that I'm using \w and \d for WORD characters and DIGITS respectively. I was assuming that JavaScript would allow "áÁéÉíÍóÓúÚñÑüÜ" and characters like it when a browser is configured for Spanish, but it seems only to care about A-Za-z.
I wanted to avoid using A-Za-z as its so English Focused.
What's the i18n "right thing to do" when using Regular Expressions?
Ayudame por favor! ;)
Of course he did not wait for me to answer <grin>, instead choosing to post about it on his own blog in a post entitled Internationalized Regular Expressions.
Its funny, I find myself using Regular Expressions more often in Visual Studio's Find/Replace than I do in actual code using the RegEx classes. Not sure what that means, but it is probably bad....
Anyway, the help for Regular Expressions that you get to if you click the Help button on the Find dialog has the following table in it, that I have used quite a bit:
The following table lists the syntax for matching by standard Unicode character properties. The two-letter abbreviation is the same as listed in the Unicode character properties database. These may be specified as part of a character set. For example, the expression [:Nd:Nl:No] matches any kind of digit.
Expression | Syntax | Description |
---|---|---|
Uppercase letter | :Lu | Matches any one capital letter. For example, :Luhe matches "The" but not "the". |
Lowercase letter | :Ll | Matches any one lower case letter. For example, :Llhe matches "the" but not "The". |
Title case letter | :Lt | Matches characters that combine an uppercase letter with a lowercase letter, such as Nj and Dz. |
Modifier letter | :Lm | Matches letters or punctuation, such as commas, cross accents, and double prime, used to indicate modifications to the preceding letter. |
Other letter | :Lo | Matches other letters, such as gothic letter ahsa. |
Decimal digit | :Nd | Matches decimal digits such as 0-9 and their full-width equivalents. |
Letter digit | :Nl | Matches letter digits such as roman numerals and ideographic number zero. |
Other digit | :No | Matches other digits such as old italic number one. |
Open punctuation | :Ps | Matches opening punctuation such as open brackets and braces. |
Close punctuation | :Pe | Matches closing punctuation such as closing brackets and braces. |
Initial quote punctuation | :Pi | Matches initial double quotation marks. |
Final quote punctuation | :Pf | Matches single quotation marks and ending double quotation marks. |
Dash punctuation | :Pd | Matches the dash mark. |
Connector punctuation | :Pc | Matches the underscore or underline mark. |
Other punctuation | :Po | Matches commas (,), ?, ", !, @, #, %, &, *, \, colons (:), semi-colons (;), ', and /. |
Space separator | :Zs | Matches blanks. |
Line separator | :Zl | Matches the Unicode character U+2028. |
Paragraph separator | :Zp | Matches the Unicode character U+2029. |
Non-spacing mark | :Mn | Matches non-spacing marks. |
Combining mark | :Mc | Matches combining marks. |
Enclosing mark | :Me | Matches enclosing marks. |
Math symbol | :Sm | Matches +, =, ~, |, <, and >. |
Currency symbol | :Sc | Matches $ and other currency symbols. |
Modifier symbol | :Sk | Matches modifier symbols such as circumflex accent, grave accent, and macron. |
Other symbol | :So | Matches other symbols, such as the copyright sign, pilcrow sign, and the degree sign. |
Other control | :Cc | Matches end of line. |
Other format | :Cf | Formatting control character such as the bidirectional control characters. |
Surrogate | :Cs | Matches one half of a surrogate pair. |
Other private-use | :Co | Matches any character from the private-use area. |
Other not assigned | :Cn | Matches characters that do not map to a Unicode character. |
I use these all the time when I am trying to get behavior that respects more of Unicode.
Not sure if this will help you with what you are looking for, but it is the way I use to get internationally aware regular expressions.... :-)
This post brought to you by "ר" (U+05e8, a.k.a. HEBREW LETTER RESH)
Because this post is כשר לפסח in anticipation of the festivities that start less than 24 hours from now)
# Jonathan on 23 Apr 2005 7:35 AM:
# Michael S. Kaplan on 23 Apr 2005 8:55 AM:
# Svend Tofte on 23 Apr 2005 10:10 PM:
# Michael S. Kaplan on 23 Apr 2005 10:36 PM:
# Travis Illig on 25 Apr 2005 12:20 PM: