by Michael S. Kaplan, published on 2008/07/25 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/07/25/8771119.aspx
Way back in December of 2007, aaron asked in the Suggestion Box:
Your recent In SQL Server, A-Z [...] might not mean the same thing:
It got me thinking, a whole post dedicated to the problems of mixing regular expressions and i18n would be very interesting. Some questions i've always woried about but never tested:
- '\b' word boundaries, do they incorrectly show up when surrogate pairs or combining characters are involved?
- '\b' word boundaries, are there / should there be characters that form word boundaries only sometimes. It's plausible in some interpretations that "hy-phen" has only two word boundaries, at the begining and end, but in reality is has 4, as '-' is not a '\w' character. But do other unicode characters have some sort of weird identity.
- If i have an accented character as two code points (combining), does / should '.' (or '?' in Win32 regex) match the character and the accent, or just the base character?
- how wide is the definition of '\w' word character? Does it / should it ever change based on the current user locale/language?
- how likely is your average regular expression going to be i18n unsafe? what are the common pitfalls to avoid?
Note: for 'should / does', i'm asking all of (a) what do you (Michael Kaplan) think it _should_ do, and (b) what do some common implementations do (for instance, the .Net System.Text.RegularExpressions.Regex class, or the new TR1 regex in Visual Studio 2008, or Win32 with FindFirstFile and friends)
(Oh, and your blog is awesome!)
Hopefully the long delay before I got to responding did not change his opinion of the blog. :-)
I'll start off with a quote from Jamie Zawinski:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
This is to help set expectations realistically. :-)
I'll start by saying that most modern regular expression implementations do have certain features that are particularly good for internationalization, such as Unicode storage semantics. Such features are pretty much essential in most cases. Some of them build in Unicode property type information and most of those keep up with recent versions of Unicode.
With that said, most of the ones that I have worked with are otherwise very primitive, not properly handling Unicode normalization/canonical equivalence, known in .NET as the "text element" semantic. This leads to some problems with Aaron's first and third points above with sequences of characters that should be treated as equivalent to some other character or sequence.
Most also have no notion of exceptional cases such as the one in Aaron's second point above -- to support these things you have to build up complex expressions to try to handle the exceptional cases. If one is lucky they are included in samples, but usually only the very simplest ones tend to be built in.
And none that I have ever worked with properly handle locale specific differences, the thing that I have referred to here in the blog as "sort elements" -- what users in a particular language think of as a single character, the kind of thing that Aaron hints at in the fourth point above.
For more on sort elements and text elements, blogs of mine like Sort element vs. text element are a good place to start.
The theory of all of that good property support often runs into the kinds of problems described in No Regex in the Unicode room! (and no sex in the champagne room, either!) and 4400 (*not* 'The 4400') and 'The 44' (*not* 'The 4400'), where what the engine does manges to fall short of what one might expect from an implementer of information coming out of the Unicode Character Database....
And I'm not going to pick on Microsoft's implementation, which is probably about average here. Most suffer from the complicated nature of the data in the UCD when their comparatively simplistic implementation tries to use the data.
Which then leads to the last question, the one about how common the "i18n unsafe" expression problem might be expected to come up. On the whole, I expect it is way more common than people realize, as the nature of the more complicated cases requires built-in expressions much more complicated than the definitions that are usually present....
An ideal implementation plan for such an engine is covered in Unicode Technical Standard #18: Unicode Regular Expressions, whose own summary states: "This document describes guidelines for how to adapt regular expression engines to use Unicode." Though many fall short of that ideal here (the only reason I don't say all here is that I have not tested every engine out there, but all the ones I have used and/or dabbled with and/or tested have issues).
Now going back to that original series of blogs about SQL Server, it is clear that problems I point out in that series and in posts like Wild[card] thing, You make my CHAR sing and With SQL Server (and SQL itself) comes the illogic of 'trailing spaces' (and the myth of fixed width) are more than anything else to do with SQL Server choosing to draw that line between appropriate behavior and simple definitional consistency in a better place that regular expressions tend to do. Which leads to inconsistencies in the documentation and limitations/flaws in the syntax (which was not made to handle things this complex either).
I must admit that I find myself more comfortable with where SQL Server sits here, rather than where regular expressions do. :-)
This blog brought to you by ঐ (U+0990, aka BENGALI LETTER AI)
# John Cowan on 25 Jul 2008 11:18 AM:
I think the JWZ witticism should be thus:
Some people, when confronted with a problem, think, "I know! I'll use a computer." Now they have 10,000 problems.
It's pretty remarkable that since about 1965 the use of computers has not improved the productivity of the U.S. workforce one iota.
# Michael S. Kaplan on 25 Jul 2008 11:22 AM:
Depends on the definition of improved, maybe? :-)
# Bradley Grainger on 25 Jul 2008 8:23 PM:
To (partly) answer Aaron's question (b), from my experimentation, it appears that the .NET Regex class is pretty closely tied to the UTF-16 encoding used by .NET strings, and doesn't fulfil "Level 1: Basic Unicode Support" as defined by UTS #18. (Thanks for the link to that; it's a great reference when discussing this area.) I just blogged yesterday about problems you'll encounter with .NET regular expressions if you have strings containing non-BMP characters: http://code.logos.com/blog/2008/07/net_regular_expressions_and_unicode.html
# int19h on 26 Jul 2008 6:11 AM:
Looks like we've got a ticket on Connect open about this now:
go to newer or older post, or back to index or month or day