No way to get that script info I was looking for earlier

by Michael S. Kaplan, published on 2007/12/05 10:01 -05:00, original URI:

I was looking for some Unicode script information today but I couldn't find it. 

Though I may have learned that you may not want to scratch your forearm and say "I need a script, man!" to a law enforcement officer. They don't tend to have a sense of humor about such things and the Unicode explanation is not one they seem to want to buy....

Not that this happened. I am not guilty of moral turpitude of anything like that.

Okay, I am feeling a smidge uncomfortable at this point.

Time to change the topic a bit....

Long time regular readers may remember running across the The Is* Unicode script ranges in .NET's RegEx blog from September of 2005.

If not, then you can go read it now and then act snooty to all those hapless SiaO Newbies who haven't seen it yet. :-)

Of course that was long ago.

Since then, tens of thousands of man-hours have produced multiple versions of the .NET Framework.

And of course there has been both Unicode 4.1 and Unicode 5.0.

Some people might be wondering (like developer Charles did) whether the .NET Framework RegEx docs were going to ever be fixed to list these entries. Or failing that whether I would update mine to show all of the ones that had been added.

It would seem, however, that the list has not been updated, though.

On the bright side, the list is available in documentation now, in the Character Classes topic -- just scroll down a bunch and you will see it.

Also on the bright side, I find it easier to answer the other question from Charles: no functional update available from Microsoft, no update need from the help topic or me. :-(

Personally, I'd like to see this information outside of RegEx or at least into CharUnicodeInfo.

And not just because of the performance issues (the RegEx version of the MSKLC parser was between two and ten times slower!), or the bug Ted Miller pointed out to me previously that I talked about in No Regex in the Unicode room! (and no sex in the champagne room, either!) and 'The 44' (*not* 'The 4400').

It is just that this data really ought to be available for non-RegEx programs too, via the CharUnicodeInfo class.

Not only because it would likely get updated more often, or probably be documented better.

Those theories are side effects of the fact that the group that owns the class is the one most likely to be watching for Unicode updates!

In truth, RegEx uses the .NET and OS casing tables and the property data, which is updated to Unicode 5.0 right now in .NET >= 2.0 qnd Vista, so this mixed behavior in RegEx is less than ideal anyway.

As is the fact that RegEx does not include any supplementary chracter info, either! (CharUnicodeInfo does)

But the real reason to add the info to CharUnicodeInfo is that not every program is a RegEx program, and there is plenty of code that could make good use of such an addition....

I could have used it myself the day before yesterday, when I basically had to write my own version of it for something that did not make sense to do with RegEx!


This post brought to you by(U+a846, aka PHAGS-PA LETTER JA)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/12/11 In SQL Server, there is the rest of Unicode (aka the SiaO Incompleteness Theorem)

go to newer or older post, or back to index or month or day