For those who enjoy mathematics (or, 'Also new in Vista')

by Michael S. Kaplan, published on 2005/10/25 00:31 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/10/25/484430.aspx


Another one of those "new in Vista" posts. :-)

Unicode has added a great deal to support mathematics, from Unicode Technical Report 25 (Unicode Support for Mathematics) to the various Mathematical subranges in Unicode (see the Mathematical Symbols column in the Code Charts for Symbols and Punctuation).

My favorite range is the Mathematical Alphanumeric Symbols block in Unicode, which currently has all of the characters from U+1d400 to U+1d7ff (almost 1000 in all, with some spaces that were left in, as you can see from the code chart).

Why is it my favorite?

Well, I was having a conversation a few years back with Murray Sargent of Microsoft (one of the representatives of MS at Unicode Technical Committee meetings and a co-author of UTR #25). He was explaining why Unicode, which is generally speaking a plain text standard, was going to approve a block of characters that included many different letters and numbers with bold, italic, and other variations usually reserved for "rich text" outside the scope of Unicode.

"It is all about mathematics, and representing it in plain text," he explained. And he has a point; while I may use bold or italicized text for emphasis, in mathematics there is actual semantic meaning that is expressed in symbols an variables that have such attributes.

At that point, thinking about collation, I asked him if there was ever a time that it would be interesting or important to fold those differences together, for all of the following:

At first, Murray thought I was trying to make them all equal, and objected strenuously to that; luckily I had something different in mind. I pointed out some scenarios:

And he definitely saw the benefit to such a collation.

So, after this conversation (and a few others with other various math experts), in Vista a special LCID is being added:

0x0001007f (MAKELCID(MAKELANGID(LANG_INVARIANT, SUBLANG_NEUTRAL), SORT_INVARIANT_MATH))

It is an alternate sort for the invariant locale, because mathematics is independent of specific locale (kind of like invariant is!).

This locale causes each of the above letters to be a mere secondary and/or tertiary difference away from everything else on the list. The same principles were applied to all of the Greek letters and numbers in the block.

Please note that this is not something that can be selected in Regional and Language Options as a locale (neither can invariant, so obviously an alternate sort of invariant cannot be chosen). But it can be used in any programmatic situation where one is looking to compare strings, find within strings, or create sort keys.

And it is right there in Vista, for those who are mathematically inclined....

 

This post brought to you by "𝐀(U+1d400, a.k.a. MATHEMATICAL BOLD CAPITAL A)


# Jon Payne on Tuesday, October 25, 2005 4:46 PM:

Could these new characters have implications for international domain names? "MATHEMATICAL SANS-SERIF CAPITAL A" could look rather similar to "LATIN CAPITAL LETTER A" in a URL. Also, a URL with a mix of mathematical and Latin characters might not be flagged up as a URL containing characters from multiple languages because, in a sense, it doesn't.

# Michael S. Kaplan on Tuesday, October 25, 2005 7:14 PM:

Excellent question, Jon!

For most people, I would say no (since you have to have one of those math-specific fonts and it would be unlikely to display with glyphs). But the math symbols would definitely give cross-script errors since they are from different scripts (one Latin, one Common), unless you pass the flag to ignore the Common script range....

# Nick Lamb on Tuesday, October 25, 2005 10:28 PM:

IDN is a per-registry issue. Each registry must write and enforce its own policies on acceptable names, because obviously allowing the entire Unicode range is simultaneously pointless and dangerous. I could write a long rant here about Network Solutions, but I think it would be redundant. Suffice to say that in a well run registry IDN fraud attempts are not likely to be a huge problem and that the public gTLD registries are not well run. Look to a European ccTlD registry like Nominet for a contrasting example.

Of course it would probably have been better to leave IDN as an experiment and put up with uninformed Totoro fans whining about why they can't register トトロ.com forever but it's too late for that now. We regret these mistakes at our leisure.

# Robert on Wednesday, October 26, 2005 6:43 AM:

Great news! Math is getting its own LCID. Hopefully it won't take long, and there will be a math IME, too, so we can type equations with those characters.

# Jonathan on Wednesday, October 26, 2005 8:14 AM:

What meaning does Bold/etc exaclt have in Math? I've never heard of this before...

# Michael S. Kaplan on Wednesday, October 26, 2005 8:52 AM:

Hi Robert -- I hear you. I know of people who have used MSKLC to create keyboards for them. Although you cannot cover all characters, few mathematicians would need to use all of them at once, so 3 keyboards do the trick....

# Michael S. Kaplan on Wednesday, October 26, 2005 8:54 AM:

Hi Jonathan --

I am going to try to get someone qualified to discuss that point more fully come up and talk about it, my knowledge is limited to a few well-known math constants. :-)

# Michael S. Kaplan on Wednesday, October 26, 2005 9:05 AM:

I believe you can also look at the text in UTR25 for some examples of when they are used....

# Andreas Magnusson on Wednesday, October 26, 2005 9:27 AM:

Jonathan: Vectors is one thing that would commonly be written in bold.

Now we just have to wait for the huge braces that can contain several lines to use for matrices...

# Michael S. Kaplan on Wednesday, October 26, 2005 9:34 AM:

Andreas -- there are fomatting programs that will properly create such huge braces etc. based on metadata that decribes how to best display things.

# Michael S. Kaplan on Wednesday, October 26, 2005 9:37 AM:

Hi Nick -- We cannot just make it a registry issue, we need this covered at all levels. Certainly it must start there, but it cannot end there. As for "would probably have been better to leave IDN as an experiment" I am forced to disagree. We call it WWW - WORLD WIDE web. So we need to support the entire world, something we were not doing previously, and really needed to. Is the problem harder? Sure. But we cannot refuse a problem that must be solved just because it is challenging.

# Murray Sargent III on Wednesday, October 26, 2005 1:11 PM:

As Andreas points out, bold is commonly used for vectors, although some authors prefer to represent vectors by letters with arrows above them. The mathematical alphanumerics, particularly the serifed italic, bold, bold italic, script, and Fraktur sets, are proving to be very useful in a math display and editing system some of us are working on. Such work is complicated a bit by the holes Michael refers to which were introduced because some of the math alphabetics already existed in Letterlike block and the Unicode Technical Committee doesn't like to exacerbate the multiple-character-same-glyph problem. Large braces are nicely handled via glyph variants, along with other special characters like superscripted primes and sub/superscript glyphs in general. Note that pieces to make large braces, brackets, etc., exist in Unicode, such as U+239B - U+23B1.<BR><BR>Re IDN, hopefully the math alphanumerics will be illegal in domain names; there are already plenty of spoofing opportunities without adding any more (although at least some of the math alphas look quite different, e.g., the Fraktur and script symbols).<BR><BR>One way to enter Unicode's vast math symbol set is as in TeX: \alpha inserts an alpha (actually a math italic alpha), \int inserts an integral sign, \fH inserts a Fraktur H, etc. If your editor has an autocorrect facility, you can define your own combinations for keyboard entry.

# Jerry Pisk on Wednesday, October 26, 2005 2:42 PM:

Way off-topic but Michael brought it on himself - (deleted question)&nbsp;Put it in the Suggestion Box, Jerry!

# Richard on Thursday, October 27, 2005 9:44 AM:

Re: IDNs

Unicode has certaonly recognised the issue; the able of confusables (found alongside the other character data tables on http://www.unicode.org/) certainly references the mathematical letters pretty heavily.

As to what they mean... depends on the branch of maths you're dealing with; these symbols are pretty heavily overloaded. Notable excepotion being for blackboard-bold (or double-struct) N, Z, Q, R, C which are used for the sets of natural, integral, rational, real and complex numbers. (And why they appear in the BMP, and leave wholes in the Mathematical Alphanumeric Symbols block.)

# Michael S. Kaplan on Thursday, October 27, 2005 10:29 AM:

Hey Richard -- I don't disagree, I was just pointing out that if you use the mitigation tools we provide, then the situation is detected and handle-able.

# Nick Lamb on Thursday, October 27, 2005 8:40 PM:

"We cannot just make it a registry issue, we need this covered at all levels. Certainly it must start there, but it cannot end there."

The question of whether any particular sequence of characters is a permissible domain name is a matter for the relevant registry. They have been instructed to decide on a policy, authorised to enforce it and given the means to do so. They also have a motive to prevent fraud (since the companies being defrauded are their customers). Aside from implementing IDN correctly if you do it at all there's just not much you can do on the client that will be effective.

"I am forced to disagree. We call it WWW - WORLD WIDE web."

First of all DNS is not the "World Wide Web", and IDN is a DNS feature, not just a web feature. Now, that said the web is no less "world wide" just because domain names are restricted to a smaller character set than that used by almost any language in the world. Prior to IDN several changes had been made to restrict the range (of name records in public DNS), eliminating ASCII characters that were confusing or caused interoperability problems. If the theory that a larger character set made it more useful were true, we'd expect to have seen big problems when we did this. But all that happened was our interop problems went away.

Having DNS names which most people can't enter, like トトロ.com just means putting up fences in our "global village", it's Babel all over again. The objective of internationalisation is to bring us all together, and my objection to IDN is that it (in contrast to things like ISO 10646) doesn't help us to do that.

"something we were not doing previously, and really needed to"

We'll have to agree to disagree, just so long as you don't come back complaining when it bites you.

# Michael S. Kaplan on Thursday, October 27, 2005 11:57 PM:

Ok, agree to disagree. I just don't want to tell people that I can have a URL with my name on it but they cannot have one with theirs. YMMV (and probably does).

But I wonder how you would feel if your name did not fit?

# Nick Lamb on Sunday, October 30, 2005 6:40 AM:

To be picky, my name as written doesn't fit, even with IDN. My name is "Nick Lamb" not "NickLamb" or "nick-lamb". If I wanted a vanity domain I would certainly never choose something so tacky, which is fortunate because it seems popular with domain speculators and search engine spammers.

I know people who had to transliterate their names because they wanted a vanity domain and they don't seem bothered about it. More famously, Håkon Lie seems more amused than annoyed that he's found it easier to get along as "Howcome" even without dealing with DNS. Like Håkon I have little use for a name that hardly anyone can pronounce or spell.

The name argument would have more merit if people had stopped calling their chidren Jean Dupont and Gwen Jones.

# Michael S. Kaplan on Sunday, October 30, 2005 2:38 PM:

Ok, fair enough. But I would feel obnoxious if I had to tell foreign companies who even find their internet usage by customers is predominently in-country to that they are not allowsed to use their company's actual name, and there were no plans to change that.

Microsoft, in recognizing that over 60% of its customers are outside the US and between 70 and 100% of those customers prefer to use their own language (in some cases it is not as choice for them -- for new potentisl customers this is often the case). So I guess we need to take a wider view on the issue.

referenced by

2006/11/26 Math in Unicode is hard. So let's have Murray make it easier!

2005/11/03 Math is hard, let's do Unicode!

go to newer or older post, or back to index or month or day