What is the name of that character?

by Michael S. Kaplan, published on 2006/01/12 15:03 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/12/511920.aspx


To date, I haven't spent very much time in the MSDN Forums.

Not that they are not cool or anything like that. I think that the work that Josh Ledgard and others have put into the setup and that so many community members have put into answering questions has been very cool

It is just that there are so many hours in a day, and between my actual work and the limited time in which I allow myself to have an actual life, reading blogs and posting to this one takes up whatever time is left over. A whole new place to be looking is just a bit more than I can handle in my spare time.

ButI think it is an awesome resource because of its differences. The truth is that some people will be comfortable with newsgroups, others with blogs, still others with PSS phone calls, and yes -- some with the MSDN Forums. And so on.  Since the goal is to get the questions answered so that people are helped, giving multiple different means of providing assistance helps make sure more people can get what they need.

Which is not to say I never make it to the MSDN Forums. Because from time to time, Stephen Fisher will notice an 'international' question that has not yet been answered and he'll have me take a look.... :-)

Anyway, a few weeks ago, Carl M. asked the following question here:

How can I get the descriptio(name) of a char in English? Assume the string it comes from is normalized.

public static string GetDescription(char c){
   
//? how to return the description
}

For example GetDescription('ñ') should return "Latin small letter n with a tilde"

What about composite characters like most of the Hebrew letters?

Thanks in advance

Carl

Unfortunately, there is nothing in the .NET Framework that will return the Unicode character names. These are not produced algorithmically but are instead assigned when characters are added to Unicode.And although they usually seem to follow nice, neat rules there are plenty of them that are not intuitive or understandable.

Occasionally there is a bug where the name is not even correct! But the rules are clearly laid out in #2 the Stability Policy for the Unicode Standard:

2. Name Stability

Applicable Version: Unicode 2.0+

Once a character is encoded, its character name will not be changed.

The character names are used to distinguish between characters, and do not always express the full meaning of each character. They are designed to be used programmatically, and thus must be stable.

In some cases the original name chosen to represent the character is inaccurate in one way or another. Any such inaccuracies are dealt with by adding annotations to the character name list (which is printed in the Unicode Standard and provided in a machine-readable format), or by adding descriptive text to the standard.

Note: It is possible to produce translated names for the characters, to make the information conveyed by the name accessible to non-English speakers.

So for every character there is one and only one official name, and to get that name you have to be using something that is storing the actual name.

Which the .NET Framework does not have. It is not something that is even very common as a functionality in Microsoft products (well, MSKLC has them and so does Character Map in Windows (both of them use a slightly friendlier proper cased name rather than the official ALL CAPS one). But there is no public function in the Win32 or .NET APIs to provide the info.

Though of course it is always something that can be considered for a future version if the scenarios are compelling enough -- so if you have a requirement then feel free to explain your scenario here. :-)

On the Unicode side, there are plans afoot to provide mechanisms to fix really awful mistakes without violating those stability guarantees, something that I will talk more about when it becomes a reality.

 

This post brought to you by "" (U+1886, a.k.a. MONGOLIAN LETTER ALI GALI THREE BALUDA)


# Dean Harding on 12 Jan 2006 6:14 PM:

It's not that hard to generate the database yourself. You can just download the NamesList.txt file from the unicode character database and parse it. When stored in a binary (and perhaps compressed as well) form, it shouldn't be that big (it's about 800KB uncompressed), and if it's something your app actually needs, then it's probably the "best" solution.

Here's the URL for the most up-to-date NamesList.txt:
http://www.unicode.org/Public/UNIDATA/NamesList.txt

# Yosuke HASEGAWA on 12 Jan 2006 8:19 PM:

Unpublished API "GetUName" in GETUNAME.DLL provides that function. Of course, this function is inadvisablethis.

# Maurits [MSFT] on 12 Jan 2006 8:33 PM:

Perhaps a compiled version of the nameslist.txt file could be included in Windows?

# Maurits [MSFT] on 12 Jan 2006 8:41 PM:

As to a use-case scenario... consider international URIs. If a URL is blocked because it contains (for example) a CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I masquerading as a LATIN CAPITAL LETTER I, how are you going to communicate that to the user?

Wrong answer: Sorry user, but there was a "І" in the URL which looks very much like a "I"

Better: Sorry user, but there was a
CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
which looks very much like a
LATIN CAPITAL LETTER I

# Michael S. Kaplan on 12 Jan 2006 8:42 PM:

Right-o, Dean!

Yosuke is correct -- though it is not advised to use undocumented entry points....

Maurits -- this is where the scenarios come in. Getting in features is not about HOW or HOW EASY, its about WHY. :-)

# Michael S. Kaplan on 12 Jan 2006 8:43 PM:

There you go! It's all about the WHY. :-)

# Robert on 12 Jan 2006 9:03 PM:

It would be nice if GetUName would be made public. This is an extremely useful function, and I am using it in my app to let the user pick a symbol, but of course I would be happier if knew I am using the right buffer size, whether it will work in future versions of Windows, etc.

# Michael S. Kaplan on 12 Jan 2006 11:12 PM:

Honestly, it is much safer to not call it, because where it sits is not the right place for an NLS API, if something ever were going to be done here.

Better to be safe....

# Dean Harding on 13 Jan 2006 1:30 AM:

I wrote up a little program which downloads NamesList.txt and parses it, just for fun. Only 150 lines of C#, including comments!

http://www.codeka.com/blogs/index.php/dean/2006/01/13/getting_the_name_of_a_unicode_character

:-)

# Abhinaba Basu [MSFT] on 13 Jan 2006 2:30 AM:

I needed the exact same thing before I began working in Microsoft. Additional requirement was it needed to be cross-platform. So we wrote a script that just took the NameList.txt (from http://www.unicode.org/Public/UNIDATA/NamesList.txt) parsed it and directly created a C++ file out of it. This file was build with our system and hence the whole list was available as a C++ object. It was a bit heavy on memory but worked like a charm.....

# Abhinaba Basu [MSFT] on 13 Jan 2006 2:37 AM:

You can take a look into IBM ICU library as well. AFAIK its free for use (even commercial use). It runs on Windows and gives you this functionaloty and more. Check out http://www.ibm.com/software/globalization/icu/. It has Doxygen commenting so you can directly peek into there header files to see the API reference. You can get the name of a character using u_charName http://icu.sourceforge.net/apiref/icu4c/uchar_8h.html#a559 or do a reverse lookup and get the character with a given name.

The only issue I faced with this is that all of these API's uses UTF-32 and I had to keep jugling between UTF16 and UTF32

# Richard Gadsden on 13 Jan 2006 5:46 AM:

Wouldn't a web service, which can be updated as the Unicode standard is updated, be a better solution than a Windows function?

# Abhinaba Basu [MSFT] on 13 Jan 2006 3:13 PM:

Richard, in case Unicode does come up with such a web-service and I use it for my word-processing software, don't you think I'll have in place the world's slowest word-processor?

Unicode character lookup like whether to break a line at a given character happens for each character in a document, when text layout happens. So you are looking towards a million web-service calls each time you open the doc. Oh by the way there's another restriction that you can open the doc only when you are connected to the internet :)

# Michael S. Kaplan on 13 Jan 2006 3:48 PM:

A better model would probably be the one MSKLC uses -- a menu option to download update unicode data files.

# CarlMeis on 18 Jan 2006 12:41 PM:

Hi everybody. I am the one that started all this.

It's been a while since I posted it. It got zero mileage in two other forums and was unanswered in the MSDN one too. The alerts don't work too good. I didn't even know Michael Kaplan replied.

I am glad I got you thinking. I have since then parsed the html http://www.unicode.org/Public/UNIDATA/NamesList.txt

I used a dataset so that in the future I(ar an admin) could subcategorize at will, not just official categories, and without digging into the code. Charmap.exe poorly categorizes.

A practical example is an algorithm that analyzes text and determines the language.

I have one that is based on spelling errors over a certain threshold. The algorithm is much more effective if first characters that are not from the same alphabet are cleaned out.

Operations that depend on string length are improved too if it is a known universe of double or single code points.

I needed the names to plug them into the error messages much like what Mauritz talks about with international URLs.

The utility is not effective as a anti-spam tool where char spoofing is intentional. It works well as a server validator that a user's input is not in French if a textbox is intended to be for "fr" input.

Another nice use is at the QA stage when translating resources. It can search through the resx files and identify untranslated data nodes if their inner text does not match the culture in the file name.












I did since then parse the

# Michael S. Kaplan on 18 Jan 2006 12:54 PM:

I would not really try to use character NAMES for that, tho -- better to use other Unicode props, all things considered....

# CarlMeis on 18 Jan 2006 1:40 PM:

What could be more friendly to a non-technical user than the char name?

How do we use getuname.dll?

# Michael S. Kaplan on 18 Jan 2006 4:02 PM:

If you have parsed nameslist.txt then you definitely would not be considered a non-technical user. :-)

And of course calling a undocumented funtion in n undocumented library is also a bit more than a non-technical user would be able to do, either....

You can look at the names in CharMap, of you can actually parse the names and other info out of the UCD (the databaase where you got nameslist.txt from) to get all of the poperties....

# CarlMeis on 19 Jan 2006 11:59 AM:

Michael,

I think we got some wires crossed here. I am not sure what you meant by "I would not really try to use character NAMES for that" Whatever "THAT" was. I shouldn't have replied to it if I was not sure.

By user I meant user of my code. That is whom I show my error messages identifying the char by name.

I am a geek for sure. So I guess I do qualify to play with matches. How do we use getuname.dll?

I would sure want to check it out and decide for myself. Appreciate a non-judgmental answer.

# Michael S. Kaplan on 19 Jan 2006 12:45 PM:

It is neither documented nor supported to call it, and there is no guarantee that it will work the same it indeed work at all in a future version -- so you should not call it.

Will it reformat your hard drive, or physically harm the user who runs the code? Probably not. But if it crashes the application that tries to call it, then that is an application bug.

I am not being judgmental, but I am being realistic about using unsupported technologies.

But Unicode is an available standard and anybody can grab down files from the UCD and use them if they want to show the character names. Until/unless MS does something more here, that is the best MS can provide....

Đonny on 24 Nov 2011 4:00 PM:

Based on this:

code.google.com/.../getuname.c

GetUName signature is:

int WINAPI GetUName(IN WORD wCharCode, OUT LPWSTR lpBuf);

Which chan be imported to VB like

Public Declare Function GetUName Lib "getuname.dll" (ByVal wCharCode As UShort, <MarshalAs(UnmanagedType.LPWStr)> ByVal lpbuf As System.Text.StringBuilder) As Integer

Return value seems to be number of characters written to lpBuf.

I use it like this:

Dim buff As New StringBuilder(1024)

Dim ret = API.Misc.GetUName(codePoint, buff)

If ret <= 0 Then Return Nothing

Return buff.ToString

Than to blogs.microsoft.co.il/.../p-invoke-signature-generator.aspx, I forgot that I can use StringBuilder for LPWSTR.


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/05/31 Character Map is being quite the [Ol ]Chiki monkey

go to newer or older post, or back to index or month or day