Can I get my characters into Unicode?

by Michael S. Kaplan, published on 2005/02/06 08:03 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/06/367985.aspx


The other day, Ivan Petrov pointed out:

...maybe the BIGGEST one, is about the absence of many of the Cyrillic vowel letters with graves in Unicode, respectively in ANSI 1251 Codepage. There are defined only 2+2=4 (CAPITAL and SMALL letters with graves – #CYRILLIC CAPITAL LETTER IE WITH GRAVE, #CYRILLIC CAPITAL LETTER I WITH GRAVE, #CYRILLIC SMALL LETTER IE WITH GRAVE and #CYRILLIC SMALL LETTER I WITH GRAVE) in Unicode.
The whole list of the Cyrillic vowel letters must be:

#CYRILLIC CAPITAL LETTER A WITH GRAVE
#CYRILLIC CAPITAL LETTER IE WITH GRAVE
#CYRILLIC CAPITAL LETTER I WITH GRAVE
#CYRILLIC CAPITAL LETTER O WITH GRAVE
#CYRILLIC CAPITAL LETTER U WITH GRAVE
#CYRILLIC CAPITAL LETTER HARD SIGN WITH GRAVE
#CYRILLIC CAPITAL LETTER YERU WITH GRAVE (only for Russian language)
#CYRILLIC CAPITAL LETTER E WITH GRAVE (only for Russian language)
#CYRILLIC CAPITAL LETTER YU WITH GRAVE
#CYRILLIC CAPITAL LETTER YA WITH GRAVE
#CYRILLIC SMALL LETTER A WITH GRAVE
#CYRILLIC SMALL LETTER IE WITH GRAVE
#CYRILLIC SMALL LETTER I WITH GRAVE
#CYRILLIC SMALL LETTER O WITH GRAVE
#CYRILLIC SMALL LETTER U WITH GRAVE
#CYRILLIC SMALL LETTER HARD SIGN WITH GRAVE
#CYRILLIC SMALL LETTER YERU WITH GRAVE (only for Russian language)
#CYRILLIC SMALL LETTER E WITH GRAVE (only for Russian language)
#CYRILLIC SMALL LETTER YU WITH GRAVE
#CYRILLIC SMALL LETTER YA WITH GRAVE

So my third question is:
“What can be done about this problem?”

Form more information you can see at:
http://titus.uni-frankfurt.de/unicode/unicsel/unicself.htm#Cyrillic

Well, when I look at the list, I can only think of one thing (well, one stream of things!) to say:

А̀ Ѐ Ѝ О̀ У̀
Ъ̀ Ы̀ Э̀ Ю̀ Я̀
а̀ ѐ ѝ о̀ у̀
ъ̀ ы̀ э̀ ю̀ я̀

or in Unicode code points....

0410 0300 0415 0300 0418 0300 041e 0300 0423 0300
042a 0300 042b 0300 042d 0300 042e 0300 042f 0300
0430 0300 0435 0300 0438 0300 043e 0300 0443 0300
044a 0300 044b 0300 044d 0300 044e 0300 044f 0300

These characters already exist in Unicode, in the composite (decomposed) form. Note that they look better in some fonts than they do in others -- which is mainly a matter of letting font foundries that work to support languages know that there is a need to make sure these particular characters have good font hints so that they will not look good "by accident" of the combining character guessing how best to work with the base characters.

If you wanted to try to get them added to Unicode in the precomposed form, the submission process for new characters is very straightforward. However, as the proposal information clearly states:

So it would appear that these characters are unlikely to be separately encoded.

As for the request to add these code points to cp1251, I will deal with that in a separate post, perhaps later today (or sometime soon).

 

This post brought to you by "Ѡ" (U+0460, CYRILLIC CAPITAL LETTER OMEGA)


# Mike Dimmick on 6 Feb 2005 10:00 AM:

Pre-empting you slightly: you can't put them into CP1251 without discarding something else. There's only one unassigned code point (0x98), according to http://www.microsoft.com/globaldev/reference/sbcs/1251.htm.

# Michael Kaplan on 6 Feb 2005 11:42 AM:

Hi Mike -- I delayed posting your comment until the answer was up -- you can see it at http://blogs.msdn.com/michkap/archive/2005/02/06/368081.aspx (you are correct, but there are additional reasons!).

# Mikhail Arkhipov (MSFT) on 6 Feb 2005 9:20 PM:

Hmm, I never knew that Cyrillic had 'omega' character :-) Apparently it does... Which language does it come from?

# Michael Kaplan on 6 Feb 2005 9:25 PM:

Looking for comments in the block description (cf: http://www.unicode.org/charts/PDF/U0400.pdf), there are none. Though it is listed with the "Historic letters".

# Ivan Petrov on 8 Feb 2005 3:23 PM:

Hi Michael,

just for completeness the following 4 characters were already added in UNICODE in the precomposed form:

CYRILLIC CAPITAL LETTER IE WITH GRAVE - http://www.fileformat.info/info/unicode/char/0400/index.htm
CYRILLIC CAPITAL LETTER I WITH GRAVE - http://www.fileformat.info/info/unicode/char/040D/index.htm
CYRILLIC SMALL LETTER IE WITH GRAVE - http://www.fileformat.info/info/unicode/char/0450/index.htm
CYRILLIC SMALL LETTER I WITH GRAVE - http://www.fileformat.info/info/unicode/char/045d/index.htm

Regards,
Ivan.

# Michael Kaplan on 8 Feb 2005 3:24 PM:

Ah, interesting info -- someone may want to try to propose the others....

referenced by

2007/01/28 Stealth features (like language detection?)

2005/03/15 Emptying some items out of the suggestion box

2005/02/06 Can a codepage be changed? How about which codepage a locale points to?

go to newer or older post, or back to index or month or day