In Tamil -- sometimes, they are digits; other times, just numbers

by Michael S. Kaplan, published on 2005/01/24 00:04 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2005/01/24/359347.aspx

Early last year, Raymond Chen talked about how Char.IsDigit matches more than just 0 through 9 and later last year I talked about Crossing the DIGITal divide. But in both cases the conversation is limited to digits, and not the wide world of numbers which includes a lot more than just different ways of saying 0123456789.

The distinction between digits and numbers in Unicode is an important one, since the formatting and parsing of numeric values is highly dependent on whether a number acts like the ASCII digits 0 - 9 or not.

Now the bulk of the modern number systems use the same Arabic-Indic system conventions to which software developers are accustomed, but others do exist, some of which are still see use today.

As an example people can relate to, most of us are aware of the Roman numeral system where there is no Zero and you sometimes have to use a lot of addition in subtraction in a deterministic manner (such that any time a smaller number comes before a larger one, the smaller one is subtracted; otherwise if they are the same value or the larger one comes first, it is added). Thus Ⅰ is one, Ⅲ is three, Ⅳ is 4, Ⅴ is 5, and so on. Although it is not used too much, it is still commonly seen in the credits of movies and television shows for the copyright date (e.g. MCMLXXXIX for 1989). Many people who are not used to Roman numerals breathed a sigh of relief at the year 2000 since MM is so much easier to read....

It is of note that the Roman Numerals are encoded in Unicode even though they can all be represented as existing letters. The primary reason for this is that there are character properties associated with each encoded character, and these properties are used by many implementations of Unicode to get actual work done. Therefore, the letter V (U+0056, LATIN CAPITAL LETTER V) has a General Category of Lu (Letter, Uppercase) while Ⅴ(U+2164, ROMAN NUMBERAL FIVE) has a general category of Nl (Letter, Number).

And yes, even that claim falls apart a little since the hexidecimal digits ABCDEF are not separately encoded for reasons of backwards compatibility with decades of existing practice on computers which is not the case with Roman numerals. Even the argument for having encoded the Roman numerals is a little specious since for the most part they have not been encoded and when they are the style never seems to be consistent typographically. Though YMMV since you may have better fonts than I do! Try "ⅯⅭⅯⅬⅩⅩⅩⅨ" for the test....

All of this goes to show that Unicode is a very complex standard. In the end, Unicode can always do what it needs to do without fear of the occasional contradiction, since there will always be some precedent with which to be consistent. :-)

Ethiopic numbers are based on a different alternative system, one that can really wreak havoc with a formatting/parsing architecture like that in Windows or the .NET Framework if you try to bring Ethiopic data in without writing code do the work (just like with Roman numerals). I'll talk about Ethiopic numbers another time....

Yet another system, the one I will talk about here, is that of Tamil numerals. It is an additive and positional system (unlike Roman numerals, there is no subtraction involved) that has no zero but includes characters for 10, 100, and 1000.

In the traditional system the number 3,782 would be represented as ௩௲௭௱௮௰௨ (literally Three-Thousand(s)-Seven-Hundread(s)-Eight-Ten(s)-Two, or மூன்று-ஆயிரத்து-எழு-நூற்று-எண்-பத்து-இரண்டு in Tamil).

At least since the early 1800s, however, usage of the Tamil numerals as digits has been more and more common. Thus the number 3,782 would often be represented as ௩௭௮௨ (literally 3782).

The following table gives a bunch of different numbers and how they are represented in both the older, more traditional style and in the "modern" style where they act as digits. Note that the table is treating U+0eb6 as TAMIL DIGIT ZERO even though it is not being added to Unicode until version 4.1. Up until now the ASCII DIGIT ZERO was used as needed, as I do in the table below for display purposes, and if you want to represent these numbers before Unicode 4.1 is released you should likely use U+0030 (DIGIT ZERO). The modern Tamil column using the LOCALE_SGROUPING setting of Tamil....

Arabic-Indic Digit	old style Tamil	modern Tamil	old style Tamil code points	modern Tamil code points for number
0	^{(not available)}	0	^{(not available)}	0be6
1	௧	௧	0be7	0be7
2	௨	௨	0be8	0be8
3	௩	௩	0be9	0be9
4	௪	௪	0bea	0bea
5	௫	௫	0beb	0beb
6	௬	௬	0bec	0bec
7	௭	௭	0bed	0bed
8	௮	௮	0bee	0bee
9	௯	௯	0bef	0bef
10	௰	௧0	0bf0	0be7 0be6
11	௰௧	௧௧	0bf0 0be7	0be7 0be7
12	௰௨	௧௨	0bf0 0be8	0be7 0be8
13	௰௩	௧௩	0bf0 0be9	0be7 0be9
14	௰௪	௧௪	0bf0 0bea	0be7 0bea
15	௰௫	௧௫	0bf0 0beb	0be7 0beb
16	௰௬	௧௬	0bf0 0bec	0be7 0bec
17	௰௭	௧௭	0bf0 0bed	0be7 0bed
18	௰௮	௧௮	0bf0 0bee	0be7 0bee
19	௰௯	௧௯	0bf0 0bef	0be7 0bef
100	௱	௧00	0bf1	0be7 0be6 0be6
156	௱௫௰௬	௱௫௬	0bf1 0beb 0bf0 0bec	0be7 0beb 0bec
200	௨௱	௨00	0be8 0bf1	0be8 0be6 0be6
300	௩௱	௩00	0be9 0bf1	0be9 0be6 0be6
1,000	௲	௧,000	0bf2	0be7 0be6 0be6 0be6
1,001	௲௧	௧,00௧	0bf2 0BE7	0be7 0be6 0be6 0be7
1,040	௲௪௰	௧,0௪0	0bf2 0bea 0bf0	0be7 0be6 0bea 0be6
8,000	௮௲	௮,000	0bee 0bf2	0bee 0be6 0be6 0be6
10,000	௰௲	௧0,000	0bf0 0bf2	0be7 0be6 0be6 0be6 0be6
70,000	௭௰௲	௭0,000	0bed 0bf0 0bf2	0bed 0be6 0be6 0be6 0be6
90,000	௯௰௲	௯0,000	0bef 0bf0 0bf2	0bef 0be6 0be6 0be6 0be6
100,000¹	௱௲	௧,00,000	0bf1 0bf2	0be7 0be6 0be6 0be6 0be6 0be6
800,000	௮௱௲	௮,00,000	0bee 0bf1 0bf2	0bee 0be6 0be6 0be6 0be6 0be6
1,000,000²	௰௱௲	௧0,00,000	0bf0 0bf1 0bf2	0be7 0be6 0be6 0be6 0be6 0be6 0be6
9,000,000	௯௰௱௲	௯0,00,000	0bef 0bf0 0bf1 0bf2	0bef 0be6 0be6 0be6 0be6 0be6 0be6
10,000,000³	௱௱௲	௧,00,00,000	0bf1 0bf1 0bf2	0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6
100,000,000⁴	௰௱௱௲	௧0,00,00,000	0bf0 0bf1 0bf1 0bf2	0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6
1,000,000,000⁵	௱௱௱௲	௧,00,00,00,000	0bf1 0bf1 0bf1 0bf2	0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6
10,000,000,000⁶	௲௱௱௲	௧0,00,00,00,000	0bf2 0bf1 0bf1 0bf2	0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6
100,000,000,000⁷	௰௲௱௱௲	௧,00,00,00,00,000	0bf1 0bf1 0bf2	0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6
1,000,000,000,000⁸	௱௲௱௱௲	௧0,00,00,00,00,000	0bf1 0bf2 0bf1 0bf1 0bf2	0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6
100,000,000,000,000⁹	௱௱௲௱௱௲	௧0,00,00,00,00,00,000	0bf1 0bf1 0bf2 0bf1 0bf1 0bf2	0be7 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6 0be6

1 - a.k.a. Lakh
2 - a.k.a. 10 Lakhs
3 - a.k.a. crore
4 - a.k.a. 10 crore
5 - a.k.a. 100 crore
6 - a.k.a. thousand crore
7 - a.k.a. 10 thousand crore
8 - a.k.a. lakh crore
9 - a.k.a. crore crore

Some examples of both types of usage:

Modern practice, using Tamil digits for chapter numbers: mozi varalARu, by munucAmi varatarAcan, published by The South India Saiva Siddhanta Works Publishing Society, Tinnevelly, Limited, November 1954, p. 357-358 (page numbers from 14th Edition, December 1996).
Traditional practice, using the older format (and the source for large parts of the table above!): iniya tamiz ilakkaNam by yokisri cuttAnan~ta pAratiyAr, published by Kavitha Publications, p. 201-204. (you can see the scanned source of some of it here).

Note that the traditional form is not currently handled by any code in either Windows or the .NET Framework, though it is sometimes seen in even modern contexts such as calendars. The system is not too complicated and figuring out the algorithm to parse or format with it seems like the sort of thing that would make an interesting Microsoft interview question. Though perhaps I will post some potential solutions another day....

Special thanks to Sivaraj Doddannan, Dr. N. Ganesan, and Working Group 02 of INFITT (of which they are both members) for helping to dig up the excellent resources for Tamil numbers. INFITT (International Forum for Information Technology in Tamil) is a liaison member of Unicode and has been instrumental in providing character addition and usage reports to help finish up the Tamil block in Unicode.

This post brought to you by "௧௨௩௪௫௬௭௮௯" (U+0be7 - U+0bef, a.k.a. TAMIL DIGIT ONE - TAMIL DIGIT NINE)
and they all welcome their new compadre U+0be6, which is coming soon to a Unicode near you!

# Bhakthan on Monday, January 24, 2005 7:06 AM:

Thanks Kaplan

# Andrew Quinn on Monday, January 24, 2005 8:15 AM:

In the table, 1,040 in modern Tamil appears to be 4,040.

# Michael Kaplan on Monday, January 24, 2005 8:29 AM:

Whoops, you're right -- fixed now. Thanks!

# Dr. N. Ganesan on Monday, January 24, 2005 11:34 AM:

It appears that the then government of Madras
(now, Chennai, capital of Tamil Nadu, India)
introduced the "zero" (0) in Tamil computations
in early 19th century. Found a school textbook
from 1825 CE describing both the traditional
and modern (which employs "zero"(0) ) methods
of writing Tamil numerals. This 1825 CE textbook
pages are reprinted in a work on the History of Tamil
literature in the 19th century (published 1962).
Those pages can be seen at
http://www.geocities.com/thamizh@sbcglobal.net/tamil_zero.PDF

Good to know the zero will be available in
Tamil code chart from unicode 4.1 onwards,
N. Ganean

# Scott Hanselman on Tuesday, January 25, 2005 3:29 PM:

That would ROCK if you would do Ethiopic sometime.

Tenastilign'

# Ambarish Sridharanarayanan on Saturday, January 29, 2005 9:59 PM:

Fascinating article. I had a question - in the traditional style, why is 1,000,000 ௰௱௲ rather than ௲௲? Similarly for larger powers of 10.

# Michael Kaplan on Saturday, January 29, 2005 10:12 PM:

The numbers came from a classical source, but if I had to guess, I'd say to match the build-up from prior characters.

Though since both forms would work, anyone who recognized one would probably be able to recognize the other...

# Vatsan on Saturday, January 29, 2005 11:15 PM:

Why is 1,000,000 ௰௱௲ and not ௲௲ ? The difference has to do with how the language is spoken.

100,000 is 1 lakh, which is called 'laksham' in Tamil. So, 1 Million is 10 * 100,000 = 10 Lakhs ('Pattu Laksham'). So, the representation of 1,000,000 would be

<symbol for 10> <representation of 100,000>

100,000 is 100 * 1000 , i.e., 'Nooru Aayiram'.
So, representation of 100,000 is

<representation of 100> <representation of 1000>

In effect, 1,000,000 is written as

<ten> <hundred> <thousand> = ௰௱௲.

# Vatsan on Saturday, January 29, 2005 11:25 PM:

"Though since both forms would work, anyone who recognized one would probably be able to recognize the other... "

The chances are, it would. In spoken tamil, two numbers uttered one after another is considered to have a multiplicative effect. For eg, 'eerezhu' (a form of saying, 'irandu ezhu', i.e., 'two seven') means 14. Similarly, 'Aayiram Aaryiram' (1000.1000) could be used to mean 1,000,000. So ௲௲ is probably also correct.

I must add though, this whole idea of the written notation of representing numbers closely following the spoken language is my conjecture, but I haven't been able to find a counterexample that disproves my theory :-)

# Mani on Sunday, January 30, 2005 8:03 AM:

Good article & discussion.
However anyone had a chance to explore in the other direction, which is decimals and fractions? In Tamil it is "binnam" (some examples like "arai" "kaal" "araikkaal" "veesam" so on). Oh yeah, whole new topic (somewhat related to this)...

-மணி

# Michael Kaplan on Sunday, January 30, 2005 8:51 AM:

I have not, myself -- not sure how the items between the integers are handled in the old style....

# Paul on Monday, February 07, 2005 4:26 PM:

Not related to Tamil, but there's an interesting post by Ian Hickson on Traditional Hebrew numbering at http://ln.hixie.ch/?start=1033524738&count=1

Now THAT looks fun...

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/02/14 Nothing seems to be parsing the crap out of *this* number

2006/08/10 Roman numerals are Latin script!

2005/02/01 Why that is positively Ethiopic!

go to newer or older post, or back to index or month or day