On reversing the irreversible (the introduction)

by Michael S. Kaplan, published on 2008/01/13 10:16 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/01/13/7090976.aspx

The other day I was in a meeting where we happened to talk about getting characters from glyphid values.Of course I mentioned I had a blog entry about this (ref: Documented, schmockumented! It's still kind of cool....), at the same time pointing out the limitations of the technique (as I mention in the article).

Peter (who was also in meeting) helped to underscore the point, and then ended his words by saying that the method was not deterministic.

What he meant was how even if one were to do really full lookups in all of the information in the font and all of the work that Uniscribe and its shaping engines do, one would still find lots of different character sequences that end up looking the same and thus there is no way to retrieve the original character string all of the time -- one might get a string that looks just like it....

But I had trouble linking that up with determinism in the software development sense, which would mean:

In computer science, a deterministic algorithm is an algorithm which, in informal terms, behaves predictably. Given a particular input, it will always produce the same output, and the underlying machine will always pass through the same sequence of states. Deterministic algorithms are by far the most studied and familiar kind of algorithm, as well as one of the most practical, since they can be run on real machines efficiently.

Now clearly a given sequence of glyphid values can be expected to return the same character string each time the algorithm was used, so I don't think the problem is really about determinism. After all, arithmetic is pretty deterministic, even though many calculations return the same results (e.g. both 1 + 3 and 2 + 2 equal 4, even though one cannot guarantee that one will be able to get back what the original string was, after the fact.

One can certainly build a function that would return deterministic results, though. One just has to keep in mind that one will not be able to retain differences that the text rendering process removes....

There is similarly interesting issue that comes up with sort keys, mentioned previously in the following posts:

If one wanted to build a function to get the original characters from a sort key, one could do so -- but just as with the glyphid --> string issue, one just has to be willing to live with characters that sort identically given the parameters you pass in all having the same string representation (for example ABC and abc and ABc and Abc having their sort keys be identical if you pass NORM_IGNORECASE, and they will all return abc for the string)....

Now tomorrow I'm going to get into a bit more detail on how this sort key reversal function would work, and how a developer would build it themselves if they wanted to. If the topic interests you, then stay tuned!


This post brought to you by(U+2a45, aka UNION WITH LOGICAL OR)

Christian Kaiser on 13 Jan 2008 4:35 PM:

... or, as the mathematician would say, it's a non-injective, surjective projection, and thus possibly not reversible.


BTW: a mapping into chinese or japanese could be as interterministic as hell, and I wouldn't know ;-)


PS: No, I'm not a mathematician, but as physicist we needed some background on math theory. Not numeric, as you might have noticed lately.

Michael S. Kaplan on 13 Jan 2008 5:21 PM:

In the internationalization sense, it would be "reversible enough". :-)

referenced by

2008/03/03 On reversing the irreversible (grabbing the data, part II: the weirdness not so related to locales)

2008/01/25 On reversing the irreversible (grabbing the data, part I)

2008/01/14 On reversing the irreversible (The Set-Up)

go to newer or older post, or back to index or month or day