He had the strength of an OX[IA], I tell you

by Michael S. Kaplan, published on 2007/02/18 19:30 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/02/18/1709479.aspx


Remember the other day when I was talking about how the jury giving the string no weight?

Well, it looks like this problem is just going to keep getting worse. Earlier today, Vineet asked:

Hello,

Could someone tell me how to use Unicode text as a parameter in an MS Access 2003 query on a WinXP box?

This is a query that should return a single line from the table.

            SELECT t030_StrongsLexicon.Lemma
            FROM t030_StrongsLexicon
            WHERE (((t030_StrongsLexicon.Lemma)="βρέχω"));

However, it returns two lines: records with βρέχω as well as βρύχω.

Access is ignoring the characters with the accents over them.  It apparently is reading both words as βρ?χω.

Thanks
Vineet

However, the problem is not a lack of Unicode support.

Microsoft Jet 4.0 and thus every version of Access (2000 and later) supports Unicode.

The two characters in question and (U+1f73 and U+1f7b, aka GREEK SMALL LETTER EPSILON WITH OXIA and GREEK SMALL LETTER UPSILON WITH OXIA), are not in the Jet, SQL Server, or Windows <= Server 2003 collation tables.

Thus they are equal as they have no weight. Their equality has nothing to do with the idea of them both being converted to question marks.

Unfortunately there is no easy clever workaround like this one since neither character has a direct decomposition, either for canonical equivalence or compatibility.

Though perhaps there is some hope; attend me for a moment:

U+1ffd (GREEK OXIA) has a canonical equivalence to U+00b4 (ACUTE ACCENT).

And U+00b4 (ACUTE ACCENT) has a compatibility decomposition to U+0020 U+0301, meaning it is the non-spacing form of U+0301 (COMBINING ACUTE ACCENT).

So in theory you could use U+03b5 U+0301 and U+03c5 U+0301 for U+1f73 and U+1f7b.

And they do look kind of alike though not really (έ is not , ύ is not ), since some fonts have the OXIA as a straight line rather than a slanted one as the ACUTE ACCENT pretty much always has.

The most compelling reason to not use this somewhat convoluted logic is that there is no conformant process like Unicode normalization that can lead to equivalence here, and no real promise that such an equivalence will be honored in products.

Luckily, there is a more direct way to get there, using U+03ad (GREEK SMALL LETTER EPSILON WITH TONOS) and U+03cd (GREEK SMALL LETTER UPSILON WITH TONOS) which does have a direct connection via canonical decomposition mappings to U+03b5 U+0301 and U+03c5 U+0301, respectively.

If this workaround works for you, most products do support both U+03ad (GREEK SMALL LETTER EPSILON WITH TONOS) and U+03cd (GREEK SMALL LETTER UPSILON WITH TONOS), which provides a better route for support in both fonts and in collation/casing.

Though I am not claiming that an OXIA is always a TONOS (I believe there are specific meanings in a linguistic framework for these two), so your mileage may vary....

 

This post brought to you by and έ (U+1f73 and U+03ad, a.k.a. GREEK SMALL LETTER EPSILON WITH OXIA and GREEK SMALL LETTER EPSILON WITH TONOS)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day