Ignore what the label says -- it's Japanese

by Michael S. Kaplan, published on 2008/07/03 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/07/03/8682656.aspx


Julien asks via the Contact link:

Dear Mr Kaplan,

I would like to display Japanese Characters in the reporting solution of my company. The DB is SQL Server 2000 and datasource is an ERP wish provide data in DBCS format. Default Collation is French_CI_AS on server and client machines.

Stored data in varchar format appeared on this way in Enterprise Manager :

à_áÉá?áZ áEáìáóá,án
<<strange characters removed due to reports that they break the feed>
Is it possible to display Japanese Character and how?

Thanks a lot for your help and information you broadcast on the web on this topic.

Julien, from France

Well, first of all, I'm not Mr. Kaplan. I was gonna say that my father is Mr. Kaplan but now that I was about to type that I realized that I don't even think of my father as Mr. Kaplan -- I don't even think of the VP name Kaplan as Mr. Kaplan since he sent mail not too long ago to all the Kaplans at Microsoft and he signed the mail with first name. Essentially, there are no Mr. Kaplans in my world at the moment! :-)

Julien, you can call me Michael; all the people I like do....

Okay, now back to the topic.

This is a rough situation.

Basically you have non-Unicode data that is stored in a column with a specific defined code page, in this case that would be COLLATIONPROPERTY('FRENCH_CI_AS', 'CodePage') or 1252.

And the data itself is actually in some other code page.

Offhand, the first problem is the one I mention in 100% roundtrip ASCII? 100% roundtrip ANSI?: that these code pages have byte sequences that are not valid in them. And since their "invalid" bits do not always overlap, one can lose data if one (using Unicode as a pivot) moves in and out of them.

So the only "fix" would be to take the data out as is (say via direct BCP transfer that does not go through COM or anything else that will try to convert it to Unicode), with a plan to later on bring it back in using preferably Unicode columns but if all else fails one that can handle Japanese data.

But the Unicode columns would be best in any case. If you look at all of the characters that the Japanese standards define as characters used by Japanese, only Unicode can store them all in a SQL Server  column. And the Japanese and I have in common a dislike for valid characters being converted into question marks, if you know what I mean.

And now we get to the next problem -- with the small sample Julien provided, I couldn't produce something that looked all that Japanese, testing with either shift-JIS or EUC-JAP. so someone would have to probably give some information about what encoding the data is in for sure, so it could then be converted.

Without knowing the encoding it is in, how could one ever hope for a computer program to read it?

 

This post brought to you by ¿ (U+00bf, INVERTED QUESTION MARK)


Michael Madsen on 3 Jul 2008 5:05 AM:

Actually, testing in JWPce (which lets me choose the character set to use when copy/pasting) EUC does give me something that at least only contains Japanese characters, if you remove the two symbols in either end:

珎疉畤痒 疆瘡癈甃瘤

However, neither Google Translate, Babelfish, nor Excite Japan can make much sense of it.

The characters do seem somewhat related though: they work out to precious/valuable/rare, repeat/duplicate/repetitious, festival grounds, itchy, boundary, wound/boil/syphilis, chronic illness, floor tile, and lump (EDICT doesn't seem to know any kanji compounds that could be made from all this, so if it's the real deal, I believe they should be read individually). With a little imagination, you could work out a way all of these could be related - something about a rare disease resulting in chronic, itchy boils appearing due to an infected floor tile found at the edge of some festival grounds, with a similar event having happened before - but although I'm only a beginner at Japanese, I have a feeling that doesn't add up :)

Michael S. Kaplan on 3 Jul 2008 8:32 AM:

In the end, I think Julien needs to find out exactly what the data is rather than try to reverse engineer it. Even when they are all Han, I am not at all sure they are meaningful Kanji. And there are the first and last characters too....

Michael Madsen on 3 Jul 2008 12:39 PM:

FYI, this post breaks your RSS feed. Those extra symbols get converted to HTML entities &#xE; and &#xF; (which won't work in RSS).

Probably not something you yourself can do much to change it without changing the meaning (except converting them to images, but that's kinda ugly), but seeing as how you work for Microsoft, you probably have better links to the blog team than I do in case they have an idea. :)

http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fblogs.msdn.com%2Fmichkap%2Frss.xml

Michael S. Kaplan on 3 Jul 2008 1:13 PM:

Well, for now I just took the characters out. :-)

Tracy on 4 Aug 2008 9:00 PM:

Those characters don't look Japanese.  Maybe they're Chinese?


go to newer or older post, or back to index or month or day