There's no "I" in IDN, part 4: the 'path' to Hell is paved with IDN bugs

by Michael S. Kaplan, published on 2011/06/17 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/06/17/10173790.aspx


Prior blogs in this series:

As you work though the IDN story in your own company, you will likely find an interesting mix in the support story, just as I have.

Perhaps one of those most interesting areas has been in the tiny (one might say "puny" to invoke a groaner of a pun!) detail work.

Like path detection.

There are algorithms in RichEdit that colleague Murray Sargent tells me are quite sophisticated. You get to it in theRichEdit with things like EM_GETAUTOURLDETECT and EM_AUTOURLDETECT and such. He has even mentiomed the AutoURL stuff in his blog before (like here).

Or in places like WordPad you just type a URL or a path, or load a file with one, and you can see the effect of turning the behavior on via these programmatic means -- it wil detect the path and mark it as if it was a clickable URL/path in a browser.

If you move over to Word it's even more sophisticated, with a config option in the UI:

The good old "Replace Internet and network path with hyperlinks" feature!

Very cool.

Until you include IDN in the mix, at least.

Let's take four URLs that could easily be created if you have set up a machine to test out IDN (server names/namespaces changed to protect something or other), and try to put them in Word 2010:

which was unable to properly detect one of the URLs out of the four.

Can you guess why it failed?

Or you could try it the same four URLs in WordPad on Windows 7:

Wow, 0/4. Not too sophieticated!

I'll point Murray to this post, and within a few hours he'll tell me that the latest version of RichEdit (essentially the one on his machine) supports all four URLs.

Of course not everything on Murray's machine gets checked in without a bug report so I'll work on that too. :-)

Let's try pasting those same four URLs here to see what this Blog Editor does wih them:

http://नांदरी.日本国.test.corp.testcompany.com
http://idn-iis1.日本国.test.corp.testcompany.com 
http://idn-iis1.日本国.test.corp.testcompany.com/интернет_страница
http://テストサイト.test.corp.testcompany.com

Wow, that's disappointing.

It looks like there are a bunch of URL detection functions that don't do so well with IDN.

I wonder if UNC paths fare any better?

I'm just kidding, I don't wonder. because I tried it.

\\idn-iis1.日本国.test.corp.testcompany.com\сетевой

The other two (WordPad and word 2010) behaved about like the third URL did, for reasons that might be obvious if you think about how tha Autodetect code works (or doesn't, in this case).

Now this kind of stuff is obviously not core feature work, it's a nice little "extra", but really it isn't so nice when it screws up.

As it does with IDn on the absolue latest version of evedry product/app/control I had immediate access to.

Sounds like there are some bugs for people to look at, huh? :-)

In the end, the roadpath to Hell is paved with IDN bugs!


Yuri Khan on 17 Jun 2011 10:33 AM:

IDN already *is* one of those good intentions paving the road to hell.

Quppa on 17 Jun 2011 7:00 PM:

I'm pleased to report that Windows Live Messenger 2011 (v15) picks up those 4 URLs correctly :)

It's one of the first things I noticed when upgrading from version 14 (aside: was there ever an explanation for the jump in version numbers for Wave 3 onwards?). It does have some unpleasant side effects, however: if I write '(テスト:http://www.quppa.net/keiki)', it will pick up the final closing bracket (U+FF09, I think) as being part of the URL.

Quppa on 17 Jun 2011 7:09 PM:

And relatedly, validating IRIs with regular expressions is not trivial: stackoverflow.com/.../190405

Ian Macfarlane on 20 Jun 2011 2:30 AM:

Even search engines are struggling with IDNs - I did a bit of research a while back which showed issues with Google, Bing and Yahoo! see http://goo.gl/cgOLM (first article) and http://goo.gl/pr66K (follow-up piece half a year later which showed that there were still serious IDN-handling bugs).


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2013/10/17 There's no "I" in IDN, part 19: There's no "I" in IPv6, either!

2013/10/08 There's no "I" in IDN, part 18: There isn't even an "I" in John C. Klensin's name!

2013/09/13 There's no "I" in IDN, part 17: EAI made it to China, and everybody knows it!

2013/04/19 There's no "I" in IDN, part 16: It's a good thing they decided to call it EAI!

2012/10/12 There's no "I" in IDN, part 15: Still no 'I' in EAI.... but we could use an US sometime soon!

2012/08/08 There's no "I" in IDN, part 14: It turns out there's no "I" in IE, either

2012/05/18 There's no "I" in IDN, part 13: Desktop and Managed and Metro; oh my!

2012/02/27 There's no "I" in IDN, part 12: Emoji + IDN == U+1F4A9 (PILE OF POO)

2011/10/25 There's no "I" in IDN, part 11: There's no place like ::1, not even 127.0.0.1!

2011/09/21 There's no "I" in IDN, part 10: Who needs IDN support? How much? When? (Part 2)

2011/09/16 There's no "I" in IDN, part 9: Who needs IDN support? How much? When? (Part 1)

2011/08/12 There's no "I" in IDN part 8: Punycode don't do the PUA

2011/07/28 There's no "I" in IDN, part 7: IDN comes to AdWords

2011/07/14 There's no "I" in IDN, part 6: It isn't like there's an "I" in EAI, either!

2011/06/29 There's no "I" in IDN, part 5: Stephen Colbert's job is not in any jeopardy

go to newer or older post, or back to index or month or day