There's no "I" in IDN, part 12: Emoji + IDN == U+1F4A9 (PILE OF POO)

by Michael S. Kaplan, published on 2012/02/27 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/02/27/10273315.aspx


Previous blogs in this series:

Today's blog starts with an email that was sent to The Unicode List by Jeroen Ruigrok van der Werven:

IDN and emoji combined brings you the wonderful domain of:

http://💩.la/        (It should be U+1F4A9.)

I don't know whether to laugh or cry at this marvel of technology. :)

Even though it was the weekend, Stephane Bortzmeyer replied quickly enough:

Note that it is a direct violation of RFC 5892. U+1F4A9, being of
category So, should be DISALLOWED. The registry was wrong to accept
it.

To which Jeroen Ruigrok van der Werven replied:

Oh, this will be fun. So I guess they did not check the codepoint categories
in their validation step then? (I honestly have no idea how NICs do this
nowadays, it's been ages since I messed with stuff on that level.)

Now this is not such a new issue.

The first time I heard of it was from the blog The World’s First Emoji Domain, which was first put up on July 21st last year.

From that blog:

Now that you’ve had a moment to recover, I’d like to give particular thanks to the country of Laos, who run the last remaining domain registrar I’m aware of that still allows international domain names that use any Unicode character. Our sincere thanks must be given to Thongsing Thammavong, the Prime Minister of Laos, for his valuable assistance in making all of this possible.
 
Update: I’ve just got word that, due to intense political unrest in Laos (untrue), they no longer allow Emoji domains! Yes, .la is no more. Fortunately, the territory of Tokelau (!) has stepped in to meet this intense international need! Emoji .tk domains are now available.
 
(Why are they so hard to register? Due to fears of IDN homograph attacks, most registrars, like .com, now only allow specific language sets to be used for Unicode domain names. The days of registering ☃.net — a previous Cabel effort in this series — are long gone. In fact, back in 2007 ICANN expressly recommended that “symbols and icons [...] such as typographic and pictographic dingbats” should not be allowable code points for domain names. Fortunately, Laos didn’t get the memo.)

...

PS: Thanks to iwantmyname.com for doing emoji domain registration, and domai.nr for valuable assistance!

Note that people started registering Emoji domains right in the comments of that blog.

Now, who is right here -- the other blog author or Stephane?

RFC 5892: The Unicode Code Points and Internationalized Domain Names for Applications (IDNA) is not as direct as Stephane suggests; although the process it describes hints at issues you can only indirectly hit if you run the process, e.g.

   The mechanisms described here allow determination of the value of the
   property for future versions of Unicode (including characters added
   after Unicode 5.2).  Changes in Unicode properties that do not affect
   the outcome of this process do not affect IDN.  For example, a
   character can have its Unicode General_Category value (see
   [Unicode52]) change from So to Sm or from Lo to Ll, without affecting
   the algorithm results.

The rest of the document does not mention So or Sm explicitly at all, and you have to dig to derived character definitions to get to them -- to even understand why such references to So or Sm even exist.

A lot of the symbols issue here has mmuch more to do with the move from IDNA2003 to IDNA2008; as the new Unicode 6.1 update to UTS #46: Unicode IDNA Compatibility Processing state:

By using this Compatibility Processing, a domain name such as ÖBB.at will be mapped to the valid domain name öbb.at, thus matching user expectation for case behavior in domain names. For transitional use, the Compatibility Processing also allows domain names containing symbols and punctuation that were valid in IDNA2003, such as √.com (which has an associated web page). Such domain names containing symbols will gradually disappear as registries shift to IDNA2008.

Anyway, what we are largely seeing here is registries that were using the older rules.

Now IE9 on Windows 7 won't go to the "Get Coffee" domain at all:

Can't use the link in IE

Even if you do it in the Punycode version directly:

Can't use the Punycode link in IE

I won't speculate on the reasons -- beyond an idle guess that it may be following IDNA2008 rules and rejecting the site? Maybe.

And I can't get to either of my Windows 8 machines at the moment.

"Luckily" Firefox had no problem with it....

It works ok in Firefox

Hmmmm.

How I feel about its success largely depends on the reason for IE's failure.

If you know what I mean.

I want to be clear on what the moral is here -- prefer IDNA2008 whenever you can.

And please never mix IDN and Emoji, even when someone lets you....


RFC Editor on 27 Feb 2012 1:31 PM:

RFC 5892 is just a proposal ;-)

Simon Buchan on 27 Feb 2012 2:20 PM:

I'd feel worse about the disallowing of symbolic IDNs if someone put up a non-incredibly stupid one. ♠.com being (an alternate name for) a poker site? ✈.com being an travel-booking site? ❤.com, ✝.net, ♞.info, ♂.com, ☎.com, ♨.com, etc..., all pretty obviously useful (if still dumb!) domains.

Incidently, but a bit off-topic: while looking for suitable victims, I noticed U+066D "Arabic Five Pointed Star" - which doesn't look to be, you know, five pointed in any font. Google had some interesting results, but I'm still not sure what's up with that....

Steak Styles on 29 Feb 2012 7:13 PM:

Pfft, the Windows 8 keyboard makes such classic domains as �� .com or ��.gov mere taps away. .la is small time.


referenced by

2013/10/17 There's no "I" in IDN, part 19: There's no "I" in IPv6, either!

2013/10/08 There's no "I" in IDN, part 18: There isn't even an "I" in John C. Klensin's name!

2013/09/13 There's no "I" in IDN, part 17: EAI made it to China, and everybody knows it!

2013/04/19 There's no "I" in IDN, part 16: It's a good thing they decided to call it EAI!

2012/10/12 There's no "I" in IDN, part 15: Still no 'I' in EAI.... but we could use an US sometime soon!

2012/08/08 There's no "I" in IDN, part 14: It turns out there's no "I" in IE, either

2012/05/18 There's no "I" in IDN, part 13: Desktop and Managed and Metro; oh my!

go to newer or older post, or back to index or month or day