If it isn't English, the WP8 Messaging App can get the numbers all wrong

by Michael S. Kaplan, published on 2013/04/01 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2013/04/01/10406100.aspx


When you use SMS/Text capabilities, you have 160 bytes per text.

This makes English the smallest language, and every other language a little bit bigger.

And it makes some a lotta bit bigger.

If you are using supplementary characters, you'll be a lotta bit biggest!

Thank goodness I have unlimited texting!

I decided to play around with this on my Nokia Lumia 920 running Windows Phone 8.

For this exercise, I spammed my good friend Lauren Witt with a few texts.

Text #1 was 160 ASCII digits.

You know,

0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789

In UTF-8, this is 160 bytes.

Text #2 was 30 emoticons.

You know,

😃😃😃😃😃😞😞😞😞😞😃😃😃😃😃😞😞😞😞😞😃😃😃😃😃😞😞😞😞😞

A stream of five U+1F603 characters followed by five U+1F61E characters, over and over.

In UTF-8, (30 * 4) or 120 bytes.

Notice how it had me at 60/70 characters there for my 120 bytes.

The actual UTF-8 (care of Mark Davis's site), is:

F0 9F 98 83 F0 9F 98 83 F0 9F 98 83 F0 9F 98 83 F0 9F 98 83 F0 9F 98 9E F0 9F 98 9E F0 9F 98 9E F0 9F 98 9E F0 9F 98 9E F0 9F 98 83 F0 9F 98 83 F0 9F 98 83 F0 9F 98 83 F0 9F 98 83 F0 9F 98 9E F0 9F 98 9E F0 9F 98 9E F0 9F 98 9E F0 9F 98 9E F0 9F 98 83 F0 9F 98 83 F0 9F 98 83 F0 9F 98 83 F0 9F 98 83 F0 9F 98 9E F0 9F 98 9E F0 9F 98 9E F0 9F 98 9E F0 9F 98 9E

But that's kinda besides the point.

Basically four bytes per character, which is how well-formed UTF-8 supplementary characters behave.

Then, my final spam.

Six ASCII digits followed by five emoticons.

So that is (6 + (5 * 4)) * 4 or 104 bytes.

You know,

012345😃😞😃😞😃012345😃😞😃😞😃012345😃😞😃😞😃012345😃😞😃😞😃

Notice how it had me at 64/70 characters.

The UTF-8 was:

30 31 32 33 34 35 F0 9F 98 83 F0 9F 98 9E F0 9F 98 83 F0 9F 98 9E F0 9F 98 83 30 31 32 33 34 35 F0 9F 98 83 F0 9F 98 9E F0 9F 98 83 F0 9F 98 9E F0 9F 98 83 30 31 32 33 34 35 F0 9F 98 83 F0 9F 98 9E F0 9F 98 83 F0 9F 98 9E F0 9F 98 83 30 31 32 33 34 35 F0 9F 98 83 F0 9F 98 9E F0 9F 98 83 F0 9F 98 9E F0 9F 98 83

which is obviously a little bit smaller.

But it still gave me weird limitation estimates, this time 64 characters out of 70.

So. Smaller, but bigger?

I think their logic is a little off here.

One last text.

150 ASCII digits, a space, and two emoticons.

You know, 

012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 😃😞

Now, like every other time, this text should fit under the 160 byte limit. The UTF-8 is 150 + 1 + (2 * 4) or 159 bytes, 1 byte short of the limit.

Let's see what the phone says:

 

Um, 155/201, taking three messages?

What the hell?

Someone has an algorithm bug to fix in the WP8 Messaging app, I think...

In the end, that last text was just one text, containing 159 bytes.

Yet the miscalculation was used and according to Lauren, the phone sent out three texts!

For the record, I'd be more forgiving if they were only off estimating when it was over the limit. There is no excuse to erroneously claim I'm over the limit when I'm not, and definitely no excuse to spam the error out into the world outside!

Note that similar repros can be constructed for actual languages using 2 bytes, 3 bytes, and/or 4 bytes for UTF-8 - all of them put the WP8 messaging app in "4 byte mode".

Not really an April Fools Day joke, though basic errors in arithmetic read that way to me!

(I originally found the problem in actual texts, not looking for bugs -- so be warned Ken if they try to Won't Fix the bug I'll have many real world examples for the reactivate!)

Either way, I'm glad I have unlimited texting -- if not, I would submit the bill to the Windows Phone team. :-)

And isn't it cool to see the font support? Segoe UI Symbol, I presume. Keyboard support for Emoji is also sublime....

Still loving my WP8 running Nokia Lumia 920.

If they gave me access to newer builds, I could selfhost/dogfood. WP would get the bug reports pre-ship, and NDAs would limit blogs about bugs - everybody would win! ;-)

In any case, I have new respect for the language based pain some languages have based on their arbitrary place in Unicode, something I've both spoken and written about before....

 

Special thanks to Lauren for putting up with my 6 spam texts and still being willing to go to the symphony with me this last weekend. Like all my friends, she puts up with a lot! :-)


Daniel on 1 Apr 2013 8:34 AM:

Your problem is assuming SMS has anything to do with UTF-8. It does not.

Read this and weep:

en.wikipedia.org/.../GSM_03.38

John Cowan on 1 Apr 2013 9:27 AM:

By what I understand, an SMS message is 140 bytes.  If all the characters are in the GSM 03.38 character set (ASCII plus various things replacing the control characters), then you get 160 7-bit characters packed into those bytes.  Otherwise you get 70 UTF-16 characters.  There is no variable-length encoding such as UTF-8 involved.

Michael S. Kaplan on 1 Apr 2013 10:12 AM:

This makes me very sad.

But that last text should still be *2*, not 3 texts!

Michael S. Kaplan on 1 Apr 2013 10:41 AM:

Of course, I'm still a little fuzzy here. Wouldn't UCS-2 mean *two* bytes per character, not FOUR?

Daniel on 1 Apr 2013 11:50 AM:

The last one works like this:

- your message contained those emoticons, so it switches the whole thing to the UCS-2/UTF-16 mode

- thus the 150 numbers plus one space thus takes 302 bytes

- the emoticons are in the SMP, so they take 4 bytes each, 8 bytes in total

So you need 310 bytes for the message.

A single SMS has space for 140 bytes, so you need multiples. When you start needing multiples, then some bytes are wasted to mark the sequences, so now you only got 134 bytes per message.

134 + 134 + 42 = 310

Thus you need three messages and you have 134-42 = 92 bytes left before needing another one.

Your phone is an optimist, as is pretty much any other phone, and assumes you are going to be writing BMP characters rather than some more SMP emoji. The 155/201 counter it displays is exactly that, 201-155 = 46. Times two for any BMP character, 92.

First example: fits into the 7-bit SMS character set and you only needed one message, so you could fit 160 characters into the 140 bytes

Second example: emojis switch to UTF-16 mode, 30 of them takes 4 bytes each, 140 - 120 = you got 20 bytes left and still only using one message, so the readout says you could type 10 more BMP characters (60/70).

All three works as expected.

Michael S. Kaplan on 1 Apr 2013 6:07 PM:

When it decides to split them, it should re-assess the size of them post-split. Not doing that is a bug, IMHO.

Kalle Olavi Niemitalo on 1 Apr 2013 9:44 PM:

Do you mean the phone should split your last text into two messages: 7-bit for the first 151 characters and UTF-16 for the two emoticons?  This is not recommended by ETSI TS 123 040 V11.4.0 (2013-01) section 9.2.3.24.1 (Concatenated Short Messages): "The TP elements in the SMS-SUBMIT PDU, apart from TP-MR, TP-SRR, TP-UDL and TP-UD, should remain unchanged for each SM which forms part of a concatenated SM, otherwise this may lead to irrational behaviour."  The character-set bits are in the TP-Data-Coding-Scheme field, which is not listed here.

Michael S. Kaplan on 2 Apr 2013 7:36 AM:

Terrible recommendation, by people who clearly either weren't sending other language text or weren't being charged per text message...

Doug Ewell on 2 Apr 2013 8:23 AM:

Back in 1995, before most folks outside the industry had heard of SMS, Tim Garton from Motorola asked for a way to encode Unicode such that the best-case scenario would be 7 bits per character:

www.unicode.org/.../0240.html

The response was to try RCSU, which is 8 bits in the best case, but still better than the adopted approach, which treats Unicode as an emergency exit. Naturally, SCSU is better still, but we're not supposed to say that.

Doug Ewell on 9 Apr 2013 9:24 AM:

> In any case, I have new respect for the language based pain some languages have based on their arbitrary place in Unicode, something I've both spoken and written about before....

You don't really want me to write my long-postponed response to "It is easy (and obnoxious)...", do you?


referenced by

2013/09/10 Is there anyone from Facebook reading this Blog?

go to newer or older post, or back to index or month or day