You can't ignore crap and hope it won't cause problems...

by Michael S. Kaplan, published on 2010/12/16 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/12/16/10105779.aspx


So the question that came up the other day:

Hi. I met a unexpected problem when using string.IndexOf. The following code demonstrates the problem:

string r = "\ufffd\ufffd\ufffd\ufffd";
string tar = "a";
Console.WriteLine(tar.IndexOf(r));
Can you guess the output? The output is 0, which is very weird for me. Can someone explain why? Because this has broken my program, which assume if a.IndexOf(b)>= 0 Then a.Length >= b.Length.

Now the behavior here is by design.

And the reasons for this are covered in my Microsoft is giving this character nada weight but lotsa importance blog.

Okay, so U+fffd (aka REPLACEMENT CHARACTER) is now the central way that Microsoft deals with text that is illegal/invalid by Unicode rules, whether due unexpected or unintended corruption of results.

Fine, I can live with that -- conformance is a good thing.

I would prefer to have a way to opt out of the behavior if I am trying to investigate the nature of an intentional attack on security, but I can live with requiring to stay outside of the encoding system provided by the globalization team.

BUT....

And this is a huge BUT, in my opinion!

If you look at what Unicode does here in its UTS 10: Unicode Collation Algorithm, in its allkeys.txt::

0000  ; [.0000.0000.0000.0000] # [0000] NULL (in 6429)
0001  ; [.0000.0000.0000.0000] # [0001] START OF HEADING (in 6429)
0002  ; [.0000.0000.0000.0000] # [0002] START OF TEXT (in 6429)
...
FFFC  ; [*1490.0020.0002.FFFC] # OBJECT REPLACEMENT CHARACTER
FFFD  ; [*1491.0020.0002.FFFD] # REPLACEMENT CHARACTER

 gives the two related characters some weight, while Microsoft's collation data does not.

Now it is not fair to say that on this basis Microsoft isn't conformant, since Microsoft does not use the Unicode Collation Algorithm.

However, it is clear that Microsoft is, while perhaps conformant, accomplishing the goal of making the conformance meaningless for developers.

Because when you get right down to it, if I am comparing

LIZ

and

<uninterpretable crap>LIZ<uninterpretable crap>

and

L<uninterpretable crap>I<uninterpretable crap>Z

then it is incorrect to consider them to be identical.

They are not.

Remember that the official change in Unicode was to get away from the tendency of implementations to drop the invalid characters entirely -- due to security concerns.

But if I say "LA LA LA LA I am not listening to you" to ignore crap when it happens, then I am subverting the whole process of inserting this particular bit of crap to be conformant in the first place.

in my opinion, Microsoft's implementation over the last few years is not conformant to the Unicode Standard.

Microsoft can't ignore crap and hope it will go away....


Cheong on 16 Dec 2010 9:31 PM:

Yet in the question, we'd expect tar.IndexOf(r) returns -1 because content of r does not exist in tar. I can imagine having it return 0 will case some infinate loop problem in certain data stream processing functions if they're lazy enough to use string manipulation functions to process data.

Michael S. Kaplan on 16 Dec 2010 10:25 PM:

That weirdness is due to a different issue that I will be covering another day. :-/

Though it too would not be an issue if my advice were taken here and U+fffd was given weight....


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/12/22 I agree with you 100%. But we're both wrong (according to the spec)

go to newer or older post, or back to index or month or day