Here is an interview question for you. :-)

by Michael S. Kaplan, published on 2005/09/13 17:17 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/09/13/465228.aspx


I am not doing interviews at the moment, but I could actually see using this one (another question will come along by the next time I an interviewing someone, so I don't think I am losing too much). But I thought I would ak it here for those who love a challenge.

Remember the CharNext function? I talked about how it could be useful here and how we broke it here. Yes, it is fixed in Vista....

Anyway, imagine the functionality -- we want a function that will increment us by one character, what the user might think of as a character.

So, let us say that you have full access to the GetStringTypeW function so you can find out all about characters. Now go ahead and write the CharNext function.

And that part is easy enough that I'll make it a little harder. Pretend there is (in addition to the C3_DIACRITIC and C3_NONSPACING flags) flags for C3_HIGHSURROGATE and C3_LOWSURROGATE. So you can take care of supplementary characters as well.

Ok, ready. set. Go!


# Shaun Bedingfield on Tuesday, September 13, 2005 10:12 PM:

What language? Which sort order? (Sql Server: collation?)

When people ask you to do anything, the best response is usually a question.

ie. If someone asks you how to compress something..

What type of compression? Run length encoding? Huffman encoding? Pattern based (vector format encryption ie. bmp to vector format graphics)? What does the data look like? Usage patterns?

# Michael S. Kaplan on Tuesday, September 13, 2005 10:33 PM:

Good questions, Shaun. I guess if I am looking for someone to rewrite CharNext then I am looking for C or C++, but since it is being treated as if it were an interview question I would accept any language. The Win32 function assumes the locale, so the new one probably should too.... No compression needed -- you are returning a pointer inside the string. You can actually look at the CharNext doc topic for spec details if you like. :-)

# James Todd on Tuesday, September 13, 2005 11:48 PM:

Looking at the documentation for GetStringTypeW from the link you provide, it does define flags for C3_HIGHSURROGATE and C3_LOWSURROGATE. Or am I misunderstanding what you mean?

James

# Michael S. Kaplan on Wednesday, September 14, 2005 2:00 AM:

Hi James -- Yep, newly added for Vista. :-)

# Serge Wautier on Wednesday, September 14, 2005 3:16 AM:

> When people ask you to do anything, the best response is usually a question.

Shaun, no offense intended but...
I just plain disagree with you. If I were the interviewee, these questions would just help you get burned. Because these questions show you're a programmer, not a developer: They can all be answered from the context.
And regarding your compression example, I'd rather ask questions such as: "What's more important : compression efficiency or low development budget ?" And then I'd justify the choice for a method accordingly.

# Shaun Bedingfield on Wednesday, September 14, 2005 1:29 PM:

The compression part was just an example of asking questions to make sure you are solving the right problem.

I guess that you could look at questions as superfluous but in development solving the right problem is often more important than solving 'a' problem at least in my opinion.

Finding the next character is kind of silly in many contexts should kanji come after English? Does this make sense at all? Yes, you can assume we mean the next highest character in unicode but that is rather naive. I didn't mean language as far as C++, C#.. I meant language as far as Korean, Hebrew, etc. The next character is highly context sensitive. I guess for character next, I would probably get the user's code page or set language. Then, find the corresponding sort order. If the current character is not a character in their language, I would return an error or else find the next character in the language's sort order. Of course, I could be wrong about what Mike wants which is why I ask questions.

As far as compression efficiency versus low budget.. Efficiency can be defined as time to compress, size of compressed image, etc. Budget may be a buy versus build trade off or it may mean balancing the time to implement the algorithm versus the efficiency in several different terms.

I don't think I would want to encrypt a text file using "lossy" encryption. Of course, maybe encryption in that case could be a tool that presents a summary of a document or rewords sentences to "meaning" equivalent shorter versions. It could also look for sentences that state the same thing (maybe authors don't proofread) and delete the copies. Depending on your customers, any of these might be an innovative solution.

A good developer asks questions and figures out what they need to build before they start coding. A bad developer assumes they know what the interviewer or customer wants and starts coding off the bat. So I guess we would have to agree to disagree. It is easy to assume all questions can be answered from contextual hints but I don't think that is the typical case.

# Mihai on Wednesday, September 14, 2005 1:46 PM:

> the best response is usually a question
Depends on the question :-)
Bad questions show you are playing with smoke and mirrors. Only good questions get you points.

And almost all your questions look like smoke :-)

# Shaun Bedingfield on Wednesday, September 14, 2005 3:39 PM:

That hurts.. even though my questions were only sample questions and most people misread my first post as only the first 3 questions actually corresponded to the question at hand and the others were attempts to show how to ask intelligent questions when given a problem (in this case compressing something). The second was admittedly a very generic problem.

My biggest problem with recruiters and people in general is that most people misread or misinterpret what I say. This really hurts me on interviews even when I do know what I am talking about.

I was told I knew nothing about object oriented analysis and design even though I have been studying it for years and blew everyone else away in a master's level OOA&D class. I also have produced some very good OOA&D designs. Oh, and this was for an entry level job where they were expecting someone with almost no exposure to OOA&D.

I can talk to senior level people and have them in awe about my skills and then be told by a beginner that I don't know ***.

Now is the point where people usually jump at me and say, "No, I understood what you said and it is full of sh*t." Let me put it this way, yes, you probably misunderstood me and no it is probably not your fault.

I have Asperger's syndrome. I think differently than almost everyone else. This means that while what I say is often literally correct, almost everyone misinterprets me because I do not understand the connotations everyone else reads in. I will not tell you how difficult this makes my life. I have heard of people with phDs and Asperger's who have gone over 20 years without a job because they cannot communicate that they know what they know.

I have a genius level IQ and I was told that I was retarded most of my life. I have failed homework assignments where my proof was identical to those who aced them because the graders didn't understand me.

I had friends all through high school and most of college calling me the next Bill Gates and I can't find a job.

Ok, what does this mean to you.. probably nothing and I probably still am full of shit and don't know a damn thing about programming but I really wish I could get someone to believe me.

# Michael S. Kaplan on Wednesday, September 14, 2005 3:48 PM:

I guess I am most hurt by the fact that no one wanted to treat it like a pseudo-interview question to solve. I wanted to compare what I came up with to what other people did. :-)

# Shaun Bedingfield on Wednesday, September 14, 2005 3:55 PM:

I am sorry Michael, I just get upset when people make thinly vieled personal attacks.

I was trying not to move things off subject but I feel obligated to respond to certain items.

# Michael S. Kaplan on Wednesday, September 14, 2005 3:55 PM:

Shaun, no worries. I was not offended anyway. :-)

You don't sound full of it or retarded to me, fwiw.

But I answered your questions and you did not follow up with your answer to the question....

# Shaun Bedingfield on Wednesday, September 14, 2005 3:59 PM:

I am also sorry for not reading it completely..

I just saw increment a character and didn't note that you meant increment a character position.

# Shaun Bedingfield on Wednesday, September 14, 2005 4:32 PM:

I am not sure this is correct but this is my first stab.. I am not a language junky so have fun. I bet I am wrong.

// okay, I hope I know what these characters mean..
// I wonder why I don't work in localization
// I do have a nifty web page pattern for localization in dev at the
// moment though
LPTSTR CharNext( LPCTSTR lpsz)
{
// regular characters so just get the next character
if(sizeof(lpsz[0]) == sizeof(char))
retun lpsz[1];

// we are at the end of the string
if(lpsz[0] == '\0')
return lpsz;

WORD characterinfo;

// See if we have a high surrogate
if(GetStringType1(CT_CTYPE3, lpsz, 1, &characterinfo))
{
// high surrogate must be followed by a low surrogate
if((characterinfo & C3_HIGHSURROGATE) != 0)
if(wcslen(lpsz) >= 2) // remember the null 0
lpsz = lpsz[2]; // eat two characters
else
lpsz = lpsz[1]; // go to the end
else
lpsz = lpsz[1]; // eat one character
}

if(wcslen(lpsz) > 0 && GetStringType(CT_CTYPE3, lpsz, 1, &characterinfo))
{
// eat extra diacritical or nonspacing characters
// I think there can only be 1
// I am just guessing that a diacritic can follow a surrogate pair
if(((characterinfo & C3_DIACRITIC) |
(characterinfo & C3_NONSPACING)) != 0)
lpsz = lpsz[1];
}

return lpsz;
}

# Mihai on Wednesday, September 14, 2005 5:00 PM:

Shaun --
Sorry Shaun, but from my "playing with smoke and mirrors" to your "retarded" is a big difference. This was definitely not my interpretation.

I agree 100% that asking questions is really good when the original question is not clear. Also, sometimes new questions might highlight hidden assumptions in the original question, assumptions that sometimes are not clear the one asking the original question.

Also, thinking differently is sometimes a disadvantage for communication, but is also a big advantage if it means creative, out of the box ideas.
And, in general, I think no good programmer is "normal" by the generally accepted standards :-)

No, let me shortly explain why I had something with your questions:
The request was "we want a function that will increment us by one character, what the user might think of as a character" ... "Now go ahead and write the CharNext function."

- What language? Which sort order?
Language + region + sort == locale.
So you don't ask about language and sort order, you do what the actual CharNext does: use the current locale.
And, really, sort order does not affect things like CharNext.
And yes, Japanese mixed with English very common (see http://www.microsoft.com/japan/msdn/campaign/)
Exactly what Sarge said: "They can all be answered from the context"

(Sql Server: collation?)
We where talking Windows API. Why bring SQL Server here?

But this would not be enough to fail an interview with me. If (after I answer your questions) you come up with something that seem to have reasonable chances to solve the problem. In general I do not expect someone to come up with my solution. It would not be fair. Maybe I am thinking daily about this problem, for the last few weeks. In fact, I think I would even like better to get 2-3 different solutions. And if you don't narrow down the problem with too many questions, then you have more space to develop "if we want something with same signature with CharNext then ...", "if all we care is 'technically correct', meaning don't split before combining characters or in surrogates, then ..."

But in a real interview (not in writing) this is much easyer. It takes 10 seconds to clarify something. Writing is way easier to misinterpret. Especially that I am not a native English speaker (writer :-) I have tried to add smiley(s), but it did not help.

MishKa - "no one wanted to treat it like a pseudo-interview"
I had three things stopping me from doing that:
- not much time
- doing it in writing (see above)
- showing that in fact I know much less than what I seem to know :-)
But if is more like a validation of you ideas than an exercise for your readers, than I can come up with something (maybe over the week-end).

# Shaun Bedingfield on Wednesday, September 14, 2005 6:50 PM:

To be honest, I misread the question.. and I am and was laughing a bit at myself.

I read that CharNext "increments" a character without looking at the actual function which just moves to the next character in a string. In this question locale information is very important and it is possible that one character could be used in multiple locales with a different sort order and GetStringTypeW might not be much of a help.. hence my questions.

I was mainly responding to the point that "these questions would get you burned" as I feel that asking questions is generally something people do too little of.

I think the reason why I eventually posed the questions is because I felt the idea of a CharNext function as stated was kind of ridiculous. Incrementing characters just does not make sense.

In this country in general, people tend to be "polite" which doesn't mean they are really polite but more that we phrase things as being a less severe cricisim than it really is. I would have expected things like "Did you read the question? I might agree with your idea of asking questions but I feel that a lot of what you said lacked relevance." which is a polite way to say that you missed the boat (which I did). Whitty statements in particular are often especially biting but made more "polite".

The problem is that statements which criticize (especially whitty ones) put me on the defensive.. fast. I have had some very mean things said to me over the years and the comment about being retarded has more to do with my actual life than the contents of the board.

Like everyone here, I sometimes don't take proper care when reading things. On top of that, I am definitely "not" a morning person.

Enjoy my short code snippet.. I am not much of a localization type so it should be a fun read to see what mistakes I made.

# Jerry Pisk on Wednesday, September 14, 2005 7:13 PM:

Shaun, we can't enjoy your code much as it won't even compile. The construct:

lpsz = lpsz[x];

assigns a char into a char* variable.

If compiled without UNICODE (i.e. sizeof(lpsz[0] wil be the same as sizeof(char)) your code will overrun the buffer if lpsz already points to the terminating '\0'.

I'll see if I can find more, I'm not sure whether you should be skipping diacritic marks.

# Shaun Bedingfield on Wednesday, September 14, 2005 7:14 PM:

I guess I can be glad that this is not my area of expertise..

My areas of expertise are..

Design - especially hard and bleeding edge, I get bored with routine things unfortunately.. I am kind of a research developer.

Iterative Development

Algorithms and Performance Optimization - I am not the best code optimizer but I am defintely not bad. I am really good at figuring out how to solve new problems.

Software construction - ie. programming (16+ years in C/C++ by 26.. If someone needs a programmer, I am very good). I probably could build software faster than many teams of software engineers.

Innovative ideas and technology - I think up some really great stuff.

Learning - I am very quick at picking up new material.. very very quick.

Areas where I suck..

Communication - I can speak tech but I tend to rub people the wrong way. I also am sometimes a bad listener and I tend to open my mouth before I think (ie. this article). My high school UIL team sans one hated me which is the only reason I was not on it. People also tend to think I am an ignorant jerk.

I spend a huge amount of time trying to improve my social skills but it is frustrating.

UI - I have no idea how the average user uses software. To be honest, I have trouble understanding how they think.. a lot of trouble.

Depression, issues, chronic illness - Yep, I have them and it doesn't help.

If anyone wants the good and can live with bad, give me a call. Sometimes I think I am never going to find employment.

Shaun Bedingfield
blogsb.blogspot.com
shaunbed@houston.rr.com

# Shaun Bedingfield on Wednesday, September 14, 2005 8:06 PM:

Yep.. you are write..
It is meant to be &lpsz[1]..

And yes, I am assuming you are not making char = wchar which can be done with certain flags.. You could also assume sizeof returns the size in bytes and make it equal to 1.

# Shaun Bedingfield on Wednesday, September 14, 2005 8:07 PM:

Yep.. you are right..
It is meant to be &lpsz[1]..
This is C++ trick to move the pointer to the next integer.

And yes, I am assuming you are not making char = wchar which can be done with certain flags.. You could also assume sizeof returns the size in bytes and make it equal to 1.

Please don't shoot me, I wrote the code in about 3 or 4 minutes.

# Shaun Bedingfield on Wednesday, September 14, 2005 8:10 PM:


// okay, I hope I know what these characters mean..
// I wonder why I don't work in localization
// I do have a nifty web page pattern for localization in dev at the
// moment though
LPTSTR CharNext( LPCTSTR lpsz)
{
// regular characters so just get the next character
if((sizeof(lpsz[0]) == sizeof(char)) && (sizeof(lpsz) > 0))
return &lpsz[1];

// we are at the end of the string
if(lpsz[0] == '\0')
return lpsz;

WORD characterinfo;

// See if we have a high surrogate
if(GetStringTypeW(CT_CTYPE3, lpsz, 1, &characterinfo))
{
// high surrogate must be followed by a low surrogate
if((characterinfo & C3_HIGHSURROGATE) != 0)
if(wcslen(lpsz) >= 2) // remember the null 0
lpsz = &lpsz[2]; // eat two characters
else
lpsz = &lpsz[1]; // go to the end
else
lpsz = &lpsz[1]; // eat one character
}

if(wcslen(lpsz) > 0 && GetStringTypeW(CT_CTYPE3, lpsz, 1, &characterinfo))
{
// eat extra diacritical or nonspacing characters
// I think there can only be 1
// I am just guessing that a diacritic can follow a surrogate pair
if(((characterinfo & C3_DIACRITIC) |
(characterinfo & C3_NONSPACING)) != 0)
lpsz = &lpsz[1];
}

return lpsz;
}

# Shaun Bedingfield on Wednesday, September 14, 2005 8:21 PM:

I shouldn't have put it down without testing it first.. I didn't intend that code to actually run, it was just high level pseudocode.

Give me a second while I hit myself in the head and build a copy that will actually compile and please ignore my earlier post.

# Shaun Bedingfield on Wednesday, September 14, 2005 8:29 PM:

Here it is.. This is the last time I post code before testing it.. I thought people would note things like the fact that I never changed
GetStringType1 to GetStringTypeW and realize that it was just low level pseudocode that I had never intended to run as is.

// okay, I hope I know what these characters mean..
// I wonder why I don't work in localization
// I do have a nifty web page pattern for localization in dev at the
// moment though
LPTSTR CharNextTest( LPCTSTR lpsz)
{
// regular characters so just get the next character
if(sizeof(lpsz[0]) == 1 && lpsz[0] != '\0')
{
return (LPTSTR)&lpsz[1];
}

// we are at the end of the string
if(lpsz[0] == '\0')
{
return (LPTSTR)lpsz;
}

WORD characterinfo;

// See if we have a high surrogate
if(GetStringTypeW(CT_CTYPE3, (LPCWSTR)lpsz, 1, &characterinfo))
{
// high surrogate must be followed by a low surrogate
if((characterinfo & C3_HIGHSURROGATE) != 0)
if(wcslen((const wchar_t *)lpsz) >= 2) // remember the null 0
lpsz = &lpsz[2]; // eat two characters
else
lpsz = &lpsz[1]; // go to the end
else
lpsz = &lpsz[1]; // eat one character
}

if(wcslen((const wchar_t *)lpsz) > 0 && GetStringTypeW(CT_CTYPE3, (LPCWSTR)lpsz, 1, &characterinfo))
{
// eat extra diacritical or nonspacing characters
// I think there can only be 1
// I am just guessing that a diacritic can follow a surrogate pair
// diacritics follow the character so this is part of the character we already ate.
if(((characterinfo & C3_DIACRITIC) |
(characterinfo & C3_NONSPACING)) != 0)
lpsz = &lpsz[1];
}

return (LPTSTR)lpsz;
}

# Shaun Bedingfield on Wednesday, September 14, 2005 8:37 PM:

Um.. Mehai et al.. Can I say some really choice things about myself right now?

Let me do this..

Shaun, you are a piece of work. You must be the greatest #$#$%J@(#* I have ever seen. Not only do you defend your ignorance by your talkitive blathering but you feel that your utter nonsense will have persuade us to feel that a kindergartener isn't better equiped to solve this problem than you.

Let me spell this out for you.. GET A LIFE.. I have met window washers with better technical accumen? Have you ever thought of a job as a sanitary engineer? Don't. The idea of you doing engineering on that level is a pathetic.

By the way your code rox my world.. Not only does my compiler feel it is garbage but even with some considerable effort all I could get it to do is display a certain blue screen.. I do not want you anywhere near my computer. I don't even like the color blue and you give it to me in spades.

Let me try to polite.. I really do think there is a future for you. Stick to what you know and you should be alright.. I may even one day figure out something that you do know.. like perhaps how to butcher the english language and waste my time.

# Maurits on Thursday, September 15, 2005 4:03 AM:

Here's my initial thoughts on the matter.

/* TODO:
* What about non-W version(s)?
* How to go boom, exactly?
* What if the string is a row of 50 nonspacing characters?
* Should I really return the null terminator in this case,
* considering them all one mega-character?
* Or should I blow up?
* Or should I return each individually as their own non-combining character?
* Or perhaps I should periodically call FoldString until I get a char-count of 2,
* and then back up one?
* overall this is too generous in its assumptions
*/

LPTSTR CharNext(LPCWSTR ps)
{

if (ps == 0)
{
/* go boom somehow */
}

/* grab GetStringTypeW info three times */
/* first get length of input string in wchars */
int l = wcslen(ps) + 1; /* +1 is for null */

/* allocate an array of 16-bit elements - one for each wchar */
LPWORD t3 = (LPWORD)malloc(sizeof(WORD) * l);

if (
!GetStringTypeW(
CT_CTYPE3,
ps,
l, /* or -1 I suppose */
t3 /* &t3[0] if you like */
)
{
/* go boom somehow */
/* make sure t3 gets free()d */
}

BOOL found = 0;

/* start offset at 1... */
/* i < l will stop walking off the end, even if string is just null */
for (int i = 1; !found && i < l; i++)
{
/* is this offset ... */
if ( t3[i] &
(
C3_NONSPACING | /* a nonspacing character? */
C3_DIACRITIC | /* a nonspacing diacritic? */
C3_VOWELMARK | /* a nonspacing vowel mark? */
C3_LOWSURROGATE /* I ASSUME ... */
/* that the previous character was a high surrogate */
/* so this is the second half of that character */
) != 0 /* for the pedantic */
)
{
/* yup, sure is -- NOT SUITABLE, SKIP */
} else
{
found = 1; /* no problems, passes the test */
}
}

if (!found)
{
/* set offset to position of null character */
i = l - 1; /* should already be this, but better safe than sorry */
}

free(t3);

// next character is "i"th away
// could be 0, if *ps == '\0'
// could be 1, 2, 3, ...
// note it's very possible for i != 0 and *(ps + i) == '\0'
// if ps was last "real" character but there was nonspacing stuff after
// or if ps was a high/low surrogate character combination
return ps + i;

}

# Maurits on Thursday, September 15, 2005 4:05 AM:

ROFL the comments box eats leading whitespace on every line :) :) :)

# Maurits on Thursday, September 15, 2005 4:32 AM:

Hmmm... if the data string is large, GetStringTypeW will take a long time to return, even though I really only care about the second character most of the time.

Also, surrogate characters confuse me.

Suppose I had a high/low pair A/B
Is it possible that this high/low pair's "real" Unicode code point could be a nonspacing character? If so a FoldString call or two may be necessary after all.

# Michael S. Kaplan on Thursday, September 15, 2005 4:53 AM:

There are not currently any non-spacing characters off the BMP, so one could assume that case does not yet need to be handled; it is worth a thought for the future, though.... I agree that GetStringType for the whole string is usually too much; smaller calls are better.

# Maurits on Thursday, September 15, 2005 5:10 AM:

OK, let me turn it around and ask you, then.

Given a properly-formed UTF-16 encoded string, how many wchar's do I need to look ahead to be sure that I see at least the start of the next character? In your linked example you give an example of a Vietnamese character that is three wchar's long. Are there any that are four? Five?

I could easily take this max (say, five) and compare it to the known remaining wchar-length of the string and take the lesser of the two.

(In fact if the known remaining wchar-length of the string is 0 I can just return the input pointer and leave it at that.)

But the thing that really bugs me is...

Suppose I have an invalid Unicode string that is nothing more than a very long series of the same combining-character code point? Should I throw an error?

# Michael S. Kaplan on Thursday, September 15, 2005 5:14 AM:

Well, the best way to handle it (in my opinion) is to go one at a time, always. In the truly abberant case of the long series of diacritics, it will be slightly slower, but there is no need to optimize for the abberant cases....

# Maurits on Thursday, September 15, 2005 5:37 AM:

Take two... if you like I could email you a .txt version of this so you can fully appreciate my magnificent indentation technique ;) ;) ;)

LPTSTR CharNext(LPCWSTR ps)
{

if (ps == 0)
{
/* go boom somehow -- string pointer is null */
}

if (*ps == '\0') /* zero-length string */
{
return ps; /* already at end of string */
}

int l = wcslen(ps); /* no +1 necessary this time around */

/* start checking at wchar after this one (ps + 1) */
/* note i < l condition still holds - just doesn't check the '\0' this time */
for (int i = 1; i < l; i++)
{

WORD t; /* just one please */
if (
!GetStringTypeW(
CT_CTYPE3, /* only CT_CTYPE3 is interesting */
ps + i,
1, /* JUST 1 */
&t /* forgive me for skipping the array nonsense */
)
)
{
/* go boom somehow -- GetStringTypeW failed */
}

/* is this wchar... */
if ( t &
(
C3_NONSPACING | /* a nonspacing character? */
C3_DIACRITIC | /* a nonspacing diacritic? */
C3_VOWELMARK | /* a nonspacing vowel mark? */
C3_LOWSURROGATE /* see following note */

/* I ASSUME, for C3_LOWSURROGATE
* that the previous character was a high surrogate
* so this is the second half of that character
* TODO: production code should check ALSO check for C3_HIGHSURROGATE
* then pull the two halves of the pair together
* and join them (via FoldString)
* then check whether that joined character is a nonspacing character
* there aren't any yet
* but who knows what the Consortium will do in the future?
* probably still within scope of an interview question to expect this
* but I'm a little tired ;)
*/

) != 0 /* for the pedantic */
)
{
/* yup, sure is -- NOT SUITABLE, SKIP */
/* I'm sorely tempted to keep a counter of */
/* consecutive nonspacing or surrogate characters */
/* if it gets higher than (say) seven, go boom */
/* better to error on insane data sooner than later */
} else
{
/* found it! break to avoid i++ */
break; /* is there a "for" break in Microsoft C? */
}
}

/* biggest i could be is l - in this case ps + i is the '\0' wchar */

return ps + i;

}

# Maurits on Thursday, September 15, 2005 5:54 AM:

/* I'm sorely tempted to keep a counter of */
/* consecutive nonspacing or surrogate characters */

Duh, it's called "i" (or technically, i - 1)

So add the following lines of code:

#define MAX_WCHARS_BETWEEN_CHARACTERS 7 /* say */
...
/* MAX_ stops for loop from spinning too many times on big bad data */
for (int i = 1; i < l && i < MAX_WCHARS_...; i++)
{
...
}

if (i >= MAX_WCHARS_...)
{
/* blow up */
}

return ps + i;

# Shaun Bedingfield on Thursday, September 15, 2005 7:28 AM:

I am really sorry for the mood in my posts yesterday. I am near suicidal and I am going to post my farewells before I blow up and tick off too many people.

I am going to take a vacation from everything for a while.

You can read details at blogsb.blogspot.com which I am closing.

Hopefully, I will return someday.

# Michael S. Kaplan on Thursday, September 15, 2005 8:53 AM:

Shaun -- I will say it once more if you are still reading -- there are no worries about anything you said. I assume you don't believe me, but I really hope you will. I've yet to find a line of code worth that high of a price....

# Michael S. Kaplan on Thursday, September 15, 2005 8:57 AM:

Hi Maurits -- by using a wcslen in there, you are walking the entire string when in almost all non-abberant cases you never need to walk past a WCHAR or two. It might be best to make sure that only the people who cause such a need are the ones who have to pay a penalty, from a perf. standpoint....

Also, I am really unconvinced about MAX_WCHARS_BETWEEN_CHARACTERS because if they have 50 discritics after a letter then they may just be wanting to test the limits of the function, and we do not currently doc such a limit....

# Shaun Bedingfield on Thursday, September 15, 2005 9:30 AM:

No, this is personal. I haven't been able to find a job in 4 years and I hate myself, blame myself, etc. I am starting to believe I am a loser and I am being very mean to myself at the moment. It's my problem. I just don't want to involve others in World War 3 (sans world)

Good point by the way on wcslen. If you have to call it, it might be best to only call it once. Though it doesn't have to be as slow as a CharNext walk as you can use SIMD calls to really speed it up. Doing things like bit masks to check multiple bytes at a time. I totally forgot about that in my code. Pascal strings don't have this problem as the length is the first character. Of course since the parameter is a constant, a good compiler might be able to optimize the call to wcslen away. If it is constant in the program, the compiler will be able to supply the length. Also if the code is being used via interop, the compiler may be able to find the string length when passed in.

Recoding the thing in assembly or an unsafe cast to an array of 64 bit integers, you could read multiple WCHARs at once. By avoiding GetStringTypeW and using actual masking to get the diacritics, etc yourself you could also test multiple characters at once. It would be easy to make this point to a hash which points to array (ie. hashtable) and then figure out how many characters to skip.

So 4 characters in one read and a couple logical ops which is really fast.. You can also detect null 0s using binary ands since they are 0s. Basically, first check to see if the run has any 0s and if so find the first null 0. This will work unless the architecture you are building for coughs when the allocated memory for the string is exceeded. In this case you can still use longs (most of the time) but you need additional checking for boundary conditions.

MAXCHAR is beautiful but unfortunately it is not perfect and can still lead to flaws. You really need to check how much memory is allocated for the string and check when that is exceeded for perfect code. Unfortunately, I know of no quick way to do this. So really you are usually stuck depending on the coder having the null zero there. MAXCHAR might keep the program from crashing but you might still get a buffer overflow. Why do you think StrSafe has character counts?

Most of the time, you do not want the routine to blow up.. My advice is have it return the original string and SetLastError if people need to find out if something occurs.

I could rewrite this and make it really fast but I didn't get any sleep last night. I was mad at myself.

# Shaun Bedingfield on Thursday, September 15, 2005 9:37 AM:

Silly me with no sleep.. Use sizeof to find out how big the buffer is...

A good sizeof function does not have to read the string to find the sizeof.. The sizeof is stored in the memory alocation tables.

# Maurits on Thursday, September 15, 2005 11:27 AM:

I know I'm on the right track because my code keeps getting shorter. :)

/*
* TODO: C3_HIGHSURROGATE
* What on earth is C3_NOTAPPLICABLE?
* I can't even test for it, it's a zero bit
* Wait... let me guess...
* GetStringTypeW(CT_CTYPE3, &'\0', 1, &t) puts t = C3_NOTAPPLICABLE
* so it's a null sentinel
*
*/

LPTSTR CharNext(LPCWSTR ps)
{

if (ps == 0)
{
/* go boom somehow -- string pointer is null */
}

/* hope we don't get dropped in a sea of diacritics
* or this "for" could last a long, long time
*/
for(; *ps != '\0'; ps++)
{
/* There are arguments for and against pre-declaring t */
WORD t; /* just one please */
if (
!GetStringTypeW(
CT_CTYPE3, /* only CT_CTYPE3 is interesting */
ps,
1, /* JUST 1 */
&t /* forgive me for skipping the array nonsense */
)
)
{
/* go boom somehow -- GetStringTypeW failed */
}

/* is this wchar... */
if ( t &
(
C3_NONSPACING | /* a nonspacing character? */
C3_DIACRITIC | /* a nonspacing diacritic? */
C3_VOWELMARK | /* a nonspacing vowel mark? */
C3_LOWSURROGATE /* see following note */
/* The beauty of this is it compiles to a constant */
) != 0 /* for the pedantic */
)
/* TODO: production code should check for C3_HIGHSURROGATE and peek
* If the one after a high surrogate is a low surrogate that's a malformed wide string
*/
{
/* semantic continuation of the previous wchar - skip it */
} else
{
/* found it! break to avoid ps++ */
break; /* is there a "for" break in Microsoft C? */
}
}

/* Made it out of the for loop
* this might be the null terminator or it might be the next real character
*/
return ps;

}

# Maurits on Thursday, September 15, 2005 11:36 AM:

Er,
If the one after a high surrogate is a low surrogate that's a malformed wide string
should of course be
If the one after a high surrogate is not a low surrogate that's a malformed wide string

# Maurits on Thursday, September 15, 2005 11:49 AM:

last code is missing a ps++ in the beginning...

LPTSTR CharNext(LPCWSTR ps)
{

if (ps == 0)
{
/* go boom somehow -- string pointer is null */
}

if (*ps == '\0') { return ps; } /* at end of string */

for(ps++; *ps != '\0'; ps++)
{
WORD t; /* just one please */
if (
!GetStringTypeW(
CT_CTYPE3, /* only CT_CTYPE3 is interesting */
ps,
1, /* JUST 1 */
&t /* forgive me for skipping the array nonsense */
)
)
{
/* go boom somehow -- GetStringTypeW failed */
}

/* is this wchar... */
if ( t &
(
C3_NONSPACING | /* a nonspacing character? */
C3_DIACRITIC | /* a nonspacing diacritic? */
C3_VOWELMARK | /* a nonspacing vowel mark? */
C3_LOWSURROGATE /* see following note */
/* The beauty of this is it compiles to a constant */
) != 0 /* for the pedantic */
)
/* TODO: production code should check for C3_HIGHSURROGATE and peek
* If the one after a high surrogate is not a low surrogate that's a malformed wide string
*/
{
/* semantic continuation of the previous wchar - skip it */
} else
{
/* found it! break to avoid ps++ */
break;
}
}

/* Made it out of the for loop
* this might be the null terminator or it might be the next real character
*/
return ps;
}

# Shaun Bedingfield on Thursday, September 15, 2005 1:36 PM:

Since Michael was remarking about performance with wcslen, here is another performance question. Which is faster ++c or c++ and why?
Neither is the wrong answer.

# Shaun Bedingfield on Thursday, September 15, 2005 1:41 PM:

I hate that people always think production code means that something will test for every given eventuallity. This is great but sometimes it is not efficient enough.

Sometimes a malformed input occurs rarely enough and is costly enough that you are better off just documenting the precondition and letting people know that in some cases behavior is undefined.

Don't get me wrong, I think be aware of malformed input is great but there are reasons why products have checked and free builds or even different build configurations sometimes.

# Shaun Bedingfield on Thursday, September 15, 2005 1:52 PM:

In light of recent posts, I would like to introduce a feature that I feel should be added to blogs.

I call this feature an exhibit.

The exhibit is a document,code snippit, or production that parallels blog communication but is not constantly repeated.

Recommended features for exhibits are change/evolution tracking and the ability for people to write comments on the exhibit so as to easily point out details.

Think of an exhibit as a virtual canvas that tells a visual store to parallel the textual story told by the main blog. It would be really neat if products like Sharepoint integrated with .TEXT (Community Server) to support what I term exhibits.

Nested comments would also be a great feature for allowing the management of large blog discussions allowing for enhanced integration of both blog and forum worlds. I like the idea of an exhibit though..

Just my random comments of dubious value.

# Maurits on Thursday, September 15, 2005 2:03 PM:

> Which is faster ++c or c++ and why?
> Neither is the wrong answer...

The answer you're looking for is, ++c is faster because it doesn't make a temporary copy of the variable.

But if you have a decent compiler, and the increment is in a void context, the compiler will auto-convert c++ to ++c for you.

# Shaun Bedingfield on Thursday, September 15, 2005 2:16 PM:

Correct.. That's an easy one but I am not sure people always think about it.

# Maurits on Thursday, September 15, 2005 3:24 PM:

I suppose taking out
for(ps++; *ps != '\0'; ps++)

and replacing it with
while (*++ps != '\0')

might ease the compiler's burden a little :)

# Shaun Bedingfield on Thursday, September 15, 2005 5:12 PM:

Anyone up for writing a nondeterministic version of CharNext? You could use prolog or something. This really wouldn't be efficient.

# Michael S. Kaplan on Friday, September 16, 2005 10:17 AM:

Ok Maurits, I like the approach you took here, it is quite similar to what I was thinking (I had to add a little more stuff for back-compat with the existing function).

# Michael S. Kaplan on Saturday, September 17, 2005 3:12 AM:

Ok, the big problem is that the current code Maurits has here will continue on a low surrogate -- when that is the one where you would want stop. It is a high surrogate that has you wanting to continue in the same way that a diacritic does....

Which then raises the next question -- in the case of invalid surrogate code units like a row of high surrogates, would one skip past them all? Or would one actually stop at the first high surrogate that did not have a low surrogate after it?

# Maurits on Monday, September 19, 2005 7:34 PM:

I made the same goof again...

if (t & (...) != 0) // WRONG

if (t & (...)) // OK

or

if ((t & (...)) != 0) // OK

# Maurits on Thursday, September 22, 2005 7:42 PM:

>Ok, the big problem is that the current code Maurits has here will continue on a low surrogate -- when that is the one where you would want stop. It is a high surrogate that has you wanting to continue in the same way that a diacritic does....

Sorry, I still don't understand that. Can you elaborate? It would be very easy to skip high surrogates too, if that would be more appropriate. I can't see why skipping high surrogates but not low surrogates would be the right thing.

>Which then raises the next question -- in the case of invalid surrogate code units like a row of high surrogates, would one skip past them all? Or would one actually stop at the first high surrogate that did not have a low surrogate after it?

The code I have would return each high surrogate as "the next character", blissfully ignorant that each character is in fact malformed. I did consider how better to handle malformed input and came up with this:
// consider high-elements first
if (this is high)
{

if (next is low)
{
high/low pair -- what to do? Could
ALWAYS SKIP: Consider high/low pair as a non-spacing text element
in which case, skip them both and continue looking
OR
ALWAYS RETURN: Consider high/low pair as the "next character" we're looking for
in which case, return the address of the high part
OR
COME UP WITH A FORMULA: Come up with some means of determining whether this particular pair is "spacing" or not
}
else
{
MALFORMED STRING! Undefined behavior? Could
Blow up
Return the high surrogate
Return this non-low surrogate
}
}

// consider low-elements now
if (this is low)
{
if (previous was high)
{
Yes, I can look at the previous character without worrying about string ranges
Why? Because I did a ps++ to get here in the first place.
Anyway, if this is the second half of a high/low pair, it's certainly not the next character.
Unilaterally skip.
}
else
{
MALFORMED STRING! Undefined behavior as above.
}
}

# Michael S. Kaplan on Thursday, September 22, 2005 10:33 PM:

Hi Maurits!
What I meant was that was what you laid out in the the pseudo code below -- you can't always continue on the low since in the normal case you would return there....

# Maurits on Tuesday, October 04, 2005 2:59 PM:

It strikes me the fundamental problem with my approach is it relies on GetStringTypeW

This is a UTF-16-specific analysis function that only provides full metadata for characters in the BMP!

In order to handle supplementary code points correctly, I need a way to find metadata for all UTF-32 characters, including those > U+FFFF

So I need another helper function that takes UTF-32 data and returns metadata. Or perhaps a series of functions.

Of these flags...

C1_UPPER
C1_LOWER
C1_DIGIT
C1_SPACE
C1_PUNCT
C1_CNTRL
C1_BLANK
C1_XDIGIT
C1_ALPHA
C1_DEFINED
C2_LEFTTORIGHT
C2_RIGHTTOLEFT
C2_EUROPENUMBER
C2_EUROPESEPARATOR
C2_EUROPETERMINATOR
C2_ARABICNUMBER
C2_COMMONSEPARATOR
C2_BLOCKSEPARATOR
C2_SEGMENTSEPARATOR
C2_WHITESPACE
C2_OTHERNEUTRAL
C2_NOTAPPLICABLE
C3_NONSPACING
C3_DIACRITIC
C3_VOWELMARK
C3_SYMBOL
C3_KATAKANA
C3_HIRAGANA
C3_HALFWIDTH
C3_FULLWIDTH
C3_IDEOGRAPH
C3_KASHIDA
C3_LEXICAL
C3_ALPHA
C3_HIGHSURROGATE
C3_LOWSURROGATE
C3_NOTAPPLICABLE

the "odd ones out" are C3_HIGHSURROGATE and C3_LOWSURROGATE. These apply not to Unicode code points in general, but only to the UTF-16 encoding system!

I'd also like to see a Cn_PRIVATE for characters in a private plane range, and a Cn_RESERVED for characters in a reserved plane or range.

I suppose a lot of this comes from the Microsoft commitment to UTF-16 as a default encoding system... instead of picking between UTF-8/UTF-16/UTF-32 "as appropriate"* on a case-by-case basis

* Where the definition of "as appropriate" is left as an exercise ;)

# Michael S. Kaplan on Tuesday, October 04, 2005 3:07 PM:

I am not sure you can call this a flaw since the input to CharNext is UTF-16, so it makes sense that the input to GetStringTypeW is also UTF-16. It is not a flaw....

Every flag there applies to UTF-16 code units, not just the surrogate ones.

The key is the most efficient way to make use of that. :-)

# Maurits on Tuesday, October 04, 2005 4:28 PM:

Alright, so how can I get metadata on supplementary code points?

For example, how can I determine programmatically that OLD ITALIC LETTER A
http://www.fileformat.info/info/unicode/char/010300/index.htm

is a letter (C1_ALPHA) and is not nonspacing (!C3_NONSPACING)?

I notice that the "Microsoft-proprietary dotNet Properties" listed on the fileformat.info page don't display correct answers... and that UrlEncodeUnicode mis-encodes it as a combining grave accent (oops!)
COMBINING GRAVE ACCENT
http://www.fileformat.info/info/unicode/char/0300/index.htm

Or is that data not yet available programmatically?

# Michael S. Kaplan on Tuesday, October 04, 2005 4:36 PM:

You can't, on Windows, exceot by using the new macros (IS_HIGH_SURROGATE, IS_LOW_SURROGATE, IS_SURROGATE_PAIR), but no metadata is availble.

In the .NET Framework 2.0, CharUnicodeInfo is your friend. I suspect he is not using that yet?

# Maurits on Tuesday, October 04, 2005 4:48 PM:

You're referring to

CharUnicodeInfo.GetUnicodeCategory(Char)?
http://msdn2.microsoft.com/en-us/library/h6sx68ke

This returns a UnicodeCategory enumeration, sure enough:

UnicodeCategory
http://msdn2.microsoft.com/en-us/library/3hay9k69

which could easily be checked for
if ((u | UnicodeCategory.NonSpacingMark) != 0)

... but how do I feed the supplemental code point to GetUnicodeCategory in the first place? It only takes a Char, which is a UTF-16 code point:

Char
http://msdn2.microsoft.com/en-us/library/3hay9k69
"a character that is encoded as a base character, *surrogate pair*, and/or combining character sequence is represented by *multiple Char objects*" (my emphasis)

# Maurits on Tuesday, October 04, 2005 4:51 PM:

er, instead of
u | UnicodeCategory.NonSpacingMark
I mean, of course,
u & UnicodeCategory.NonSpacingMark

# Maurits on Tuesday, October 04, 2005 4:54 PM:

Char.ConvertToUtf32
http://msdn2.microsoft.com/en-us/library/wdh8k14a

It returns an int, natch... so I can easily get my supplementary code point... but I can't test it. :'(

# Michael S. Kaplan on Tuesday, October 04, 2005 5:01 PM:

CharUnicodeInfo's methods have overrides that take a string and an index -- if it points to a high surrogate and is followed by a low sdurrogate, you will get the property for the supplementary character.

HOWEVER, since the question was about solving the issue for Win32, the .NET Framework cannot be used. Lucvkily, no supplementary characters are combining at this point, so it is a non-idue at the moment....

# Maurits on Tuesday, October 04, 2005 6:44 PM:

Ah, that's actually quite helpful... but it's not mentioned in the documentation or the examples, only in your blog post of 1/28/05:
http://blogs.msdn.com/michkap/archive/2005/01/28/362305.aspx

So if I'm walking the string in UTF-16 characters, 0 to s.Length - 1, and I run across a high surrogate character at position i..

GetUnicodeCategory(s, i) will give me the categories of the supplementary code point
BUT
GetUnicodeCategory(s[i]) will give me the categories of the high surrogate?

Whereas both
GetUnicodeCategory(s, i + 1)
AND
GetUnicodeCategory(s[i + 1])
will give me the categories of the low surrogate? (That is, UnicodeCategory.Surrogate)

*sigh* once again the UTF-16 nature of the string is artificially exposed...

I'm beginning to think that String.Length is too ambiguous to be useful. How about a family of new properties to replace it...

String.LengthInBytesUnderCurrentInternalRepresentation
String.LengthInBytesAsUTF8
String.LengthInBytesAsUTF16
String.LengthInBytesAsUTF32
String.LengthInUTF32CodePoints
String.LengthInUTF16BytePairs
String.LengthInUTF8Bytes
String.LengthInUnicodeCharacters (= UTF32CodePoints)
String.LengthInTextElements (ParseCombiningCharacters)
String.LengthInCharacters (= number of times CharNext can be called)

(whew!)

# Michael S. Kaplan on Tuesday, October 04, 2005 8:12 PM:

Well, you can think of this blog as being where no doc has gone before? :-)

It is not artificial -- UTF-16 is what it is, and a method that can take a Unicode code point whether it is made up of one or two Unicode c ode units is useful.

No way I am going to touch that list of yours, though -- I don't agree with most of them. :-)

Have we officially abandoned the original Win32 problem?

# Maurits on Wednesday, October 05, 2005 3:22 PM:

Here's my latest iteration of the Win32 problem...
http://www.geocities.com/mvaneerde/charnext.html

I decided on how to handle errors. Still no ANSI version.

I'm about halfway through the VS2005 Beta 2 install to verify the CharUnicodeInfo thing.

# Michael S. Kaplan on Wednesday, October 05, 2005 3:40 PM:

That actually looks pretty amazing, Maurits!

# Michael S. Kaplan on Wednesday, October 05, 2005 3:41 PM:

Note that no change is required for the non-Unicode version, since you cannot have supplementary characters there!

# Maurits on Wednesday, October 05, 2005 5:06 PM:

This is going on my resume :)

"That actually looks pretty amazing, Maurits!"
-- Michael Kaplan

# Michael S. Kaplan on Wednesday, October 05, 2005 5:48 PM:

They may think you mean this guy:

http://www.mathematik.tu-muenchen.de/~kaplan/

or this guy:

http://www.imdb.com/name/nm0438325/

or this guy:

http://med.stanford.edu/ohns/faculty/kaplan.html

etc. :-)

referenced by

2006/07/15 How long is that non-Unicode string?

go to newer or older post, or back to index or month or day