Grease is the word; ░░░░░░ not so much...

by Michael S. Kaplan, published on 2008/11/10 10:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/11/10/9056364.aspx


The question from, the other day was an interesting one. It was something like this:

I’m trying to do a word-boundary check, and I noticed regex doesn’t handle boundaries correctly for some extended characters  (░╤╞╬═╣etc.).

A simple example is “\b░” which should match “░” but doesn’t. Any normal character in front (“\bg░” : “g░”) will match correctly.

If I manually check for boundaries (^$\W\s etc.) it works correctly.

I haven’t found any of the regex options fix it.

Is this a known issue?

Does anyone have the equivalent pattern for \b so I recreate it myself?

 First let's look at those characters. They are:

Did you realize there was all this graphical crap in Unicode? :-)

All of them have a Unicode General Category of So, also known as Symbol, Other. What the CharUnicodeInfo class I mentioned earlier would call UnicodeCategory.OtherSymbol.

And then we'll look at how \b is defined when it comes to regular expressions, in topics like Atomic Zero-Width Assertions:

 Assertion 
Description

\b

Specifies that the match must occur on a boundary between \w (alphanumeric) and \W (nonalphanumeric) characters. The match must occur on word boundaries (that is, at the first or last characters in words separated by any nonalphanumeric characters). The match can also occur on a word boundary at the end of the string.

\B

Specifies that the match must not occur on a \b boundary.

There we go -- the explanation!

It would be unrealistic to assume that a regular expresion engine even remotely Unicode aware would think that ░ or any other symbol would be a \w character -- because those symbols aren't words!

When this was pointed out, the person asking the question definitely didn't expect anything different here; he said:

That seems reasonable enough.

If I need to support this scenario (probably don’t) I can create my own \w patterns that include those Unicode characters, like [^\p{L}\p{Nd}\p{Pc}…].

which gives the workaround if anyone if looking for it (I suspect the actual need here to treat a symbol as a word would be pretty uncommon in text scenarios, as is the use of these symbols anyway).

 

This blog brought to you by the previously mentioned symbols, obviously!


# Josh on 10 Nov 2008 1:58 PM:

"Did you realize there was all this graphical crap in Unicode? :-)"

Not only realized it, but created several scripts to automatically generate them in fonts. What is even more funny (as in "sad; pathetic") is that some of our clients, who shall remain nameless, have requested bold and italic versions of these characters...really...

# Michael S. Kaplan on 10 Nov 2008 2:14 PM:

Given who the client is, Josh, I have to agree with the sad/pathetic tags. :-)

# Centaur on 11 Nov 2008 5:15 AM:

I suspect the use case is Ctrl+Left/Right cursor movement. The logic is not quite trivial there.


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day