What if my strings are > 2 gb?

by Michael S. Kaplan, published on 2005/08/23 15:30 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/08/23/455339.aspx

We do get our fair share of silly questions here in NLS.

I should perhaps explain what I mean by silly. :-)

I don't think I'd ever consider a question where somebody is asking about language and how it might work in a certain situation and call that silly. I mean, that's how people learn. It's the kinds of questions that I ask of native speakers and of linguists, and even if they smile or laugh I never get the sense that they are thinking me silly for the question.

But today, somebody who is thinking about 64-bit Windows and who assumed that one day strings that are greater than 2 GB would be common looked at our signature for CompareString:

int CompareString(
    LCID Locale,
    DWORD dwCmpFlags,
    LPCTSTR lpString1,
    int cchCount1,
    LPCTSTR lpString2,
    int cchCount2

and suggested that perhaps those int parameters containing the string lengths ought to be size_t instead.

Now I would like to forget about the argument that this is a public API that is been around since NT 3.1. It's obviously important here, and makes a suggestion a little bit silly, but not everyone really pays attention to what's in NLS API or how long it's been there.

I'd also like to forget about the argument that 2 GB strings are uncommon, because one day they may not be. Especially in the 64-bit world. There may be a perfectly valid reason to have huge strings.

The real problem I have here, and what makes the question in silly to me, is the notion that you need to do linguistic comparisons on strings that are greater than 2 GB in size.

There is simply no way to justify this is a reasonable use of the collation functionality in NLS API.

Perhaps some of you may disagree with this notion, and I'll be curious how people respond to this post. If you are somebody disagrees, please be sure to include information about your "reasonable example" so that people have a chance to appropriately judge the judgment being used. :-)


This post brought to you by "ยง" (U+00A7, a.k.a. SECTION SIGN)

# shaunbed on Tuesday, August 23, 2005 7:14 PM:

I don't know.. Maybe someone would one day want to compare several hundred copies of "War and Piece" at the same time.


# Wesner Moise on Tuesday, August 23, 2005 7:18 PM:

In the distant future, when AI takes over and the world moves from 32bit to 1024bit computers, and computers being much smarter than humans are able to process a lot (lot) more information, machines may want to talk to other machines.

Of course Windows 2100 will be available and the NLS APIs may long be obsolete.

# CN on Tuesday, August 23, 2005 7:22 PM:

Well, first I want to know:

Does CompareString succeed with 2 GB strings on Win64? Is there some slight non-linearity in the runtime behavior that makes it practically impossible (with expected running time in the range of several years or something)?

That would be the REAL showstopper! What if you can't even do a lingustic comparison of two CDs with a single API call?

# Michael S. Kaplan on Tuesday, August 23, 2005 7:28 PM:

Hi shaunbed -- exactly!

Hi Wesner -- even if we process more information, the need to compare that much information is simply not a huge scenario. :-)

Hi CN -- well, we will have to stop the show, in that case. But I think you are mistaken -- a binary comparison is what you would want in that fairly obscure scenario....

# Dean Harding on Tuesday, August 23, 2005 8:14 PM:

Well, if you think about it, even "War and Peace" is only about 600,000 words, which is probably around 5,000,000 characters (maybe?) - *much* less that the theoretical maximum of 2 billion. You'd have to be comparing something 400 times longer than "War and Peace" to hit the limit!!

But even if you wanted to compare something that long, then surely a more sensible way would be to do it line-by-line. It would seem obvious that any comparison that big would take a loooong, and at least that way you could provide some feedback to the user that *something* was happening (a progress bar, for example).

# shaunbed on Tuesday, August 23, 2005 8:51 PM:

The compare will not take long enough for a progress bar.

I once had "War and Peace" and all of Shakespeare's plays on my PDA with 8 MB or RAM. That is any easy squeeze for memory.

Efficiency depends on the comparison but with a binary comparison you can compare several bytes at one time. The operation is memory limited as the cpu will outperform the memory accesses. DDR400 has a peak theoretical bandwidth of 3.2 GB/s without considering dual channel. This means that a 2 GB compare will probably run in less than 1 s. Why you would want to do this is a mystery...

What does take time is the time to copy the file to memory. Reading a 2 GB harddrive would probably take about 1 minute on most systems. This operation would need a bar.

# Michael S. Kaplan on Tuesday, August 23, 2005 9:06 PM:

Hey Dean!

Luckily, there is no need -- the first character that has a different primary weight will cause CompareString to exit -- and therefore it is fast.

The string loading is something the caller controls -- so the caller can put up UI if they want to....

# Dean Harding on Tuesday, August 23, 2005 9:24 PM:

> Luckily, there is no need -- the first character that has a different primary weight will cause CompareString to exit

Well, I was assuming that you would be comparing two 2GB strings that were equal, in which case it'd have to go through the whole thing...

In fact, I had a look at the download for "War and Peace" on Project Gutenburg, and it's only 3.15MB which means my estimate of 5 million characters was a bit over. You'd have to be comparing something *650 times longer* than "War and Peace" to hit the 2GB "limit".

I can't imagine why there would be any reason to have two strings 650 times longer than the entire text of "War and Peace" loaded into memory, ready to be compared with CompareString.

Oh, another thought: if you pass -1 as the sizes, would it work with strings longer than 2GB then? Depends how it works internally, I suppose. At least if it does at the moment, and strings > 2GB really were needed, you could *make* it work in the case of a NULL-terminated string without changing the public interface...

# shaunbed on Tuesday, August 23, 2005 9:48 PM:

Not that it makes any difference but..

Project Gutenberg's W&P is not in Unicode..

# shaunbed on Tuesday, August 23, 2005 9:53 PM:

Actually int can handle strings up to approximately 4 GB in size as long as they are in 16 bit Unicode. It just can't handle strings over ~2^31 characters long :)

# Michael S. Kaplan on Tuesday, August 23, 2005 10:02 PM:

Actually, we were talking about the parameter that controls the length in WCHARs -- we can handle 2^31 - 1 WCHARs, or 2147483647. This is approximately 2gb. :-)

# Piotrek on Wednesday, August 24, 2005 12:37 AM:

"Data density in DNA is also hugely bigger than standard computers. A DNA strand has the bases A, T, C, and G spaced evenly 0.35 nanometers apart on it. This means that, if there is one base per square nanometer, the data density of one square inch is close to a million gigabytes. In a standard computer, data density is close to 100,000 time smaller, around 7 gigabytes per square inch"

but tell him not to use a string :)))

# Michael S. Kaplan on Wednesday, August 24, 2005 12:48 AM:

Hi Piotrek -- Ah yes, but not only is string probably the wrong way to store it, but (more importantly for our purposes) CompareString is definitely not the right way to try to compare it!

# Rosyna on Wednesday, August 24, 2005 1:04 AM:

dur, because I download my stargate SG-1 season sets to a string. Then i compare what I have versus what someone else has using a client-server process on the internets.

# Michael S. Kaplan on Wednesday, August 24, 2005 1:09 AM:

Hi Rosyna -- there is still nothing useful that CompareString can do here -- they will not be equal, ever. You can have the answer that CompareString would give you without even making the call....

To tell the truth, not even a binary compare would help you. To get such a comparison done, you would need something that could read out title info from the file?

# Maurits on Wednesday, August 24, 2005 1:08 PM:

sToday = System.Disk(0).ToString() -> formats contents of C: drive as an XML string

sYesterday = System.History(Date() - 1).Disk(0).ToString() -> formats yesterday's contents of the C: drive as an XML string

bHasAnythingChangedSinceYesterday = (0 <> CompareString(sToday, sYesterday)

Only a little contrived.

# Rosyna on Wednesday, August 24, 2005 1:15 PM:

I was talking more using the CompareString as a really, really lame checksum type check. CompareString wouldn't return two if you loaded the same string twice from the same file?

# Michael S. Kaplan on Wednesday, August 24, 2005 1:16 PM:

Maurits -- A binary comparison is still perfect there -- you expect a rash of people changing to equivalent forms of characters?

Remember the challenge -- a reasonable scenario for a LINGUISTIC comparison....

# Michael S. Kaplan on Wednesday, August 24, 2005 1:17 PM:

Dean -- I somehow doubt that passing -1 would work here. Greater than 2gb is just a bad idea, all the way around.

# Michael S. Kaplan on Wednesday, August 24, 2005 1:19 PM:

Rosyna -- not even a binary comparison would work there -- two people recording the sa,e episode is two different binary representations.

A LINGUISTIC conmparison will not do any better.

# josh on Wednesday, August 24, 2005 11:26 PM:

shaunbed's right, it can be up to 4GB minus two bytes. #define UNICODE and try the math again. ;)

# Michael S. Kaplan on Wednesday, August 24, 2005 11:35 PM:

Sorry Josh, we are not talking about how much space UTF-16 code points take up. We are talking about the count of WCHARs -- and you can have up to almost 2gb of them. Since the function is not asking for bytes, neither is the blog entry....

# josh on Friday, August 26, 2005 1:04 AM:

Ok, well next time you want GB to mean something other than gigabyte, it'd be nice if you'd say so up front. :P

# Yuhong Bao on Sunday, April 19, 2009 3:52 AM:

Remind me of the REP string instructions on x64. In 64-bit mode they use RCX, which is 64-bit. Problem is, many x64 CPUs have errata relating to RCX values exceeding 32-bit being used with these instructions.

go to newer or older post, or back to index or month or day