The Importance of Locale Knowledge: Or how to avoid breaking stuff in Turkey

Sometimes, the little quirks and abstraction fails of a programming language can mess things up in really subtle and specific ways. Especially when it comes to localization.

I spent most of last week wracking my brains over an oddly specific bug in our product. Namely, when you installed our server software on a Turkish copy of Windows...it didn't work. More specifically, everything started up, but nobody could login.

Cutting a long story short (a dull story about a week long mostly involving reading logs) we figured out that this was because some message keys were getting truncated and ignored when parsed in Java. No message keys, no servers can talk to each other, no logins.

A little digging revealed that the truncation was caused by a call to String.toLowerCase(). Quite why were converting a Base-64 id to lower case is lost in the history of our implementation, but we do, and it broke everything.

String.toLowerCase() is locale specific, but because we were doing this to a Base-64 string, we didn't think it would be a problem. The valid character set for the manner of Base-64 we use is as below:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

We expected this would have the same upper/lower case mappings for all languages. Which it does, except in Turkish.

.toLowerCase() for English: abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz0123456789+/
.toLowerCase() for Turkish: abcdefghıjklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz0123456789+/

It's hard to see, but the converted lower case 'i' in Turkish doesn't have a dot above it. A subtle difference, but enough to confuse the heck out of the regex we used to parse our message keys.

The solution was pretty simple: specify an English locale when we manipulate the case of a Base-64 String:

id.toLowerCase(Locale.ENGLISH);

We know the character set in use, so it wouldn't cause any problems forcing the locale in this manner.

Being an unusual and interesting, case, I dug a little deeper.

String.equalsIgnoreCase() uses a completely separate mechanism for testing case matches, namely Character.toUpperCase() and Character.toLowerCase(), which isn't locale specific. So seemingly equivalent calls will give different results:

/* Assuming a Turkish locale */
// Will return true;
"I".equalsIgnoreCase("i");
// Will return False
"I".toLowerCase().equals("i");

So the upshot of all this: if you want someone to use your software in Turkey, be very careful about how you do String manipulation. Istanbul was Constantinople, which was a lot easier to localize.

The Other Tom Elliott

The Importance of Locale Knowledge: Or how to avoid breaking stuff in Turkey

Apps

Speaker Alert

Phone Status Glance

GitHub Bulk Unwatch