Saturday, February 16, 2008

Does Your Code Pass The Turkey Test?

Over the past 6 years or so, I've failed each item on "The Turkey Test." It's very simple: will your code work properly on a person's machine in or around the country of Turkey? Take this simple test.

  1. Parsing dates from a configuration file using DateTime.Parse(string):

    Does it pass "The Turkey Test?"

    Nope:



    Reason: Turkish people write July 4th, 2008 as "04.07.2008"

    Fix: Always specify what format your date is in. In this case, we use a DateTimeFormat.InvariantInfo which just happens to be USA English's format (more or less):



    Which gives us what we were expecting:


    Scott Hanselman likes to talk about DateTimes. (Be sure to see his DateTime interview question).
  2. Ok, ok. You knew about dates. I sort of did, but I still got it wrong the first few times. What about this seemingly simple piece of code:



    Does it pass "The Turkey Test?"

    Nope:



    Reason: Turkish people use a period to group digits (like people in the USA use a comma). Instead of getting a 4.5% discount like you intended, Turkish people will be getting at 45% discount.

    Fix: Again, always specify your format explicitly:



    Which saves your company from having to go out of business from having too high of discounts:


  3. Say your application reads in some command line parameter:



    Forget about Turkey, this won't even pass in the USA. You need a case insensitive compare. So you try:

    String.Compare(string,string,bool ignoreCase):



    Or using String.ToLower():



    Or String.Equals with CurrentCultureIgnoreCase:


    Or even a trusty Regular Expression:



    Do any of these pass "The Turkey Test?"

    Not a chance!

    Reason: You've been hit with the "Turkish I" problem.






    As discussed by lots and lots of people, the "I" in Turkish behaves differently than in most languages. Per the Unicode standard, our lowercase "i" becomes "İ" (U+0130 "Latin Capital Letter I With Dot Above") when it moves to uppercase. Similarly, our uppercase "I" becomes "ı" (U+0131 "Latin Small Letter Dotless I") when it moves to lowercase.

    Fix: Again, use an ordinal (raw byte) comparer, or invariant culture for comparisons unless you absolutely need culturally based linguistic comparisons (which give you uppercase I's with dots in Turkey)


    Or

    Or

    And finally, a fix to our Regex friend:

  4. My final example is especially embarrassing. I was actually smug when I wrote something like this (note the comment):



    Does this simple program pass "The Turkey Test?"

    You're probably hesitant to say "yes" ... and rightly so. Because this too fails the test.

    Reason: As Raymond Chen points out, there are more than 10 digits out there. Here, I use real Arabic digits (see page 4 of this code table):



    Fix: A CultureInvariant won't help you here. The only option is to explicitly specify the character range you mean:



    Or use the RegexOptions.ECMAScript option. In JavaECMAScript, "\d" means [0-9] which gives us:

"The Turkey Test" poses a very simple question, but yet is full of surprises for guys like me who didn't realize all the little details. Turkey, as we saw above, is sort of like "New York, New York" in the classic Frank Sinatra song:

"These little town blues, are melting away
I'll make a brand new start of it - in old New York
If I can make it there, I'll make it anywhere
Its up to you - New York, New York"

If your code properly runs in Turkey, it'll probably work anywhere.

This brings us to the logo program:

"Turkey Test" Logo Program Requirements:

  1. Read Joel Spolsky's basic introduction to Unicode to understand the absolute minimum about it.
  2. Read Microsoft's "New Recommendations for Using Strings in Microsoft .NET 2.0" article and this post by the BCL team.
  3. Always specify culture and number formatter for all string, parsing, and regular expression you use.
  4. If you read data from the user and want to process it in a language sensitive matter (e.g. sorting), use the CurrentCulture. If none of that matters, really try to use use Ordinal comparisons.
  5. Run FxCop on your code and make sure you have no CA1304 (SpecifyCultureInfo) or CA1305 (SpecifyIFormatProvider) warnings.
  6. Unit test string comparing operations in the "tr-TR" culture as well as your local culture (unless you actually live in Turkey, then use a culture like "en-US").

Having successfully passed the above requirements, your software will finally be able to wear "Passed 'The Turkey Test'" logo with pride.

Note: Special thanks to my coworker, Evan, for calling the 3rd type of error "Turkeys." Also, thanks to Chip and Dan Heath for the "Sinatra Test" idea.


kick it on DotNetKicks.com

67 comments:

Anonymous said...

That's a great way of demonstrating the various locale problems, thanks!

RabidHamster said...

Suggest you make that the Iraq test or the Yemen test. Arabic is more challenging since it's bi-directional (no, it isn't RTL; embedded numbers or latin characters are LTR) and the letters themselves are "shaped", i.e. the precise glyph differs based on preceding and following character(s).

Not to mention that inflective characters are optional, multiplying the regex complexity.

--

tbpmd

ilker Aksu said...

Thank you for invaluable post. I hope that this kind of knowledge may decrease our pains.

I very liked the idea of Turkey Test Logo.

Just a small reminder. In Turkey we do not use Arabic letters.
Just In case someone confuse with last sample.
Regards

Anonymous said...

About the dates, Indians (and probably the British as well) put the day first, it's very strange that American's do it another way!

Anonymous said...

Anonymous, Programmers prefer recording dates in YYYY-MM-DD order, most to least significant. That way, lexical sorting is also chronological. We find it strange that anyone, anywhere would choose to record dates any other way.

Jeff Moser said...

Thanks for the great feedback everyone!

Rabidhamster: I had no idea about the nuances of Arabic bi-directionality. I haven't met that issue yet in my code. I'll have to read up more on that to see how to best handle in code. Any specific advice?

Ilker Aksu: Thanks for pointing out the Arabic numbers. After doing a little more researching, I realized that Turkish people didn't use Arabic numbers. That's why I made a slight update of "in or around the country of Turkey."

In regards to the anonymous comments: I agree that MM/DD/YYYY makes no sense. Often, I resort to writing dates as "16 Feb 2008" to avoid all confusion. I use that when I write things by hand. Otherwise, I'll use the XML date format.

You make a good point about ASCII sorts of dates. I use the YYYYMMDD format for things like status reports since they sort so well.

Huseyin Tufekcilerli said...

As a developer living in Turkey, we always face that dotted and dotless I problem. Most software doesn't care enough about i18n and l10n, at least for Turkish. There is also a wikipedia page about the problem.

A similar issue exists in Excel when you are using your OS with Turkish regional settings. The problem is when you save an Excel Worksheet as a CSV (comma-separated values) file, you get a text file where values are separated with semicolons. Probably Excel uses the "List separator" character for separating values.

Jeff Moser said...

Huseyin: Great feedback! Thanks for the Wikipedia link. Also, I had no idea about the CSV issue. That's still yet another thing to keep in mind when doing i18n.

Paul Kuliniewicz said...

Isn't there a problem assuming Turkey uses five-digit ZIP codes in the first place?

Huseyin Tufekcilerli said...

@Paul: Turkish ZIP codes are 5 digits.

Jeff Moser said...

Paul: The code snippet was just there to present how silly I was that I thought the code was correct because I used a regex to check it.

Huseyin: The fact that Turkey uses 5 digit zipcodes gives me too much credit and might actually mislead people into thinking that I knew what I was doing :) Thanks for the info.

Anonymous said...

For the love of *insert favorite love*, don't do this. What this code will do concretely is make your application behave DIFFERENTLY IN DIFFERENT LOCATIONS AROUND THE GLOBE.

Just use the ISO standard, and assume people know the standard.

Because suppose you get a ticket reservation system in this way. It has servers in, say Europe and in America. One is the fallback for the other.

You travel on 4/7/2008, or, if the European server was down at the time of email, something you won't know and have no control over, on 7/4/2008.

Better hope you have a damn broad ping utility standing by for scanning your emails.

Anonymous said...

Oh and btw, "arabic numbers" is a misnomer. The numbers are Hindu, not arabic. Zero, like the "arabic" numerals, is not arabic at all, but Hindu in origin (it originated in India, in present-day pakistan, before it became islamic, arabic is a double misnomer since they associate themselves as both a religion and an ethnic group, neither of which had anything to do with the idea of Hindu numbers, Hinduism, on the other hand, relates to both the ideology and the ethnicity of the people who came up with the numbers, and the zero)

Jeff Moser said...

I agree with using standard formats for dates. I'd probably use the XML date format so that it's absolutely ambiguous.

Dates are just one thing. With things like string comparisons, it's not intuitive to a guy coming out of college in most places.

As the referenced paper mentions, checking URIs for the "file" protocol(knowingly or unknowingly) using a culture sensitive comparer could be dangerous due to the "Turkish I" issue.

I guess the overarching theme I was going for, and I still stand by it, is that you have to think about the globalization impacts of your code. It's hard to abstract away if your code is to correctly operate around the world.

Niki said...

Great Post! A little not-so-well-known addition: If you parse CSV (comma-separated-values) files, don't assume they're actually separated by commas. In a German locale, for example, they're separated by semicolons. Use CultureInfo.TextInfo.ListSeparator instead of ','.

Rainer said...

I personally use for dates the number of days since the 1 JAN 1968, as in the Pick/Universe world for the internal representation of dates. Makes it easy to calculate 30 or 90 days off a certain date for compares and can be easily converted into any local external date format. Today is day 14658. By the way we had no problems with Y2k at all.

Anonymous said...

MM/DD/YYYY format is misleading and counter-intuitive, abondon on it :) http://en.wikipedia.org/wiki/ISO_8601 defines YYYY-MM-DD which I prefer

Anonymous said...

This might be also called the Polish test with pun intended. We Poles write dates just like Turkish people. DD.MM.YYYY.

Anonymous said...

2 comments:
1) Only 10 digits used in Turkey, so a regex for [0-9] is sufficient. Perhaps you should've done a bit more research about Turkey and what's used there before naming the test as "the Turkey test". I understand that you're trying to make a point in locale problems but I guess you can name it differently (locale test... middle east test.. eastern europe test w/ Arabic numbers.. etc.)

2)
for the ignorant ones.

Jeff Moser said...

Niki: Thanks for the CultureInfo.TextInfo.ListSeparator tip. I didn't know about that member.

Thanks for the concerns about digits in Turkey. As mentioned above in a comment, I qualified the post as "in or around" Turkey since you're absolutely right that most Turkish people use [0-9].

I tried to pick a locale that would highlight a lot of unexpected results and had to stretch it for Turkey for digits.

Again, the overarching theme that I wanted people to take away was the concept of Turkey as a "Sintra Test" in the sense that if you work hard to make your code work well in Turkey by using the best practices outlined in the string paper, your code will probably work anywhere.

Omur said...

thanks

Mike Petry said...

You need to email me or something. Try my work email, my middle initial is C.

amiroff said...

It would be worth noting that all those cases mentioned above are also valid for other Turkic people using latin alphabet like Azerbaijanis Uyghurs, Tatars etc... (see http://en.wikipedia.org/wiki/Uniform_Turkic_Alphabet)

Great and informative article, right in spot. Thanks!

David Nelson said...

Since everyone likes bashing the "American" date format (MM/DD/YYYY), I would just like to point out that it does have a good reason for existing: it mirrors the way most Americans (and most English speakers as far as I know) pronounce dates aloud, e.g. "March Thirteenth, Two Thousand [and] Eight". Since all such formats have to be interpreted relative to pre-conceived ideas, I don't think that format is any less valid than any other. That said, it is terrible for sorting, which is the only reason that I typically don't use it.

Peter Morris said...

Rainer said "By the way we had no problems with Y2k at all".

No, we had problems with "day ten thousand" causing havoc on alpha-sorted indexes 4.5 years earlier instead. So don't go making out it is perfect :-)

Clifton said...

Aren't most of your examples showing exactly the wrong way to fix it?

Your suggestions address what would be needed to make the application work for an American who happens to be living in some other country and running your software on the operating system version for that country. It would absolutely not work the way most people in that country would want!

For example, your proposed code ensures that wherever in the world someone uses your application, it will require them to input dates in American format, enter numeric values like Americans, and so on. The problem is that if a Turkish, English, or German customer enters the date "07.04.2008", you can be virtually certain they want it to be treated as April 7, 2008; if a Turk (or Frenchman) enters "0,50" for the percentage discount, you can be virtually certain they want to give a 1/2 percent discount, not a 50% discount.

Instead, shouldn't you be showing examples of how to change your displays and input validation code to match the local culture? Then you will have a usefully internationalized application.

Rich M said...

Got here from Scoble!

I'm a web/software developer in the UK. I used to work for a large multinational with sites in both the UK and the US. Whilst the problem of date formatting etc did come up, I solved it by adhering to the standards set by the ISO, which is what I guess I'd urge anyone to do. After all, what use is a standard if no one uses it?

I let users know how to enter data in the application by adding appropriate labels to text boxes (e.g. "Date (YYYY/MM/DD)"). This seemed to work quite well but then I never had to deal with anyone in Turkey!

R

Anonymous said...

@David Nelson -

Not at all. For me (in NZ), today is the 16th of March. Not March 16th. Although I think that varies as much from person to person as it does culturally...

Jeff Moser said...

Clifton: I can see your point to a degree. The examples were primarily focused on classic cases where seemingly innocent lines of code can get you in trouble. While looking a bit contrived, they do tend to show up in production code. But you're right, it makes sense to use ISO standards whenever possible. Especially for clear things like XML formatting standards that build on ISO ones. There will still be times where you might look for a configuration file string literal. In this case, it becomes important to not get hit with culture issues.

The whole point of the article is to make sure you're aware of the issue. That in itself is a huge step towards getting things to work right.

stanleyxu said...

German use the same date and currency format. The 3. and 4. problems are really crazy ^^)

Mert Sakarya said...

The main issue of the "Turkey" test is the "I" and "i" problem ("i" is not lowercase of "I" and vice versa). This is specific to Turkish language. The other problems are specific to more then one countries. There are lots of countries that uses "DD/MM/YYYY" or "DD.MM.YYYY" format.

Until the first release of .Net, Turkish "I" problem could not be solved in Microsoft technologies, and programmers had to do some hardcoding for that problem which led the application to be very specific to Turkey.

There are many other server and development applications that don't work on computers with Turkish region settings. As far as I know Adobe Flex that runs on Eclipse is one of them. When you start coding mxml it has a "Script" tag and the "i" in tag causes compiler error.

I recommend setting your region to Turkey and using "I", "İ", "ı" and "i" everywhere (comments, strings, variable-names etc.) then testing your compiler/environment...

If you are developing international applications and Turkey is one of your targets, be prepared for problems...

One more thing Turkey is in a very strange location, so you have to consider the language problems for Greece, Russia (by the sea), Bulgaria, Armenia, Georgia as well as arabic spoken languages (Syria, Iran and Iraq has borders with Turkey), when you say "Turkey and nearby".

So if you expand your territory to Turkey and nearby region, you'll probably will have to solve all regional issues other than Far East (and may be India region) issues.

"I" and "i" problem is unique to Turkish.

peSHIr said...

In Holland we use a date format of dd-mm-yyyy and have things like IJ and ij ligatures which could mess up your Unicode characters.

Something I have not seen mentioned here are assumptions on other culture specific things, like when concerning yourself with names. In Holland someone named "Jan van Dam" would have a first name "Jan" and last name "van Dam" but in a "LastName, FirstName" list would have to be sorted as "van Dam, Jan" under the letter D... ;-)

SuperJason said...

Where I used to work, we always did QA on a German OS, and a Chinese OS. It's amazing how much you learn by doing that. Chinese is particularly difficult because it's double byte.

The biggest thing that people had to learn, was to always keep everything binary. Only when accepting input or displaying something do you turn it into human readable format.

--SuperJason (blog)

Jeff Moser said...

mert sakayra: interesting to hear about Flex. You're right about it being not just Turkey. The idea is to think about many different locales, but Turkey does seem to highlight a lot of the issues itself.

peshir: Yeah, names are tough. Thanks for the info.

superjason: Agreed.

Anonymous said...

good job, this really makes sense

tomwp said...

Thank you for the excellent article and comments.

Worth mentioning the opposite opperation for numbers - parsing a number into a string. For example, I'm communicating with some hardware that uses the GB culture for numbers. When I read back variables from the device I use:

number = double.Parse(numberAsString, NumberFormatInfo.InvariantInfo);

However, when I send data to the instrument I must also remember to use:

numberAsString = number.ToString("0.00", NumberFormatInfo.InvariantInfo));

Many thanks,

Tom.

Jeff Moser said...

tomwp:

Right, the biggest part is being consistent in how you write and read something back in.

To partially address these problems, Microsoft has announced that in .NET 4.0, the default for many string operations will be either Ordinal or Invariant cultures.

I agree with Microsoft's recommendation though of always specifying your intentions so that your code is unambiguous.

Thanks for stopping by!

David said...

Re the date format, I think the USA is the main place that uses MM/dd/yyyy. Almost everywhere else uses dd/MM/yyyy. Because of the use of MM/dd/yyyy, it makes both formats useless. It always surprises me when I come across a site with a date like "02/11/2008". No-one can have any idea what this means. And this even happens on www.microsoft.com, which you would think would know better. But then again, they did misspell "Colour" in "System.Drawing.Imaging.ColorMap"! :-)

David said...

Regarding the InvariantCulture Setting, I am very surprised that MS chose a short-date of MM/dd/yyyy.

This is not truely invariant. It is only invariant while it remains in the DateTime object, with full context. If they had made the InvariantCulture short date "yyyy/MM/dd", then converting the date to a short-date string representation would preserver the initial meaning.

This is an issue wen using NHibernate, which converts everything to a string to run on SQL Server, instead of passing the DateTime object to SQL Server. There is no Culture I know of that I can use in converting a DateTime object to a short date string that is unambiguous. You either use LongDateFormat, which "should" be safe, or explicitly format it as "yyyy/MM/dd".

Jeff Moser said...

David: Tradition is hard to fight, which is probably why MM/dd/yyyy persists so long. As mentioned above, I'd encourage "4 Dec 2008" or ISO 8601 (e.g. "2008-12-04").

As for "color" spellings... we'll just have to agree to disagree :)

I'm not really surprised that "invariant" just happened to be exactly what the local culture was in Redmond, WA where the code was probably written. Ultimately, I think it doesn't matter - they had to pick one and stick with it. I agree that ISO would have been better from a theoretical perspective, but I'm sure the train had already left the station for existing apps and they couldn't change it even if they wanted to.

About NHibernate: I haven't used it, but it's an interesting issue. I wonder why it doesn't pass the date as parameterized SQL to the database provider which would handle it correctly.

Anonymous said...

It looks like the test forgets to mention one thing: parsing floats. Many European countries use , as the decimal separator instead of . and this causes problems in programs that let you enter such values.

Anonymous said...

"New York, New York" was actually written for Liza Minnelli and Frank Sinatra subsequently appropriated it.

Jeff Moser said...

Anonymous #1: I mentioned floating points in #2 (specifically using System.Double). Did I miss something?

Anonymous #2: Thanks for the clarification! I used Sinatra's name since I got the idea from "Made to Stick" where they refer to the idea as the "Sinatra Test" as mentioned in the post. I didn't check to see if the book ultimately clarified the rigth source.

Miral said...

That regex in #3 is a bit nasty... it'll not only match "portrait" and "port", but also "portttti" and "portair"...

Jeff Moser said...

Miral: Yeah, it's a bit contrived. I was thinking of matching misspellings like "portirat". This was brought up in a comment on reddit as well.

Prasad Paluri said...

This is a good article but Turkey Test doesnt cover the issues with DBCS character set. In my opinion other than char 'I' issue, international tester should use "Japanese Test" and shouldn't worry about "Turkey Test". This article might be misleading some International tester.

Jeff Moser said...

Prasad Paluri: You're definitely right that this article just covers some basic issues. Encodings can be quite challenging. I found the Joel article that I linked to in this post helpful. It briefly touches on DBCS and its relation to Unicode and encodings like UTF-8.

Roimer said...

This is a Great Article: almost a year and it is still having commentaries. Since last six month I had been trying to write code that work properly in most of our target cultures (Latin America, USA and Spain), I haven't idea that Regex could treat as digit to any digit (it can detect japanese digits too? like the character ichi?). I'm going to read more about all this issues; thansk again to Moserware for the article, and all of you for your commentaries.

PD: Sorry for any misspell7mistake, I'm not a fluent English speaker.

Jeff Moser said...

Roimer:

The "\d" Regex pattern will match any character in the "number, decimal digit" Unicode category. This page lists them. There are about 230 in this category. The first 10 Japanese numbers don't match this category. Perhaps because they can be used for more than just how we use digits. For example, var japaneseDigits = Regex.IsMatch("一二三四五六七八九", @"\d+") returns false while the Regex in this post returns true because the Arabic digits I used were in that "number, decimal digit" category.

Thanks for the international comment!

Javacikiz said...

I am from Turkey too, in addition I use java for my programs. However I do not think I have a problem with this letters or other stuff. We can use English characters rather than Turkish. For example Japanese or Russian people don't use their own language for programming.

On the other hands it can be useful to have a Turkey Test and our Flag on it:) I really like to read your article:) Thanks for it and also for the people who comments:)

okuryazarkişi said...

We're living *around Turkey and we past the test easily :)

http://www.copytaste.com/z0755c2

*around: I mean İstanbul actually since 29.10.1979 :) Did you see?

Jeff Moser said...

Javacikiz: I agree that most people use English when programming. This post refers to potentially unexpected things that can occur with strings in different locales. For example "i" will go to the uppercase "İ" in a Turkish locale/culture using the default ToUpper() function. If you're not aware of this, you can experience problems (like those I faced and mentioned above).

okuryazarkişi: How do you handle right-to-left languages like Arabic? :) There's always a different locale that can cause issues if you're not prepared.

okuryazarkişi said...

So what do you think about our project? Did you submit any post? Did you check it? I think it can be past the test :)

hipero said...

Hi! Great job :). I was looking for something like this to show all the topic in one post. Linked on my blog: http://karolchecinski.spaces.live.com/default.aspx

Your blog is now in my RSS reader :)

Jeff Moser said...

okuryazarkişi: Are you saying your wrote copytaste.com? The date seems to display in my USA locale. Other than that, it doesn't seem that this service uses any localized handling other than using Unicode.

Am I missing something?

hipero: Thanks for stopping by and writing a post about this in Polish!

Anonymous said...

Excellent job in writing about these locale issues.

Anonymous said...

As a Turkish citizen i like the last joke, that Frank Sinatra thing. :)

Anonymous said...

Hi Jeff, thanks for the post which warms up an old pain in (en)coding and gives the opportunity to remember John Cowan's proposal for Resolving dotted and dotless "i" dated as early as 1997. As far as I know this proposal (like the similar ones for ASCII set etc.) found no response from the western-centric circles and in Turkish "cases" we can sort that the big "Ignorance" comes first, then comes the "inequity". -Sahin

Jeff Moser said...

Anonymouses: Thanks!

Sahin: Interesting proposal. Having two "i"'s would indeed provide more context.

aib said...

Actually, Turks don't get 45% discounts. We get %45 discounts - another thing to watch out for.

John Cowan said...

I no longer believe in that 1997 post; there is too much English-Turkish or French-Turkish bilingual text in 8859-9 (Latin-5) to make it feasible, to say nothing of Unicode stability issues.

Jeff Moser said...

aib: Thanks for the extra tidbit :)

John Cowan: It's always great when an author explains his technical position rather than speculating. Thank you for doing that here in response to Sahin's comment.

Anonymous said...

Thank you very much for sharing this, it's great.

I believe in Turkey they have a very similar test called the "American Test" where they do all the same types of tests but with American style inputs.

serhio said...

bfff.... Americans.... Thought that all the world things and does like them....

For me is unnatural to write month then day then year, because I write day/month/year.

all the world use Celsius as degrees, and just Americans (and UK) use Fahrenheit... All the world use Meter and Kilogram, and they use feet and pounds.

Anonymous said...

Great!! you told good programming practise in such a nice and humor style. LIKE!!

gpvos said...

To me, the last example suggests that int.Parse() is broken. If in the current locale, \d matches Arabic numerals, int.Parse() should also accept those, unless you pass it an additional InvariantCulture parameter.

gpvos said...

It's even worse:

int.Parse("\x0661", new CultureInfo("ar-LB"));

returns 1585, because int.Parse() just subtracts 48 from the Unicode value of the Arabic digit 1.

It's probably documented somewhere in the manual pages, but it's pretty counterintuitive.