Saturday, February 16, 2008

Does Your Code Pass The Turkey Test?

Over the past 6 years or so, I've failed each item on "The Turkey Test." It's very simple: will your code work properly on a person's machine in or around the country of Turkey? Take this simple test.

  1. Parsing dates from a configuration file using DateTime.Parse(string):

    Does it pass "The Turkey Test?"


    Reason: Turkish people write July 4th, 2008 as "04.07.2008"

    Fix: Always specify what format your date is in. In this case, we use a DateTimeFormat.InvariantInfo which just happens to be USA English's format (more or less):

    Which gives us what we were expecting:

    Scott Hanselman likes to talk about DateTimes. (Be sure to see his DateTime interview question).
  2. Ok, ok. You knew about dates. I sort of did, but I still got it wrong the first few times. What about this seemingly simple piece of code:

    Does it pass "The Turkey Test?"


    Reason: Turkish people use a period to group digits (like people in the USA use a comma). Instead of getting a 4.5% discount like you intended, Turkish people will be getting at 45% discount.

    Fix: Again, always specify your format explicitly:

    Which saves your company from having to go out of business from having too high of discounts:

  3. Say your application reads in some command line parameter:

    Forget about Turkey, this won't even pass in the USA. You need a case insensitive compare. So you try:

    String.Compare(string,string,bool ignoreCase):

    Or using String.ToLower():

    Or String.Equals with CurrentCultureIgnoreCase:

    Or even a trusty Regular Expression:

    Do any of these pass "The Turkey Test?"

    Not a chance!

    Reason: You've been hit with the "Turkish I" problem.

    As discussed by lots and lots of people, the "I" in Turkish behaves differently than in most languages. Per the Unicode standard, our lowercase "i" becomes "İ" (U+0130 "Latin Capital Letter I With Dot Above") when it moves to uppercase. Similarly, our uppercase "I" becomes "ı" (U+0131 "Latin Small Letter Dotless I") when it moves to lowercase.

    Fix: Again, use an ordinal (raw byte) comparer, or invariant culture for comparisons unless you absolutely need culturally based linguistic comparisons (which give you uppercase I's with dots in Turkey)



    And finally, a fix to our Regex friend:

  4. My final example is especially embarrassing. I was actually smug when I wrote something like this (note the comment):

    Does this simple program pass "The Turkey Test?"

    You're probably hesitant to say "yes" ... and rightly so. Because this too fails the test.

    Reason: As Raymond Chen points out, there are more than 10 digits out there. Here, I use real Arabic digits (see page 4 of this code table):

    Fix: A CultureInvariant won't help you here. The only option is to explicitly specify the character range you mean:

    Or use the RegexOptions.ECMAScript option. In JavaECMAScript, "\d" means [0-9] which gives us:

"The Turkey Test" poses a very simple question, but yet is full of surprises for guys like me who didn't realize all the little details. Turkey, as we saw above, is sort of like "New York, New York" in the classic Frank Sinatra song:

"These little town blues, are melting away
I'll make a brand new start of it - in old New York
If I can make it there, I'll make it anywhere
Its up to you - New York, New York"

If your code properly runs in Turkey, it'll probably work anywhere.

This brings us to the logo program:

"Turkey Test" Logo Program Requirements:

  1. Read Joel Spolsky's basic introduction to Unicode to understand the absolute minimum about it.
  2. Read Microsoft's "New Recommendations for Using Strings in Microsoft .NET 2.0" article and this post by the BCL team.
  3. Always specify culture and number formatter for all string, parsing, and regular expression you use.
  4. If you read data from the user and want to process it in a language sensitive matter (e.g. sorting), use the CurrentCulture. If none of that matters, really try to use use Ordinal comparisons.
  5. Run FxCop on your code and make sure you have no CA1304 (SpecifyCultureInfo) or CA1305 (SpecifyIFormatProvider) warnings.
  6. Unit test string comparing operations in the "tr-TR" culture as well as your local culture (unless you actually live in Turkey, then use a culture like "en-US").

Having successfully passed the above requirements, your software will finally be able to wear "Passed 'The Turkey Test'" logo with pride.

Note: Special thanks to my coworker, Evan, for calling the 3rd type of error "Turkeys." Also, thanks to Chip and Dan Heath for the "Sinatra Test" idea.

kick it on

Saturday, February 9, 2008

SKU Driven Development

Here I am chomping down on a massive ice cream cone from the Dairy Bar at the 2007 Indianapolis State Fair.I like ice cream. Sure, I like enjoying a coffee cup filled with ice cream after working out at the gym (and sometimes even when I don't), but I also enjoy its perceived simplicity. All of my favorite flavors have some form of vanilla base with something added to it. For example, Fudge Tracks to me is vanilla + chocolate + peanut butter. Cookie Dough is, unsurprisingly, vanilla + cookie dough + chocolate.

I said "perceived" simplicity because my father-in-law works in the ice cream business and I know there are lots of smart people in the lab working on formulas and many people that design the manufacturing processes to ensure the final result "just right." There are lots of little "gotchas" too. For example, when adding cookie dough, you might have to reduce the butterfat content to keep the nutrition label from scaring people. Also, many people (e.g. me) think that everything starts with vanilla, but it's really a "white base" that may or may not have vanilla in it.

But despite all of the little details under the covers, ice cream still has a simplicity that our grandmas can understand. As I work more with software professionally, I'm becoming more convinced that if your architecture is too complicated that you couldn't realistically explain it at a high level to your grandma (while she's not taking a nap that is), it's probably too complicated.

My favorite ice cream flavors all start out roughly the same, but then get altered by the addition of things. The end result is a unique product that has its own Universal Product Code (UPC) on its container. Can we make software development that simple? Can we start with some core essence like vanilla ice cream and create various versions of products, each with its own unique Stock Keeping Unit (SKU)?

This idea isn't foreign to our industry. Microsoft has at least five different SKUs of Vista and Visual Studio has over 10. The test is, could we as developers of products with much smaller distribution create different SKUs of our software? More importantly, even if we only planned to sell one version of our software, would it still be worth putting an emphasis on thinking in terms of partitioning our product it into several "SKUs?"

I am starting to think that there might be merit in it. What will it take to think that way? I think it requires just a teeny bit of math.

The Calculus of SKU Driven Development

Even as a math major in college, I never really got excited about calculus or its related classes like differential equations. It tended to deal with things that were "continuous" whereas my more dominant computer science mind liked "discrete" things that I could count. With whole numbers, I could do public key cryptography or count the number of steps required in an algorithm. With "continuous" math, I could do things like calculate the exact concentration of salt in leaky tank of water that was being filled with fresh water. In my mind, salt doesn't compare with keeping secrets between Alice and Bob.

Although I could answer most of the homework and test problems in calculus, I never really internalized or "connected with" it or its notation of things like "derivatives."

That is, until this week.

While trying to find a solution to the nagging feeling that software could be simpler, I came across a fascinating paper that was hidden away with the title of "Feature oriented refactoring of legacy applications." It had a list of eight equations that had an initial scary calculus feel to them. But after the fourth reading or so, they really came alive:

The first equation essentially says that if you want to make something like Cookie Dough Ice Cream (here labeled "H") from your vanilla base product (labeled "B"), you'll need... cookie dough! See? Math is simple. The actual cookie dough is expressed by "h." The "db/dh" part is telling us "here's how you have to modify the vanilla ice cream base when you're making Cookie Dough to keep the nutrition reasonable." The letter "b" is simply the raw vanilla ice cream base. The "*" operator says "take the instructions for how to modify the base and actually do them on the base." Very intuitive if you think about it. The only trick is that uppercase letters represents SKUs (aka "Features") and lowercase letters represent the ingredients (or "modules") in that SKU. The paper was nice enough to include this picture to visualize this:

We'll skip to the last significant equation. It is also the most scary looking, but it's just as simple:

The scary part is that if we want to to add chocolate chips, "j", to our existing Cookie Dough Ice Cream, "H(B)", we will start to see a "2" superscript, called a "second order derivative." The "d2b/(dJdH)" just means that "if I have both chocolate chips and cookie dough, I'll need to lower the butterfat content of the vanilla base even more to make the nutrition label not scare people." Then, make the cookie dough healthier to allow for the chocolate chips (dh/dJ) and then finally add the chocolate chips (j). That is, say that if I just added chocolate chips to vanilla (db/dJ), I'd only have to lower the vanilla butterfat by 5%. Similarly, if I just added cookie dough, I'd have to lower the butterfat by 7%. If I have both chocolate chips and cookie dough, I have to lower the butterfat an additional 3% (d2b/(dJdH)) for a total butterfat lowering of 3 + 5 + 7 = 15%.

Why Software Development Can Be Frustrating

The above calculus shows, in a sort of sterile way, why developing software can frustrate both you as a developer and as a result, your customers. The fundamental philosophy of SKU Driven Development is that you absolutely, positively, must keep your derivatives (first order and higher) as small as humanly possible with zero being your goal. If you don't, you'll feel tension and sometimes even pain.

It starts off with the marketing guys telling you that customers really want SKU "J" of your product because they really need all that "j" will give them. Moreover, your top competitors have promised to have "j" and if you don't add it to your existing product, "H(B)", then customers have threatened to jump ship.

So management gets together and then eventually asks you their favorite question: "how long will it take to make SKU 'J' of our product? How much will it cost?" Enlightened by the calculus above, you look at equation 4 and then count every time where you see "J." These items denotes changes to the existing code. The final "j" represents the actual additional feature that marketing wants.

You'll likely have a conversation like this: "well, we can't just add 'j.' We need to change our core library to get ready for J, and update this user control to get ready for J, oh and then we'd need update this menu and that toolbar and this context menu and this documentation, oh yeah and then make sure that we didn't cause a regression of any of the functionality in our existing product, H(B), and then finally we could add 'j' and then test and document it."

Sure, you didn't use nice simple letters like "J", but that's what makes math so elegantly expressive. It doesn't matter what "J" is, the math doesn't lie.

Here's another pain point: let's say that in testing feature #4 of your product, a lot of bugs came up and moreover, marketing has proven beyond a reasonable doubt that no one even cares about feature #4. It was just some silly feature that a programmer added to feel more macho. Feature #4 is stopping your company from shipping and making any more money. You have to cut it! "Go, cut it out! Here's a knife" they shout to you.

"But, it's not that simple!" you reply.

"Why not? It's just as if I told you to make a version of a car that has an option to not have a GPS, or a TV, or cruise control, or rear defrost. The auto industry has been doing this for decades. Why can't you do it? Is this too much to ask!? Don't get in the way of my next bonus buddy!"

Ok, it's usually more peaceful than that. Well, usually anyway.

You can't just remove feature #4, "F4", because your product is now P = F6(F5(F4(F3(F2(F1(B)))))). Right about this time, you can feel the weight of all the derivatives. They are going to drive you to do a lot of searching and testing. It's not that you're a bad programmer, it's just that in programs that haven't been built with the "SKU Driven Development" philosophy in mind tend to have higher amounts of derivatives. Usually, derivatives that could be almost zero are much higher and therefore cost you time and your company money.

Is There Any Hope?

I use the term "SKU Driven Development" because, as I write this, it has zero matches on Google. This is a good thing because it means I can define to be what I choose it to mean. To me, SKU Driven Development is a software development philosophy that has the following four principles:

  1. Always have a "get the product to market" mentality by thinking in terms of shipping a new "SKU" that has the additional functionality. This makes it easier to think of your product in a more natural way of adding and removing options like people can do with ice cream and cars.
  2. Build your product in such a way that derivatives are as small as possible. You want them to be zero. This is not always possible, but that's your goal. It is extremely important to have higher order derivatives small. I think that a developer should seriously think about his or her architecture if it requires significant second order derivatives. Furthermore, a software architect should be forced to give a rigorous defense of why their architecture necessitates third order or higher derivatives and convincingly prove there is no other reasonable way that wouldn't sacrifice something else that is more important.
  3. Your product should be stripped down to some fundamental "core" SKU where further functionality can't be stripped or it ceases to be meaningful. This is sort of like a car with an engine and seats, but no air conditioning, radio, clock, cruise control, etc. Only once you have stripped things down to the core can you start to think in terms of additive SKUs.
  4. Each piece of new functionality and every derivative that it necessitates must be testable. This principle ensures that a quality product continues to be a quality product, even with the additional functionality.

The problem isn't that our industry hasn't thought about this; the problem is that there are many different things to "keep in mind" while trying to achieve it. As developers, we have many tools that can help:

The list could go on and on.

Having all of these things that we have to "keep in mind" as developers makes it hard to keep up. But, it's also why we're paid to do what we do. Marketing wants to give us a Marketing Requirements Document (MRD) with additional features F1, F2, F3, and we're paid to turn our existing product, B, into F3(F2(F1(B))). It's not as easy as we'd like (f1 + f2 + f3) because all of those derivatives get in the way.

There's no silver bullet and its unlikely that completely derivative free programming will ever be possible for real applications. SKU Driven Development isn't prescriptive beyond the four principles I outlined above. It's sort of a "try your best, but keep these overarching goals in mind." Academics are already showing inroads into how it might be possible with simple examples and things like AHEAD.

It's going to take time. I'm going to try to head towards the SKU mentality. I'll probably going to go down the wrong path many times. I'll probably create or at least suggest of designs that are too complicated and don't stand up to growth and maintainability. Over the years I want to get better, but I don't see of a clear path there yet. It will probably involve using some of the tools I outlined above.

Lots of things to think about and "keep in mind" while enjoying my next bowl of ice cream.

kick it on

Saturday, February 2, 2008

For Loops: Using i++, ++i, Enumerators, or None of the Above?

An IBM 704 mainframe computer. (Image courtesy of Lawrence Livermore National Laboratory)

It was 1958; IBM was the leader in creating computers that took up entire rooms. Their recent creation, the IBM 704, was projected to only sell six units and was priced accordingly. When it ended up selling well over 100 units, it helped propel the careers of its two lead designers. Gene Amdahl, from whom we get Amdahl's Law (which is still a hot topic), led the hardware design. John Backus, the B in BNF, was responsible for leading a team to define and develop an implementation of Fortran that would help ease programming on the system. It offered a choice besides writing in the 704's assembly language.

A for loop in flow-chart notation (Image by Paweł Zdziarski)

One of the newest features of their design was what would later be referred to as the "for loop." I say "later," because at the time of its introduction, it was really a "do loop" in syntax, but it had the modern day meaning of a for loop.

As a programmer, you'd sit down at a typewriter-like device called a keypunch, and would type out a line like "310 DO 400 I=1,50" that would tell the compiler that "starting at line 310, keep executing code until you get to line 400, then increment the 'I' variable and start all over until 'I' exceeds 50, then stop by jumping to the line right after 400."

There are several things to note about that last statement. The first is that the idea of a "for loop" is over 50 years old. The second is that since there were only 72 characters of usable space on the 704's card reader, the syntax had to be incredibly terse. Due to the design of language, "I" was the first integer variable available for use ("N" was the last, all others were floating point). Like it or not, this is probably the driving reason why you were taught to write statements like "for(int i = 0; i < 50; i++)" It's only convenient that "I" is also the first letter in "index" and "iterator." Computer programming teachers and books inevitably all trace back to Fortran.

I'm sure that as a programmer you've written thousands of for loops in your life. If you're a professional programmer, you've probably read thousands of for loops written by other people. If you do this long enough, you'll likely come across several different styles.

The first style is by far the most popular. We can illustrate it by an example. Let's say that we have a list of presidents of the United States and we want to know how many presidents have a last name that is longer than six letters. You might have a program that has a helper class like this:

You'd then create and initialize a list of Presidents explicitly or from a file/database:

Finally, you'd be able to have your dear friend, the for loop:

We've all written code like this. It's so easy you don't even have to think about it.

Right? Well, let's think about it just for fun.

The second style would be to do something like this:

Did you catch the difference? The first style used the post-increment (i++) operator for the for-iterator and the latter one used the pre-increment (++i) operator. Most of the world uses the post-increment notation because that's what we all saw in our textbooks and online samples. People that use the second notation are usually very particular about it. Almost all of them come from a C++ background. The reason is that in the Standard Template Library (the C++ rough equivalent of the .net Base Class Library) has the concept of iterators that allow you to go through a data structure like a vector. In the implementation of the post-increment operator on the iterator, one needs to preserve a copy of the old value and then increment the value and return the old value to keep with the proper meaning of the operator in language specification. The post-increment operator is usually implemented by calling the pre-increment operator with the added tax of keeping the old value. Typically users of the "++i" style will come back at you and say something like "i++ is so wasteful! You've got that copy that you aren't even using! Shame on you buddy!"

Ok, maybe that's a bit extreme. But it could happen.

But, let's check out the truth for .net code. In this particular situation, is it wasteful? Let's look at the IL instructions for the post-increment (i++) version of our "for loop" using ILDASM (with my comments on the right):

Now, to save time, let's see how the above compares with the pre-increment (++i) version:

There you have it folks. The code is the exact same except for how the C# compiler names the throw-away variable used for checking inequalities. This has absolutely no effect on performance. All other instructions are exactly identical. The C# compiler sees that no one cares about preserving a copy of the value, and makes a decision to ignore that. This is so basic of a technique that this happens even in debug builds. In case you're wondering, we couldn't have just done a simple diff on the two EXEs even if the C# compiler emitted the same code because .net compilers place a Module Version Identifier (MVID) GUID that is different every time you compile your assembly.

Anyways, with all due respect, any C++ compiler worth its file size would perform the same optimization trick (for integer types). Like I said, the only real difference came with iterators. In .net, we don't have the exact equivalent of iterators, but we do have Enumerators. This allows us to change the code above to look like this:

Notice that we got rid of "i" completely! However, it's just slightly less efficient from a performance perspective. The reason is that the compiler turns this into a call to the List's GetEnumerator() which in turn does this:

Which just punts to this method:

Now, during the actual enumeration/iteration, the code will call MoveNext() that looks like this:

There are two things that I found interesting. The first is that there is a check on a "_version" field to see if it's different and if so, throw an error. After doing some reflectoring, it turns out that any time you modify a List through calls to methods like Add or Remove, the "_version" member gets incremented. This tells the class that it's been changed. If you changed the List after creating the Enumerator, you'll get this unfriendly error message:

This design is ultimately caused by the second interesting thing in the code above. If you look carefully, you'll see that the core of MoveNext is exactly the same as the "for loop" style mentioned earlier. We have a "this.index" variable that gets incremented on every call and then we check to make sure that the index is less than "list._size". If the List is modified, the "this.index" might not make sense, and therefore an exception needs to be thrown.

After all enumeration is done, there is a call to dispose the Enumerator. Therefore, the "foreach" syntax is roughly rewritten to this code by the compiler:

which just makes me even more thankful for the C# "foreach" syntax sugar.

Clearly the "foreach" method takes more instructions and performs a generation 0 memory allocation and therefore is slightly slower than the "for loop" method. However, it's more readable and intuitive. It lets you focus on the problem at hand (enumerating) rather than having to worry about things like the size of the List or being explicit with incrementing an index.

Can we go even better than "foreach"? Absolutely!

I'm sure you've even seen "foreach" a lot in production code for the reasons I mentioned. However, it's very repetitive and boring to have explicitly tell the computer to go through the List looking for a match.

Some programmers might have noticed that List<T> has a ForEach method that executes a delegate on each item in the list. This allows you to write statements like this:

The messy delegate notation in C# 2.0 made this approach hardly more elegant.

With lambda expressions and type inference, we can rewrite the above statement as:

This really didn't buy us too much. It's still a bit messy. The final solution would be to use the LINQ Count extension method to get code like this:

Now we've actually made some progress! We've compressed all of the work to one clean line of code.

That was a long journey! Congratulations if you made it this far. Was it worth the investment?

Lessons Learned

  1. When you don't care about what the specific item offset/index number is, prefer a foreach since it's cleaner and not that much more expensive.
  2. Sometimes you can't use a foreach because you need to know index number (e.g. going through two lists of the same size at the same time). This might be a better candidate for the "classic" for loop. However, don't bother with using a pre-increment operator because it's not the way the vast majority of people does it and it doesn't buy you any performance improvement.
  3. If performance is absolutely critical, use a for loop as it avoids the Enumerator allocation. You'll note that the source code for the .net Base Class Library tends to avoid "foreach" because it has been aggressively optimized for performance. It has to sacrifice readability for performance because the BCL is used everywhere. However, it's likely that your code doesn't have that type of strict performance requirement. Therefore, favor readability.
  4. It is worth your time to look at the LINQ extension methods like Count, Sum, Min, Max, Where, OrderBy, and Reverse. It's amazing how these can dramatically simplify 6-7 lines down to a single line.
  5. By using the new extension methods, you'll be able to quickly take advantage of upcoming technologies like Parallel LINQ. Say you had a billion presidents and 4 cores. You'd just simply change your code to "presidents.AsParallel().Count(..)" and your code would scale automatically to all processors.
  6. In short, consider thinking outside of "for loops" prefer to think at a higher level. One day, you just might be able to honestly say that you've written your last "for loop."

One last thing: as Backus was developing Fortran, John McCarthy was developing LISP. Although the two languages started at roughly the same time, they had notably different design philosophies. Fortran was designed with a bottom-up style that was slightly higher than the raw assembler, but it was fast. LISP was deeply entrenched in the symbolic world of lambda calculus. It was powerful, but slow at first. Many of LISP's ideas are just now entering into common practice with popular languages like Python, Erlang, F#, and as we saw in the code above, C# 3.0.

So it seems that in 1958, two roads diverged in a wood and LISP became the road less traveled by. At least now, fifty years later, our industry is starting to bridge those two paths. Maybe Anders is right saying that in 10 years, it'll be hard to categorize languages like C# because they will have deeply incorporated both paths.

But I can't help but think what programming would be like now had Fortran become the road not taken.

P.S. In case you care, there are 23 presidents of the United States who have a last name longer than 6 letters.

kick it on