Saturday, February 2, 2008

For Loops: Using i++, ++i, Enumerators, or None of the Above?

An IBM 704 mainframe computer. (Image courtesy of Lawrence Livermore National Laboratory)

It was 1958; IBM was the leader in creating computers that took up entire rooms. Their recent creation, the IBM 704, was projected to only sell six units and was priced accordingly. When it ended up selling well over 100 units, it helped propel the careers of its two lead designers. Gene Amdahl, from whom we get Amdahl's Law (which is still a hot topic), led the hardware design. John Backus, the B in BNF, was responsible for leading a team to define and develop an implementation of Fortran that would help ease programming on the system. It offered a choice besides writing in the 704's assembly language.

A for loop in flow-chart notation (Image by PaweĊ‚ Zdziarski)

One of the newest features of their design was what would later be referred to as the "for loop." I say "later," because at the time of its introduction, it was really a "do loop" in syntax, but it had the modern day meaning of a for loop.

As a programmer, you'd sit down at a typewriter-like device called a keypunch, and would type out a line like "310 DO 400 I=1,50" that would tell the compiler that "starting at line 310, keep executing code until you get to line 400, then increment the 'I' variable and start all over until 'I' exceeds 50, then stop by jumping to the line right after 400."

There are several things to note about that last statement. The first is that the idea of a "for loop" is over 50 years old. The second is that since there were only 72 characters of usable space on the 704's card reader, the syntax had to be incredibly terse. Due to the design of language, "I" was the first integer variable available for use ("N" was the last, all others were floating point). Like it or not, this is probably the driving reason why you were taught to write statements like "for(int i = 0; i < 50; i++)" It's only convenient that "I" is also the first letter in "index" and "iterator." Computer programming teachers and books inevitably all trace back to Fortran.

I'm sure that as a programmer you've written thousands of for loops in your life. If you're a professional programmer, you've probably read thousands of for loops written by other people. If you do this long enough, you'll likely come across several different styles.

The first style is by far the most popular. We can illustrate it by an example. Let's say that we have a list of presidents of the United States and we want to know how many presidents have a last name that is longer than six letters. You might have a program that has a helper class like this:

You'd then create and initialize a list of Presidents explicitly or from a file/database:

Finally, you'd be able to have your dear friend, the for loop:

We've all written code like this. It's so easy you don't even have to think about it.

Right? Well, let's think about it just for fun.

The second style would be to do something like this:

Did you catch the difference? The first style used the post-increment (i++) operator for the for-iterator and the latter one used the pre-increment (++i) operator. Most of the world uses the post-increment notation because that's what we all saw in our textbooks and online samples. People that use the second notation are usually very particular about it. Almost all of them come from a C++ background. The reason is that in the Standard Template Library (the C++ rough equivalent of the .net Base Class Library) has the concept of iterators that allow you to go through a data structure like a vector. In the implementation of the post-increment operator on the iterator, one needs to preserve a copy of the old value and then increment the value and return the old value to keep with the proper meaning of the operator in language specification. The post-increment operator is usually implemented by calling the pre-increment operator with the added tax of keeping the old value. Typically users of the "++i" style will come back at you and say something like "i++ is so wasteful! You've got that copy that you aren't even using! Shame on you buddy!"

Ok, maybe that's a bit extreme. But it could happen.

But, let's check out the truth for .net code. In this particular situation, is it wasteful? Let's look at the IL instructions for the post-increment (i++) version of our "for loop" using ILDASM (with my comments on the right):

Now, to save time, let's see how the above compares with the pre-increment (++i) version:

There you have it folks. The code is the exact same except for how the C# compiler names the throw-away variable used for checking inequalities. This has absolutely no effect on performance. All other instructions are exactly identical. The C# compiler sees that no one cares about preserving a copy of the value, and makes a decision to ignore that. This is so basic of a technique that this happens even in debug builds. In case you're wondering, we couldn't have just done a simple diff on the two EXEs even if the C# compiler emitted the same code because .net compilers place a Module Version Identifier (MVID) GUID that is different every time you compile your assembly.

Anyways, with all due respect, any C++ compiler worth its file size would perform the same optimization trick (for integer types). Like I said, the only real difference came with iterators. In .net, we don't have the exact equivalent of iterators, but we do have Enumerators. This allows us to change the code above to look like this:

Notice that we got rid of "i" completely! However, it's just slightly less efficient from a performance perspective. The reason is that the compiler turns this into a call to the List's GetEnumerator() which in turn does this:

Which just punts to this method:

Now, during the actual enumeration/iteration, the code will call MoveNext() that looks like this:

There are two things that I found interesting. The first is that there is a check on a "_version" field to see if it's different and if so, throw an error. After doing some reflectoring, it turns out that any time you modify a List through calls to methods like Add or Remove, the "_version" member gets incremented. This tells the class that it's been changed. If you changed the List after creating the Enumerator, you'll get this unfriendly error message:

This design is ultimately caused by the second interesting thing in the code above. If you look carefully, you'll see that the core of MoveNext is exactly the same as the "for loop" style mentioned earlier. We have a "this.index" variable that gets incremented on every call and then we check to make sure that the index is less than "list._size". If the List is modified, the "this.index" might not make sense, and therefore an exception needs to be thrown.

After all enumeration is done, there is a call to dispose the Enumerator. Therefore, the "foreach" syntax is roughly rewritten to this code by the compiler:

which just makes me even more thankful for the C# "foreach" syntax sugar.

Clearly the "foreach" method takes more instructions and performs a generation 0 memory allocation and therefore is slightly slower than the "for loop" method. However, it's more readable and intuitive. It lets you focus on the problem at hand (enumerating) rather than having to worry about things like the size of the List or being explicit with incrementing an index.

Can we go even better than "foreach"? Absolutely!

I'm sure you've even seen "foreach" a lot in production code for the reasons I mentioned. However, it's very repetitive and boring to have explicitly tell the computer to go through the List looking for a match.

Some programmers might have noticed that List<T> has a ForEach method that executes a delegate on each item in the list. This allows you to write statements like this:

The messy delegate notation in C# 2.0 made this approach hardly more elegant.

With lambda expressions and type inference, we can rewrite the above statement as:

This really didn't buy us too much. It's still a bit messy. The final solution would be to use the LINQ Count extension method to get code like this:

Now we've actually made some progress! We've compressed all of the work to one clean line of code.

That was a long journey! Congratulations if you made it this far. Was it worth the investment?

Lessons Learned

  1. When you don't care about what the specific item offset/index number is, prefer a foreach since it's cleaner and not that much more expensive.
  2. Sometimes you can't use a foreach because you need to know index number (e.g. going through two lists of the same size at the same time). This might be a better candidate for the "classic" for loop. However, don't bother with using a pre-increment operator because it's not the way the vast majority of people does it and it doesn't buy you any performance improvement.
  3. If performance is absolutely critical, use a for loop as it avoids the Enumerator allocation. You'll note that the source code for the .net Base Class Library tends to avoid "foreach" because it has been aggressively optimized for performance. It has to sacrifice readability for performance because the BCL is used everywhere. However, it's likely that your code doesn't have that type of strict performance requirement. Therefore, favor readability.
  4. It is worth your time to look at the LINQ extension methods like Count, Sum, Min, Max, Where, OrderBy, and Reverse. It's amazing how these can dramatically simplify 6-7 lines down to a single line.
  5. By using the new extension methods, you'll be able to quickly take advantage of upcoming technologies like Parallel LINQ. Say you had a billion presidents and 4 cores. You'd just simply change your code to "presidents.AsParallel().Count(..)" and your code would scale automatically to all processors.
  6. In short, consider thinking outside of "for loops" prefer to think at a higher level. One day, you just might be able to honestly say that you've written your last "for loop."

One last thing: as Backus was developing Fortran, John McCarthy was developing LISP. Although the two languages started at roughly the same time, they had notably different design philosophies. Fortran was designed with a bottom-up style that was slightly higher than the raw assembler, but it was fast. LISP was deeply entrenched in the symbolic world of lambda calculus. It was powerful, but slow at first. Many of LISP's ideas are just now entering into common practice with popular languages like Python, Erlang, F#, and as we saw in the code above, C# 3.0.

So it seems that in 1958, two roads diverged in a wood and LISP became the road less traveled by. At least now, fifty years later, our industry is starting to bridge those two paths. Maybe Anders is right saying that in 10 years, it'll be hard to categorize languages like C# because they will have deeply incorporated both paths.

But I can't help but think what programming would be like now had Fortran become the road not taken.

P.S. In case you care, there are 23 presidents of the United States who have a last name longer than 6 letters.


kick it on DotNetKicks.com

19 comments:

Eelis said...

(In what follows I'll only talk about C++, because I know nothing about C# and Java and whatnot.)

The efficiency argument against i++ vs ++i is indeed very weak. However, there is a stronger argument.

Postfix increment (which makes a copy of the argument, increments the argument, and then returns the copy) has more complex semantics than prefix increment (which just increments the argument and returns it). This is not an implementation issue; we're talking about the semantics specified by the language definition. With this observation, the well known KISS principle is directly applicable: postfix increment's more complex semantics don't win one anything, so prefer the simpler prefix increment solution.

Note that the argument applies not only when i happens to be a C++ iterator, but also when it is an int.

Jeff Moser said...

Eelis: Thanks for sharing the comment.

I absolutely agree that pre-increment semantics are better. I want to use pre-increments from a technical perspective.

It's just that in C# at least, it makes no difference in the case I mentioned. Since pre-increment goes against the norm, it would just serve to make the code-base on a large project less uniform.

So, althought it is better from an imperative purity perspective, I am bound from a practical aspect to go with post-increment since consistency and readability are higher priorities.

Any thoughts on why post-increment became so popular even before objects?

Eelis said...

Jeff:

I agree that working with inexperienced programmers in a crappy codebase can often be a very strong argument for choosing solutions that would be frowned upon by more experienced programmers working in a clean codebase. However, I don't consider this one of those cases.

If a programmer sees prefix increment where he would have expected postfix increment because he doesn't know any better, what's the worst that can happen? Either he will ignore it, or he will decide to investigate, learn the difference, and end up a more informed programmer who has now seen another demonstration of the KISS principle.

As for "popularity", I highly doubt it was ever a very conscious thing, let alone active enthusiasm. Most likely neither the first needless-postfix users nor their countless imitators gave it any thought at all.

Jeff Moser said...

Eelis: interesting thoughts. It's sort of a shame from a language perspective that new programmers need to be concerned about such details.

However, I agree that it would help better inform the programmer if it caused him/her to investigate the issue more.

I also agree that it probably wasn't given much thought at the start and that has just grown (in non-thought that is) as the years have gone by.

For C++ code, I'm still undecided if I would make the push for preincrement notation since it would make the overall codebase more uniform because the preferred syntax for iterators would be preincrement.

For C#, what I work in the most at the moment, it doesn't matter nearly as much and that's why I'm leaning to favoring the community trend of postincrement.

Regardless, it's an interesting to find what's most elegant from a technical, community, and common sense perspective on the issue.

Mike Petry said...

Cool post, I ran across it on the front page of programming.reddit.

One needs to look a bit deeper at the C++ pre/postfix unary operator. The main point is that C++ supports operator overloading. There are semantical differences between the two operators (as you described very lucidly). Languages such a C# and Java do not support operator overloading.

Jeff Moser said...

Mike:

C# supports operator overloading for some operators, but not for ++.

In addition, I think that C# was influenced by having to design features that would play nice with features available in a vast amount of other languages that ride on the CLR. This is what probably led to the push towards an interface/Enumerator based approach.

IEnumerable is a very powerful interface and indeed is what LINQ to Objects is built on.

Having the clear semantics that IEnumerable has made LINQ possible.

Maybe I misunderstood your comment?

robert said...

I've stuck to pre-increment simply on aesthetic grounds: I dislike seeing the ++ immediately next to a close parenthesis.

However, I would point out that in general making such decisions based on a hazy idea of what might be faster is farcical (and unfortunately common). Optimization is not optimization without cold hard data: measure the speed.

For typical tight inner loops, anyhow, the code for incrementing is probably nowhere near as important as memory access concerns---e.g. setting up cache prefetches. Which of course, and very frustatingly, we don't yet have much control over in languages above assembly: I find it amazing that something so critical to performance is still left up to hoping for compiler magic or weird non-portable hacks.

Paolo Bonzini said...

The reason why IJKLMN were used for integer variables in Fortran, by the way, is that those were used often as indexes or sizes (both of which are integers) in math.

Anonymous said...

This is so funny...

I guess you can't do that in C either?

I'm just dumbfounded by this whole blog.

Jeff Moser said...

Robert: Interesting views on aesthetics. I would have thought that i++ would look better since everything else starts with a letter (i =; i <; i..) and you'd have more balance, but I can also see your point.

Jeff Moser said...

Paolo: thanks for the insight! I had a hunch that it might have had to deal with that I,J, and K are used for vectors in physics. However, I didn't find any reference to this in my limited research.

Jeff Moser said...

Anonymous: do what in C? What exactly are you dumbfounded by?

Mike said...

Nice article Jeff. Appreciate the insights.

Bart Czernicki said...

Nice article.

However, I would add that foreach is also avoided when doing programming games in managed C#. It leaves behind objects that need to be GC'ed :)

Having said that the for loop is not faster all the time! Check out this link (guy who created MDX) where in SSAS 2005 the dvelopers screwed up in the implementation of iterators/enumerators...foreach is exponentially faster:
http://sqljunkies.com/WebLog/mosha/archive/2007/04/19/stored_procs_best_practices.aspx

It alawys goes back to that rule: "It Depends" :)

Jeff Moser said...

Mike: Glad you liked it!

Bart: foreach does create an enumerator object, but it's very fast since it's going to use generation 0 memory (quick birth, quick death)

The for(..;..;..) method really requires, like you say, that the underlying data structure has array like semantics rather than linked-list semantics.

Mike Petry said...

Jeff,
Sorry for the nitpick. Your illustration of C++ pre vs post increment was perfect but I didn't think it was clear that STL iterators are implemented by overloading the increment operators. But you clearly pointed out the STL iterators are coded in a way that conforms to the semantics of those operators. Any class that overloads the post-increment operator in C++ must return a copy-constructed version of itself which could be a performance hit if the class held enough data.
Of course the performance hit is negligible. If you are required to program in C++ it is typically because you are coding a performance sensitive capability and ignoring even modest opportunities to improve performance just seems wrong.
Otherwise, this entry is probably the best I have seen from somebody I personally know.

Jeff Moser said...

Mike: thanks for the follow-up response. Your points are well taken.

I think that every language has to make decisions on several areas like syntax, expressiveness, raw power, and what it will protect you from.

I think it's clear that Bjarne choose to write as think of a layer on top of C that would give the majority of OO ideas. To him, I think performance was paramount like you mentioned.

I also think that people write in C++ now since it's the language of their legacy codebase that companies have spent an enormous amount of time creating. (If MS Office was greenfield now, it'd probably be written in C#). The other reason is for cross-platform programs that need good performance. (This will likely become more negligible as the CLR and JVM continue to expand)

However, when they do write in C++, it would be silly to ignore libraries like Boost and ACE that implement a lot of low level stuff for you. By the time you resolve to use smart pointers and structured exception handling, no one is worried excessively on the copy constructor that's wasted.

Still, you made good points. Some places are going to demand perf above all else. That's why I have to acknowledge that it might be worth the battle to fight for unilateral ++i syntax as a coding standard if you're in a pure C++ shop.

However, for the vast majority of cases, readability and maintainability is a trump card. As we move into the dynamic language and functional paradigms more, people will drift away from for loops altogether and start to think more and more in terms of things like Count() and higher order functions like map and reduce.

Kevin said...

A thought: programmers are probably drawn to postfix notation because by definition, the name C++ is itself postfix?

Jeff Moser said...

Kevin: Interesting observation ; I hadn't considered that path. I think a lot of it was influenced by examples, but it could cause a subconscious preference.