Game Development Community

Fastest Iteration Method

by Sean Monahan · in Torque X 2D · 02/05/2009 (7:24 pm) · 18 replies

I'm working on a component that requires a lot of iteration (potentially hundreds of iterations) and I want to get some input on the fastest method for iterating. Some examples:

// listItems is a List

// Assume the order of the items is unimportant

// Method A: foreach
foreach (Item item in listItems)
{
   // Do stuff
}

// Method B: increment
for (int i = 0; i < listItems.Count; i++)
{
   Item item = listItems[i] as Item;
   // Do stuff
}

// Method C: decrement
for (int i = listItems.Count - 1; i >= 0; i--)
{
   Item item = listItems[i] as Item;
   // Do stuff
}

I know method A will be the slowest. My experience with ActionScript has taught me that method C is the fastest in that language, is that also true of C#? Remember, I could potentially be doing hundreds of iterations so even a miniscule speed difference could add up rather quickly.

#1
02/05/2009 (10:46 pm)
Why don't you time stamp each method, run each in your test case (or make a simple valid test case), and let us know the results. Then all of us will know :)
#2
02/05/2009 (11:18 pm)
@Sean
I noticed you have "The complete guide to Torque X". Check page 199. It has a side note in the grey area about that.

Brian
#3
02/06/2009 (8:11 am)
@Jim

Yeah, I was hoping someone would just know ;) I'll set something up and post back.

@Brian

I looked at that, you'll see that the side bar is pretty much the same as my method A vs method B.
#4
02/06/2009 (10:29 am)
just a meta-note here,
unless DoStuff() is very, very fast,
then doStuff is going to be 99.9% of the time here, not the iteration itself.
#5
02/06/2009 (10:46 am)
Hi Folks,

My take is closest to option option B, with a twist.

Item item = null;
 
for (int i = 0; i < listItems.Count; i++)
{
   item = listItems[i] as Item;
   // Do stuff
}

This way, the space in memory is reserved once for the Item object and wont result in possible garbage collection. Option A is bad because an object iterator will be created on each loop and also require garbage collection. In fact, this is just a good time for everyone. Never declare new objects inside of ProcessTick()... create an instance for the class and reference them within ProcessTick().

Also, how ofter do you need to run this code? If this code is run on every tick, then performance will come to a halt and you should consider changing the whole design of the feature. Otherwise, consider running this code on every few ticks... set a tick counter and when tick = 10, then execute the code. Even if you need to update positions on a map, once per second is probably just as fine as once every 2 milliseconds.

John K.
www.envygames.com
#6
02/06/2009 (11:07 am)
@John

Good point about the declaration. I'm doing that already I just wanted the examples to be more straight-forward.

I also already have a tick counter to limit this component like you suggested. In fact, I haven't had any performance problems at all yet, but I feel like I very well could and, well, ultimately I feel it's best to squeeze every ounce of performance out of my code that I can.

---

Doing a few Google searches I came across this. The result:
Quote:
Decrementing a variable in a loop is twice as fast. On a modern 2 GHz computer, this will save you 1 millisecond per million iterations. That means testing against zero is a millionth of millisecond faster. This is of extremely minimal value in 99% of programs.

Please don't go out of your way to use this optimization. It is mostly here just for interest and to help gain insight into processor behavior and tricks.

So going back to my original post, method C is fastest but it probably wouldn't matter. There is almost certainly more performance to be wrung from John K's suggestions.
#7
02/06/2009 (7:33 pm)
@Sean - I wonder if a decrementing for loop still holds true on the Xbox 360 itself. Since it is PowerPC based like the old Macintoshes, it's using Big-Endian memory allocation (hidden by the the .net compact framework during JITing). If you try it, let me know if it still has a performance increase. Thanks for the article link!

Brian

XBox360 Processor Reference

Edit:
Definetly try it. I read the entire page you had there and basically it explains that x86 processors have optimizations for a zero test which is where the speed boost comes from.
Quote:X86 chips, including Pentium, Celeron, Core 2 Duo, and AMD Athlon, have special instructions and optimizations for testing against zero.
#8
02/06/2009 (7:48 pm)
@Brian

I got started on this idea because it was recommended for Actionscripting and then I seemed to recall learning in some intro CS class that computers typically perform subtraction faster than addition (whether or not this is the case or the difference is appreciable I don't know). Anyway, during the course of some reading I found that the true performance gain is not in incrementing vs. decrementing but in the comparison. Apparently with "modern" processors it is much, much faster to compare against zero than some other number and with a decrementing loop your comparison is always against zero. In the end it seems like any performance increases would be negligible so I'll probably just go on my merry way unless I run into problems down the line.
#9
02/06/2009 (7:54 pm)
@Sean - Well that's definitely cool. I am interested in trying it on the 360 because of the information I stated above (for my own knowledge) but thanks for the tip!

Brian

#10
02/06/2009 (10:03 pm)
I don't know how x86 processors do it, but I learned SPARC architecture... basically, subtraction is "adding" a negative number.

ex. 4-bit numbers: 4 - 1

4 = 0100
1 = 0001
-1 = 1111 (two's complement)
4 + -1 =
0100
1111
----
0011 = 3

So, I'd imagine it has to be some other thing that's going on... but really, it doesn't matter. In modern programming, you strive for program readability rather than speeding up such minuscule things (within reason, obviously if you profile your code and it's really holding you back, then go ahead and optimize the crap out of it!). Honestly, I'd stick with option B. It's pretty much the universal way of doing it, and it's easier to read than A and C. If you're familiar with syntax of A, then it'd be better for readability (slightly), but, in general, programmers will be more familiar with option B.
#11
02/07/2009 (9:00 am)
The reason C is faster than B is because listItems.Count is in the condition and must be calculated every iteration. Method C will always be faster because listItems.Count is calculated once when the variable is initialized, rather than every iteration of the loop.

I agree with you Chris.
#12
02/07/2009 (11:42 am)
Based on the comments here and my own little bit of research it looks like C is fastest but B is the preferred method unless you have some particular reason to count backward or you are iterating through a massive list on the scale of millions of items.
#13
02/07/2009 (2:38 pm)
@Jim - So could you achieve relatively the same speed on B by calculating count right before you enter the for loop so it is only calculated once? I know this is a micro-optimization and probably not much difference on the overall scheme of things, but could be useful for something.

Brian
#14
02/16/2009 (8:30 pm)
@Sean

Assuming that you know how many iterations you need to do beforehand I believe another possible solution is loop unwinding.

Item item = listItems[1] as Item;
// Do stuff
item = listItems[2] as Item;
// Do stuff

If you don't know exactly what it is beforehand, but you know that it is possibly some multiple of a number then doing batches of that multiple should (in most cases) slightly improve performance.

//NOTE: Assuming that listItems.Count is a multiple of 2
for(int i = 0; i < listItems.Count; i += 2)
{
Item item = listItems[i] as Item;
// Do stuff
item = listItems[i + 1] as Item;
// Do stuff
}

The less loops you have the better, as you won't succumb to CPU stalls in the pipeline due to branch prediction. You should test these scenarios out as they are not always the quickest in all scenarios.

Lastly, if //Do stuff ends up being a hefty load of work, your on a multi-core CPU, and your //Do stuff is easily split up (no dependent data, shared memory usage), then I would also consider a multi-threaded solution. Check out .NET's Parallel Extensions and the Parallel.For(...) method.
#15
04/03/2009 (3:47 pm)
I know this is an old thread, but a few comments.

A) Orion is right.
B) See A.

C) It's dependent on the language.
D) It's dependent on the compiler (if there is one).
E) It might be dependent on the interpreter (if it's interpreted, or if it's compiled and then bytecode interpreted).

None of the answers given is truly correct, nor could they be without knowing the specifics of the language, the compiler and/or interpreter, and the target architecture, and I suspect in a great many cases the answer is "they're all equal".

Even an "optimization" like loop unwinding could actually be a significant deoptimization if interaction with your particular bytecode interpreter causes a ton of cache line misses or your particular compiler for that particular language decides to reroll your loop.

In the end, you always have to measure with the tools you're going to use.

#16
04/04/2009 (9:15 pm)
Quote:Even an "optimization" like loop unwinding could actually be a significant deoptimization if interaction with your particular bytecode interpreter causes a ton of cache line misses or your particular compiler for that particular language decides to reroll your loop.

Yes, this is true. These all play into factors in how loop unwinding will help, but honestly it all comes down to the hardware and instruction code. And according to Intel 64 and IA-32 Architecture Optimization Manual Section 3.4.17 on Loop unwinding, it is still slightly benefitial under certain rare scenarios; assuming you are using this hardware of course.

Since I'm running on an Intel Core2 Duo, these conditions should be relevant to my hardware. So I've went ahead and compiled a small project that is equivalent to the loop unwinding example in Section 3.4.17. I did see a performance gain in the loop unwinding, very little, but it was still there:

Non-unwound loop: 00.0000023 millaseconds
Unwound loop: 00.0000019 millaseconds

Of course increasing this to 10,000 iterations and the unwound loop is roughly 2x faster then the non-unwound loop.

Just wanted to clarify that I'm in no way attempting to give *false* information. Yes, this can only be used for very rare cases in very specific scenarios; but trying can't hurt, right? :)
#17
04/06/2009 (10:23 am)
As others have said, it normally doesn't matter. It's more important to pay attention to heap use issues due to object lifetimes and unnecessary object creation and destruction. Therefore A is not ideal if concerned with heap performance at runtime (although in the next post I will discuss where A can be very helpful, even in games).

Regarding comparisons of B and C. B may be clearer, but C is faster, potentially significant faster depending on what Count is doing. In this case not much, but County *is* a Property, meaning that C is faster.

Why?

Remember, this is .Net. Reading a property always means invoking a property accessor, and there is overhead to doing that at runtime (especially if doing it millions of times in a tight loop). Even if it's a readonly member (you never wrote an accessor), there is still one generated for you whether you like it or not. ;-)

When dealing with properties, the JIT compiler has little flexibility, the accessor MUST be invoked every time the property is read, whether you wrote it or the compiler wrote it. Since in Method B you access this property every time, there is this accessor overhead N times.

Therefore, you could always just cache the property value, a minor derivation of Method B:

int count = listItems.Count;
for (int i = 0; i < count; i++)
{
   Item item = listItems[i] as Item;
   // Do stuff
}

Of course, Method C does this naturally, which is it's major performance advantage over B.

Method C has another advantage, although not nearly as big as unwinding the property read. The JIT compiler can implement Method C at runtime using either jump or loop instructions that check the zero bit immediately after doing a decrement. With Method B, a comparison test is necessary after performing the increment operation.

Therefore, what can be expressed with two instructions when decrementing (DEC + Jxx) must be implemented as three instructions when incrementing (INC + CMP + Jxx). This is because the DEC naturally sets the zero bit flag, and therefore can be immediately tested against inside the jump instruction, whereas when incrementing this requires an additional comparison first. This holds true whether you are talking about Jxx or LOOPx, as LOOPx decrements LCV's only).

Of course, let's keep in mind, we are only talking about one measly instruction. The overhead of one extra instruction per iteration normally won't add up to much actual runtime overhead. But you asked, so yes, Method C is ever-so-slightly faster. ;-) This is of course assuming the JITter isn't junk. Microsoft's is not.

But more importantly, method C is faster due to the property read happeing only once, and not N times. This fact will normally *vastly* overshadow the savings of counting backwards. But as has been shown, you don't have to count backwards to get this benefit. Just use a local stack variable to get this advantage, without any of those disadvantages of method C.

The disadvantage of C is it simply is not clear. Without an associated comment, anybody reading this could easily assume the items are being read backwards because they *need* to be read backwards. This is bad. And even if a comment is put in explaining this, there is no guarantee it will be read. At the end of the day, writing code that's harder to understand, or easier to misinterpret, just to save one cycle per iteration, is probably *not* worth it.

Hopefully some of this has helped somebody. ;-)

- Ryan
#18
04/06/2009 (11:02 am)
When would you want to use Method A (foreach) which uses iterators?

Let's say you are defining your own data structure that is going to be accessed from multiple threads. This could be a very common thing when writing an TorqueX / XNA game targetting the 360. Let's also keep in mind that desktop processors are getting more cores, not less. ;-)

The generic collections are not thread-safe. This is actually a good thing, but it means if you want to access/alter them from multiple threads, you'll need to synchornize this somehow.

The generic collection enumerators are threadsafe for read access only. Multiple threads can be enumerating on a collection concurrently with no problems, as long as none of them modify the collection. If the collection is modified, the enumerators will throw an exception upon next access. This is better than havving undefined behavior, which is what happens if you are using this[] accessor and Count and the collection is modified. Those bugs can be a nightmare to track down.

The most common approach is to simply lock the entire structure. While it is often simple, it is often too heavy handed and introduces a bottle-neck into the application, or worse, creates dead-lock conditions that can be difficult to debug.

If you look at all of the points where the datastructure needs to be accessed, it's not uncommon that many of the readers won't care if an item has been added for it's current loop instance; it only needs to know the next time it iterates through all objects. It is also not uncommon that data structures may only be added to, and not removed from.

Therefore, depending on the usage needs of the various threads, it may be possible to implement a custom enumerator that provides the actual synchronization implementation, allowing the threads themselves to consider the data structure to be inherently threadsafe for their limited scope of use. This can make implementing proper synchronization a much simpler (or even automatic) task, without the overhead and/or danger of introducing pessimistic locking everywhere.

However, the mechanism is only easily implemented if you use iterators (foreach). If you are using .Count and indexing, then it is *much* harder to implement. In fact, it is impossible to implement if you want to support all conditions, as an enumerator instance can be disposed separately from the collection instance itself; the Count and this[] can't be.

Of course, when you implement the custom enumerator, you can also make it so that the enumerator itself is cached, and never actually disposed. This in turn gets rid of the negative heap impact of using foreach itself...you won't be garbage collecting your enuerator no matter how often you use them.

Ryan