Game Development Community

VTune

by Orion Elenzil · in Torque Game Engine · 05/24/2007 (4:09 pm) · 9 replies

I realize a lot of folks can't afford the $700 price tag,
but, wow, VTune is an awesome tool.

Not only can it tell you things like "you spent 50% of the time in ShapeBase::UpdateMesh()",
but you can even explore the call-graph.

In our case, when players come into scope, we were making an awful lot of calls to dStricmp(),
which, good luck figuring which of about a zillion calls to dStricmp() is the really expensive one.

With VTune properly configured, it told me right away that in certain situations "you called dStricmp about a million times!" and then the really amazing part is you just click on "show me the most expensive call path" and voila - it's being called 900,000 times by TSShape::findName(), which is being called by 100,000 times by foo() which is being called 100,000 times by bar(), etc.

In my case this led me to refactor all the calls to TSShape::findName() to only accept StringTableEntries instead of arbitrary const char*'s. (which of course increases the entries in the stringtable and the number of calls to StringTable->insert(), but the vast majority of those calls are only during mission loading, so this is a win)

#1
05/24/2007 (4:46 pm)
Yes, VTune is great. I actually find the built in Torque profiler to be incredibly useful to (different strengths and different weaknesses).

BTW, in the case of TSShape::findName, to get the best performance you probably want to use the index lookup method and store off the index for repeated use. But maybe you can't in your current situation.

Anyway, feel free to forward any performance issues you see to me, to Andy Maloney, or just post them here.
#2
05/24/2007 (5:01 pm)
.. Or use a Hash table ! ;)
#3
05/24/2007 (8:38 pm)
Here's a small but easy optimization this has turned up.

i had set up a mission with about 150 AIPlayers in view,
but with their LOD level artificially dialed down so far that they had zero polygons.

i noticed that the framerate in this situation was much lower than i would have expected,
and VTune has finally allowed me to get in there and pry.

turns out that most of the time in this situation was in Quat16::getQuatF().
it was using about 10% of CPU time. (i think. i'm still getting used to VTune),
followed by several at at 5% and a bunch below that.

by making the following changes and rerunning the same test
i was able to get getQuatF()'s CPU time down to about 3%.

very surprising thing i found is that altho i have Visual Studio set for "Optimize Speed" (/O2),
f / was significantly slower than f * .
- i've been assuming that the compiler optimized the divide into the multiply.

the ultimate impact on framerate from the change below is pretty small, but hey, i'll take it.

a much better optimization would be to get rid of the integer-to-float cast in there.
my first thought was to keep seperate member variables for the float version of the values,
but that was quickly complicated because the Quat16 structure is directly read from binary files in TSShape::readOldShape(). so that should probably still be done, but it will involve some actual programming.

anyhow, the optimization.

change this:
QuatF & Quat16::getQuatF( QuatF * q ) const
{
   q->x = float( x ) / float(MAX_VAL);
   q->y = float( y ) / float(MAX_VAL);
   q->z = float( z ) / float(MAX_VAL);
   q->w = float( w ) / float(MAX_VAL);
   return *q;
}

to this:
QuatF & Quat16::getQuatF( QuatF * q ) const
{
   q->x = float( x ) * (1.0f / MAX_VAL);
   q->y = float( y ) * (1.0f / MAX_VAL);
   q->z = float( z ) * (1.0f / MAX_VAL);
   q->w = float( w ) * (1.0f / MAX_VAL);
   return *q;
}

edit: originally had * 0.0000305185095f. realized 1.0f / MAX_VAL optimizes to the same thing.
#4
05/25/2007 (4:32 am)
Seems like VTune could be a really useful tool, unfortunately it only runs on Intel cpu's from reading the requirements, does anyone know of a similar kind of tool that would work for AMD processes?
#5
05/25/2007 (8:30 am)
Hm - i'm not sure about that. VTune is an Intel product, and has some features specifically for optimizing for Intel processors, but the general profiling features should be CPU-agnostic, i think. It has a 30-day free trial..
#6
05/25/2007 (8:48 am)
You can perform a few tests, which test were you using to find out you called dStricmp about a million times? - most of the ones I tried just told me it couldn't detect my CPU architecture.
#7
05/25/2007 (9:14 am)
Really? huh. i may be wrong. the two tests (aka "activities") i've used so far are Sampling and CallGraph. When you (i) first launch it, it has a little panel giving you the option of setting up a new sampling activity. - Does it even get that for for you ?
#8
05/25/2007 (9:17 am)
Ah yes activities ;) that's the word I was looking for... I can get as far as selecting them but when you try and start the activity then I get the pop up message saying it cannot detect the cpu architecture. I've just tried it on my laptop which is intel and that works, although the laptop is so old these days I can barely move around in Torque.
#9
05/25/2007 (9:21 am)
Good info on the get quat. Just shows the importance of profiling your own game and not just relying on the engine team. The above would never show up in most profiles because ts animation is not a bottleneck. But once you get rid of rendering and 100 animating shapes, things change and new bottlenecks appear.