Improving cache utilization of animation keyframe data
by Manoel Neto · in Torque 3D Professional · 02/26/2011 (12:10 pm) · 0 replies
The game I'm working on has quite a lot of animated characters with lots of joints and my profiler dumps show TSShapeInstance_animateNodes as the top CPU-time waster, surpassing even TSSkinMesh_UpdateSkin (the models still need more aggressive LODs) on most test machines. I found that the way keyframe data is stored and accessed is *very* cache-unfriendly. I'll explain how so:
Keyframe data is stored in separate vectors for rotations, translations, etc. They're organized like this:
A 10 frame animation for 5 nodes has 50 entries. Items 0-9 are the keyframes for node 0, items 10-19 are the keyframes for node 1, 20-29 for node 2, etc. The array is indexed as follow:
The core animation code works like this:
So, to animate all 5 nodes at frame 0.5, the animation code reads 10 keyframes form the array, using the following indices: 0, 1, 10, 11, 20, 21, 30, 31, 40 and 41. See how the index skips 9 items every two? That's the cache problem.
Reading from RAM is absurdly slow. To make things faster, CPUs cache data into themselves. When the CPU is told to read something from RAM, it stores some the data right next to the requested address into the cache. The amount of data is defined by the size of the cache line. When the requested data is in the cache already, we have a "cache-hit" (GOOD). If it is not cached and must be fetched from RAM, you have a "cache miss" (BAD).
For illustration, let's assume we can fit 10 keyframes in a cache line. Let's see how the cache behaves when animating the 5 example nodes:
- Read at index 0: cache-miss, copy items 0 to 9 to cache.
- Read at index 1: cache-hit
- Read at index 10: cache-miss, copy items 10 to 19 to cache.
- Read at index 11: cache-hit
- Read at index 20: cache-miss, copy items 20 to 29 to cache.
- Read at index 21: cache-hit
- Read at index 30: cache-miss, copy items 30 to 39 to cache.
- Read at index 31: cache-hit
- Read at index 40: cache-miss, copy items 30 to 39 to cache.
- Read at index 41: cache-hit
As seen above, if the current animation is long enough we get a cache-miss for every node! And by "long enough", I mean more than 8 frames (a rotation keyframe has 8 bytes, and nowadays CPUs have 64 byte cache lines). Most animations (idle, run, etc) are much longer than this.
How to fix this? Changing the order keyframes are stored and indexed in the TSShape to be like this:
I haven't coded this yet (still very busy with other aspects of the game), but I'm leaving this posted here so the GG guys can give it some thought. When I get around to implementing this change, I'll report its impact on my profilings.
Keyframe data is stored in separate vectors for rotations, translations, etc. They're organized like this:
[all keyframes for node 0][all keyframes for node 1][all keyframes for node 2]...
A 10 frame animation for 5 nodes has 50 entries. Items 0-9 are the keyframes for node 0, items 10-19 are the keyframes for node 1, 20-29 for node 2, etc. The array is indexed as follow:
// [node #, keyframe #] [0,0],[0,1],[0,2],[0,3],[0,4],[0,5],[0,6],[0,7],[0,8],[0,9], [1,0],[1,1],[1,2],[1,3],[1,4],[1,5],[1,6],[1,7],[1,8],[1,9], [2,0],[2,1],[2,2],[2,3],[2,4],[2,5],[2,6],[2,7],[2,8],[2,9], [3,0],[3,1],[3,2],[3,3],[3,4],[3,5],[3,6],[3,7],[3,8],[3,9], [4,0],[4,1],[4,2],[4,3],[4,4],[4,5],[4,6],[4,7],[4,8],[4,9]
The core animation code works like this:
for each node read current keyframe for node read next keyframe for node interpolate between current and next keyframe
So, to animate all 5 nodes at frame 0.5, the animation code reads 10 keyframes form the array, using the following indices: 0, 1, 10, 11, 20, 21, 30, 31, 40 and 41. See how the index skips 9 items every two? That's the cache problem.
Reading from RAM is absurdly slow. To make things faster, CPUs cache data into themselves. When the CPU is told to read something from RAM, it stores some the data right next to the requested address into the cache. The amount of data is defined by the size of the cache line. When the requested data is in the cache already, we have a "cache-hit" (GOOD). If it is not cached and must be fetched from RAM, you have a "cache miss" (BAD).
For illustration, let's assume we can fit 10 keyframes in a cache line. Let's see how the cache behaves when animating the 5 example nodes:
- Read at index 0: cache-miss, copy items 0 to 9 to cache.
- Read at index 1: cache-hit
- Read at index 10: cache-miss, copy items 10 to 19 to cache.
- Read at index 11: cache-hit
- Read at index 20: cache-miss, copy items 20 to 29 to cache.
- Read at index 21: cache-hit
- Read at index 30: cache-miss, copy items 30 to 39 to cache.
- Read at index 31: cache-hit
- Read at index 40: cache-miss, copy items 30 to 39 to cache.
- Read at index 41: cache-hit
As seen above, if the current animation is long enough we get a cache-miss for every node! And by "long enough", I mean more than 8 frames (a rotation keyframe has 8 bytes, and nowadays CPUs have 64 byte cache lines). Most animations (idle, run, etc) are much longer than this.
How to fix this? Changing the order keyframes are stored and indexed in the TSShape to be like this:
[all keyframes for frame 0][all keyframes for frame 1][all keyframes for frame 2]...
// [node #, keyframe #] [0,0],[1,0],[2,0],[3,0],[4,0], [0,1],[1,1],[2,1],[3,1],[4,1], [0,2],[1,2],[2,2],[3,2],[4,2], [0,3],[1,3],[2,3],[3,3],[4,3], [0,4],[1,4],[2,4],[3,4],[4,4], [0,5],[1,5],[2,5],[3,5],[4,5], [0,6],[1,6],[2,6],[3,6],[4,6], [0,7],[1,7],[2,7],[3,7],[4,7], [0,8],[1,8],[2,8],[3,8],[4,8], [0,9],[1,9],[2,9],[3,9],[4,9]Using this scheme, animating all nodes at frame 0.5 would read keyframes using indices: 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. It's obvious that, assuming 10 keyframes can fit a cache line, all 5 nodes would be animated with only the first cache-miss. In real-life, with a 64-byte cache line and 8-byte rotation keyframes, the animation would have a cache-miss at every 8 joints, as opposed to every 1 joint. That's 87.5% less cache-misses!
I haven't coded this yet (still very busy with other aspects of the game), but I'm leaving this posted here so the GG guys can give it some thought. When I get around to implementing this change, I'll report its impact on my profilings.
About the author