Altivec optimized point3F_normalize
by asmaloney (Andy) · in Torque Game Engine · 01/11/2007 (1:56 pm) · 29 replies
In my quest to make the Stronghold mission actually playable on my PowerBook G4, my Shark profile showed that m_point3F_normalize_C() accounted for 1.3% of the time. [It showed uses mainly in the precipitation code.] So here's an altivec implementation which reduced that to 0.4% - not a bad savings. [I used journal playback so I can compare profile results directly.]
In math/mMathAltiVec.cc add:
Then in mInstallLibrary_Vec(), add this:
If there are any Altivec masters out there - I'm not - please let me know how this can be improved!
In math/mMathAltiVec.cc add:
// ------------
// The following code is from here: http://developer.apple.com/hardwaredrivers/ve/algorithms.html
//result = v-0.5
inline vector float ReciprocalSquareRoot( vector float v )
{
//Get the square root reciprocal estimate
vector float zero = (vector float)(0);
vector float oneHalf = (vector float)(0.5);
vector float one = (vector float)(1.0);
vector float estimate = vec_rsqrte( v );
//One round of Newton-Raphson refinement
vector float estimateSquared = vec_madd( estimate, estimate, zero );
vector float halfEstimate = vec_madd( estimate, oneHalf, zero );
return vec_madd( vec_nmsub( v, estimateSquared, one ), halfEstimate, estimate );
}
// ------------
static void vec_point3F_normalize(F32 *p)
{
vector float zeros = (vector float)(0.0f);
vector float pv;
F32 *loader = (F32*) &pv;
loader[0] = p[0];
loader[1] = p[1];
loader[2] = p[2];
loader[3] = 0.0f;
// Square each of the components of the vector, then add them together
// After this, all components of the vector are the same value
// This is the recommended way to operate across a vector
vector float squared = vec_madd( pv, pv, zeros );
squared = vec_add( squared, vec_sld( squared, squared, 8 ) );
squared = vec_add( squared, vec_sld( squared, squared, 4 ) );
if ( vec_all_eq( squared, zeros ) )
{
p[0] = 0;
p[1] = 0;
p[2] = 1;
}
else
{
// If you're happy with a faster estimated reciprocal sqrt, use vec_rsqrte() instead of ReciprocalSquareRoot()
// pv = vec_madd( pv, vec_rsqrte( squared ), zeros );
pv = vec_madd( pv, ReciprocalSquareRoot( squared ), zeros );
p[0] = loader[0];
p[1] = loader[1];
p[2] = loader[2];
}
}Then in mInstallLibrary_Vec(), add this:
m_point3F_normalize = vec_point3F_normalize;
If there are any Altivec masters out there - I'm not - please let me know how this can be improved!
#2
I'm pretty sure the same essential algorithm here can be used for normalizing Point2Fs and Point4Fs, with that added advantage that we could use vec_ld and vec_st with the Point4F, which would help speed things up a bit (assuming 16 byte alignment).
01/11/2007 (4:22 pm)
Looks spiffy! I certainly don't see any way to improve it, seeing as 90% of it is from Apple's developer page on Altivec, and we're dealing with 12 bytes and not 16 so there isn't a faster way to load or store the result.I'm pretty sure the same essential algorithm here can be used for normalizing Point2Fs and Point4Fs, with that added advantage that we could use vec_ld and vec_st with the Point4F, which would help speed things up a bit (assuming 16 byte alignment).
#3
01/11/2007 (4:41 pm)
Yes the main problem is of course having to marshal the data since it's not stored in a vector processor-friendly way. I'd tried altivec-ing [!] a couple of other functions, but the cost of doing the conversion cancelled out the speed advantage... Oh well - there are others I might look at.
#4
First off, I wrote a quick console function that would create a Point3F, normalize it 10,000,000 times, and quit. Very handy for performance testing.
Stock Torque ran through that in about 2.35 seconds, and the code here took 1.75 seconds.
As stated though, the altivec implementation suffers in getting data in and back out. Putting the point in a vector was about 20% of the functions time, and getting the data back out another 15%. Still a fair bit we can fix up.
Loading Data
Apple has a very nice bit on loading and storing unaligned data, which I liberally borrowed from.
The initial go using LoadUnaligned showed little improvement: 1.71 seconds. At this point, shark became my best friend in finding the little itty bitty things gcc likes to toss in to make code slow. As expected, gcc took great pains to store every vector after every operation, even if it has to load it up again for the next instruction (way to go gcc!).
Liberal application of the register qualifier reduced time to 1.56 seconds, much of this coming from the ReciprocalSquareRoot function, which ran 40% faster (307ms vs. 515ms). Adding a const qualifier to the constant vectors also helped gcc, and dropped total time to 1.47 seconds.
Issues with 12 bytes
LoadUnaligned loads 16 bytes, but we only want the first 12 of those. To zero out the last 4 bytes, I use a madd. Removing that madd drops total time to 1.38 seconds. Too bad it's necessary. However, this is a good indication that when we need 16 bytes (like in the altivec matrix multiply) LoadUnaligned is about 25% - 35% faster than trying to load the data by hand.
Conclusion
It probably isn't going to get much faster than 1.47 seconds. However, some useful things popped up through this experimentation.
1) Use the const and register qualifiers liberally when dealing with AltiVec. gcc is obsessed with storing and loading vectors, and this adds up very quickly. const and register tell gcc that we will be using this variable quite a bit, and it cuts down on the storing and loading (some, not all of it. Even with every vector labelled register gcc would still routinely store a vector and load it two ASM instructions later).
2) LoadUnaligned is faster than writing in the data manually, and using an aligned union is faster than writing out the data manually. The AltiVec matrix multiply would probably see a significant performance gain from using LoadUnaligned if the data isn't already aligned, as will any AltiVec math functions using matrices.
3) Write your own ASM. A large chunk of the time spent is gcc doing random stuff for who knows what reason, namely storing vectors, loading them, and swapping around which vectors are in which registers. Hand tuned ASM will be significantly faster still.
4) Prefetching data might be useful. It probably wouldn't have any effect for Point3F normalization (beyond putting 4 bytes of data we don't care about in L1 cache), but again this could arguably help the matrix multiplication and AltiVec blender. I'll mess around with it a bit, at some point.
5) I like GPU SIMD more. The instruction set is a bit more useful (wouldn't a dot product AltiVec instruction be awesome for this, or just about any other math function?), I don't have to worry about registers and cache and RAM (it's all registers), and swizzling is just plain awesome (though the vec_perm instruction is pretty nifty too).
01/11/2007 (11:36 pm)
Well, I've been wanting to mess around with altivec and CPU optimization for a while now, and this became an excellent excuse to go hog wild and try out everything to see how much performance could be squeezed out of this.First off, I wrote a quick console function that would create a Point3F, normalize it 10,000,000 times, and quit. Very handy for performance testing.
Stock Torque ran through that in about 2.35 seconds, and the code here took 1.75 seconds.
As stated though, the altivec implementation suffers in getting data in and back out. Putting the point in a vector was about 20% of the functions time, and getting the data back out another 15%. Still a fair bit we can fix up.
Loading Data
Apple has a very nice bit on loading and storing unaligned data, which I liberally borrowed from.
The initial go using LoadUnaligned showed little improvement: 1.71 seconds. At this point, shark became my best friend in finding the little itty bitty things gcc likes to toss in to make code slow. As expected, gcc took great pains to store every vector after every operation, even if it has to load it up again for the next instruction (way to go gcc!).
Liberal application of the register qualifier reduced time to 1.56 seconds, much of this coming from the ReciprocalSquareRoot function, which ran 40% faster (307ms vs. 515ms). Adding a const qualifier to the constant vectors also helped gcc, and dropped total time to 1.47 seconds.
Issues with 12 bytes
LoadUnaligned loads 16 bytes, but we only want the first 12 of those. To zero out the last 4 bytes, I use a madd. Removing that madd drops total time to 1.38 seconds. Too bad it's necessary. However, this is a good indication that when we need 16 bytes (like in the altivec matrix multiply) LoadUnaligned is about 25% - 35% faster than trying to load the data by hand.
Conclusion
It probably isn't going to get much faster than 1.47 seconds. However, some useful things popped up through this experimentation.
1) Use the const and register qualifiers liberally when dealing with AltiVec. gcc is obsessed with storing and loading vectors, and this adds up very quickly. const and register tell gcc that we will be using this variable quite a bit, and it cuts down on the storing and loading (some, not all of it. Even with every vector labelled register gcc would still routinely store a vector and load it two ASM instructions later).
2) LoadUnaligned is faster than writing in the data manually, and using an aligned union is faster than writing out the data manually. The AltiVec matrix multiply would probably see a significant performance gain from using LoadUnaligned if the data isn't already aligned, as will any AltiVec math functions using matrices.
3) Write your own ASM. A large chunk of the time spent is gcc doing random stuff for who knows what reason, namely storing vectors, loading them, and swapping around which vectors are in which registers. Hand tuned ASM will be significantly faster still.
4) Prefetching data might be useful. It probably wouldn't have any effect for Point3F normalization (beyond putting 4 bytes of data we don't care about in L1 cache), but again this could arguably help the matrix multiplication and AltiVec blender. I'll mess around with it a bit, at some point.
5) I like GPU SIMD more. The instruction set is a bit more useful (wouldn't a dot product AltiVec instruction be awesome for this, or just about any other math function?), I don't have to worry about registers and cache and RAM (it's all registers), and swizzling is just plain awesome (though the vec_perm instruction is pretty nifty too).
#5
The code generated by GCC can often be braindead. Unfortunately the last time I wrote ASM it was for CISC processors - RISC frightens me with all its 'prefetching' and 'scheduling' :-)
01/12/2007 (6:45 am)
Hey - thanks for the info! I just 'discovered' the unaligned data trick on another site last night and applied it to a different function I was working on. Of course, like you, the matrix multiplication sprang to mind because it seems to deal with unaligned data a lot [at least in my tests].The code generated by GCC can often be braindead. Unfortunately the last time I wrote ASM it was for CISC processors - RISC frightens me with all its 'prefetching' and 'scheduling' :-)
#6
does anyone know if there's an approximate sqrt() function for general C++ or windows in particular ?
01/12/2007 (10:17 am)
Great thread.does anyone know if there's an approximate sqrt() function for general C++ or windows in particular ?
#7
01/12/2007 (10:32 am)
I agree, this is great stuff. Thanks for working on this, guys!
#8
www.lomont.org/Math/Papers/2003/InvSqrt.pdf is a nice write up on the technique (popularized as Carmack's Square Root). www.codemaestro.com/reviews/review00000105.html has some more info on it as well.
01/12/2007 (12:14 pm)
Yes - there's a fast reciprocal square root approximation you can use on any IEEE compliant platform.www.lomont.org/Math/Papers/2003/InvSqrt.pdf is a nice write up on the technique (popularized as Carmack's Square Root). www.codemaestro.com/reviews/review00000105.html has some more info on it as well.
#9
01/12/2007 (12:30 pm)
Cool, thanks ben.
#10
and noticed no discrepencies running TGE. i haven't done any performance measurements tho.
01/12/2007 (3:47 pm)
Fwiw, i replaced all the calls to 1/mSqrt() in mMath_C.cc with the following approximatorand noticed no discrepencies running TGE. i haven't done any performance measurements tho.
static F32 m_ApproxInvSqrtF32(F32 x)
{
F32 xhalf = 0.5f*x;
S32 i = *(S32*)&x; // get bits for floating value
i = 0x5f375a86 - (i>>1); // gives initial guess y0
x = *(F32*)&i; // convert bits back to float
x = x * (1.5f - xhalf * x * x); // Newton step, repeating increases accuracy
return x;
}
#11
@Ben: I like Duff's Device on Code Maestro. Now that's some code-fu!
[Of course I mean 'like' in a stare-at-the-screen-squint-turn-my-head-to-the-left-poke-my-eye-with-a-fork kinda way.]
@Orion: I'd be curious what kind of performance you get with that if you do some measurements. I like the comment in the code on Code Maestro for Carmack's Constant.
01/12/2007 (6:31 pm)
@Alex: I tried constifying and register-ing ReciprocalSquareRoot(), but performance decreased for me. Could you post what you came up with for that function? [Edit: Are you running a G4 or a G5?]@Ben: I like Duff's Device on Code Maestro. Now that's some code-fu!
[Of course I mean 'like' in a stare-at-the-screen-squint-turn-my-head-to-the-left-poke-my-eye-with-a-fork kinda way.]
@Orion: I'd be curious what kind of performance you get with that if you do some measurements. I like the comment in the code on Code Maestro for Carmack's Constant.
#12
Thanks for those links, Ben. I agree with Andy, it's blank-stare-level stuff for me, but I'm enjoying the education.
01/12/2007 (6:41 pm)
I liked that too...;)Thanks for those links, Ben. I agree with Andy, it's blank-stare-level stuff for me, but I'm enjoying the education.
#13
Uh... getting off topic.
01/12/2007 (6:46 pm)
I might have to use Duff's Beer, er.. Device just to screw with people's minds. "What! That can't possibly be legal code! I'm sending it to The Daily WTF!"Uh... getting off topic.
#14
Here's my ReciprocalSquareRoot function
Worth noting is that GCC produces much better code using full optimization, and that if your altivec implementation runs slower than the original scalar, compile it in a release build and see how it compares. I saw a huge difference between debug and release with my go at altivec'ing m_matF_x_point4F. In debug the scalar code executed 10,000,000 runs in 2.0 seconds, while the altivec code took 2.3. In a release build, the scalar code executed in 792ms while the altivec code executed in 642 (!). Essentially I went from a 20% performance hit to a 20% performance boost.
Other things I have picked up:
Functions that use altivec
For all sorts of mystical reasons, functions that use altivec take much much longer to set up than functions which do not. For example, in the optimized altivec matF_x_point4F, more than 20% of the function's execution is simply setting up the stack. Therefore, for small things which won't see much boost from altivec anyways, that function setup cost is likely to negate what little boost is given and hurt performance.
Matrix loading
A MatrixF, aligned or unaligned, can be loaded using 5 vec_ld's, 4 vec_perm's, and one vec_lvsl.
While this looks like significantly more than four vec_ld calls, in reality it isn't. A vec_ld and vec_perm can execute simultaneously. vec_ld requires 3 cycles, vec_perm requires 2 (on my G4+). GCC, quite intelligently, intermingled the vec_ld and vec_perm calls, aligning the data for free. In the case of matrixF_x_point4F, we need to transpose the matrix, and GCC kindly deals with the data in the order it is being used, so we are simultaneously loading, permutating, and merging vectors. Essentially, for this specific function, the above code requires no more time than four straight vec_ld's.
Why AltiVec is faster
For this function, AltiVec runs faster for two key reasons. Obviously, SIMD reduces computation time. Of equal importance though is intelligent pipelining. The altivec function stalls for a mere 9 cycles, while the C function stalls for a whopping 56. Because AltiVec gives us access to numerous execution units which run in parallel, we can push more data through.
On CPU optimization
I could really get into this kind of thing. It's kinda fun knowing how the CPU is allocating its time, and what instructions are executing where and when they'll return. With GPU/shader optimization, I'm mostly limited to stuff that should work and hoping I'm still GPU bound so I'll notice the fps difference if it did or didn't.
01/12/2007 (9:52 pm)
@Andy: I am running a G4+ (7450/7455) clocked at 1.42Ghz.Here's my ReciprocalSquareRoot function
//Get the square root reciprocal estimate const register vector float zero = (vector float)(0); const register vector float oneHalf = (vector float)(0.5); const register vector float one = (vector float)(1.0); const register vector float estimate = vec_rsqrte( v ); //One round of Newton-Raphson refinement const register vector float estimateSquared = vec_madd( estimate, estimate, zero ); const register vector float halfEstimate = vec_madd( estimate, oneHalf, zero ); return vec_madd( vec_nmsub( v, estimateSquared, one ), halfEstimate, estimate );
Worth noting is that GCC produces much better code using full optimization, and that if your altivec implementation runs slower than the original scalar, compile it in a release build and see how it compares. I saw a huge difference between debug and release with my go at altivec'ing m_matF_x_point4F. In debug the scalar code executed 10,000,000 runs in 2.0 seconds, while the altivec code took 2.3. In a release build, the scalar code executed in 792ms while the altivec code executed in 642 (!). Essentially I went from a 20% performance hit to a 20% performance boost.
Other things I have picked up:
Functions that use altivec
For all sorts of mystical reasons, functions that use altivec take much much longer to set up than functions which do not. For example, in the optimized altivec matF_x_point4F, more than 20% of the function's execution is simply setting up the stack. Therefore, for small things which won't see much boost from altivec anyways, that function setup cost is likely to negate what little boost is given and hurt performance.
Matrix loading
A MatrixF, aligned or unaligned, can be loaded using 5 vec_ld's, 4 vec_perm's, and one vec_lvsl.
// Load up the matrix, with a bit extra in case we aren't aligned register vector float m0, m1, m2, m3, me; m0 = vec_ld(0, m); m1 = vec_ld(16, m); m2 = vec_ld(32, m); m3 = vec_ld(48, m); me = vec_ld(63, m); // Align the matrix. If we already are aligned, mask is the identity permute, and me == m3 mask = vec_lvsl(0, m); m0 = vec_perm(m0, m1, mask); m1 = vec_perm(m1, m2, mask); m2 = vec_perm(m2, m3, mask); m3 = vec_perm(m3, me, mask);
While this looks like significantly more than four vec_ld calls, in reality it isn't. A vec_ld and vec_perm can execute simultaneously. vec_ld requires 3 cycles, vec_perm requires 2 (on my G4+). GCC, quite intelligently, intermingled the vec_ld and vec_perm calls, aligning the data for free. In the case of matrixF_x_point4F, we need to transpose the matrix, and GCC kindly deals with the data in the order it is being used, so we are simultaneously loading, permutating, and merging vectors. Essentially, for this specific function, the above code requires no more time than four straight vec_ld's.
Why AltiVec is faster
For this function, AltiVec runs faster for two key reasons. Obviously, SIMD reduces computation time. Of equal importance though is intelligent pipelining. The altivec function stalls for a mere 9 cycles, while the C function stalls for a whopping 56. Because AltiVec gives us access to numerous execution units which run in parallel, we can push more data through.
On CPU optimization
I could really get into this kind of thing. It's kinda fun knowing how the CPU is allocating its time, and what instructions are executing where and when they'll return. With GPU/shader optimization, I'm mostly limited to stuff that should work and hoping I'm still GPU bound so I'll notice the fps difference if it did or didn't.
#15
Here's a paper by Ian Ollmann I ran across that you [and others] might find useful:
Practical Altivec Strategies: The Why, How and When of Optimization Success and Failure Using Altivec [it's a PDF].
01/13/2007 (7:23 am)
Good stuff Alex. That's what I had for my ReciprocalSquareRoot(). Maybe the difference is I'm measuring it in real use, not just a test case. [I am only using release builds for optimization.]Here's a paper by Ian Ollmann I ran across that you [and others] might find useful:
Practical Altivec Strategies: The Why, How and When of Optimization Success and Failure Using Altivec [it's a PDF].
#16
Running my profile test journal:
m_point3F_normalize_C: 682ms
1st vec version: 236ms
2nd vec version: 215ms
I'm sure there are other funky scheduling things we can do, but I think I'm at the point of diminishing returns with this one. That speedup is pretty good! I hope I don't become obsessed with this...
01/13/2007 (3:35 pm)
Based on Alex's comments and the paper I mentioned, here's another version://result = v-0.5
inline vector float ReciprocalSquareRoot( const vector float &v )
{
//Get the square root reciprocal estimate
const register vector float zero = (vector float)(0);
const register vector float oneHalf = (vector float)(0.5);
const register vector float one = (vector float)(1.0);
const register vector float estimate = vec_rsqrte( v );
//One round of Newton-Raphson refinement
const register vector float estimateSquared = vec_madd( estimate, estimate, zero );
const register vector float halfEstimate = vec_madd( estimate, oneHalf, zero );
return vec_madd( vec_nmsub( v, estimateSquared, one ), halfEstimate, estimate );
}
// Convert a Point3F to a vector, filling in 0.0 for the 4th element
// Safe for aligned and unaligned addresses
inline vector float Point3FToVector0( F32 *p )
{
register vector float MSQ = vec_ld( 0, p );
register vector float LSQ = vec_ld( 15, p );
register vector unsigned char mask = vec_lvsl( 0, p );
register vector float temp = vec_perm( MSQ, LSQ, mask );
const register vector float zeroLast = (vector float)(1.0f, 1.0f, 1.0f, 0.0f);
const register vector float zeros = (vector float)(0.0f);
return vec_madd( temp, zeroLast, zeros );
}
static void vec_point3F_normalize(F32 *p)
{
register vector float pv = Point3FToVector0( p );
register const F32 *result = (F32*) &pv;
p[0] = 0;
p[1] = 0;
p[2] = 1;
// Square each of the components of the vector, then add them together
const register vector float zeros = (vector float)(0.0f);
register vector float squared = vec_madd( pv, pv, zeros );
squared = vec_add( squared, vec_sld( squared, squared, 8 ) );
squared = vec_add( squared, vec_sld( squared, squared, 4 ) );
if ( vec_any_ne( squared, zeros ) )
{
// If you're happy with a faster estimated reciprical sqrt, use vec_rsqrte() instead of ReciprocalSquareRoot()
// pv = vec_madd( pv, vec_rsqrte( squared ), zeros );
pv = vec_madd( pv, ReciprocalSquareRoot( squared ), zeros );
p[0] = result[0];
p[1] = result[1];
p[2] = result[2];
}
}Running my profile test journal:
m_point3F_normalize_C: 682ms
1st vec version: 236ms
2nd vec version: 215ms
I'm sure there are other funky scheduling things we can do, but I think I'm at the point of diminishing returns with this one. That speedup is pretty good! I hope I don't become obsessed with this...
#17
another cool thing about using this newton's method approximate inverse square root stuff is that approximateSqrt(0) != NAN.
.. it's just a very large number. like 1.9e19.
which means that normalizeSafe() could be sped up as well, because you could avoid the call to mSqrt.
something like this:
here's the original:
01/22/2007 (5:00 pm)
Y'know,another cool thing about using this newton's method approximate inverse square root stuff is that approximateSqrt(0) != NAN.
.. it's just a very large number. like 1.9e19.
which means that normalizeSafe() could be sped up as well, because you could avoid the call to mSqrt.
something like this:
inline void Point3F::normalizeSafe()
{
F32 invVmag = approximateInverseSqrt(x*x + y*y + z*z);
if (vmag < HUGE)
{
*this *= invVmag;
}
}here's the original:
inline void Point3F::normalizeSafe()
{
F32 vmag = magnitudeSafe();
if( vmag > POINT_EPSILON )
{
*this *= F32(1.0 / vmag);
}
}
#18
if you're using the newton/raphson thing,
be aware that it may be platform-dependant!
on our linux servers the following code just returns zero for all the values i tested:
01/23/2007 (10:45 am)
More notes on this -if you're using the newton/raphson thing,
be aware that it may be platform-dependant!
on our linux servers the following code just returns zero for all the values i tested:
F32 xhalf = 0.5f*x; S32 i = *(S32*)&x; // get bits for floating value i = 0x5f375a86 - (i>>1); // gives initial guess y0 x = *(F32*)&i; // convert bits back to float x = x * (1.5f - xhalf * x * x); // Newton step, repeating increases accuracy return x;
#19
But there could be an endian issue or some other "minor detail" like a compiler setting preventing it from working.
At what line do the values go to zero?
01/23/2007 (11:12 am)
The technique works on every platform that supports IEEE floating point numbers. So unless you're running some very weird hardware you're probably ok.But there could be an endian issue or some other "minor detail" like a compiler setting preventing it from working.
At what line do the values go to zero?
#20
it turns out that release builds return bad results, the debug is fine.
we're looking into it more..
heh - kind of a conundrum, eh ?
an optimized routine in debug versus an unoptimed in release.
01/23/2007 (11:22 am)
Hey ben -it turns out that release builds return bad results, the debug is fine.
we're looking into it more..
heh - kind of a conundrum, eh ?
an optimized routine in debug versus an unoptimed in release.
Associate Kyle Carter