Game Development Community

M_matF_x_point3F() and m_matF_x_vectorF()

by asmaloney (Andy) · in Torque Game Engine · 01/20/2007 (8:36 am) · 14 replies

Here's a nice, simple optimization...

[I thought it was PPC only since there's already an asm version of these functions for x86, but it looks like the x86 Mac build does not define TORQUE_SUPPORTS_GCC_INLINE_X86_ASM, so it should apply there as well.]

In math/mathFn.h, change the code in m_matF_x_point3F() from this:
presult[0] = m[0]*p[0] + m[1]*p[1] + m[2]*p[2]  + m[3];
   presult[1] = m[4]*p[0] + m[5]*p[1] + m[6]*p[2]  + m[7];
   presult[2] = m[8]*p[0] + m[9]*p[1] + m[10]*p[2] + m[11];

to this:
const F32	p0 = p[0], p1 = p[1], p2 = p[2];
	const F32	m0 = m[0], m1 = m[1], m2 = m[2];
	const F32	m3 = m[3], m4 = m[4], m5 = m[5];
	const F32	m6 = m[6], m7 = m[7], m8 = m[8];
	const F32	m9 = m[9], m10 = m[10], m11 = m[11];
	
	presult[0] = m0*p0 + m1*p1 + m2*p2  + m3;
	presult[1] = m4*p0 + m5*p1 + m6*p2  + m7;
	presult[2] = m8*p0 + m9*p1 + m10*p2 + m11;

And change the code in m_matF_x_vectorF() from this:
vresult[0] = m[0]*v[0] + m[1]*v[1] + m[2]*v[2];
   vresult[1] = m[4]*v[0] + m[5]*v[1] + m[6]*v[2];
   vresult[2] = m[8]*v[0] + m[9]*v[1] + m[10]*v[2];

to this:
const F32	v0 = v[0], v1 = v[1], v2 = v[2];
	const F32	m0 = m[0], m1 = m[1], m2 = m[2];
	const F32	m4 = m[4], m5 = m[5], m6 = m[6];
	const F32	m8 = m[8], m9 = m[9], m10 = m[10];
	
	vresult[0] = m0*v0 + m1*v1 + m2*v2;
	vresult[1] = m4*v0 + m5*v1 + m6*v2;
	vresult[2] = m8*v0 + m9*v1 + m10*v2;

What the hell? Adding all that makes it faster?

Read on for the...

Gory Details
------------


What's going on here? Let's look at m_matF_x_point3F() as an example. The compiler has no way of knowing that presult, m, and p do not point to the same place in memory [called an alias]. We assert that p != presult at the top of the function, but the compiler doesn't understand that so it cannot optimize the code here.

This optimization is difficult to profile because it's pervasive. Since m_matF_x_point3F() is inlined in other inline functions [MatrixF::mulP()] which are used throughout, changing this code has a general impact on the topology of the profile. It might look like some functions are slower, but they may just be called more often now.

So let's look at the PPC instructions gcc generated!

Before - 36 instructions
lfs f0,4(r4)
	lfs f12,4(r3)
	lfs f9,0(r4)
	lfs f13,0(r3)
	lfs f11,8(r4)
	fmuls f12,f12,f0
	lfs f0,8(r3)
	lfs f10,12(r3)
	fmadds f13,f13,f9,f12
	fmadds f0,f0,f11,f13
	fadds f0,f0,f10
	stfs f0,0(r5)
	lfs f0,4(r4)
	lfs f12,20(r3)
	lfs f9,0(r4)
	lfs f13,16(r3)
	lfs f11,8(r4)
	fmuls f12,f12,f0
	lfs f0,24(r3)
	lfs f10,28(r3)
	fmadds f13,f13,f9,f12
	fmadds f0,f0,f11,f13
	fadds f0,f0,f10
	stfs f0,4(r5)
	lfs f0,4(r4)
	lfs f12,36(r3)
	lfs f13,32(r3)
	lfs f9,0(r4)
	lfs f10,8(r4)
	fmuls f12,f12,f0
	lfs f0,40(r3)
	lfs f11,44(r3)
	fmadds f13,f13,f9,f12
	fmadds f0,f0,f10,f13
	fadds f0,f0,f11
	stfs f0,8(r5)
	blr

After - 30 instructions
lfs f0,4(r4)
	lfs f13,4(r3)
	lfs f12,20(r3)
	lfs f11,36(r3)
	lfs f10,0(r3)
	fmuls f13,f13,f0
	lfs f9,16(r3)
	fmuls f12,f12,f0
	lfs f8,32(r3)
	fmuls f11,f11,f0
	lfs f0,0(r4)
	fmadds f10,f10,f0,f13
	lfs f13,8(r3)
	fmadds f9,f9,f0,f12
	lfs f12,24(r3)
	fmadds f8,f8,f0,f11
	lfs f0,8(r4)
	lfs f11,40(r3)
	fmadds f13,f13,f0,f10
	lfs f10,28(r3)
	fmadds f12,f12,f0,f9
	lfs f9,44(r3)
	fmadds f11,f11,f0,f8
	lfs f0,12(r3)
	fadds f12,f12,f10
	fadds f13,f13,f0
	fadds f11,f11,f9
	stfs f12,4(r5)
	stfs f13,0(r5)
	stfs f11,8(r5)
	blr

Not a bad savings when applied everywhere MatrixF::mulP() is called [similar savings on MatrixF::mulV()].

[Aside: I had tried in the past to Altivec these [that can be a verb, right?], but I could not realize any savings because of all the manipulations to get the data in the right formats.]

[Edit: missed a word or two...]

#1
01/20/2007 (9:56 am)
Your recent posts have been so awesome.
#2
01/20/2007 (11:25 am)
I agree, these are awesome...it's getting tough to keep track of all of them, though. It would be great to start putting these into a resource or two...
#3
01/20/2007 (11:39 am)
@Orion: Thanks - I appreciate it!

@Rubes: Thanks! You're the second to request a resource... I've been considering how I might do that. Stay tuned - same bat-time, same bat-channel! [sorry - couldn't be helped]
#4
01/20/2007 (11:45 am)
I have been following your threads with great interest, however I have no clue how to see the asm code of a given function, so I can't help you test or I would.
#5
01/22/2007 (9:09 am)
Um.. anyone know how to view assembly in VS 2005 ?
#6
01/22/2007 (10:20 am)
Hey orion, while in the debugging, right click -> go to disassembly.
#7
01/22/2007 (1:39 pm)
For any that are interested,
i measured the effect of this change on a win32 stand-alone client,
and it seemed to be a very slight gain when rendering mostly DIFs,
and a somewhat less slight loss when rendering mostly players.
- why it should both help and hinder, i'm not sure.
i have the most confidence in the slight loss number.

many DIFs: 38.8 FPS -> 39.7
56 Players: 24.7 FPS -> 22.8

also curious,
the file size reduced by 8KB with this change.
#8
01/22/2007 (1:52 pm)
That's interesting!

24.7 FPS -> 22.8 FPS is quite serious though.
#9
01/22/2007 (2:29 pm)
@Orion: Thanks for the check! I will relegate this type of alias optimization to the GCC-only bin for now :-)
#10
01/22/2007 (2:40 pm)
A teammate is did a closer analysis,
and it looks like the modified version does 12 additional "fstp" ops,
due to the creation of the const storages.

he's fooling around more with it 'though.
#11
01/23/2007 (11:52 am)
Have you tried const-ing the inputs rather than allocating new const variables?
#12
01/23/2007 (12:28 pm)
The inputs are already const, but that doesn't solve the alias problem. Let's say v and vresults point to the same place in memory. Then:

vresult[0] = m[0]*v[0] + m[1]*v[1] + m[2]*v[2];
vresult[1] = m[4]*[b]v[0][/b] + m[5]*v[1] + m[6]*v[2];
vresult[2] = m[8]*[b]v[0][/b] + m[9]*[b]v[1][/b] + m[10]*v[2];

The compiler has no way of knowing it does not need to refetch the ones in bold, so it can't just load it once and use a register. [That's why I'm surprised about the VC++ compiler. Orion - can you post the before and after asm?]

Actually, looking at this again, the m ones don't need to be done since they can't overlap.
const F32	v0 = v[0], v1 = v[1], v2 = v[2];
	
	vresult[0] = m[0]*v0 + m[1]*v1 + m[2]*v2;
	vresult[1] = m[4]*v0 + m[5]*v1 + m[6]*v2;
	vresult[2] = m[8]*v0 + m[9]*v1 + m[10]*v2;

Likewise for m_matF_x_point3F():
const F32	p0 = p[0], p1 = p[1], p2 = p[2];
	
	presult[0] = m[0]*p0 + m[1]*p1 + m[2]*p2  + m[3];
	presult[1] = m[4]*p0 + m[5]*p1 + m[6]*p2  + m[7];
	presult[2] = m[8]*p0 + m[9]*p1 + m[10]*p2 + m[11];
#13
01/23/2007 (4:43 pm)
FWIW, the Visual Studio compiler produces ~ 27 instructions (vs 36 for the original) for this:

inline void m_matF_x_point3F(const  F32* m,  const  F32* p, F32* presult)
{
   AssertFatal(p != presult, "Error, aliasing matrix mul pointers not allowed here!");
   const F32* pt = p;
   *presult++ = *m++ * *pt++ + *m++ * *pt++ + *m++ * *pt  + *m++;   
   pt = p;
   *presult++ = *m++ * *pt++ + *m++ * *pt++ + *m++ * *pt  + *m++;
   pt = p;
   *presult   = *m++ * *pt++ + *m++ * *pt++ + *m++ * *pt  + *m;   
}

I'm guessing that is still between 40 and 50 cycles.

With SIMD/SSE2, you could do it in 30 cycles, probably, but I'm not sure it's worth it.
#14
01/23/2007 (5:04 pm)
Wow Tim. That's something I hadn't considered - ugly, but sweet:

lfs f12,0(r3)
	addi r3,r3,12
	lfs f13,0(r4)
	fmuls f13,f12,f13
	fadds f0,f13,f13
	fadds f0,f0,f13
	fadds f0,f0,f12
	stfs f0,0(r5)
	lfsu f12,4(r3)
	lfs f0,0(r4)
	fmuls f0,f12,f0
	fadds f13,f0,f0
	fadds f13,f13,f0
	fadds f13,f13,f12
	stfsu f13,4(r5)
	lfs f12,16(r3)
	lfs f0,0(r4)
	fmuls f0,f12,f0
	fadds f13,f0,f0
	fadds f13,f13,f0
	fadds f13,f13,f12
	stfs f13,4(r5)
	blr

22 instructions

Edit: Except that asm is wrong... It gives incorrect results on gcc. Perhaps the optimizer is borked?