Double vs. float constants
by asmaloney (Andy) · in Torque Game Engine · 01/19/2007 (4:42 pm) · 6 replies
As I've been working my way around the TGE codebase, I've noticed many, many places where double constants are used instead of floats in float calculations - e.g.:
for which gcc generated the following PPC asm:
If we fix this:
We get much nicer asm [if it can ever be called nice]:
I don't know what other compilers do with this - I'd be curious if someone with VC++ could see what it does.
As I mentioned, this is throughout the codebase and could improve things if it were fixed systematically...
[Edit: Sorry - should have pointed out that the code is doing more than just that one expression - it's actually interleaved the next expression - pmid->y = ...]
pmid->x = (p1->x + p2->x) * [b]0.5[/b];
for which gcc generated the following PPC asm:
... [b]lfd f31,-160(r2) [/b] lfs f13,0(r25) lfs f12,0(r24) lfs f11,4(r24) lfs f0,4(r25) fadds f13,f13,f12 fadds f0,f0,f11 [b]fmul f13,f13,f31 [/b] fmul f0,f0,f31 [b]frsp f11,f13 [/b] frsp f12,f0 stfs f11,52(r23) stfs f12,4(r31) ...
If we fix this:
pmid->x = (p1->x + p2->x) * [b]0.5f[/b];
We get much nicer asm [if it can ever be called nice]:
... [b]lfs f31,-3192(r2) [/b] lfs f13,0(r25) lfs f12,0(r24) lfs f11,4(r24) lfs f0,4(r25) fadds f13,f13,f12 fadds f0,f0,f11 [b]fmuls f11,f13,f31 [/b] fmuls f12,f0,f31 [b]stfs f11,52(r23) [/b] stfs f12,4(r31) ...
I don't know what other compilers do with this - I'd be curious if someone with VC++ could see what it does.
As I mentioned, this is throughout the codebase and could improve things if it were fixed systematically...
[Edit: Sorry - should have pointed out that the code is doing more than just that one expression - it's actually interleaved the next expression - pmid->y = ...]
#2
becomes:
Whereas:
becomes:
01/19/2007 (6:20 pm)
Sure.float foo( float input ) { return( input * [b]0.5[/b] ); }becomes:
lis r2,ha16(LC0) lfd f0,lo16(LC0)(r2) fmul f1,f1,f0 frsp f1,f1 blr
Whereas:
float foo( float input ) { return( input * [b]0.5f[/b] ); }becomes:
lis r2,ha16(LC0) lfs f0,lo16(LC0)(r2) fmuls f1,f1,f0 blr
#3
and it generated this using a float (0.5f rather than 0.5):
01/19/2007 (8:53 pm)
I'm not sure if this will help you or not but the MSVC6 compiler generated this from pmid->x = (p1->x + p2->x) * 0.5;:0082F1FA mov edx,dword ptr [ebp-14h] 0082F1FD mov eax,dword ptr [ebp-18h] 0082F200 fld dword ptr [edx] 0082F202 fadd dword ptr [eax] 0082F204 fmul qword ptr [__real@8@3ffe8000000000000000 (00c88010)] 0082F20A mov ecx,dword ptr [ebp-10h] 0082F20D fstp dword ptr [ecx]
and it generated this using a float (0.5f rather than 0.5):
0082C35A mov edx,dword ptr [ebp-14h] 0082C35D mov eax,dword ptr [ebp-18h] 0082C360 fld dword ptr [edx] 0082C362 fadd dword ptr [eax] 0082C364 fmul dword ptr [__real@4@3ffe8000000000000000 (00c789c4)] 0082C36A mov ecx,dword ptr [ebp-10h] 0082C36D fstp dword ptr [ecx]
#4
Now I'm curious what gcc gives when it generates x86 asm for this...
01/19/2007 (9:19 pm)
Thanks Chris - that's interesting. So the VC++ compiler is giving the same number of instructions for both, only the fmul happens with a different argument. So this change will have no affect on Windows [unless an fmul with a qword is slower than an fmul with a dword - no idea there].Now I'm curious what gcc gives when it generates x86 asm for this...
#5
01/19/2007 (11:14 pm)
I have next to no assembly experience but the second version should result in moving 4 bytes rather than 8 (__real@4 rather than __real@8, I think). I don't know what kind of performance gain that actually nets but it certainly can't hurt.
#6
Out of curiosity, any x86 experts know if fmul qword is slower than fmul dword or do they take the same number of cylces?
In the end, it wouldn't have the potential impact that it does with gcc for PPC. Most of those frsp [Floating Round to Single-Precision] instructions stall the processor. Also, because it's actually removing these instructions, it would give the optimizer a chance to schedule things differently.
01/20/2007 (5:42 am)
Yes, you are correct -that's what I was trying to get at with my 'unless' comment.Out of curiosity, any x86 experts know if fmul qword is slower than fmul dword or do they take the same number of cylces?
In the end, it wouldn't have the potential impact that it does with gcc for PPC. Most of those frsp [Floating Round to Single-Precision] instructions stall the processor. Also, because it's actually removing these instructions, it would give the optimizer a chance to schedule things differently.
Associate Orion Elenzil
Real Life Plus