Game Development Community

Crash course on assembly ? - processTriFan

by Orion Elenzil · in Torque Game Engine · 01/13/2009 (10:21 am) · 5 replies

Howdy all -

processTriFan is showing up as a hotspot on our windows build,
and i notice in the assembly code for it the comment "This could be faster".

i'd love to go and make it faster,
but my assembly is a bit rusty.

i coded my share of assembly on the 68000 (amiga),
but have never looked at modern x86 code.

is there a recommended cheat-sheet somewhere ?

eg, my guess is that the following code is copying stuff from one memory location to another through some registers. what's the significance of putting something in brackets ?

mov     eax, [esi + 0]              ; x
    mov     ebx, [esi + 4]              ; y
    mov     ecx, [esi + 8]              ; z
    mov     edx, [esi + 12]             ; f
    mov     [edi + 0],  eax             ; <- x
    mov     [edi + 4],  ebx             ; <- y
    mov     [edi + 8],  ecx             ; <- z
    mov     [edi + 12], edx             ; <- f

has anyone looked at caching the results of processTriFan ?
it looks like it gets called every frame.
is it responsible for fogging on Interiors ?

many tia,
ooo

#1
01/13/2009 (10:32 am)
The brackets just mean not to move the contents of the register, but to move memory from where the register points to.

edit:

That looks like it could be replaced by a rep movsd, but I'm not sure that would give you much benefit.
If it gets copied a lot, perhaps replacing it with SSE (or even MMX) would be a good idea.

My mmx is rusty, but that would be something like this:

movdqa xmm0, [esi]
movdqa [edi], xmm0
#2
01/13/2009 (11:13 am)
Many thanks, Jaimi!

the following compiles and has identical output:

movups    xmm0     , [esi ]
    movups   [edi ]    ,  xmm0

w00t!

i'll be going through the rest of that routine and seeing what i can do.
#3
01/13/2009 (12:43 pm)
Sys64738

;-)
#4
01/13/2009 (12:51 pm)
I'm halting this for the moment,
but if anyone's interested, here are a couple small optimizations to processTriFan:

part 1,
let's get rid of the "*4" in "lea esi, [esi + ebp*4]":

just after the lines:
mov     [srcIndices], eax
    mov     eax, in_numpoints
add:
shl     eax, 2                 ; numPoints *= 4

change this line:
lea     esi, [esi + ebp*4]
to this:
lea     esi, [esi + ebp]

at the bottom of the routine, change this:
inc     ebp
to this:
add     ebp, 4

.. i'm not sure how much of an optimization that really is, but who knows.

part2,
convert the copying of memory to SSE:

change these lines:
mov     eax, [esi + 0]              ; x
    mov     ebx, [esi + 4]              ; y
    mov     ecx, [esi + 8]              ; z
    mov     edx, [esi + 12]             ; f
    mov     [edi + 0],  eax             ; <- x
    mov     [edi + 4],  ebx             ; <- y
    mov     [edi + 8],  ecx             ; <- z
    mov     [edi + 12], edx             ; <- f
to these lines:
movups    xmm0     , [esi ]          ; copy xyzf
    movups   [edi ]    ,  xmm0

part 3,
convert the math to SSE.

.. this is where it's going to take me a long time to learn, so i'm moving on to other stuff.
but in case anyone is reading this and feels like taking it on,
here's the C:
dst->texCoord.x =   (texGen0[0]*x)
								+ (texGen0[1]*y)
								+ (texGen0[2]*z)
								+ (texGen0[3]  );
and here's the regular x86 assembly which might benefit from SSE-ification:
; tc0.s
    fld     dword [_texGen0 + 0]   ; tg0.s.x   
    fmul    dword [esi + 0]
    fld     dword [_texGen0 + 4]   ; tg0.s.y
    fmul    dword [esi + 4]
    fld     dword [_texGen0 + 8]   ; tg0.s.z
    fmul    dword [esi + 8]
    fld     dword [_texGen0 + 12]  ; tg0.s.w
    
    faddp   st3, st0
    faddp   st1, st0
    faddp   st1, st0
    fstp    dword [edi + 16]    ; tc0.s
#5
01/14/2009 (1:11 pm)
test comment, pls ignore.