Game Development Community

Enabling ASM on OSX Intel

by Gary "ChunkyKs" Briggs · in Torque Game Engine · 01/05/2007 (12:54 pm) · 19 replies

Beware: This technique will probably cause you problems with ppc, since it adds x86 asm into previously-universal builds.

Thanks mostly to discussions in this thread, specifically Paul and Ben showing the way and substantial input from Andy, here's an XCode project and a diff.

torque-with-asm.xcodeproj.tar.gz
enable-asm.diff.gz

Uh. Hope it's of use, or interest, to someone. If you wanna do this from the command line on a fresh copy of TGE1.5:
1) Extract your TGE1.5 SDK. The one I downloaded is called TorqueSDKMac_150.zip
2) Assuming you extracted it in your home dir:
cd "Torque Game Engine 1.5 SDK/Torque SDK/"
curl 'http://icculus.org/~chunky/stuff/osxasm/patches/enable-asm.diff.gz' | gzip -cd | patch -p0
cd xcode
curl 'http://icculus.org/~chunky/stuff/osxasm/patches/torque-with-asm.xcodeproj.tar.gz' | gzip -cd | tar -xvf-
open 'torque-with-asm.xcodeproj'

Gary (-;

#1
01/05/2007 (3:26 pm)
Holy crap! This is great! I need to read the engine forum more. Gary, recently you've been on fire. Way to go!

Are there obvious issues with the textures being in the wrong format?
#2
01/05/2007 (3:47 pm)
No issues that I've seen. Perhaps they manifest in strange scenarios, or only if you don't have the other asm stuff enabled, or perhaps paul's on crack or something :-)

Gary (-;
#3
01/09/2007 (12:10 pm)
Heh. So then I poke about with shark a little more, and find that interiors processTriFan is a big cpu chomper on stronghold [see here]. Lo and behold, there's an assembly version.

Open up XCode, go into your interiors dir, "Add Existing File", add itfdump.asm. Rightclick it, get info, set type to sourcecode.nasm. Remove itfdump_c.cc from the target [no need to delete it, just uncheck it], and thar goes that bottleneck.

Gary (-;
#4
01/09/2007 (7:19 pm)
Wow...those are two intense bottlenecks...I'll be eager to check this out. Thanks!
#5
01/12/2007 (9:25 pm)
Did Paul give this his blessing yet? I remember he was saying there was issue with the output texture format from the assembly code.
#6
01/18/2007 (12:16 pm)
Oh, and HUGE bug [that doesn't manifest anywhere you'd see if you're trying this, it's huge]. In CPUInfo, I used this:

unsigned int twoints[2];
size_t twoints_s = sizeof(twoints);

That should be a long long, not two ints. So the code would be:
long long sysctlll;
   size_t sysctlll_s = sizeof(sysctlll);
   
   if(0 != sysctlbyname("hw.memsize", (void *)&sysctlll, &sysctlll_s, NULL, 0)) {
   } else {
          Con::printf("System memory size: %iMB", (unsigned int)(sysctlll/(1024*1024)));
    }

And a little later:

if(0 != sysctlbyname("hw.cpufrequency", (void *)&sysctlll, &sysctlll_s, NULL, 0)) {
          Platform::SystemInfo.processor.mhz = 1000;
          Con::errorf("Couldn't detect CPU Frequency, assuming 1GHz");
   } else {
          Platform::SystemInfo.processor.mhz = (unsigned int)(sysctlll/(1000*1000));
          Con::printf("CPU Frequency: %2.3fGHz", (float)sysctlll/(1000*1000*1000));
    }

Three cheers for hungarian notation!

Gary (-;
#7
01/18/2007 (2:22 pm)
Here I am, eating my words.

Was talking with Paul earlier, and he pointed out that torque, after slapping the blender asm in, is spending a lot of time in glgProcessPixels. Paul's original comment was:
Quote:As stated on some other threads, the intel mac version of torque doesn't use the mmx blender because that blender outputs the wrong texture format, and I've yet to untangle the asm code to find out how it works. Previous attempts were unfruitful
And my comment earlier in this thread:
Quote:or perhaps paul's on crack or something :-)

A short amount of reading and you find This apple page, which speaks of glgProcessPixels:

Quote:Costly data conversions. If you notice the glgProcessPixels call (in the libGLImage.dylib library) showing up in the analysis, it's an indication that the driver is not handling a texture upload optimally. The call is used when your application makes a glTexImage or glTexSubImage call using data that is in a nonnative format for the driver, which means the data must be converted before the driver can upload it. You can improve performance by changing your data so that it is in a native format for the driver. See "Use Optimal Data Types and Formats".

Note: If your data needs only to be swizzled, glgProcessPixels performs the swizzling reasonably fast although not as fast if the data didn't need swizzling. But non-native data formats are converted one byte at a time and will incur a performance cost that is best to avoid.

All credit of this goes to Paul, not me, I'm just posting some text for anyone interested. :-)

In short: Sure, this blend-in-asm thing makes torque go a lot faster, but that's nothing on how fast it *should* be going, because, as Paul said, "the blender outputs the wrong texture format"

Reading this page, the implcation is that windows performance would improve if the asm blender output a different format, too...

Gary (-;
#8
01/18/2007 (2:27 pm)
Interesting stuff...I don't understand most of it, but interesting nonetheless.

I'd put a nice little bounty on this solution if there was a way....
#9
01/18/2007 (3:01 pm)
Mmmh... I may doing something beacause I do not see any boost from this new version.
Gary, could you put a compiled version on the mighty Icculus server?
#10
01/19/2007 (6:46 pm)
@All:
I'd like to thank Gary (ChunkyKs) for the code he posted. It bumped this bug up to the top of my task list, even though it did not fully solve the problem, because it was a great head start.

So, I've got the terrain blender problem fixed, for intel macs. I still have yet to make everything backward compatible to other macs. The good news is, there's a huge speed boost coming for TGE on Intel Macs

Here's a quick summary:
$pref::terrain::textureCacheSize needs to be bumped up to 512, in the prefs.cs files.
This will make the blender cache 512 textures. At 128x128 x 2 bytes per texel, this uses 16Mb of texture memory on the graphics card for textures. New low-end Macs have 64Mb of texture memory. Slightly older low end cards have 32. So, the cost is justified, and the sourcecode side of Torque sets this limit to 512... but it gets overwritten by old prefs.cs scripts. Gotta change em.

Blend maps and white maps must be swizzled into GL_BGRA_EXT / GL_UNSIGNED_SHORT_1_5_5_5_REV format
As documented by Apple, and as I've mentioned, 16 bit textures need to be in a particular format to take advantage of fast texture upload. Here's a non-optimized code snippet that fixes the blend maps once the blender is done creating them.
static void fixcolors(GBitmap* bmp)
{
   U32 _bpp = bmp->bytesPerPixel;
   for(int miplevel = 0; miplevel < bmp->getNumMipLevels(); miplevel++)
   {
      U16* pixels = (U16*)bmp->getWritableBits(miplevel);
      U32 numpixels = bmp->getWidth(miplevel) * bmp->getHeight(miplevel);
      
      for( int i = 0; i < numpixels; i++ )
      {
         register const U16 c = *(pixels + i);
//         from:
//         *sourceFormat = GL_RGBA;
//         *byteFormat   = GL_UNSIGNED_SHORT_5_5_5_1;
//         to:
//         *sourceFormat = GL_BGRA_EXT;
//         *byteFormat   = GL_UNSIGNED_SHORT_1_5_5_5_REV;

         //static U16 __color = 0xF800;
         // rrrrr ggggg bbbbb a
         // a bbbbb ggggg rrrrr
         register U16 red   = ( c & 0xf800 ) >> 11;
         register U16 green = ( c & 0x07C0 ) >> 6;
         register U16 blue  = ( c & 0x003e ) >> 1;
         register U16 alpha = ( c & 0x0001 );
         
         *(pixels + i) = alpha << 15 | red << 10 | green << 5 | blue;
      }
   }
}

void TerrainRender::buildBlendMap(AllocatedTexture *tex)
{
   // snip...
   mCurrentBlock->mBlender->blend(x, y, level, (const U16*)lightmap->getBits(), mips);
   [b]fixcolors(bmp);[/b]
   // snip...
}
I may or may not actually spend time optimizing this, as it hasn't shown up significantly on any time profiles yet

This code swizzles the white maps:
In terrainRender.cc, in TerrainRender::buildBlendMap(), near line 2580:
#if defined([b]TORQUE_OS_MAC[/b])
			   // a bbbbb ggggg rrrrr
			   lmbits[i] = (alpha << 15) | (blue << 10) | (green << 5) | red;
#else
			   // bbbbb ggggg rrrrr a
			   lmbits[i] = (blue << 11) | (green << 6) | (red << 1) | alpha;
#endif
The only change here from the current code is on the first line, changing TORQUE_BIG_ENDIAN to TORQUE_OS_MAC.

And this code will tell the texture manager to use the faster texture format when it uploads the texture:
In gTexManager.cc, in getSourceDestByteFormat(), near line 655:
#if defined([b]TORQUE_OS_MAC[/b])
         *sourceFormat = GL_BGRA_EXT;
         *byteFormat   = GL_UNSIGNED_SHORT_1_5_5_5_REV;
#else
         *sourceFormat = GL_RGBA;
         *byteFormat   = GL_UNSIGNED_SHORT_5_5_5_1;
#endif
The change here is, once again, from TORQUE_BIG_ENDIAN to TORQUE_OS_MAC.


So, this will go out in an update to 1.5 as soon as I can get everything synced up & compatible.

Share and Enjoy.

/Paul

Edit: added the bit in gTexManager.cc -- sorry for the confusion.
#11
01/19/2007 (6:51 pm)
Oh sweet mama, that's music to my ears. I'm all over this like a bad dress on a bridesmaid.

Thanks, Paul! Can't wait to play around with it.

Edit: Thanks to Gary too, of course!
#12
01/22/2007 (11:47 am)
Awesome beans, Paul.

Only problem is, when I made these changes, I end up with this: icculus.org/~chunky/stuff/osxasm/likeacidbutnot.png

My first idea is that this is because, after making the above changes, the textures are being swizzled appropriately, but are still being presented to the graphics driver as GL_RGBA instead of GL_BGRA_EXT... except that doesn't entirely make sense, since glgProcessPixels is no longer appearing in my shark profile.

My graphics knowlege here is pretty heavily limited, but I looked in gBitmap.h and there's this enum:

enum BitmapFormat {
      Palettized = 0,
      Intensity  = 1,
      RGB        = 2,
      RGBA       = 3,
      Alpha      = 4,
      RGB565     = 5,
      RGB5551    = 6,
      Luminance  = 7
   };

Which doesn't have a mention of the format that Paul's code is swizzling to. So, uh, mostly I'm just confused?

On the other hand, I'm seeing even less stuttering than before :-)

Oh, and $pref::terrain::textureCacheSize needs to be bumped up to 512, in the prefs.cs files.:

Probably better to open up terrRender.cc starting around line 82, and change the relative section thus:
#if defined(TORQUE_OS_LINUX) || defined(TORQUE_OS_OPENBSD)
// Texture slop isn't necessary on Linux
U32 TerrainRender::mTextureSlopSize = 512;
#elsif defined(TORQUE_OS_MAC_OSX)
// No reason not to increase this significantly on macs that support torque
U32 TerrainRender::mTextureSlopSize = 512;
#else
U32 TerrainRender::mTextureSlopSize = 220;
#endif

Honestly, I suspect it might be time to rid that whole define and just set it to 512 by default for all systems?

Gary (-;
#13
01/22/2007 (12:10 pm)
Ah. Right. Forgot a couple of things here:

1) Upping the hardcoded value to 512 is fine. Be sure to change the prefs.cs file too, or your hardcoded value won't take effect.
2) I forgot to post the code snippet from the texture manager, that alters the actual texture format we use to upload the terrain textures. Edited the above post to include it. Thanks for catching that, Gary!

/Paul
#14
01/22/2007 (12:13 pm)
@Gary:

I've noticed the same thing. I haven't gotten around yet to posting it, but that's pretty much exactly the way it looks for me, too.

I'm also noticing considerably less stuttering, but that seems to be from setting the textureCacheSize to 512. I did that for both my PPC and Intel builds and noticed less stuttering, but that's the case even with the build that does not incorporate these changes.

Edit: Great, thanks Paul. I'll give it a shot.
#15
01/22/2007 (12:40 pm)
Heh. It's still suffering minor sniglets, /a la/ groundbubbles.png, but getting there! [yes, I tried relighting. No, that wasn't the problem]

Gary (-;
#16
01/23/2007 (7:05 pm)
Another screenie that shows the same problem more clearly is here:
purpleground.png

Grepping the code, there's a lot of TORQUE_BIG_ENDIAN defines around, I suspect one or more of them may be to blame, but I'm never really clear on what's actually appropriate changes for OSX, or what's appropriate changes for PPC, or... or... or...

Gary (-;
#17
04/23/2007 (9:52 am)
Does anyone know if these fixes were rolled into 1.5.1?
#18
04/23/2007 (1:41 pm)
They were. My name's even there :-)

*glee*

Gary (-;
#19
04/23/2007 (1:52 pm)
Ah excellent :)