Game Development Community

Multi-processor parallel programming in TGE

by Duncan Gray · in Torque Game Engine · 05/10/2007 (12:17 am) · 83 replies

I just tried running 4 threads on a single CPU and I did not get the above predicted clash in the assembly code.
In fact I get no problems at all.

Although I only adapted the updateSkin method for this test, this multiprocess approach can be used to increase performance in collission, physics, animation calculations, AI etc as well

If you want to try the demo, please post here or email me.
#41
05/24/2007 (9:37 am)
Hi Duncan, thanks for the reply, I understand. Was just surprised at exactly how much slower it was on my desktop machine.

thanks for sharing the code!
#42
05/24/2007 (8:14 pm)
@Clint This link has source code and an executable for detecting how many cpu's and types you have.

@Sebastian,
Quote:It works fine with the Intel Core Duo laptops with AMD 1400 mobilities

Does mean you got a performance increase?

I don't know what the cause of your performance loss is unless those cores also use hyperthreading.
But it's good to get feedback so that we know where to expect problems.

It would be nice if ALL the other plenty people who have downloaded the binary and source would give some feedback............
#43
05/25/2007 (2:55 am)
First things first... Duncan, you are a champion! :D

My system: Core2Duo 2.4GHz E6600, 2GB DD2 RAM, 8800GTX - Windows Vista

Primer - I always find that TGE runs pretty poorly on my machine (~80fps on starter.fps (200ish when staring at the ground), silly environment generators), but TGEA will runs exceptionally well (around 290fps); due to my DX10 hardware and operating system :)

--------------------

Test #1: vanilla TGE1.5.2 - starter.fps; OpenGL, 1280x720 resolution [windowed]

Single thread: ~52% on single core, ~6% on secondary
Multi threaded: ~47% on single core, ~8% on secondary

Frame rate pretty steady on either version. Also tried D3D but no change in fps/CPU usage.

--------------------

Test #2: vanilla TGE1.5.2 - starter.fps; OpenGL, 1280x720 resolution [windowed]
+ with L3DT creating an 8km x 8km texture, Constructor open, WMP playing music

Single thread: ~100% on single core, ~98% on secondary
Multi threaded: ~100% on single core, ~98% on secondary

Frame rate pretty steady on either version (granted it dropped to about 45fps compared to the first test). Also tried D3D but no change in fps/CPU usage.
#44
05/25/2007 (5:00 am)
Thanks Gavin. Yeah the multi thread part was only added to the skinmesh and wheeled vehicle section so there wont be much visible difference unless you have a lot of player characters on screen as with the pictures Orion posted. Or perhaps drop a few vehicles into the starter.racing mission.

It would be nice if someone who knew collision and terrain code well enough could data-parrallize (is that a word) those sections as well.

I hope to get to it at some point but only got so much time in a day and so much to do.

@Orion or Clint, can you make available a test mission similar to your pictures?
#45
05/25/2007 (6:06 am)
Duncan, I believe the Pentium D is the Dual Core Pentium 4 and does not actually use hyperthreading, the AMD machine for sure does not. Its just a "standard" dual cpu rig with Dual Core Athlons in them. Actually, I think he's using Opterons. The artist who owns that particular machine has too much disposable cash. The common component to both machines is the nvidia 7900 cards. It'd be nice if other people with those cards provided feedback :)

I actually haven't noticed an increase yet, but thats because the test scenes being used dont use wheeled vehicles or skinmeshes yet. Although lack of performance decrease was more important in this case : )
#46
05/25/2007 (11:06 am)
Duncan -
sure. the mission file which generated the screenshot below is here.

you might want to flatten the terrain and remove some buildings tho.

there's a bit of magic in the bottom which does the work:
new SimGroup(NPCGroup) {
   };
};
//--- OBJECT WRITE END ---


function SimGroup::FillWithAIPlayers(%this, %jNum, %iNum, %corner, %spacing)
{
   %dx     = getWord(%spacing, 0);
   %dy     = getWord(%spacing, 1);
   %n      = 0;
   
   for (%i = 0; %i < %iNum; %i++)
   {
      for (%j = 0; %j < %jNum; %j++)
      {
         %pos = (%dx * %i) SPC (%dy * %j) SPC "0";
         %pos = vectorAdd(%pos, %corner);
         %npc = new AIPlayer() {
            position  = %pos;
            dataBlock = DemoPlayer;
         };
//       looks better w/o names on these guys..
//       %npc.setShapeName("kork" SPC %n);
         %this.add(%npc);
         %n++;
      }
   }
}

npcgroup.FillWithAIPlayers(16, 16, "240 60 400", "3 3");

elenzil.com/gg/images/lotsakorks.jpg
#47
05/25/2007 (3:05 pm)
Thanks Orion, it did not occur to me add script into a mission file and its a really good idea, Thanks for making me smarter.

I updated it to work with default 1.5.2

Lets use than as a benchmark. I'm going to look at adding a timed test to the resource as described to Clint earlier.

I'm also going to make a benchmark mission for starter.racing

See my lousy single cpu frame-rate
www.ultrabizz.com/biz/fps1.jpg
#48
05/25/2007 (4:46 pm)
Hmm just reran with the lots-a-kork mission; still the same results - actually noticed a ~1 fps drop when switching to Threaded. CPU usage still around ~52% / 8%. FPS at ~5.

My guess would be that the CPU can simply handle this many calculations operating in single core mode; so splitting off to multiple threads just adds overhead (hence the fps drop when in threaded).

Would be interesting to see what sort of improvements can be gained in TGEA - not sure how many threads/if they are running at the moment...
#49
05/25/2007 (4:57 pm)
Quote:
My guess would be that the CPU can simply handle this many calculations operating in single core mode

hmm - well try editing the mission file and changing it from 16x16 korks to say 2x2.

if you see a framerate improvement, then there's an argument for optimization..
#50
05/25/2007 (5:33 pm)
Yep can definatly see a bigger jump well less models...

In threaded, fps around 55; in single, it jumps up to ~65!

www.irombu.net/lotsakork_threaded.jpg
www.irombu.net/lotsakork_single.jpg
#51
05/25/2007 (6:43 pm)
I'm confused as to why Gavin's second core did not take more of the load, as compared with Lee's test results.

The vehicle test should be more intensive, here are both missions files.

starter.fps benchmark mission file TGE 1.5.2
starter.racing benchmark mission file TGE 1.5.2

www.ultrabizz.com/biz/cars.jpg
#52
05/25/2007 (7:09 pm)
Tried with starter.racing - even pushed the vehicle grid to 7x7... still the same as before: about 4-5fps in Threaded, 6-7 in Single.

I just ported it over to TGEA (1.01)

Now that's a lot of Space Korks...
www.irombu.net/lotsakork_tgea.jpg


Threaded...
www.irombu.net/lotsakork_tgea_threaded.jpg


Single (on CPU usage graph, just after the spike; spike from alt tabbing into photoshop)
www.irombu.net/lotsakork_tgea_single.jpg

I tried disabling the thread affinity masking, but doesn't change anything in terms of performance... now I'm really confused...
#53
05/25/2007 (7:40 pm)
Some profiler dumps in TGEA [5s each] - defiantly a lot of semaphore handling time...

www.irombu.net/profilerDump_Single.txt
www.irombu.net/profilerDump_Threaded.txt
#54
05/25/2007 (9:15 pm)
I'm not seeing any changes between the 2... still getting about 8-9 in either mode using that mission file with lots of korks. This is with them all in view centered. Personally (and i'm no pro) on windows i don't think SetThreadAffinityMask is the way to go... A quote from msdn...

"Setting thread affinity should generally be avoided, because it can interfere with the scheduler's ability to schedule threads effectively across processors. This can decrease the performance gains produced by parallel processing. An appropriate use of thread affinity is testing each processor." ie coinit

Pretty much MP usage takes a little different twist when you bypass the schedulers standard thread calls. I think if you want to go outside the standard schedulers, you're pretty much stuck with writing the code without using the standard libs. I'm sure there are resources out there for this, just not sure where...

Moreso, it's not so much a thread issue as it is the old gl2 libs torque is built around. And you for sure won't see better results with this using directx (TGEA), if anything i'd expect a major drop in performance using these calls. I think Gavin's example shows that.

Without a doubt I'd bet a lot of torque code should be re-written to take advantage of more modern gpu and cpu threading boosts. Pretty much getting open scenegraph and the 2.1 libs in code would be a harder but better solution. Rendering that many korks in scenegraph utilizing far more then a 50% cap on processing (on mp machines) you would see very workable frame rates.
#55
05/25/2007 (9:59 pm)
Good work Gavin. One possible problem. Did you wrap the PROFILE_START/PROFILE_END directives arround the semaphores inside the processThreaded method?

If you did then keep in mind that there are some threads running around inside that loop and the profiler itself is probably not thread safe and could impact on the figures you got and even interfere with the comparison between single and threaded mode because the processThreaded method is now doing additional work compared to the startSingle method.

For best comparison, just wrap profiling directives around the entire contents of the updateSkin method and the updateforces method and then do profile dumps.

It would be good to get a results from Mac or Linux builds to confirm if it's a problem with the multi-process implementation or with the Windows scheduler overriding the thread affinity.

I know Lee Latham managed to do a Linux compile but it was crashing with a segfault so no progress there yet.

I'm getting behind in other projects and I'm all out of ideas and hardware on this matter so I'm going to leave it with you guys, but I'll contribute any useful input I can.
#56
05/26/2007 (6:21 am)
I got a 4-5 fps boost with Duncan's starter.racing mission on my core2duo (WinXP). About a 3fps boost on lots-a-korks.
#57
05/26/2007 (5:11 pm)
I added a built in benchmark test to safely profile how many threads to create and then setup the engine to use the threadpool size which yeiled the best results.

While testing it I got VERY interesting results. I hardcoded 4 threads and the test showed that my single cpu AMD Athlone 3000+ did the test in half the time with two threads than in single thread mode.
Parallel Processing Actvated on 4 CPU's
Parallel Processing disabled by user request
Benchmark test took 31 milliseconds in single thread mode
Parallel Processing activated by user request
Benchmark test took 16 milliseconds with 2 CPU count
Benchmark test took 31 milliseconds with 3 CPU count
Benchmark test took 32 milliseconds with 4 CPU count
Best performance [16] was with 2 CPU count
setting thread pool accordingly

Perhaps AMD slipped in a second core and did not tell anyone. Very odd.

I updated the resource and the Windows binary

The test runs automatically when the multi-process class is created. Check the very top of your log file for the results (or the console window)

You can also re-run the test by typing MP_doBenchmark() in the console window.

Now at least we have a standard test to compare apples with apples on different cpu's
#58
05/26/2007 (8:36 pm)
As a result of my single cpu performance discovery above ......

Added MP_forceThreads(count) console method to give more flexibility to test options
Above links and resource have been updated

Try doing MP_forceThreads(32) followed by
MP_doBenchmark()
#59
05/28/2007 (10:34 am)
From AMD Dual CPU, Dual Core machine:
Parallel Processing activated
Benchmark test took 15 milliseconds with 2 threads
Benchmark test took 16 milliseconds with 3 threads
Benchmark test took 0 milliseconds with 4 threads
Best performance [0] was with 4 threads

Debug mode:

1047ms single, 547ms 2, 0ms 3, 266ms 4, selected 3 threads

Still under 10 fps ingame though with the set affinity code
#60
05/28/2007 (1:13 pm)
@Sebastien. I think there is a bug in my timing code or something. Probably I need to add more loop itterrations so that the test takes longer and you don't get the 0 seconds problem.

In MultiProcess::doBenchmark() find the line start(this,0,1000,NULL); and change the 1000 to 10000 and see if you get better figures.

[update] yea, my single cpu anomality above dissapears with 10000 itterations because the timing test takes longer. I just figured out why, I think... The tick time in TGE is 32 ms, perhaps that affects the accuracy of Platform::getRealMilliseconds() method if the time duration is less than a tick between calls

Perhaps it would be best to replace getRealMilliseconds with ctime() or similar

I now get, on my single cpu
==>mp_forcethreads(4);
Parallel Processing activated
==>mp_dobenchmark();
Setting single thread modet
Benchmark test took 15 milliseconds in single thread mode
Parallel Processing activated
Benchmark test took 282 milliseconds with 2 threads
Benchmark test took 281 milliseconds with 3 threads
Benchmark test took 281 milliseconds with 4 threads
Best performance [15] was with 1 threads
setting thread pool accordingly

I updated the binary file and resource to use 10000 but I think set it higher still, to about 50000

Quote:Still under 10 fps ingame though with the set affinity code
I wish I knew why. Perhaps there is some thread conflict in the matrix calcs