Game Development Community

dev|Pro Game Development Curriculum

Multi-Threaded Mesh Skinning Showing Promise

by Pat Wilson · 06/17/2008 (10:52 am) · 20 comments

It's been almost 2 years since my last blog post, so I guess it was about time.

Rokkitball is a bad case for mesh skinning. The maximum game size, right now, is 8 players. These players are all in a small arena, the camera is 3rd person, and so you are frequently skinning high LOD meshes, and lots of them. While there are still some art things that we can do to reduce the load on the CPU, I decided to re-visit a problem which had previously frustrated me: optimizing mesh skinning.

This time through, I decided that the way to go was to multi-thread the skinning code. Mesh skinning happens as a part of the rendering of an object, take a look at TSSkinMesh::render(). It calles UpdateSkin(), and then calls up to the Parent::render(), which creates render instances. While the vertex buffer is created, it is not used immediately. This parallels discussions I had previously with Mark Frohnmayer about the idea of using a 'Promise' for things which aren't ready right away, but will be eventually. Like many network cases, and also, like skinned meshes.

So I've already written too much without pasting any code. I like code.
#ifndef _PROMISE_H_
#define _PROMISE_H_

// This is a rough draft

template<typename T>
class Promise
{
protected:
   /// Return the promised data. Undefined results if isReady() is false.
   virtual T *get() = 0;

public:
   virtual ~Promise() { }

   /// Returns true if the promised data is ready.
   virtual bool isReady() const = 0;

   /// Blocks until the promised data is ready, and returns it.
   virtual T *resolve() { while( !isReady() ); return get(); }
};

#endif
Pretty simple. So after that I changed render instances to store Promise* instead of a vertex buffer pointer, and put in proper methods to GFXDevice so that calling resolve() on a promise would be delayed until the last possible second; in updateStates(). After that I needed to test it, so I wrote this class:
#ifndef _SIMPLE_PROMISE_H_
#define _SIMPLE_PROMISE_H_

#include "core/promise.h"

template<typename T>
class SimplePromise : public Promise<T>
{
protected:
   T *mPtr;
   virtual T *get() { return mPtr; }

public:
   SimplePromise( T *ptr ) : mPtr( ptr ) {}

   virtual bool isReady() const { return true; }

   virtual void set( T *ptr ) { mPtr = ptr; }
};

#endif
This worked fine and dandy, so it was time to thread the promise fulfillment. The ThreadedPromise class is not nearly as simple as the above class, so pasting it would just make a wall-o-text, but the concept should be apparent: a ThreadedPromise is created and will, at some point, be serviced by a thread, fulfilling the promise.

The way I chose to approach this was to create a PromiseFulfillerThread class, which would operate on a provided list of ThreadedPromises. This was implemented, intially, using a VectorPtr with a mutex on it. These were the initial results run on a Core2 Quad. Each mesh was 3k+ polys, and had no LODs.
spreadsheets.google.com/pub?key=pzUp1A26JGMqsOmSNZr-fZA&oid=5&output=imageWell that's pretty cool, but as you can see, by the orange bar, one additional thread did not provide much of an improvement at all. Since the target was a Core 2, not a quad, this wasn't good enough.

The next step I took was to remove the mutex. Locking, and unlocking a mutex takes time, and it really isn't very appropriate to the simple manipulation of a worklist. So I revisited something I'd been meaning to look at for a while: interlocked operations. I re-wrote the ThreadedPromise to use a lockless linked list. This provided a noticeable speed increase, but it still wasn't what I was looking for. What I found was that, with 2 cores, the main thread would batch up all the render instances, start drawing, and then sit and spin the first time it ran into a promise that wasn't ready. I suspected that, if the main thread could "take" promises back and fulfill them as it needed them, it would provide the speed increases that were needed, as well as provide flexibility for future improvements.

This turned out to be true, and the metrics proved it. These are metrics dumped to the console.log from a 2-core AMD:
TSSkinBatcher -- Threaded Skin Promise Metrics
   - Promises batched to threads: 134050
   - Fulfilled before resolve: 91382
   - Promises blocked on resolve: 42668 [31.83%]
      + Worker thread processed 15065 [35.31%]
      + Main thread processed 27603 [64.69%]
So out of 134k skin requests, 91k of them got processed by worker threads before the main thread even asked for them. There were still 42k promises that the main thread had to block on. Of those 42k, 15k were being processed by the worker threads when the main thread tried to resolve() them, so the main thread had to sit and spin, but 27k of them hadn't been started on the worker threads yet, so the main thread processed them, while letting the worker threads continue on other promises.

This was a very good return on work done. I have been seeing a ~20% framerate improvement on 2 core systems, for about 3.5 days of work, including the re-write of the skinning loop, and creating the TSSkinBatcher. The really surprising thing was how the implementation just fell into place once I wrote that tiny, abstract, 4-method class.

And now, the wrap-up!

Promises are really useful, and powerful constructs. They are not a substitute for proper architecture, but they are a tool in the toolbox. They are useful for more than threading and networking. Although I haven't coded it yet, one could easily make a deferred function-call promise which could be returned by a method which has many callers. The execution of the function could be delayed until the value was actually used, and then, once computed, the value could be stored and returned back to subsequent calls, until it was reset.

So, yet another tool to throw into the toolbox for optimization! Next blog entry will probably be about deferred rendering.

#1
06/17/2008 (11:09 am)
An excellent blog to come back to life with!
#2
06/17/2008 (11:14 am)
OMG, THANK YOU PAT!! This blog rocked.
#3
06/17/2008 (1:51 pm)
Hehe, awesome! Really interesting blog, too.
#4
06/17/2008 (2:15 pm)
i was paying attention until you posted code. then my brain switched focus from reading the plan to thinking about tacos
#5
06/17/2008 (5:00 pm)
Hehe - Indeed Adam, +1 to that.

Pat, you are always walking the thin line between excellence and madness. I wish I could understand all what you are talking about, but even when I dont completely, I can see the enormous effort here. Awesome.
#6
06/17/2008 (5:24 pm)
Multithreading = Shiva making tacos.

With all those hands, just think of how fast he could make tacos.
#7
06/17/2008 (6:18 pm)
Multithreading is the next BIG thing for indie game engines!
#8
06/18/2008 (1:55 am)
Wow very cool! Are you planning on sharing this back to not-so-clever-people?
#9
06/18/2008 (3:24 am)
Quote:
Multithreading is the next BIG thing for indie game engines!

Oh...
#10
06/19/2008 (1:58 pm)
I'll make ya'all a deal. If someone commits to fixing the extreme corner case (which can't be reproduced reliably) that has prevented it from shipping so far, than I'll release it as a resource. Otherwise, it'll be rolled into an engine update sometime in the future.

Also, all the code is lockless, so there are no critical sections. Non-reproducible bug, with interlocked operations. Debugging it is like trying to nail Jello to a wall.
#11
06/24/2008 (8:38 am)
Wouldn't it be easier to just nail the wall to the jello?
#12
06/26/2008 (5:19 am)
I posted this in another thread but I'll cross post it here:

@Pat - Great blog! It's a very interesting approach.

But...

You're using the Promise (aka delayed execution) design pattern incorrectly here.

The spinlock isn't appropriate either. You end up chewing up a whole core, which I'm sure you know already.

There are several things you can do to multi-thread the skinning process... in fact, the code would scale even better if you also multi-threaded the collision detection. You could run the collision and the skinning threads simultaneously and max out 8 cores (assuming the client machine has them) with a nearly linear improvement in performance.

I would be interested in helping you fix this if you're not completely po'd at me for critiquing your code.
#13
06/26/2008 (8:39 am)
Not at all, Tony. In fact, this isn't at all the optimal design for threaded tasks in the game engine.

The spinlock is only the default implementation. The actual threaded implementation is messy, but looks like this:
virtual T *resolve()
   {
      PROFILE_SCOPE(ThreadedPromise_Resolve); 

#ifdef ENABLE_THREADED_PROMISE_METRICS
      if( _isReady() )
      {
         Atomic::Value32::Increment( &smNoBlockResolves );
         return get();
      }
#else
      if( _isReady() )
         return get();
#endif


      // Interlock op, take control of promise
      if( mInterlockState == ilsPromiseNoState &&
         ( Atomic::Value32::Increment( &mInterlockState ) == ilsPromiseAcquired ) )
      {
#ifdef ENABLE_THREADED_PROMISE_METRICS
         Atomic::Value32::Increment( &smMainThreadResolveBlocks );
#endif
         return set( reinterpret_cast<T *>( mFulfillerFunction( mFunctionData ) ) );
      }

      while( !_isReady() )
         Platform::sleep( 0 );

#ifdef ENABLE_THREADED_PROMISE_METRICS
      Atomic::Value32::Increment( &smWorkerThreadResolveBlocks );
#endif

      return get();
   }
The spinlock still exists right at the end, but I think that it's appropriate. The spinlock case only happens if the main thread has requested a Promise which is currently being worked on by a worker thread. Ideally, the main thread should process a small worker task here, but since this is pretty much a hack-retrofit to just solve skinning bottlenecking, I left it at that.

The "real" implementation is actually being spearheaded by Alex, and it is more of a generic task construct where tasks can depend on other tasks, and the data is stored a bit better. I've been poking at Intel's Threading Building Blocks and they seem to be much improved from the last time I looked at them. Right now the plan is to use the TBB low-level constructs to build the higher level Torque classes. The goal is that the "game programmer" doesn't write threaded code, or worry about synchronization (unless it's some super-specific code).
#14
06/26/2008 (9:23 am)
Phew... you had me worried there for a bit :P

That looks a whole lot better, but I can see why you didn't want to post it... it adds a lot of noise to a beautiful blog :P

Instead of the sleep, couldn't you could use a condition variable and wait on it? I know it's a little heavier weight, but it won't sit there and spin.

Another enhancement might be to prioritize the order in which the skinning is done... possibly related to the order in which the objects will be rendered?

Side thoughts...

What is the license for TBB? Isn't it $299? Wouldn't that be a bad idea unless you're not planning on distributing the source code?

I have my own threading library that my mentor and I have been creating over the past few years. It's very stable, fast and pretty easy to use.

Some of it's open source and eventually all of it will be open source once I get the time to reformat the code

It's all cross platform and runs on Solaris (on Intel and Sparc), Mac OS X (I've only tested on Intel), Linux (Intel, AMD, and even PS3), Windows, and even Cygwin.

You're welcome to take a look at it and see if it suits your needs.... much better than requiring Torque users to purchase TBB.

If it's missing anything that you require, let me know and I'll get it in there immediately if I have it already implemented... it's just a matter of copy / paste and editing the coding style and adding the ZLib license.

If you need something that's not implemented, also let me know... I love extending it and adding new functionality when it's needed.

You can browse the source here.

It's free, open source and you can use it commercially without having to redistribute the source code since it's licensed under the ZLib license.
#15
06/26/2008 (10:41 am)
TBB is GPL v2 Runtime, which is a really confusing way of saying, "Link it however you want, it won't make your stuff GPL." Here is the official version of that:

www.cs.huji.ac.il/~etsman/Docs/gcc-3.4-base/libstdc++/html/17_intro/license.html

TBB recently got a lot of work done on it (well recently relative to the last time I looked at it). The DOxygen is here: www.threadingbuildingblocks.org/files/documentation/index.html

I'm leaning twords TBB right now because there is a contact at Intel working in their game threading stuff, who used to work at Dynamix, so it's a good resource. They came to GG a year or two ago, but we really weren't in the position to start doing research work. Luckily, we are now...but there's a bunch of catch-up to be done.
#16
06/26/2008 (7:37 pm)
Wow, nice... I will have to take another look at TBB now too :-D

Thanks for the information.
#17
06/27/2008 (8:32 am)
Hey Tony, for sanity's sake...what exactly is the difference between a Promise and a Future? Or am I using neither, in this case?
#18
06/27/2008 (10:24 am)
Futures and Promises are very closely related muti-threading constructs. I'll do my best to explain the differences, although I'm probably only going to be about 90% accurate in my explanation, so take it with a grain of salt. :-D

It's difficult to give a good definition for a Promise due to some ambiguity which I'll discuss later.... part of that ambiguity is what caused me to say you're not using them correctly.

Futures have a less ambiguous definition since most programming languages address them the same way (albeit very few of them address them correctly since very few languages outside of certain scripting languages can handle the constructs).

Futures are constructs that allow for deferred responses. Generally a future is assigned a thread immediately upon creation, although that's not a firm rule. If a thread is assigned to a Future when the value is first needed, this is called a "Lazy Future". If a thread is assigned to a Future before the value is needed, this is called an "Eager Future".

In this example, x will eventually be assigned a value:
x = future(factorial(100000));

In a slightly more complex example:
x = future(factorial(100000)); 
y = x + 3;
print(123);
print (y);
print(321);

In this case, the language should automatically interpret it at:
x = future(factorial(100000)); 
y = future(x) + 3;
print(123);
print (future(y));
print(321);

And the output will be
Quote:
123
321
(and then whatever factorial(100000) + 3 is)

Notice the output is out of sequence (it doesn't have to be but it can be, depending on how quickly your CPU can calculate factorial(100000) + 3) :-D

Futures generally have constructs that allow you to wait for the result, so you can solve the out-of-order execution if it would cause problems.

In C++ you can implement Futures, but only if every class you're using provides support for futures and delayed responses and your classes can queue up the requests of "future(x) + 3" as another future.... not an easy task. Not impossible since you can overload operators in C++.

But... I don't think a Future is what you want.

Now, back to the ambiguity related to the definition of a Promise.

In a lot of software, a Promise is used for deferred or lazy evaluation. Up until today and yesterday after discussing this with you, this was the only definition I had ever seen, hence my reason for saying you're using them incorrectly.

Other software designers use Promises the way you're using them, albeit your implementation is a little... (trying to put it politely) irregular.

Possibly after all of this, you are right, although in my defense, I clarified my definition of a Promise:
Quote:You're using the Promise (aka delayed execution) design pattern incorrectly here.

Which is the correct use of the term? I don't know... although the "children" typically use the "deferred evaluation" definition and "adults" typically use the definition you're using. (/me blushes)

I hate being wrong, ya know? :-P

If you want, we could agree to reduce the ambiguity when we're speaking. :-P

We could borrow from the terms generally used with a Future.

If you like:

An Eager Promise is preemptive evaluation. This is what you want.

A Lazy Promise is deferred evaluation. This is what I have typically seen implemented.

A good example of a Lazy Promise that's related to the previous example and ignoring threading issues:
class X
{
public:

...

    int getFactorial()
    {
        if (!m_bEvaluated)
        {
            m_factorial = ::factorial(m_x);
            m_bEvaluted = true;
        }
        return m_factorial;
    } 

...

private:
    bool m_bEvaluated;
    int m_x;
    int m_factorial;
};

This is definitely not what you're doing nor is it what you want to do.

MultiLisp and Act 1 are the only languages I know that implement Eager Promises the way you're using them, although there could possibly be others.

The way I would implement an Eager Promise in C++ would be an through an asynchronous task handler.

The requests for values (or result of a task) would include a callback. The request object would check to see if the result had already already been computed, and if not then the request would be queued. If the result was available, the request callback would be called.

A thread pool of 1 or more threads would service the queue of requests and as each item was computed then the callback would be called with the results.

That's a very high level, generic explanation... obviously there's a whole lot more to it (i.e. multiple requests for the same value could cause a dispatch of the result to all waiting callbacks). You also have the problem that the dispatcher for the results should probably be put into a queue so it can be serviced by the main rendering thread.... not a difficult task, but you just have to be aware of the requirement.

What you need is a prioritized Eager Promise plus a single threaded result handler.

The priority should be based on the order that you're wanting to render the skinned objects.... if you could sort the objects and throw them into a queue in the order that you're going to need them, go off and do some other processing, then come back and pop the results off of the result queue then you should see some fantastic, nearly linear speed improvements (up to the point until you've exhausted your cores).

Maybe that's exactly what you implemented and I'm just confused since I'm only looking at a subset of your code....

English is a second language for me... C++ is my primary language and I communicate much better in that language than English...

Hopefully that is a decent enough English explanation.
#19
06/28/2008 (4:07 pm)
The real cool thing about Pat's implementation of threaded skinning is the small amount of code/framework it actually took. The other cool thing I learned was the idea of the main thread taking work back from the thread pool instead of blocking on it. Totally smart. Using your terminology above, his system is implementing both Eager and Lazy promises. So the generic title of "Promises" is probably good. ;)
#20
06/14/2014 (7:04 pm)
I found this blog again looking for a T3D multithreading example. I really want to resurrect the work Pat did on this and see if we can implement it in MIT3D!