Game Development Community

SimEvents posted from threads have old times, causing asserts

by Drew Parker · in Torque Game Engine Advanced · 01/04/2007 (9:59 am) · 15 replies

The Overview:
It seems posting simEvents from a thread other then the main TGE thread, using the current game time, can cause a fatal assert. This occurs because when the posted event (lets say event A) is processed, the TGE game time has already been advanced ahead of event A's time. The assert checks to see if any posted event has an event time behind the current game time. Since event A is behind the current game time, a fatal assert is thrown.


The long explanation:
I am working on a project that has TSE running in a multi-threaded environment. There is a thread dedicated to processing different kinds of network traffic, in addition to the thread running TSE. When certain kinds of traffic comes in, the network thread posts a simEvent (or a console function is called using executef() )


The network thread:
// event handler which is called upon receiving network traffic....

case DATA_EVENT:

Con::executef(5, "scriptDataAlert", name, buffer1, buffer2, eventName);


executef() calls execute()

\engine\console\console.cpp
const char *executef(S32 argc, ...)
{
   const char *argv[128];

   va_list args;
   va_start(args, argc);
   for(S32 i = 0; i < argc; i++)
      argv[ i ] = va_arg(args, const char *);
   va_end(args);
   return execute(argc, argv);
}


execute() runs, and posts a simEvent to process the console function (it appers to be thread safe from the code, and as discussed in this TDN article: Torque/1.4/MultiThread)

\engine\console\console.cpp
const char *execute(S32 argc, const char *argv[])
{
#ifdef TORQUE_MULTITHREAD
   if(isMainThread())
   {
#endif
      Namespace::Entry *ent;
      StringTableEntry funcName = StringTable->insert(argv[0]);
      ent = Namespace::global()->lookup(funcName);

      if(!ent)
      {
         warnf(ConsoleLogEntry::Script, "%s: Unknown command.", argv[0]);

         // Clean up arg buffers, if any.
         STR.clearFunctionOffset();
         return "";
      }
      return ent->execute(argc, argv, &gEvalState);
#ifdef TORQUE_MULTITHREAD
   }
   else
   {
      SimConsoleThreadExecCallback cb;
      SimConsoleThreadExecEvent *evt = new SimConsoleThreadExecEvent(argc, argv, false, &cb);
      Sim::postEvent(Sim::getRootGroup(), evt, Sim::getCurrentTime());
      
      return cb.waitForResult();
   }
#endif
}

The bottom of the function is where the simEvent is posted. The simEvent is posted with the current game time:
SimConsoleThreadExecCallback cb;
      SimConsoleThreadExecEvent *evt = new SimConsoleThreadExecEvent(argc, argv, false, &cb);
      Sim::postEvent(Sim::getRootGroup(), evt, Sim::getCurrentTime());


At this point, the network thread has finished processing the executef() call that started it.


... continued in next post...

#1
01/04/2007 (10:00 am)
... continued from above ...

The TGE thread
In the TGE thread, advanceToTime() is running to process events:


\engine\console\simManager.cpp
void advanceToTime(SimTime targetTime)
{
   AssertFatal(targetTime >= gCurrentTime, "EventQueue::process: cannot advance to time in the past.");

   Mutex::lockMutex(gEventQueueMutex);
   gTargetTime = targetTime;
   while(gEventQueue && gEventQueue->time <= targetTime)
   {
      SimEvent *event = gEventQueue;
      gEventQueue = gEventQueue->nextEvent;
      AssertFatal(event->time >= gCurrentTime,
			"SimEventQueue::pop: Cannot go back in time (flux capacitor not installed - BJG).");
      gCurrentTime = event->time;
      SimObject *obj = event->destObject;

      if(!obj->isDeleted())
         event->process(obj);
      delete event;
   }
	gCurrentTime = targetTime;
   Mutex::unlockMutex(gEventQueueMutex);
}


A fatal assert occurs, because the event time posted from the other thread (which was the current game time then) is now behind the current game time by about 30-60ms
AssertFatal(event->time >= gCurrentTime,
			"SimEventQueue::pop: Cannot go back in time (flux capacitor not installed - BJG).");



A brief example - trace through the debugger:

- The network thread calls the event handler upon receiving a packet over the network, which the event handler then calls a script function (executef)
- executef() calls execute(), which posts a simEvent. So far so good. The simEvent is posted with the current game time (gCurrentTime), for example, 12742. In other words, the event is posted to run immediately (the next tick)
- In the meantime, the TGE thread is still running. advanceToTime() processes SOME OTHER EVENT, which pushes the time forward to about 13000.
- The next go around when our posted simEvent is processed by advanceToTime(), the game time has passed the simEvent time (13000 > 12742), throwing a fatal assert.


Any thoughts on the correct way to deal with this issue? Thanks to any who reply!
#2
01/04/2007 (10:25 am)
A couple of high level strategies come to mind:

--Admit (in a programming sense) that even though multi-threaded, you need to block the main thread when processing network events. This might include something like running threaded until a fully formed event is ready to post, then blocking the main thread until the networking thread posts, then continuing the main thread.

--implement a processing type on cross-thread posted events called "just do it next". Basically, that assert is simply a safety valve to make sure something wonky isn't going on, but it very much implies contiguous time based processing--which you obviously don't have in this case. You would want to be very very careful with this, because obviously the most critical issue in multi-threaded applications is knowing when to synchronize and when not to, but in this case we do want to process these net based posted events when they arrive.

I am extremely interested in this technique--it's one that I've suggested to many commercial studios using Torque, but have yet to see anyone do actual R&D. If you have any further questions, please don't hesitate to ask!

Oh, and I'm pinging Ben on this thread, since he has done the most multi-threaded work with TGE/TSE(Atlas), as well as Tom Bampton who is just a game programming god.

Also, outstanding question writeup--probably the best I've seen in quite a while!
#3
01/04/2007 (10:31 am)
Ok, here is a solution that appears to be working at the moment, though it could cause problems elsewhere that I am unaware of. I extended the simEvent logic so it can check if you want the event processing to ignore the event time. If you want the event time to be ignored, then the simEvent will still run with no assert.

Here are the code changes I made:

Find the class SimEvent definition in simBase.h

\engine\console\simBase.h
class SimEvent
{
  public:
   SimEvent *nextEvent;     ///< Linked list details - pointer to next item in the list.
   SimTime startTime;       ///< When the event was posted.
   SimTime time;            ///< When the event is scheduled to occur.
   U32 sequenceCount;       ///< Unique ID. These are assigned sequentially based on order
                            ///  of addition to the list.
   SimObject *destObject;   ///< Object on which this event will be applied.

Below SimObject *destObject; , add:

bool ignoreTime;         ///< If set, the SimEvent will not throw an assert if it has an old time.  --dp

Just below that, change the line:
SimEvent() { destObject = NULL; }
To this:
SimEvent() { destObject = NULL; ignoreTime = false; }      //--dp



Find the function advanceToTime() in simManager.cpp

\engine\console\simManager.cpp
void advanceToTime(SimTime targetTime)

Find this line:
AssertFatal(event->time >= gCurrentTime,
			"SimEventQueue::pop: Cannot go back in time (flux capacitor not installed - BJG).");

Replace it with this:
// Only throw asserts on events that have old times and are not ignoring the time setting. --dp
      if (event->time < gCurrentTime && !event->ignoreTime)
      {
         Con::printf("SimEventQueue::pop: Cannot go back in time --  event->Time:%i, currentTime:%i", 
                     event->time, gCurrentTime);

         AssertFatal(event->time >= gCurrentTime,
			   "SimEventQueue::pop: Cannot go back in time (flux capacitor not installed - BJG).");
      }


This checks the event to see if the event time should be ignored. If the event time should be ignored, we don't throw an assert.


Find the function postEvent () in simManager.cpp
U32 postEvent(SimObject *destObject, SimEvent* event,U32 time)

The first line of the function is:
AssertFatal(time >= gCurrentTime,
		"Sim::postEvent: Cannot go back in time. (flux capacitor unavailable -- BJG)");

Replace that with this:
// Only throw asserts on events that have old times and are not ignoring the time setting. --dp
   if (time < gCurrentTime && !event->ignoreTime)
   {
	   AssertFatal(time >= gCurrentTime,
		   "Sim::postEvent: Cannot go back in time. (flux capacitor unavailable -- BJG)");
   }

Now, when posting an event from a different thread that you don't want restricted by the game time (like for our project, the simEvents posted from the networking thread), just make sure the simEvent's ignoreTime variable is set to true on creation.

That's it. This seems to be working now (as in, there are no more fatal asserts) but there may be undesireable behavior that arises from this change, I don't know. I'll try to post to this thread in the future if I learn more regarding this.

It would be great to have GG's feedback on the best approach to fixing this problem, whenever someone gets the time. Thanks!
#4
01/04/2007 (10:50 am)
I don't see any major issues with what you implemented--and it's the correct solution for simply getting rid of the original design assumption (single threaded, contiguous time processing) in a relatively safe manner. Of course, to be completely paranoid, you should be 100% validating your events before actually processing them on your main thread, but safety is always a trade off on performance, and the entire idea here is to optimize performance. You may want to simply modify the assert and give it a bigger threshold to check against--hey, we don't want this network event, it's 3 seconds old! type of thing. That would be a big task however, and should be covered by the already existing network code I'd think.

I'm not a multi-threading guru, but I would suggest that in 100% pure MT design, you really would want to block the main thread, or at least synchronize much better, when posting a fully formed network event from your background thread. As far as i am aware, unexpected race conditions are the bane of multi-threaded application testing/debugging, and without a synchronized post of some sort, my gut tells me a possible race condition could exist, but I can't tell you what it may be!
#5
01/04/2007 (10:58 am)
Hey Stephen,

Thanks for both of your replies! You are fast, let me tell you - I really appreciate it! This is the second time I've been writing a post to find you have already replied to my previous one. :)

In regards to your posts: That's some great feedback - I'll definitely be sharing it with the people I work with so we can discuss it.

I don't have much experience working with threads, so I didn't know the full ramifications of what I did in terms of threading (I put that in there so I could continue my development for the mean time, and it seemed to make sense) - and that's why it's great to hear your feedback. Other people on the same project are the experts on threading and wrote the related code, so I'll be sharing all of this with them for sure.

Thanks again for your reply, and passing along this thread (forum :) info to Ben and Tom. If they have time to look at it, it would be great to hear what their thoughts are as well.

Glad you liked the writeup! I tried to put as much info in there as possible to make it clear.
#6
01/04/2007 (11:14 am)
Adding a unique ID seems like a workable solution for non-time-dependent events. You also might try doing something where when you post from the non-main thread it shuffles the new event into the right place in the queue to avoid problems.

The big issue is you probably don't want your time to jump around in odd ways, but I'm unsure of the exact ramifications of that - it might not actually be a big issue as long as when you get to rendering you have a correct time.

Why can't you just populate the time when you actually queue the event? Wouldn't that always work?
#7
01/04/2007 (11:50 am)
@Ben - I think the real question is why did you leave execute() broken? ;)

Explain what you mean by "queue the event"... in Sim::postEvent()? I guess it's possible that by the time gEventQueueMutex is locked gCurrentTime could have moved forward. Maybe having Sim::postEvent() respond to time == 0 by setting gCurrentTime after the gEventQueueMutex is locked will solve this.
#8
01/04/2007 (1:22 pm)
Right - in postEvent. You probably need to put a mutex around gCurrentTime as well, to avoid races.
#9
01/04/2007 (1:51 pm)
Actually gCurrentTime is only changed within the mutex lock in Sim::advanceToTime()... so once Sim::postEvent() locks the mutex it has gCurrentTime to itself.
#10
01/04/2007 (2:11 pm)
Very good. But what about reading? :)
#11
01/04/2007 (2:25 pm)
You got me there... why would two threads reading the same memory location be an issue?

Oh duh... yea... Sim::postEvent() is being changed to only read gCurrentTime from inside the gEventQueueMutex lock. Drew should be posting his findings shortly with the code.
#12
01/04/2007 (11:29 pm)
Ok, taking what Ben and Tom said here, and with some guidance from Tom over IM, here is what I've got (it seems to be working so far)

First, take out any changes made for the ignoreTime fix I proposed earlier in this thread (Tom didn't like that fix, and I agree with him, since it is just a hack around the assert).


Find the function postEvent() in simManager.cpp

\engine\console\simManager.cpp
U32 postEvent(SimObject *destObject, SimEvent* event,U32 time)

Go to the first line
AssertFatal(time >= gCurrentTime,
		"Sim::postEvent: Cannot go back in time. (flux capacitor unavailable -- BJG)");
and erase it. We are going to move it down (and it will be included in the next added code block)

Find the line:
event->time = time;

Replace it with this:
// If the time is 0, set the event to have the current time --dp
   if (time != 0)
      event->time = time;
   else
      event->time = gCurrentTime;

AssertFatal(event->time >= gCurrentTime,
	"Sim::postEvent: Cannot go back in time. (flux capacitor unavailable -- BJG)");

There is a mutex lock line that is right above the old event->time = time; line:
Mutex::lockMutex(gEventQueueMutex);

You need to ensure this added codeblock remains AFTER this mutex line.


Now go find the function execute() in console.cpp

\engine\console\console.cpp
const char *execute(S32 argc, const char *argv[])

At the bottom of the function, find the line:
Sim::postEvent(Sim::getRootGroup(), evt, Sim::getCurrentTime());

And replace it with this:
// Post with a time of zero - tells postEvent() to populate the time when the event is queued
      Sim::postEvent(Sim::getRootGroup(), evt, 0);


That should do it. I've tested it and it seems to work fine. It also works with posting other simEvents with a time of zero.

In regards to handling the read of gCurrentTime you guys are talking about above, I'm not sure what to do (my lack of threaded programming experience). I'll need some help there - some explicit instructions or for someone else to please jump in. :)
#13
01/05/2007 (9:06 am)
@Drew - What you did should work... Ben was just alluding to the fact that you need to protect the reading of gCurrentTime (protect it from the main thread writing to it). You've done this by only reading from gCurrentTime inside of the gEventQueueMutex lock since gCurrentTime is only changed by the main thread behind that same lock. So if i read your description above correctly i think your just fine.
#14
01/05/2007 (9:52 am)
Doesnt that make the assert you have in the event post code redundant then?

If you have the gCurrentTime locked by the mutex above it, then the assert cant be triggered, because you cant get a changing gCurrentTime, hence event->time will ALWAYS be gCurrentTime as youve set it above).

This would change if you'd done the event->time = gCurrentTime AFTER the release of the mutex..

Or did I misread something? :)
#15
01/05/2007 (10:09 am)
Of course it should never fire. It's an assert! Nevertheless it helped us catch and analyze what would otherwise be problematic behavior. :)

You're right - in the zero case it can't fire (hopefully) but in other cases still can.

These fixes seem good although I'm not gonna swear to it - haven't had my morning coffee yet. :(