WritePacketData readPacketData and latency
by Tom Spilman · in Torque Game Engine · 04/14/2008 (10:11 am) · 13 replies
We're currently working on networking some physics and i've realized that i know less than i thought i did about it. Sometimes i find that writing it out helps me solidify my understanding of things and posting it here will allow me to get feedback from others and hopefully be useful to someone on the future.
What i understand...
The client processes the input Move during a tick which changes the state of the control object. writePacketData() is then called on the control object and a CRC is generated from the packed data. This CRC and the Move are sent to the server. This is all "client side prediction" and the server will have the final say.
The server gets the Move and CRC from the client and then processes that input on the server object that the client is controlling. It then does writePacketData() and generates the server CRC which is compared to the client CRC. If they match then nothing else needs to be done... the client is in sync with the server. If the CRCs do not match then the server must correct the client. The writePacketData is sent to the client.
Before the client processes any ticks it handles incoming data from the server. If the server sent a correction it calls readPacketData() on the client control object. The readPacketData() makes changes to the client object to put it in sync with the server. The client can then process the next tick from the correct server position.
So... now what i don't understand.
Lets say in a good case you have a little over 32ms of network latency from the server to the client. This means that when the client receives the correction packet from the server the correction data is for a previous tick. If the client applies this correction it would move it back to an old position... from then on the client and server would be out of sync to each other. From what i can tell this is possible in all of the current control objects (Player, Vehicle, etc) as they just do a setPosition() on the correction data. Is there something in the bowels of the GameConnection code that throws out old corrections? How do corrections arrive in time at all?
Also this is all for the control object. Non-control objects rely on packUpdate/unpackUpdate to get periodic positional data from the server and use "warpTicks" to smooth the updates out. It seems that the client should be in sync with the server more often than not, so sending corrections would make sense for all active objects. So why is it that CRCd "corrections" are not used for non-control objects?
What i understand...
The client processes the input Move during a tick which changes the state of the control object. writePacketData() is then called on the control object and a CRC is generated from the packed data. This CRC and the Move are sent to the server. This is all "client side prediction" and the server will have the final say.
The server gets the Move and CRC from the client and then processes that input on the server object that the client is controlling. It then does writePacketData() and generates the server CRC which is compared to the client CRC. If they match then nothing else needs to be done... the client is in sync with the server. If the CRCs do not match then the server must correct the client. The writePacketData is sent to the client.
Before the client processes any ticks it handles incoming data from the server. If the server sent a correction it calls readPacketData() on the client control object. The readPacketData() makes changes to the client object to put it in sync with the server. The client can then process the next tick from the correct server position.
So... now what i don't understand.
Lets say in a good case you have a little over 32ms of network latency from the server to the client. This means that when the client receives the correction packet from the server the correction data is for a previous tick. If the client applies this correction it would move it back to an old position... from then on the client and server would be out of sync to each other. From what i can tell this is possible in all of the current control objects (Player, Vehicle, etc) as they just do a setPosition() on the correction data. Is there something in the bowels of the GameConnection code that throws out old corrections? How do corrections arrive in time at all?
Also this is all for the control object. Non-control objects rely on packUpdate/unpackUpdate to get periodic positional data from the server and use "warpTicks" to smooth the updates out. It seems that the client should be in sync with the server more often than not, so sending corrections would make sense for all active objects. So why is it that CRCd "corrections" are not used for non-control objects?
About the author
Tom is a programmer and co-owner of Sickhead Games, LLC.
#2
04/14/2008 (11:41 am)
Juicy!
#3
So where is this timestamp and moving history kept? Its not in Player, ShapeBase, GameBase, or GameConnection that i can see. Also where does the unwind and replay on the client control object occur? Seeing the code would help me tremendously in understanding it all.
As far as i know none of this is documented in Torque. The old networking docs barely cover write/readPacketData much less how corrections really work. It would be a huge help to get a little of this into TDN or into the docs in the next Torque release.
---
While poking around the code just now it finally clicked for me how BitStream compression points work.
GameConnection::writePacket() first calls writePacketData() on the control object before network events and ghosted objects are written. As long as you always call setCompressionPoint() in your writePacketData() all other objects can use writeCompressedPoint(). This encodes the positions relative to the control object which saves alot of network bandwidth when the objects are nearby to the player (which is very common).
Again... another really cool feature that isn't documented.
04/14/2008 (12:04 pm)
@Stephen - Thanks for the reply.So where is this timestamp and moving history kept? Its not in Player, ShapeBase, GameBase, or GameConnection that i can see. Also where does the unwind and replay on the client control object occur? Seeing the code would help me tremendously in understanding it all.
As far as i know none of this is documented in Torque. The old networking docs barely cover write/readPacketData much less how corrections really work. It would be a huge help to get a little of this into TDN or into the docs in the next Torque release.
---
While poking around the code just now it finally clicked for me how BitStream compression points work.
GameConnection::writePacket() first calls writePacketData() on the control object before network events and ghosted objects are written. As long as you always call setCompressionPoint() in your writePacketData() all other objects can use writeCompressedPoint(). This encodes the positions relative to the control object which saves alot of network bandwidth when the objects are nearby to the player (which is very common).
Again... another really cool feature that isn't documented.
#4
Let me get up to speed (been more than a year since I looked at this in depth), and I'll get back to you :)
04/14/2008 (12:26 pm)
You're going to make me dig down deep into the code myself, hehe.Let me get up to speed (been more than a year since I looked at this in depth), and I'll get back to you :)
#5
04/14/2008 (12:37 pm)
Well... i assumed you knew already. I looked and couldn't find it. ;)
#6
gameProcess.cpp:
player.cpp:
The big conceptual barrier to keep in mind is that ServerProcessList::onTickObject will call processTick multiple times in one advanceObjects cycle depending on the number of moves in the Connection's move queue, and that ClientProcessList::clientCatchup will do the same.
Also, try compiling with #TORQUE_DEBUG_NET_MOVES which will provide a lot of debugging state information that can help track the whole process.
04/14/2008 (1:15 pm)
I don't have a ton of time right now to go through the full flow, but some key methods to look at:gameProcess.cpp:
gameProcess.cpp ClientProcessList::onTickObject ServerProcessList::onTickObject ClientProcessList::clientCatchup
player.cpp:
Player::processTick
specifically this section:
// Warp to catch up to server
if (delta.warpTicks > 0) {
delta.warpTicks--;
Player::unpackUpdate
--again, specifically on the delta/backstepping sectionThe big conceptual barrier to keep in mind is that ServerProcessList::onTickObject will call processTick multiple times in one advanceObjects cycle depending on the number of moves in the Connection's move queue, and that ClientProcessList::clientCatchup will do the same.
Also, try compiling with #TORQUE_DEBUG_NET_MOVES which will provide a lot of debugging state information that can help track the whole process.
#7
04/14/2008 (1:22 pm)
PS: I slightly mis-spoke when I used the term "timestamp", I should have been more explicit and talked about the move queue that is stored on the connection, and used in the methods above.
#8
So you can start by looking at GameConnection::writePacket()...
The "timestamp" you spoke of Stephen is the last move index processed by the server. In MoveList::clientReadMovePacket() you see...
04/14/2008 (2:41 pm)
Thanks for pointing me in the right direction Stephen. Having the right terminology makes it easier to search the code for things.So you can start by looking at GameConnection::writePacket()...
// assume that the control object will write in a compression point
if(bstream->writeFlag(mMoveList.isMismatch() || mControlForceMismatch))
{The "mismatch" in isMismatch() means the CRC from the client didn't match the CRC on the server. As you can see GameConnection::writePacket() calls writePacketData on your control object in this case. This is the correction packet.The "timestamp" you spoke of Stephen is the last move index processed by the server. In MoveList::clientReadMovePacket() you see...
mLastMoveAck = bstream->readInt(32);
if(mLastMoveAck > mLastClientMove)
mLastClientMove = mLastMoveAck;This is how the server rolls the client back to an old move when the CRCs do not match. Then in ClientProcessList::clientCatchup() it calls processTick() on your control object until all the rolled back Moves are processed.
#9
04/14/2008 (2:46 pm)
This is a fascinating conversation and I am learning a lot. This kind of thing has always been a bit of a mystery to me. Keep us posted Tom.
#10
The hidden (and false) assumption here is that the client and server are "in step", or in fact coupled at all, when they are not. The concept of "arrive in time at all" is a red herring, because the entire system is set up with the known fact that the client and server are -never- in synchronization, and are playing a tug of war with the control object, with the server having ultimate authority.
"Worst case" in the asynchronous connection is a combination of variable latency (some packets arrive fast, some slow, with the arrival rate changing often), combined with dropped packets, which means that some arrive out of order.
The underlying ack/nack protocol (application implemented on top of UDP, this is what GameConnection, NetConnection, and NetInterface take care of) handles dropped packets, the bit stream implementation handles resend of data (we don't want to send old data, so we send latest state data instead), but this section we're talking about here (the move list) is what is responsible for handling variable latency (or, more importantly, the fact that we have latency at all, which we can't get around), and it does this through the wind/unwind (or as Tom says, the rollbacks).
Tom, I think when you spotted the setPosition() call, it was probably within the "ok, client is so desynchronized we have to take command" section, which is only reached when things get totally out of synch. Otherwise, server corrections will be blended in to the client's simulation via rolling back to the appropriate time state, interpolating the correction, and then re-applying moves that have been made on the client simulation, but not yet factored in to the server's simulation.
This is probably the most complex portion of Torque that exists, and in fact drives a lot of the things that people claim are "outdated" (physics limitations being a primary one)--the networking synchronization drives the entire process, and is responsible for maintaining both synchronization between the server and client, as well as immediate response capability on the client when a human enters a move.
I've been studying it off and on for almost 3 years now, and a lot of it is still over my head--I understand the theory, but digging into the code is complex :)
04/14/2008 (3:14 pm)
By the way, it can be very useful to think of things in the worst case scenario, not the best case one. For example, Tom's original assumptions stated:Quote:
Lets say in a good case you have a little over 32ms of network latency from the server to the client. This means that when the client receives the correction packet from the server the correction data is for a previous tick. If the client applies this correction it would move it back to an old position... from then on the client and server would be out of sync to each other. From what i can tell this is possible in all of the current control objects (Player, Vehicle, etc) as they just do a setPosition() on the correction data. Is there something in the bowels of the GameConnection code that throws out old corrections? How do corrections arrive in time at all?
The hidden (and false) assumption here is that the client and server are "in step", or in fact coupled at all, when they are not. The concept of "arrive in time at all" is a red herring, because the entire system is set up with the known fact that the client and server are -never- in synchronization, and are playing a tug of war with the control object, with the server having ultimate authority.
"Worst case" in the asynchronous connection is a combination of variable latency (some packets arrive fast, some slow, with the arrival rate changing often), combined with dropped packets, which means that some arrive out of order.
The underlying ack/nack protocol (application implemented on top of UDP, this is what GameConnection, NetConnection, and NetInterface take care of) handles dropped packets, the bit stream implementation handles resend of data (we don't want to send old data, so we send latest state data instead), but this section we're talking about here (the move list) is what is responsible for handling variable latency (or, more importantly, the fact that we have latency at all, which we can't get around), and it does this through the wind/unwind (or as Tom says, the rollbacks).
Tom, I think when you spotted the setPosition() call, it was probably within the "ok, client is so desynchronized we have to take command" section, which is only reached when things get totally out of synch. Otherwise, server corrections will be blended in to the client's simulation via rolling back to the appropriate time state, interpolating the correction, and then re-applying moves that have been made on the client simulation, but not yet factored in to the server's simulation.
This is probably the most complex portion of Torque that exists, and in fact drives a lot of the things that people claim are "outdated" (physics limitations being a primary one)--the networking synchronization drives the entire process, and is responsible for maintaining both synchronization between the server and client, as well as immediate response capability on the client when a human enters a move.
I've been studying it off and on for almost 3 years now, and a lot of it is still over my head--I understand the theory, but digging into the code is complex :)
#11
The only best case is that the client and server CRCs are equal... else a correction occurs and that always means a rollback and reprocessing of moves on the client. So ensuring your control object simulation is deterministic is extremely important to high performance and low bandwidth usage.
Even then none of the ghosted objects are in sync and they all constantly require updates from the server.
When readPacketData() is called it is because the server detected the local client control object is out of sync. readPacketData() uses the data from the server to move the client control object into the corrected past state. Then the last processed move index is moved back to the past move which matches the server state received. Now several processTick() calls will occur in quick succession to "catch up" the client control object to where it should be at. Normal tick processing can then continue.
As far as i can see no "warp ticks" occur under a correction to a local control object via readPacketData(). I assume because corrections of this sort shouldn't occur often or be radically divergent.
unpackData() on the other hand is called for ghosted client objects. This calculates the "warpOffset" (the distance between the current position and the sent server position) and will either apply the warp over several ticks (aka warpTicks) or within the current tick if its short enough. No rollback of moves are done for ghosted objects.
I had hoped when TDN was opened up to see people collaboratively build articles on these subjects, but it hasn't happened so far.
04/14/2008 (4:02 pm)
Quote:The hidden (and false) assumption here is that the client and server are "in step", or in fact coupled at all, when they are not.I kept my example simple is all. While some network implementations use a lockstep update, none of the machines involved are in sync in the default Torque networking implementation.
The only best case is that the client and server CRCs are equal... else a correction occurs and that always means a rollback and reprocessing of moves on the client. So ensuring your control object simulation is deterministic is extremely important to high performance and low bandwidth usage.
Even then none of the ghosted objects are in sync and they all constantly require updates from the server.
Quote:Tom, I think when you spotted the setPosition() call...I was looking at Vehicle and WheeledVehicle which uses the same basic techniques as Player, but is infinitely easier to understand.
When readPacketData() is called it is because the server detected the local client control object is out of sync. readPacketData() uses the data from the server to move the client control object into the corrected past state. Then the last processed move index is moved back to the past move which matches the server state received. Now several processTick() calls will occur in quick succession to "catch up" the client control object to where it should be at. Normal tick processing can then continue.
As far as i can see no "warp ticks" occur under a correction to a local control object via readPacketData(). I assume because corrections of this sort shouldn't occur often or be radically divergent.
unpackData() on the other hand is called for ghosted client objects. This calculates the "warpOffset" (the distance between the current position and the sent server position) and will either apply the warp over several ticks (aka warpTicks) or within the current tick if its short enough. No rollback of moves are done for ghosted objects.
Quote:This is probably the most complex portion of Torque that exists, and in fact drives a lot of the things that people claim are "outdated" (physics limitations being a primary one)I'd argue that the bowels of the interior code or the collision code is just as confusing... but yea... its hard to get your head around this stuff.
I had hoped when TDN was opened up to see people collaboratively build articles on these subjects, but it hasn't happened so far.
#12
In our project the CRC generation was too accurate. By that i mean that little deviations 4 or 5 decimal places down in a floating point number would cause a correction to be sent from the server. Corrections are expensive in that they eat server upload and client download bandwidth as well as CPU cost generating and applying the correction.
The way to solve this is to generate a CRC that is specialized to the error in the control object that you can tolerate.
Normally writePacketData() is called to generate the CRC data, but what is not well known is that you can overload this behavior. See GameBase::getPacketDataChecksum()...
Also note that the correction itself still remains in writePacketData() unchanged. So while the CRC uses rounded values and tolerates some error, the correction is accurate to the state of the server.
04/16/2008 (11:39 am)
Another interesting lost tidbit of Torque.In our project the CRC generation was too accurate. By that i mean that little deviations 4 or 5 decimal places down in a floating point number would cause a correction to be sent from the server. Corrections are expensive in that they eat server upload and client download bandwidth as well as CPU cost generating and applying the correction.
The way to solve this is to generate a CRC that is specialized to the error in the control object that you can tolerate.
Normally writePacketData() is called to generate the CRC data, but what is not well known is that you can overload this behavior. See GameBase::getPacketDataChecksum()...
U32 GameBase::getPacketDataChecksum(GameConnection * connection)
{
// just write the packet data into a buffer
// then we can CRC the buffer. This should always let us
// know when there is a checksum problem.
static U8 buffer[1500] = { 0, };
BitStream stream(buffer, sizeof(buffer));
writePacketData(connection, &stream);
U32 byteCount = stream.getPosition();
U32 ret = calculateCRC(buffer, byteCount, 0xFFFFFFFF);
dMemset(buffer, 0, byteCount);
return ret;
}You can overload this function in your control object to generate a more forgiving CRC. For instance...U32 MyVehicle::getPacketDataChecksum(GameConnection * connection)
{
static U8 buffer[1500] = { 0, };
BitStream stream(buffer, sizeof(buffer));
stream.write( (S32)(mRigid.linPosition.x * 100.0f) );
stream.write( (S32)(mRigid.linPosition.y * 100.0f) );
stream.write( (S32)(mRigid.linPosition.z * 100.0f) );
stream.write( (S32)(mRigid.angPosition.x * 1000.0f) );
stream.write( (S32)(mRigid.angPosition.y * 1000.0f) );
stream.write( (S32)(mRigid.angPosition.z * 1000.0f) );
stream.write( (S32)(mRigid.angPosition.w * 1000.0f) );
U32 byteCount = stream.getPosition();
U32 ret = calculateCRC(buffer, byteCount, 0xFFFFFFFF);
dMemset(buffer, 0, byteCount);
return ret;
}In this code i'm rounding the vehicle's rigid position to 2 decimal places and the orientation to 3 decimal places. I then write them as S32s to the stream and CRC it. This means that the server will only correct the client if the position is off by 0.01 or the orientation is off by 0.001. The fact i'm writing S32s here doesn't matter as none of this is transmitted over the network... its just for generating the CRC.Also note that the correction itself still remains in writePacketData() unchanged. So while the CRC uses rounded values and tolerates some error, the correction is accurate to the state of the server.
#13
Anyone who's ever messed with the PhysX implementation in T3D might be aware that players with active physics representations (mPhysicsRep) call write/readpacketdata literally every tick. I ran into this same problem while attempting to create vehicles based on PhysX actors (easier than it sounded aside from this issue).
I believe the problem is that the physics simulation is not in any way connected to the ticking of Torque objects. The physics sim ticks once per global tick, while Torque objects tick individually (allowing everything discussed above to work).
Basically the way the server handles incoming moves doesn't work with objects connected to the physics layer, so the objects are always farther out of sync than they should be. With an object using Rigid the incoming move events can trigger additional physics ticks as needed to catch up, but this is impossible with a physics engine like PhysX because you can't tell individual objects to tick forward without running the entire simulation forward.
When the checksums are generated they're basically always off by some amount. The faster the object is moving the farther off the checksums will be since, again, the incoming move events can't tick the physics sim forward.
I used a variation of the code above to increase the error required to trigger a correction, but in the end there are always going to be corrections (especially at high velocities). And what this really points out is simply that the two systems aren't really going to work together; a library like PhysX isn't designed to work with Torque's elaborate networking.
The most absurd aspect of all of this is that the primary goal of implementing PhysX vehicles is simply to get better collision solutions. PhysX has that excellent interpenetration handling that prevents all the crazy situations you encounter with Torque's Rigid class.
Blah blah. The point is that the physics layer code in T3D really doesn't work with networking at all and basically can never be used in multiplayer in any capacity.
The only way to use PhysX for vehicles with the existing networking would be to completely skip client side prediction and handle the moves on the server, then let packUpdate send the positional updates and set up the interpolation like it does for non-controlobject ghosts. This model may be familiar because that failed GTA online style MMO "APB" used PhysX for its vehicles and had to resort to the same solution, resulting in vehicle driving that had no client side prediction (so all inputs had latency based on your ping). Under certain circumstances this might be a desirable trade (much better collision handling), but the overall driving/flying experience will by laggy for any player without an excellent ping.
05/28/2014 (3:37 pm)
This thread's pointed me in the right direction as far as resolving the conflict between physics plugins and read/writepacketdata.Anyone who's ever messed with the PhysX implementation in T3D might be aware that players with active physics representations (mPhysicsRep) call write/readpacketdata literally every tick. I ran into this same problem while attempting to create vehicles based on PhysX actors (easier than it sounded aside from this issue).
I believe the problem is that the physics simulation is not in any way connected to the ticking of Torque objects. The physics sim ticks once per global tick, while Torque objects tick individually (allowing everything discussed above to work).
Basically the way the server handles incoming moves doesn't work with objects connected to the physics layer, so the objects are always farther out of sync than they should be. With an object using Rigid the incoming move events can trigger additional physics ticks as needed to catch up, but this is impossible with a physics engine like PhysX because you can't tell individual objects to tick forward without running the entire simulation forward.
When the checksums are generated they're basically always off by some amount. The faster the object is moving the farther off the checksums will be since, again, the incoming move events can't tick the physics sim forward.
I used a variation of the code above to increase the error required to trigger a correction, but in the end there are always going to be corrections (especially at high velocities). And what this really points out is simply that the two systems aren't really going to work together; a library like PhysX isn't designed to work with Torque's elaborate networking.
The most absurd aspect of all of this is that the primary goal of implementing PhysX vehicles is simply to get better collision solutions. PhysX has that excellent interpenetration handling that prevents all the crazy situations you encounter with Torque's Rigid class.
Blah blah. The point is that the physics layer code in T3D really doesn't work with networking at all and basically can never be used in multiplayer in any capacity.
The only way to use PhysX for vehicles with the existing networking would be to completely skip client side prediction and handle the moves on the server, then let packUpdate send the positional updates and set up the interpolation like it does for non-controlobject ghosts. This model may be familiar because that failed GTA online style MMO "APB" used PhysX for its vehicles and had to resort to the same solution, resulting in vehicle driving that had no client side prediction (so all inputs had latency based on your ping). Under certain circumstances this might be a desirable trade (much better collision handling), but the overall driving/flying experience will by laggy for any player without an excellent ping.
Torque 3D Owner Stephen Zepp
The client will then continue to proceed with interpolation corrections to the future projected state that should be aligned with the server's update, including both prediction of moves that are currently in the move window but not been cleared by the server, and therefore still exist on the client, as well as additional inputs.
The way I keep this clear in my head is keeping in mind that regardless of what the latency always is, the client is always out of synchronization, by definition. You can think of the client's simulation state as a "window into the past" of a previous state the server was in, combined with a "window into the future" for the control object, since client side moves are applied in a predictive manner before the server even finds out about them.
The client has to "unwind" control object state to the state it was at the timestamp received by the server (subtract all moves that haven't been marked as processed/cleared yet in the update packet), apply the correction, and then add the moves back in to the "current" position from the player's perspective, so it can continue to apply player moves.