Nasty platformMemory bug (and possible fix..)
by William Todd Scott · in Torque Game Engine Advanced · 07/10/2007 (12:05 am) · 8 replies
Hi,
This one is a bit nasty, as such, I would really appreciate if some of the big guns could take a look at this to verify or refute the fix...
On line 1035 of platformMemory.cpp is the following bit of code:
If TORQUE_DEBUG_GUARD is not set, the code is fine, because the Header structure will take up exactly 16 bytes and we have padded our memory allocation out by 16 bytes.
However, if TORQUE_DEBUG_GUARD is set, then we are only padding by 4 bytes. Even without TORQUE_DEBUG_GUARD, the Header struct is still 16 bytes. With the guard the header grows to 64 bytes at a minimum and 68 bytes if TORQUE_ENABLE_PROFILE_PATH is set.
So, I believe this code should look like this:
Since we can't pad to non-power of 2 numbers, the Header struct also needs to be changed for the TORQUE_ENABLE_PROFILE case, to be not 68 bytes, but 128 bytes.
It would be very helpful if someone could either verify this fix or explain to me why the current code is correct.
Thanks for the help.
Todd
edited for mistakes
This one is a bit nasty, as such, I would really appreciate if some of the big guns could take a look at this to verify or refute the fix...
On line 1035 of platformMemory.cpp is the following bit of code:
#ifdef TORQUE_DEBUG_GUARD // if we're guarding, round up to the nearest DWORD size = ((size + 3) & ~0x3); #else // round up size to nearest 16 byte boundary (cache lines and all...) size = ((size + 15) & ~0xF); #endif
If TORQUE_DEBUG_GUARD is not set, the code is fine, because the Header structure will take up exactly 16 bytes and we have padded our memory allocation out by 16 bytes.
However, if TORQUE_DEBUG_GUARD is set, then we are only padding by 4 bytes. Even without TORQUE_DEBUG_GUARD, the Header struct is still 16 bytes. With the guard the header grows to 64 bytes at a minimum and 68 bytes if TORQUE_ENABLE_PROFILE_PATH is set.
So, I believe this code should look like this:
#ifdef TORQUE_DEBUG_GUARD
#ifdef TORQUE_ENABLE_PROFILE
size = ((size + 127) & ~0x7F);
#else
size = ((size + 63) & ~0x3F);
#endif
#else
size = ((size + 15) & ~0xF);
#endifSince we can't pad to non-power of 2 numbers, the Header struct also needs to be changed for the TORQUE_ENABLE_PROFILE case, to be not 68 bytes, but 128 bytes.
It would be very helpful if someone could either verify this fix or explain to me why the current code is correct.
Thanks for the help.
Todd
edited for mistakes
#2
But all that said, who knows. If you have particular problem cases it might help debugging any issues.
07/11/2007 (2:43 pm)
@Todd - do you have a crash case by any chance? This code is certainly hard to follow so it is not easy to be definitive. But I think it is ok. The size parameter is really just the requested size, not size + header. The padding is added to keep allocations on good boundaries (either dword or cache aligned). I also think that we'd have seen a lot of crashes in the memory manager if these numbers were off since I've tested with and with profile path a fair amount.But all that said, who knows. If you have particular problem cases it might help debugging any issues.
#3
07/11/2007 (8:59 pm)
Deleted because of its incorrectness.
#4
I deleted my last post because I was so far off I didn't want to leave the misinformation up...
Clark, thank you for your help. I spent the last hour or so going through the red-black tree and I see what you are saying and agree that my change is not helpful.
Unfortunately, there is a bug somewhere. According to the minidump I am working off of, the program crashed on line 505 in the treeRemove() function executing this code:
hdr->treeNode->queueHead = next;
Unfortunately, I don't know how the tree got corrupted.
Thanks
Todd
07/11/2007 (9:51 pm)
Ok,I deleted my last post because I was so far off I didn't want to leave the misinformation up...
Clark, thank you for your help. I spent the last hour or so going through the red-black tree and I see what you are saying and agree that my change is not helpful.
Unfortunately, there is a bug somewhere. According to the minidump I am working off of, the program crashed on line 505 in the treeRemove() function executing this code:
hdr->treeNode->queueHead = next;
Unfortunately, I don't know how the tree got corrupted.
Thanks
Todd
#5
Set a data breakpoint at that address when running the app, it should break when it gets hammered and you can check the call stack etc to find the offending code.
07/12/2007 (3:24 am)
@William:Set a data breakpoint at that address when running the app, it should break when it gets hammered and you can check the call stack etc to find the offending code.
#6
I have a full call stack from the minidump and I examined the registers. I know exactly why it is crashing, the treeNode variable is pointing to garbage.
Unfortunately, the corruption happens long before the actual crash.
Setting a data breakpoint isn't feasible because memory corruption crashes are notoriously difficult to reproduce. So, I can't start the engine and set a breakpoint because I don't know what memory location will be corrupted.
The crash is occurring at the same point indicated by some people on the TGE Crash forum, so its not a one-off problem, it just can't be reproduced reliably.
Todd
07/12/2007 (9:26 am)
Hi Neo,I have a full call stack from the minidump and I examined the registers. I know exactly why it is crashing, the treeNode variable is pointing to garbage.
Unfortunately, the corruption happens long before the actual crash.
Setting a data breakpoint isn't feasible because memory corruption crashes are notoriously difficult to reproduce. So, I can't start the engine and set a breakpoint because I don't know what memory location will be corrupted.
The crash is occurring at the same point indicated by some people on the TGE Crash forum, so its not a one-off problem, it just can't be reproduced reliably.
Todd
#7
But that doesn't mean you aren't on to a bug with the memory system. So if you don't use debug guard or profile path you don't get the error, or do you get it no matter what? In the first case I'd tend to think it might be an error in the memory manager whereas in the second I'd lean towards a corruption error elsewhere.
One thing you might try is disabling the memory manager altogether. We've found that extremely helpful in the past for finding memory corruption errors. The Torque memory manager won't release old memory to the os, allowing you to corrupt it all you want. If you turn off the memory manager, writing to old memory blocks (or even currently unallocated blocks) will cause a crash right away.
07/12/2007 (10:01 am)
The problem with this kind of bug is you can never tell if it's in the system experiencing the crash or some random place that's corrupting memory. Some systems are just extremely vulnerable to corruption, so they tend to be the first to show it.But that doesn't mean you aren't on to a bug with the memory system. So if you don't use debug guard or profile path you don't get the error, or do you get it no matter what? In the first case I'd tend to think it might be an error in the memory manager whereas in the second I'd lean towards a corruption error elsewhere.
One thing you might try is disabling the memory manager altogether. We've found that extremely helpful in the past for finding memory corruption errors. The Torque memory manager won't release old memory to the os, allowing you to corrupt it all you want. If you turn off the memory manager, writing to old memory blocks (or even currently unallocated blocks) will cause a crash right away.
#8
Thanks again for your help.
It is hard for me to answer your questions definitively because it is one of those crashes that shows up VERY infrequently. In fact, if we weren't using minidumps I wouldn't even realize that it occurred more than once.
Those few times that the crash has occurred happened only when both debug guard and profile path were turned off.
I will give turning the memory manager off a try. That is a REALLY good idea and addresses one of the issues I was having in trying to create repro.
As always, if I find anything useful I'll post here.
Thanks
Todd
07/12/2007 (1:03 pm)
Hi Clark,Thanks again for your help.
It is hard for me to answer your questions definitively because it is one of those crashes that shows up VERY infrequently. In fact, if we weren't using minidumps I wouldn't even realize that it occurred more than once.
Those few times that the crash has occurred happened only when both debug guard and profile path were turned off.
I will give turning the memory manager off a try. That is a REALLY good idea and addresses one of the issues I was having in trying to create repro.
As always, if I find anything useful I'll post here.
Thanks
Todd
Torque Owner William Todd Scott
I do not believe that any change to the Header struct is necessary, the unused space from padding out to 128 bytes will not be used.
Todd