A better Interior::setupActivePolyList()
by asmaloney (Andy) · in Torque Game Engine · 01/14/2007 (3:56 pm) · 41 replies
In my continuing quest for a Stronghold which runs at a reasonable speed on my PowerBook G4, I've improved Interior::setupActivePolyList() which was near the top of my profile as a heavy function. The changes should be useful for everyone so I'm posting them here instead of in the Mac forum.
Interior::setupActivePolyList() is a fairly long function. To simplify profiling and to allow the compiler a better chance to optimize at the function level, I added a new function [doFogActive()] to separate out the cases when we have fog to deal with.
The main change though is the way activePoints is handled. This is used to mark points we've already visited so we don't have to do calculations on them again. In the original version, it was an array of U8s and was allocated as part of the frame using FrameAllocator::alloc(). The two problems with this are (1) activePoints is only used in this function, so there's no need to add it as part of the frame memory and (2) indexing into an array of U8s just for a 0-or-1 comparison is quite expensive because of the alignment of the data. So this version replaces activePoints with a bit vector and checks bits instead.
The initial version accounted for 8% of my run time on my profile journal. These changes reduced it to 6%.
If you try it, please let me know what kind of results you get [I'm interested in how it affects the Windows build too]. If you have suggestions for further improvement - I'm sure there are other things to do here - please post!
Aside: One of the things I see in profiling is that some classes/structs are not aligned. I think this is the cause of a lot of inefficient loads and stores, but some class/structs seem to be sensitive to their data layout, so I'm going to have to look at this more carefully.
-----
In interior/interior.h around line 708 add the function header for doFogActive():
continued...
[Edit: include no longer needed]
Interior::setupActivePolyList() is a fairly long function. To simplify profiling and to allow the compiler a better chance to optimize at the function level, I added a new function [doFogActive()] to separate out the cases when we have fog to deal with.
The main change though is the way activePoints is handled. This is used to mark points we've already visited so we don't have to do calculations on them again. In the original version, it was an array of U8s and was allocated as part of the frame using FrameAllocator::alloc(). The two problems with this are (1) activePoints is only used in this function, so there's no need to add it as part of the frame memory and (2) indexing into an array of U8s just for a 0-or-1 comparison is quite expensive because of the alignment of the data. So this version replaces activePoints with a bit vector and checks bits instead.
The initial version accounted for 8% of my run time on my profile journal. These changes reduced it to 6%.
If you try it, please let me know what kind of results you get [I'm interested in how it affects the Windows build too]. If you have suggestions for further improvement - I'm sure there are other things to do here - please post!
Aside: One of the things I see in profiling is that some classes/structs are not aligned. I think this is the cause of a lot of inefficient loads and stores, but some class/structs seem to be sensitive to their data layout, so I'm going to have to look at this more carefully.
-----
In interior/interior.h around line 708 add the function header for doFogActive():
void traverseZone(const RectD* inRects, const U32 numInputRects, U32 currZone, Vector<U32>& zoneStack);
[b] void doFogActive( bool environmentActive,
SceneState* state,
U32 outputCount, U16* output,
U16* planeSides,
const PlaneF &distPlane,
const F32 distOffset,
const Point3F &worldP,
const Point3F& osZVec,
const F32 worldZ );
[/b]
void setupActivePolyList(ZoneVisDeterminer&, SceneState*,
const Point3F&, const Point3F& rViewVector,
const Point3F&,
const F32 worldz, const Point3F& scale);continued...
[Edit: include no longer needed]
#2
[Edit: Ben's suggested change BitMatrix -> Bitvector]
01/14/2007 (3:58 pm)
In interior/interior.cc, add this function:void Interior::doFogActive( bool environmentActive,
SceneState* state,
U32 outputCount, U16* output,
U16* planeSides,
const PlaneF &distPlane,
const F32 distOffset,
const Point3F &worldP,
const Point3F& osZVec,
const F32 worldZ )
{
// Point setup
BitVector activePoints( mPoints.size() );
activePoints.clear();
sgActivePolyList = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(U16));
sgEnvironPolyList = NULL;
sgActivePolyListSize = 0;
sgEnvironPolyListSize = 0;
sgFogPolyList = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(U16));
sgFogTexCoords = (Point2F*)FrameAllocator::alloc(mPoints.size() * sizeof(Point2F));
sgFogPolyListSize = 0;
if (useFogCoord())
{
if (environmentActive)
{
// Environ, fc fog
sgEnvironPolyList = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(U16));
for (U32 i = 0; i < outputCount; i++)
{
const U16 oIndex = output[i];
const Surface& rSurface = mSurfaces[oIndex];
// Not back faced? Add it to the list
if (((planeSides[getPlaneIndex(rSurface.planeIndex)] ^ rSurface.planeIndex) & 0x8000) == 0)
continue;
sgActivePolyList[sgActivePolyListSize++] = oIndex;
if (mEnvironMaps[rSurface.textureIndex] != NULL)
sgEnvironPolyList[sgEnvironPolyListSize++] = oIndex;
const U32 count = rSurface.windingStart + rSurface.windingCount;
for (U32 j = rSurface.windingStart; j < count; j++)
{
const U32 index = mWindings[j];
if ( !activePoints.test( index ) )
{
activePoints.set( index );
const Point3F &point = mPoints[index].point;
const F32 z = worldZ + mDot(point, osZVec);
mPoints[index].ogCoord = state->getHazeAndFog(mFabs(distPlane.distToPlane(point)) + distOffset,
z - worldP.z);
AssertFatal(mPoints[index].fogCoord >= 0.0f, "Error, neg fog coord!");
}
}
}
}
else
{
// No environ, FC fog
for (U32 i = 0; i < outputCount; i++)
{
const U16 oIndex = output[i];
const Surface& rSurface = mSurfaces[oIndex];
// Not back faced? Add it to the list
if (((planeSides[getPlaneIndex(rSurface.planeIndex)] ^ rSurface.planeIndex) & 0x8000) == 0)
continue;
sgActivePolyList[sgActivePolyListSize++] = oIndex;
const U32 count = rSurface.windingStart + rSurface.windingCount;
for (U32 j = rSurface.windingStart; j < count; j++)
{
const U32 index = mWindings[j];
if ( !activePoints.test( index ) )
{
activePoints.set( index );
const Point3F &point = mPoints[index].point;
const F32 z = worldZ + mDot(point, osZVec);
mPoints[index].fgCoord = state->getHazeAndFog(mFabs(distPlane.distToPlane(point)) + distOffset,
z - worldP.z);
AssertFatal(mPoints[index].fogCoord >= 0.0f, "Error, neg fog coord!");
}
}
}
}
return;
}
// Environment, textured fog
if (environmentActive)
{
sgEnvironPolyList = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(U16));
for (U32 i = 0; i < outputCount; i++)
{
const U16 oIndex = output[i];
const Surface& rSurface = mSurfaces[oIndex];
// Not back faced? Add it to the list
if (((planeSides[getPlaneIndex(rSurface.planeIndex)] ^ rSurface.planeIndex) & 0x8000) == 0)
continue;
sgActivePolyList[sgActivePolyListSize++] = oIndex;
if (mEnvironMaps[rSurface.textureIndex] != NULL)
sgEnvironPolyList[sgEnvironPolyListSize++] = oIndex;
// Fog the unfogged points...
const U32 count = rSurface.windingStart + rSurface.windingCount;
for (U32 j = rSurface.windingStart; j < count; j++)
{
const U32 pIndex = mWindings[j];
if ( !activePoints.test( pIndex ) )
{
activePoints.set( pIndex );
// fog.
const Point3F &point = mPoints[pIndex].point;
const F32 dist = distPlane.distToPlane(point) + distOffset;
const F32 z = worldZ + mDot(point, osZVec);
Point2F &fogTexCoord = sgFogTexCoords[pIndex];
gClientSceneGraph->getFogCoordPair(dist, z, fogTexCoord.x, fogTexCoord.y);
}
}
}
}
else
{
// no environment, textured fog
for (U32 i = 0; i < outputCount; i++)
{
const U16 oIndex = output[i];
const Surface& rSurface = mSurfaces[oIndex];
// Not back faced? Add it to the list
if (((planeSides[getPlaneIndex(rSurface.planeIndex)] ^ rSurface.planeIndex) & 0x8000) == 0)
continue;
sgActivePolyList[sgActivePolyListSize++] = oIndex;
// Fog the unfogged points...
const U32 count = rSurface.windingStart + rSurface.windingCount;
for (U32 j = rSurface.windingStart; j < count; j++)
{
const U32 pIndex = mWindings[j];
if ( !activePoints.test( pIndex ) )
{
activePoints.set( pIndex );
// fog.
const Point3F &point = mPoints[pIndex].point;
const F32 dist = distPlane.distToPlane(point) + distOffset;
const F32 z = worldZ + mDot(point, osZVec);
Point2F &fogTexCoord = sgFogTexCoords[pIndex];
gClientSceneGraph->getFogCoordPair(dist, z, fogTexCoord.x, fogTexCoord.y);
}
}
}
}
}[Edit: Ben's suggested change BitMatrix -> Bitvector]
#3
1. How much time is spent calculating envmap or fog values? The calculations involved are quite simplistic - it might be possible to vectorize these calculations, and if they are a big chunk of the work then it might be smart to consider it.
2. Any reason you chose BitMatrix? I believe there's a BitVector you can use, which should be a bit (no pun intended :P) faster than the BitMatrix, since it doesn't consider the second dimension.
3. This code looks like it might be memory bound. Would it be possible to get a win out of prefetching? Does Shark show any memory stalls? If so, where?
4. Depending on what you're targeting, have you considered generating a fixed vertex buffer for each zone in the interior, and simply drawing that (potentially doing fog/envmap calculations in CPU) than doing all this dynamic stuff? We do this in TSE to good effect.
01/14/2007 (11:29 pm)
Thanks for the write-up here! I have a few questions based on reading over this code...1. How much time is spent calculating envmap or fog values? The calculations involved are quite simplistic - it might be possible to vectorize these calculations, and if they are a big chunk of the work then it might be smart to consider it.
2. Any reason you chose BitMatrix? I believe there's a BitVector you can use, which should be a bit (no pun intended :P) faster than the BitMatrix, since it doesn't consider the second dimension.
3. This code looks like it might be memory bound. Would it be possible to get a win out of prefetching? Does Shark show any memory stalls? If so, where?
4. Depending on what you're targeting, have you considered generating a fixed vertex buffer for each zone in the interior, and simply drawing that (potentially doing fog/envmap calculations in CPU) than doing all this dynamic stuff? We do this in TSE to good effect.
#4
I should have stated my goals a bit [haha - I can do it too!] better. I'm trying to make some general, non-radical changes to the stock TGE to make it run better on my PB G4. I'm using this in part to learn the internals of TGE, so I don't want to make any sweeping changes. I would also like others to benefit, so I want to keep the changes as simple as possible while still providing some improvements. I'm hoping that this will encourage others to jump in and learn, and that you [GG] might consider my changes [or something like them] for the main code base since they're easy to understand [no Duff's Device here], generally applicable, and non-threatening :-)
To answer your questions:
1) About a third - see 3. I had thought of vectorizing as well, which might be a bit easier to tackle now that the fog stuff is in a separate function.
2) Just my inexperience with TGE. :-) [I've changed the code for doFogActive() above...]
3) Yes it is stalling all over getFogCoordPair(). The compiler cannot optimize this because we're using a global, non-const object [gClientSceneGraph]. Manually inlining and moving invariants outside the loops would probably help here. I'll see what happens.
4) Honestly, I don't understand TGE enough to do this yet :-)
01/15/2007 (6:22 am)
Ben: Thanks for the input!I should have stated my goals a bit [haha - I can do it too!] better. I'm trying to make some general, non-radical changes to the stock TGE to make it run better on my PB G4. I'm using this in part to learn the internals of TGE, so I don't want to make any sweeping changes. I would also like others to benefit, so I want to keep the changes as simple as possible while still providing some improvements. I'm hoping that this will encourage others to jump in and learn, and that you [GG] might consider my changes [or something like them] for the main code base since they're easy to understand [no Duff's Device here], generally applicable, and non-threatening :-)
To answer your questions:
1) About a third - see 3. I had thought of vectorizing as well, which might be a bit easier to tackle now that the fog stuff is in a separate function.
2) Just my inexperience with TGE. :-) [I've changed the code for doFogActive() above...]
3) Yes it is stalling all over getFogCoordPair(). The compiler cannot optimize this because we're using a global, non-const object [gClientSceneGraph]. Manually inlining and moving invariants outside the loops would probably help here. I'll see what happens.
4) Honestly, I don't understand TGE enough to do this yet :-)
#5
2) Was there much of a win from that change? (We _are_ opimizing after all... always gotta measure changes to see if they do anything! :P)
4) It's a pretty big set of changes, and as you say, isn't really inline with your goals - makes sense to me to leave it to TSE. (Besides, that way we sell more TSE licenses... :P) And this code _is_ smarter for slow GPU/fast CPU situations.
Thanks for stepping up to the plate and hacking on this stuff... great work so far!
01/15/2007 (10:16 am)
1&3) Hmm - well I'd try breaking that function out. It's pretty simple code - should be easy to break it out and test. If it's effective might be worthwhile making a FogCalculator class that lives in the stack and does the calcs - that might be a clean way to get the compiler win w/o uglifying your code.2) Was there much of a win from that change? (We _are_ opimizing after all... always gotta measure changes to see if they do anything! :P)
4) It's a pretty big set of changes, and as you say, isn't really inline with your goals - makes sense to me to leave it to TSE. (Besides, that way we sell more TSE licenses... :P) And this code _is_ smarter for slow GPU/fast CPU situations.
Thanks for stepping up to the plate and hacking on this stuff... great work so far!
#6
1) I'll revisit this when I get a bit more time to look at it. I keep getting distracted every time I zone-in on this ;-)
2) Reduced time spent in doFogActive() from 6.0% to 5.9%
4) Just need TSE *cough* TGEA support on Mac OS X... :-)
01/15/2007 (3:41 pm)
Thanks Ben - appreciate the feedback!1) I'll revisit this when I get a bit more time to look at it. I keep getting distracted every time I zone-in on this ;-)
2) Reduced time spent in doFogActive() from 6.0% to 5.9%
4) Just need TSE *cough* TGEA support on Mac OS X... :-)
#7
2) Hah - well, good to know there WAS a change!
4) We're closer than ever... :)
01/15/2007 (4:33 pm)
1) I think it sounds like a smart move... look forward to hearing what sort of win this is for you! :)2) Hah - well, good to know there WAS a change!
4) We're closer than ever... :)
#8
1) Added a fog calculator class for textured fog calculations based on Ben's suggestion [at least I think I'm doing what he suggested :-)]
2) Manually inlined and jiggered the textured fog calculation code within that class [uglified!]
3) Moved the planeSides() checks out of the inner loops all the way back to the creation of the mergeArray to reduce the inner loop iterations
...and the result: 2.3%. From the original 8%, that makes a nice little optimization.
There are still some stalls in the fog calc and register spills, so it can be improved even more, but I'm going to leave it as is for now.
Code to follow...
In interior/interior.h around line 708 add the function header for doFogActive():
01/16/2007 (12:30 pm)
I made the following three changes:1) Added a fog calculator class for textured fog calculations based on Ben's suggestion [at least I think I'm doing what he suggested :-)]
2) Manually inlined and jiggered the textured fog calculation code within that class [uglified!]
3) Moved the planeSides() checks out of the inner loops all the way back to the creation of the mergeArray to reduce the inner loop iterations
...and the result: 2.3%. From the original 8%, that makes a nice little optimization.
There are still some stalls in the fog calc and register spills, so it can be improved even more, but I'm going to leave it as is for now.
Code to follow...
In interior/interior.h around line 708 add the function header for doFogActive():
void doFogActive( const bool environmentActive, const SceneState* state, const U32 mergeArrayCount, const U16* mergeArray, const PlaneF &distPlane, const F32 distOffset, const Point3F &worldP, const Point3F &osZVec, const F32 worldZ );Continued...
#9
01/16/2007 (12:31 pm)
In interior/interior.cc, replace the function Interior::setupActivePolyList() with this:void Interior::setupActivePolyList(ZoneVisDeterminer& zoneDeterminer,
SceneState* state,
const Point3F& rPoint,
const Point3F& osCamVector,
const Point3F& osZVec,
const F32 worldZ,
const Point3F& scale)
{
PROFILE_START(InteriorSetupActivePolyList);
// Here's the deal. We loop through each of the zones, and create a merged master
// list of polygons that are the union of all the zones render sets. While we're
// doing this, we'll be setting up each zone's list of planes. I've got these
// processes separated out for now, but they could be merged. After we have the
// master list of polys, and the list of active planes, we'll copy the list
// (culling the backfaces) into the ActivePolyList.
// There's some trickiness here. We use the high bit of this U16 to test against the
// flip bit in the surfaces planeindex.
U16* planeSides = (U16*)FrameAllocator::alloc(sizeof(U16) * mPlanes.size());
// We'll never need more than twice the number of zones for merging...
ItrMergeStruct* mergeArray = (ItrMergeStruct*)FrameAllocator::alloc((mZones.size() * 2) * sizeof(ItrMergeStruct));
U32 numMergeStructs = 0;
PROFILE_START(ISAPL_Merge);
for (U32 i = 0; i < mZones.size(); i++)
{
if (zoneDeterminer.isZoneVisible(i) == false)
continue;
// Setup the plane directionals
for (U32 j = mZones[ i ].planeStart; j < mZones[ i ].planeStart + mZones[ i ].planeCount; j++) {
if (getPlane(mZonePlanes[j]).distToPlane(rPoint) >= 0.0f)
planeSides[mZonePlanes[j]] = 0x8000;
else
planeSides[mZonePlanes[j]] = 0x0000;
}
// Create a merge struct for this zone
ItrMergeStruct& rMerge = mergeArray[numMergeStructs++];
rMerge.size = mZones[ i ].surfaceCount;
rMerge.array = (U16*)FrameAllocator::alloc(rMerge.size * sizeof(U16));
dMemcpy( rMerge.array, &mZoneSurfaces[mZones[ i ].surfaceStart], rMerge.size * sizeof(U16) );
}
AssertFatal(numMergeStructs > 0, "Error, no rendered zones, big problem.");
// Merge the arrays into the final version
U32 finalArray = 0xFFFFFFFF;
{
U32 begin = 0;
U32 end = numMergeStructs;
while ((end - begin) > 1)
{
U32 newEnd = end;
U32 i;
for (i = begin; (i + 1) < end; i += 2)
{
const ItrMergeStruct& rMerge0 = mergeArray[ i ];
const ItrMergeStruct& rMerge1 = mergeArray[i+1];
// Create the new structure to merge into
ItrMergeStruct& rMergeOut = mergeArray[newEnd++];
rMergeOut.array = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(U16));
mergeSurfaceVectors(rMerge0.array, rMerge0.size,
rMerge1.array, rMerge1.size,
rMergeOut.array,
&rMergeOut.size);
}
begin = i;
end = newEnd;
}
finalArray = begin;
}
AssertFatal(finalArray < mZones.size() * 2, "Error, final array out of bounds!");
U16* output = mergeArray[finalArray].array;
U32 outputCount = mergeArray[finalArray].size;
U32 pos = 0;
// remove back faced polys from list
for (U32 i = 0; i < outputCount; i++)
{
const U16 oIndex = output[ i ];
const U16 rSurfacePlaneIndex = mSurfaces[oIndex].planeIndex;
if ( (planeSides[getPlaneIndex(rSurfacePlaneIndex)] ^ rSurfacePlaneIndex) & 0x8000 )
output[pos++] = oIndex;
}
outputCount = pos;
PROFILE_END();
// Before we go and fog this object, we'll test the points of our bounding box. If
// they are all unfogged, then we have no need to do any fogging. If there are
// all fogged though, we cannot turn off rendering of the object, as it's possible
// that they extend into fog planes.
PlaneF distPlane;
// Setup the dist plane
const Point3F &closest = getBoundingBox().getClosestPoint(rPoint);
Point3F n = ( closest - rPoint );
n.convolve( scale );
const F32 distOffset = n.len();
if (distOffset != 0)
{
distPlane.set(closest, n);
}
else
{
// Oops, we're inside the bounding box. distnormal is the view vector in object space
distPlane.set(closest, osCamVector);
}
distPlane.x /= scale.x;
distPlane.y /= scale.y;
distPlane.z /= scale.z;
const Point3F &worldP = state->getCameraPosition();
F32 maxFog = -1;
const Point3F fp[2] = { getBoundingBox().min, getBoundingBox().max };
for (U32 i = 0; i < 8; i++)
{
Point3F test;
if (i & 0x1) test.x = fp[0].x;
else test.x = fp[1].x;
if (i & 0x2) test.y = fp[0].y;
else test.y = fp[1].y;
if (i & 0x4) test.z = fp[0].z;
else test.z = fp[1].z;
F32 hazeVal = state->getHazeAndFog(mFabs(distPlane.distToPlane(test)) + distOffset,
(mDot(test, osZVec) + worldZ) - worldP.z);
if (hazeVal > maxFog)
maxFog = hazeVal;
}
PROFILE_START(ISAPL_Setup);
bool environmentActive = (dglDoesSupportARBMultitexture() &&
smRenderEnvironmentMaps &&
mValidEnvironMaps != 0);
if (maxFog >= 1.0/255.0f)
{
// Sigh. Gotta do it
sgFogActive = true;
doFogActive( environmentActive,
state,
outputCount, output,
distPlane,
distOffset,
worldP,
osZVec,
worldZ );
PROFILE_END();
PROFILE_END();
return;
}
// Unfogged. We can turn off this part of the setup...
sgFogActive = false;
// Point setup
sgActivePolyList = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(16));
sgEnvironPolyList = NULL;
sgActivePolyListSize = 0;
sgEnvironPolyListSize = 0;
sgFogPolyList = NULL;
sgFogTexCoords = NULL;
sgFogPolyListSize = 0;
// No Fog
if (environmentActive)
{
sgEnvironPolyList = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(U16));
// Environ
for (U32 i = 0; i < outputCount; i++)
{
const U16 oIndex = output[ i ];
sgActivePolyList[sgActivePolyListSize++] = oIndex;
const Surface& rSurface = mSurfaces[oIndex];
if (mEnvironMaps[rSurface.textureIndex] != NULL)
sgEnvironPolyList[sgEnvironPolyListSize++] = oIndex;
}
}
else
{
for (U32 i = 0; i < outputCount; i++)
{
sgActivePolyList[sgActivePolyListSize++] = output[ i ];
}
}
PROFILE_END();
PROFILE_END();
}Continued...
#10
[Edit: fix sign problem - see below]
01/16/2007 (12:33 pm)
In interior/interior.cc, add this:// This class is a fancy inlined implementation of the fog calculations with all invariants moved
// into the constructor.
class FogCalc
{
public:
FogCalc::FogCalc( const PlaneF &dPlane, F32 distOffset, const Point3F &zVec, F32 worldZ, F32 worldPz, const SceneState *state )
: fc_distOffset( distOffset ),
fc_newWorldZ( worldZ - worldPz ),
distPlane( dPlane ),
osZVec( zVec ),
sState( state )
{
// For Textured
F32 heightOffset;
gClientSceneGraph->getFogCoordData( tex_invVisibleDistance, heightOffset, tex_invHeightRange );
tex_visibleDistanceMod = gClientSceneGraph->getVisibleDistanceMod() - distOffset - distPlane.d;
tex_newWorldZ = worldZ - heightOffset;
}
inline F32 CalcFC( const Point3F &point ) const
{
return( sState->getHazeAndFog( mFabs( distPlane.distToPlane(point) ) + fc_distOffset,
fc_newWorldZ + mDot(point, osZVec) ) );
}
inline const Point2F CalcTextured( const Point3F &point ) const
{
return( Point2F(
(tex_visibleDistanceMod - (point.x * distPlane.x + point.y * distPlane.y + point.z * distPlane.z)) * tex_invVisibleDistance,
(tex_newWorldZ + point.x * osZVec.x + point.y * osZVec.y + point.z * osZVec.z) * tex_invHeightRange
) );
}
private:
const F32 fc_distOffset;
const F32 fc_newWorldZ;
F32 tex_invVisibleDistance;
F32 tex_invHeightRange;
F32 tex_visibleDistanceMod;
F32 tex_newWorldZ;
const PlaneF distPlane;
const Point3F osZVec;
const SceneState *sState;
};
void Interior::doFogActive( const bool environmentActive,
const SceneState* state,
const U32 mergeArrayCount, const U16* mergeArray,
const PlaneF &distPlane,
const F32 distOffset,
const Point3F &worldP,
const Point3F &osZVec,
const F32 worldZ )
{
// Point setup
BitVector activePoints( mPoints.size() );
activePoints.clear();
sgActivePolyList = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(U16));
sgEnvironPolyList = NULL;
sgActivePolyListSize = 0;
sgEnvironPolyListSize = 0;
sgFogPolyList = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(U16));
sgFogTexCoords = (Point2F*)FrameAllocator::alloc(mPoints.size() * sizeof(Point2F));
sgFogPolyListSize = 0;
const FogCalc fogCalc( distPlane, distOffset, osZVec, worldZ, worldP.z, state );
if (useFogCoord())
{
// FC fog
for (U32 i = 0; i < mergeArrayCount; ++i)
{
const U16 oIndex = mergeArray[ i ];
sgActivePolyList[sgActivePolyListSize++] = oIndex;
const Surface& rSurface = mSurfaces[oIndex];
if (environmentActive && (mEnvironMaps[rSurface.textureIndex] != NULL))
{
if ( sgEnvironPolyList == NULL )
sgEnvironPolyList = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(U16));
sgEnvironPolyList[sgEnvironPolyListSize++] = oIndex;
}
// Fog the unfogged points...
const U32 count = rSurface.windingStart + rSurface.windingCount;
for (U32 j = rSurface.windingStart; j < count; ++j)
{
const U32 index = mWindings[j];
if ( !activePoints.test( index ) )
{
activePoints.set( index );
mPoints[index].fogCoord = fogCalc.CalcFC( mPoints[index].point );
AssertFatal(mPoints[index].fogCoord >= 0.0f, "Error, neg fog coord!");
}
}
}
return;
}
// Textured fog
for (U32 i = 0; i < mergeArrayCount; ++i)
{
const U16 oIndex = mergeArray[ i ];
sgActivePolyList[sgActivePolyListSize++] = oIndex;
const Surface& rSurface = mSurfaces[oIndex];
if (environmentActive && (mEnvironMaps[rSurface.textureIndex] != NULL))
{
if ( sgEnvironPolyList == NULL )
sgEnvironPolyList = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(U16));
sgEnvironPolyList[sgEnvironPolyListSize++] = oIndex;
}
// Fog the unfogged points...
const U32 count = rSurface.windingStart + rSurface.windingCount;
for (U32 j = rSurface.windingStart; j < count; ++j)
{
const U32 pIndex = mWindings[j];
if ( !activePoints.test( pIndex ) )
{
activePoints.set( pIndex );
sgFogTexCoords[pIndex] = fogCalc.CalcTextured( mPoints[pIndex].point );
}
}
}
}End[Edit: fix sign problem - see below]
#11
01/16/2007 (12:40 pm)
That's a pretty huge difference...can't wait to get home and try this out!
#12
For the method:
What does the assembly look like? Is GCC actually copying the resultant Point2F around or storing it directly into place? You might consider doing:
As this would avoid a copy. But it comes down to what the optimizer is doing. Avoiding memory read/writes in those inner loops is probably going to be the biggest win you can get.
That fog calculator class might be handy elsewhere, too. :) Nice bit of utility coding there.
01/16/2007 (1:05 pm)
Nice optimization work.For the method:
inline const Point2F CalcTextured( const Point3F &point ) const
{
return( Point2F(
(visibleDistanceMod + point.x * distPlane.x + point.y * distPlane.y + point.z * distPlane.z) * invVisibleDistance,
(newWorldZ + point.x * osZVec.x + point.y * osZVec.y + point.z * osZVec.z) * invHeightRange
) );
}What does the assembly look like? Is GCC actually copying the resultant Point2F around or storing it directly into place? You might consider doing:
inline void CalcTextured( const Point3F &point, Point2F &out ) const
{
out.x = (visibleDistanceMod + point.x * distPlane.x + point.y * distPlane.y + point.z * distPlane.z) * invVisibleDistance;
out.y = (newWorldZ + point.x * osZVec.x + point.y * osZVec.y + point.z * osZVec.z) * invHeightRange;
}As this would avoid a copy. But it comes down to what the optimizer is doing. Avoiding memory read/writes in those inner loops is probably going to be the biggest win you can get.
That fog calculator class might be handy elsewhere, too. :) Nice bit of utility coding there.
#13
Great work!
01/16/2007 (1:10 pm)
Oh yeah - also don't forget to use the fog calculator for all fog calcs in there, it'll make the code much more maintainable (and probably more performant on all code paths).Great work!
#14
@Ben: Thanks! The CalcTextured() change - I tried that and it resulted in quite a slowdown - not quite sure why. There is still some copying going on, but the optimizer seems to be doing well here. I also tried adding a Point2F to the class and just returning a ref to it to avoid any copying, but I didn't gain anything from that.
As to the other fog calculations, I started to do CalcFC(), but ran out of steam. It would just require a couple of additional args to FogCalc's constructor [state and worldP.z] - and some rejiggery.
Oh, now that I look at the code again, I can make doFogActive() a bit simpler...
Oh heck - I'll look at it later - one more round I guess :-)
01/16/2007 (2:34 pm)
@Rubes: Let me know how it goes!@Ben: Thanks! The CalcTextured() change - I tried that and it resulted in quite a slowdown - not quite sure why. There is still some copying going on, but the optimizer seems to be doing well here. I also tried adding a Point2F to the class and just returning a ref to it to avoid any copying, but I didn't gain anything from that.
As to the other fog calculations, I started to do CalcFC(), but ran out of steam. It would just require a couple of additional args to FogCalc's constructor [state and worldP.z] - and some rejiggery.
Oh, now that I look at the code again, I can make doFogActive() a bit simpler...
Oh heck - I'll look at it later - one more round I guess :-)
#15
In sceneGraph/sceneGraph.h, add this function:
01/16/2007 (3:52 pm)
Oops. Forgot one piece of code - we need access to some of SceneGraphs protected vars, so we grab them with this function.In sceneGraph/sceneGraph.h, add this function:
...
void buildFogTextureSpecial( SceneState *pState );
[b] void getFogCoordData(F32 &invVisibleDistance, F32 &heightOffset, F32 &invHeightRange) const;[/b]
void getFogCoordPair(F32 dist, F32 z, F32 &x, F32 &y) const;
...
inline void SceneGraph::getFogCoordData(F32 &invVisibleDistance, F32 &heightOffset, F32 &invHeightRange) const
{
invVisibleDistance = mInvVisibleDistance;
heightOffset = mHeightOffset;
invHeightRange = mInvHeightRange;
}
#16
1) I added the method for calculating using fog coords called CalcFC()
2) I change doFogActive() to be a lot cleaner and easier to maintain
3) Speed is the same
About fog coord: when I went to test this on my Mac, I found that in game.cc, FC is disabled on the Mac, so even though EXT_fog_coord is available, it is never used in the Mac build. Searching the forums, I found that there have been problems with it in the past. When I enabled it, recompiled, and ran Stronghold, everything seemed to work fine - at the same speed as the textured fog. When I tried to run the lighting demo, it asserted in DepthSortList::depthPartition(). Further investigation required :-)
01/16/2007 (5:19 pm)
Instead of making this into The Longest Thread ever, I've gone back and edited the code with the FogCalc class [six posts up].1) I added the method for calculating using fog coords called CalcFC()
2) I change doFogActive() to be a lot cleaner and easier to maintain
3) Speed is the same
About fog coord: when I went to test this on my Mac, I found that in game.cc, FC is disabled on the Mac, so even though EXT_fog_coord is available, it is never used in the Mac build. Searching the forums, I found that there have been problems with it in the past. When I enabled it, recompiled, and ran Stronghold, everything seemed to work fine - at the same speed as the textured fog. When I tried to run the lighting demo, it asserted in DepthSortList::depthPartition(). Further investigation required :-)
#17
First thing I noticed was that Interior::setupActivePolyList() was way down the list on Shark (0.5%). Next thing was that these changes didn't have any effect on this or anything else.
I'm still spending all my time in Blender::blend_vec().
01/16/2007 (7:26 pm)
@Andy: I added this code and then used journaling with Shark to do some profiling, and compared it with my app prior to adding these edits.First thing I noticed was that Interior::setupActivePolyList() was way down the list on Shark (0.5%). Next thing was that these changes didn't have any effect on this or anything else.
I'm still spending all my time in Blender::blend_vec().
#18
01/16/2007 (8:11 pm)
@Rubes: What was your journal of? In a complex interior situation these changes should make a difference. If you're zooming around outdoors with the C blender, then it'll get drowned out by the much costlier other activities you're engaging in.
#19
It was, in fact, a complex interior situation. My game starts off with the player inside a large DIF structure (though with a few windows/doors to the outside). The journal is basically me spinning around twice, moving into an adjacent room, and spinning around a couple more times.
I suppose the spinning will have a large effect on the processes that are called, but I'm still inside a large, complex DIF the whole time.
01/16/2007 (8:16 pm)
Sorry, should have been more specific.It was, in fact, a complex interior situation. My game starts off with the player inside a large DIF structure (though with a few windows/doors to the outside). The journal is basically me spinning around twice, moving into an adjacent room, and spinning around a couple more times.
I suppose the spinning will have a large effect on the processes that are called, but I'm still inside a large, complex DIF the whole time.
#20
A more accurate way to profile this code would probably be to compare a fixed view, before/after, or one with no exterior rendering at all, starting and stopping the profiler while in this fixed view. Spinning will make the blender dominate the profiler, minimizing the visibility of any performance gains from this win.
01/16/2007 (8:23 pm)
When do you start/end profiling? If you profile the load sequence or the first few frames of the scene, loading time will dominate the profile results.A more accurate way to profile this code would probably be to compare a fixed view, before/after, or one with no exterior rendering at all, starting and stopping the profiler while in this fixed view. Spinning will make the blender dominate the profiler, minimizing the visibility of any performance gains from this win.
Torque Owner asmaloney (Andy)
Default Studio Name
void Interior::setupActivePolyList(ZoneVisDeterminer& zoneDeterminer, SceneState* state, const Point3F& rPoint, const Point3F& osCamVector, const Point3F& osZVec, const F32 worldZ, const Point3F& scale) { PROFILE_START(InteriorSetupActivePolyList); // Here's the deal. We loop through each of the zones, and create a merged master // list of polygons that are the union of all the zones render sets. While we're // doing this, we'll be setting up each zone's list of planes. I've got these // processes separated out for now, but they could be merged. After we have the // master list of polys, and the list of active planes, we'll copy the list // (culling the backfaces) into the ActivePolyList. // There's some trickiness here. We use the high bit of this U16 to test against the // flip bit in the surfaces planeindex. U16* planeSides = (U16*)FrameAllocator::alloc(sizeof(U16) * mPlanes.size()); // We'll never need more than twice the number of zones for merging... ItrMergeStruct* mergeArray = (ItrMergeStruct*)FrameAllocator::alloc((mZones.size() * 2) * sizeof(ItrMergeStruct)); U32 numMergeStructs = 0; PROFILE_START(ISAPL_Merge); for (U32 i = 0; i < mZones.size(); i++) { if (zoneDeterminer.isZoneVisible(i) == false) continue; // Setup the plane directionals for (U32 j = mZones[ i ].planeStart; j < mZones[ i ].planeStart + mZones[ i ].planeCount; j++) { if (getPlane(mZonePlanes[j]).distToPlane(rPoint) >= 0.0f) planeSides[mZonePlanes[j]] = 0x8000; else planeSides[mZonePlanes[j]] = 0x0000; } // Create a merge struct for this zone ItrMergeStruct& rMerge = mergeArray[numMergeStructs++]; rMerge.array = &mZoneSurfaces[mZones[ i ].surfaceStart]; rMerge.size = mZones[ i ].surfaceCount; } AssertFatal(numMergeStructs > 0, "Error, no rendered zones, big problem."); // Merge the arrays into the final version U32 finalArray = 0xFFFFFFFF; { U32 begin = 0; U32 end = numMergeStructs; while ((end - begin) > 1) { U32 newEnd = end; U32 i; for (i = begin; (i + 1) < end; i += 2) { const ItrMergeStruct& rMerge0 = mergeArray[ i ]; const ItrMergeStruct& rMerge1 = mergeArray[i+1]; // Create the new structure to merge into ItrMergeStruct& rMergeOut = mergeArray[newEnd++]; rMergeOut.array = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(U16)); mergeSurfaceVectors(rMerge0.array, rMerge0.size, rMerge1.array, rMerge1.size, rMergeOut.array, &rMergeOut.size); } begin = i; end = newEnd; } finalArray = begin; } AssertFatal(finalArray < mZones.size() * 2, "Error, final array out of bounds!"); U16* output = mergeArray[finalArray].array; U32 outputCount = mergeArray[finalArray].size; PROFILE_END(); // Before we go and fog this object, we'll test the points of our bounding box. If // they are all unfogged, then we have no need to do any fogging. If there are // all fogged though, we cannot turn off rendering of the object, as it's possible // that they extend into fog planes. PlaneF distPlane; // Setup the dist plane const Point3F &closest = getBoundingBox().getClosestPoint(rPoint); Point3F n = ( closest - rPoint ); n.convolve( scale ); const F32 distOffset = n.len(); if (distOffset != 0) { distPlane.set(closest, n); } else { // Oops, we're inside the bounding box. distnormal is the view vector in object space distPlane.set(closest, osCamVector); } distPlane.x /= scale.x; distPlane.y /= scale.y; distPlane.z /= scale.z; const Point3F &worldP = state->getCameraPosition(); F32 maxFog = -1; const Point3F fp[2] = { getBoundingBox().min, getBoundingBox().max }; for (U32 i = 0; i < 8; i++) { Point3F test; if (i & 0x1) test.x = fp[0].x; else test.x = fp[1].x; if (i & 0x2) test.y = fp[0].y; else test.y = fp[1].y; if (i & 0x4) test.z = fp[0].z; else test.z = fp[1].z; F32 hazeVal = state->getHazeAndFog(mFabs(distPlane.distToPlane(test)) + distOffset, (mDot(test, osZVec) + worldZ) - worldP.z); if (hazeVal > maxFog) maxFog = hazeVal; } PROFILE_START(ISAPL_Setup); bool environmentActive = (dglDoesSupportARBMultitexture() && smRenderEnvironmentMaps && mValidEnvironMaps != 0); if (maxFog >= 1.0/255.0f) { // Sigh. Gotta do it sgFogActive = true; doFogActive( environmentActive, state, outputCount, output, planeSides, distPlane, distOffset, worldP, osZVec, worldZ ); PROFILE_END(); PROFILE_END(); return; } // Unfogged. We can turn off this part of the setup... sgFogActive = false; // Point setup sgActivePolyList = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(16)); sgEnvironPolyList = NULL; sgActivePolyListSize = 0; sgEnvironPolyListSize = 0; sgFogPolyList = NULL; sgFogTexCoords = NULL; sgFogPolyListSize = 0; // No Fog if (environmentActive) { sgEnvironPolyList = (U16*)FrameAllocator::alloc(mSurfaces.size() * sizeof(U16)); // Environ for (U32 i = 0; i < outputCount; i++) { const U16 oIndex = output[ i ]; const Surface& rSurface = mSurfaces[oIndex]; // Not back faced? Add it to the list if (((planeSides[getPlaneIndex(rSurface.planeIndex)] ^ rSurface.planeIndex) & 0x8000) == 0) continue; sgActivePolyList[sgActivePolyListSize++] = oIndex; if (mEnvironMaps[mSurfaces[output[ i ]].textureIndex] != NULL) sgEnvironPolyList[sgEnvironPolyListSize++] = oIndex; } } else { for (U32 i = 0; i < outputCount; i++) { const U16 oIndex = output[ i ]; const Surface& rSurface = mSurfaces[oIndex]; // Not back faced? Add it to the list if (((planeSides[getPlaneIndex(rSurface.planeIndex)] ^ rSurface.planeIndex) & 0x8000) == 0) continue; sgActivePolyList[sgActivePolyListSize++] = oIndex; } } PROFILE_END(); PROFILE_END(); }continued...