Batching and Render Perfromance
by Jason Gossiaux · in Torque 2D Beginner · 11/08/2015 (8:15 pm) · 14 replies
Howdy folks. After reading all the articles and guides for composite sprites and batching, I've been running some benchmarks and trying to better understand why just displaying two composite sprites, one for a tilemap and one for a fog of war layer, is killing my performance.
Each composite sprite is 10000 rectangular size 12 tiles. Originally I just had the one and the performance was acceptable on Windows and iOS devices. Zoomed out about half-way I was getting 2500 render calls with the map composite sprite and about 450fps. After adding the fog of war composite sprite this doubled, and my fps dropped to about 300fps. On mobile devices this performance hit is even more severe and becomes a real concern.
I've tried every setting on both composite sprites for batching and sorting without any reduction in the render calls or improvement with the fps. All I want is for the engine to recognize the obscured sprites and omit them from rendering. Any other ideas on how to handle this are welcome too :)
Thanks!
Each composite sprite is 10000 rectangular size 12 tiles. Originally I just had the one and the performance was acceptable on Windows and iOS devices. Zoomed out about half-way I was getting 2500 render calls with the map composite sprite and about 450fps. After adding the fog of war composite sprite this doubled, and my fps dropped to about 300fps. On mobile devices this performance hit is even more severe and becomes a real concern.
I've tried every setting on both composite sprites for batching and sorting without any reduction in the render calls or improvement with the fps. All I want is for the engine to recognize the obscured sprites and omit them from rendering. Any other ideas on how to handle this are welcome too :)
Thanks!
About the author
#2
1). Are you running a release build? A debug build will be significantly slower.
2). Is batching enabled at the scene-level? (You can verify this by typing <Scene>.getBatchingEnabled() in the console).
3). What kind of info are you getting from the debug overlay? Type <SceneWindow>.setDebugOn("metrics"); to get the full overlay.
Post a screenshot of the scene and debug overlay if possible.
11/11/2015 (5:00 pm)
Just to cover the bases:1). Are you running a release build? A debug build will be significantly slower.
2). Is batching enabled at the scene-level? (You can verify this by typing <Scene>.getBatchingEnabled() in the console).
3). What kind of info are you getting from the debug overlay? Type <SceneWindow>.setDebugOn("metrics"); to get the full overlay.
Post a screenshot of the scene and debug overlay if possible.
#3
This will increase the number of UI elements on screen and can affect performance. The issue with the Torque UI system is that it is not hooked up to the batching system(s). As such, the moment a GUI is pushed to the Canvas there will be a performance hit. Just keep that in mind.
11/13/2015 (8:51 am)
@Chris - All good tips/questions. One thing worth noting:Quote:Type <SceneWindow>.setDebugOn("metrics");
This will increase the number of UI elements on screen and can affect performance. The issue with the Torque UI system is that it is not hooked up to the batching system(s). As such, the moment a GUI is pushed to the Canvas there will be a performance hit. Just keep that in mind.
#4
1). Yes, I am running release build.
2). Batching is enabled at the scene level.
3). My debug overlay is showing 2500 render requests and 420 flushes when zoomed out on a rather busy dungeon environment. My FPS is about 250 on a very high end PC.
The number of flush requests seems to be a problem. Every tile in the tilemap showing a unique texture seems to increase the flush count. Sometimes even for the same type of texture. I don't really understand that.
I've played with all the various sorted and batching commands on the scene and composite sprite level. Not seeing anything that affects the Flush count though.
Thanks!
05/09/2016 (12:22 am)
I need to dig this back up in hopes of learning more about the T2D render pipeline.1). Yes, I am running release build.
2). Batching is enabled at the scene level.
3). My debug overlay is showing 2500 render requests and 420 flushes when zoomed out on a rather busy dungeon environment. My FPS is about 250 on a very high end PC.
The number of flush requests seems to be a problem. Every tile in the tilemap showing a unique texture seems to increase the flush count. Sometimes even for the same type of texture. I don't really understand that.
I've played with all the various sorted and batching commands on the scene and composite sprite level. Not seeing anything that affects the Flush count though.
Thanks!
#5
Now, this did break the 'Y' depth sorting of my units as expected but I am confused why sorting inherently breaks batching for the layer it is enabled on. 200 of the flushes were listed as TexFlush too. So maybe I am still missing something.
The fps gain wasn't large enough for this to be a smoking gun. In the current state of my game, performance problems are really only apparent on tablets/phones from ~2012 when on large maps with lots of units. However, I expect things to degrade as I begin adding particle effects and other enhancements.
One design goal is to also offer a map view that shows the entire map in a low-rezz state. I can't see how this is possible using the build-in object constructions given the rendering limitations. A 10000 tile composite sprite using single-pixel textures would still crater the engine. How the heck do other games do overland maps so well?
Would I be better off trying to generate a texture from the composite sprite map in real time and displaying that one texture instead? Can't imagine that would be faster...but who knows.
05/09/2016 (1:20 am)
After some playing with the engine, I found that changing setStrictOrderMode in the scene rendering code to always be false reduced the flushes down to just 4 and improved fps by about 10%.Now, this did break the 'Y' depth sorting of my units as expected but I am confused why sorting inherently breaks batching for the layer it is enabled on. 200 of the flushes were listed as TexFlush too. So maybe I am still missing something.
The fps gain wasn't large enough for this to be a smoking gun. In the current state of my game, performance problems are really only apparent on tablets/phones from ~2012 when on large maps with lots of units. However, I expect things to degrade as I begin adding particle effects and other enhancements.
One design goal is to also offer a map view that shows the entire map in a low-rezz state. I can't see how this is possible using the build-in object constructions given the rendering limitations. A 10000 tile composite sprite using single-pixel textures would still crater the engine. How the heck do other games do overland maps so well?
Would I be better off trying to generate a texture from the composite sprite map in real time and displaying that one texture instead? Can't imagine that would be faster...but who knows.
#6
05/09/2016 (4:11 pm)
Well, remember that CompositeSprite was a fairly late addition and might need some optimization....
#7
This can be demonstrated using the SpriteStressToy fairly easily on my PCs.
100 static sprites and I get 2000 fps.
10,000 static sprites gets me 48fps.
10,000 animated composite sprites and I get 200fps.
In the case of 10k static vs 10k animated composite sprites the render requests and flush values are identical. The speed difference must be attributed to the # of objects. Which is 10k in the static/animated sprite case, and only 1 in the composite sprite case.
05/09/2016 (6:00 pm)
My experience has been that large numbers of sprites rendered inside of a composite sprite are more efficient than static sprites rendered on the scene.This can be demonstrated using the SpriteStressToy fairly easily on my PCs.
100 static sprites and I get 2000 fps.
10,000 static sprites gets me 48fps.
10,000 animated composite sprites and I get 200fps.
In the case of 10k static vs 10k animated composite sprites the render requests and flush values are identical. The speed difference must be attributed to the # of objects. Which is 10k in the static/animated sprite case, and only 1 in the composite sprite case.
#8
My main point is that, because it's relatively new there is probably plenty of room for further optimization/improvement.
05/10/2016 (3:36 pm)
Yeah, the underlying containers are probably at least partly responsible, along with the fact that Sprites are full-blown scene objects (along with all of the updating that goes with this) and the CompositeSprite "sprites" are crazy-tiny objects designed to be used exactly how they are being used while the CompositeSprite itself is the only actual scene object (and therefore only hits that processTick() call once per tick).My main point is that, because it's relatively new there is probably plenty of room for further optimization/improvement.
#9
Every call to glPolygonMode, glBlendFunc, glTexEnvi were considered redundant.
Almost every call to glColor4g and glTexCoordPointer were redundant.
I'm not entirely sure what I can do with this information :chuckle: But it at least provides some insight into where the cycles are going.
On the IRC channel it was suggested I render the composite sprite map to a texture and then display that texture to greatly boost performance. I think it could work, though would be a somewhat complicated effort due to the max texture size being 4K, and each of my tiles being 64 pixels. I'd effectively need to render the composite sprite to a set of 157 textures and then arrange them properly to represent the whole map.
Alternatively I could build a much simpler texture from the composite sprite. But even if each 64-pixel tile turned into a 1-pixel square on the map, I would still need multiple textures to properly display the whole thing.
Until then I'll keep learning :)
05/10/2016 (5:56 pm)
I tried running gDEBugger on the running executable and it is telling me that 50% of the state change functions called are redundant and wasting performance.Every call to glPolygonMode, glBlendFunc, glTexEnvi were considered redundant.
Almost every call to glColor4g and glTexCoordPointer were redundant.
I'm not entirely sure what I can do with this information :chuckle: But it at least provides some insight into where the cycles are going.
On the IRC channel it was suggested I render the composite sprite map to a texture and then display that texture to greatly boost performance. I think it could work, though would be a somewhat complicated effort due to the max texture size being 4K, and each of my tiles being 64 pixels. I'd effectively need to render the composite sprite to a set of 157 textures and then arrange them properly to represent the whole map.
Alternatively I could build a much simpler texture from the composite sprite. But even if each 64-pixel tile turned into a 1-pixel square on the map, I would still need multiple textures to properly display the whole thing.
Until then I'll keep learning :)
#10
05/10/2016 (8:23 pm)
Max texture size is set in code - I'll try to remember where I did that and let you know (though when I did it I set it to 2048x2048). Optimally this should look at available VRAM and adjust accordingly.
#11
MaximumProductSupportedTextureWidth
MaximumProductSupportedTextureHeight
There's also some code in the image loaders (bmpPng.cc, etc.) regarding mipmaps that assume the textures are no larger than 2048x2048.
These really should be changed to use the maximum hardware-supported texture size reported by the driver. We're already querying OpenGL for this info anyway (see: line 204 of platformVideo.cc) so might as well use it.
05/11/2016 (6:50 pm)
I believe max texture size is set using #defines named:MaximumProductSupportedTextureWidth
MaximumProductSupportedTextureHeight
There's also some code in the image loaders (bmpPng.cc, etc.) regarding mipmaps that assume the textures are no larger than 2048x2048.
These really should be changed to use the maximum hardware-supported texture size reported by the driver. We're already querying OpenGL for this info anyway (see: line 204 of platformVideo.cc) so might as well use it.
#12
05/11/2016 (8:57 pm)
That's where I was going - but it's been literally 5 years since I had my nose in there.
#13
05/12/2016 (7:13 am)
Sure. I can likely increase it a bit. However I believe iOS devices are limited to 4K x 4K?
#14
05/12/2016 (5:57 pm)
Might be now - the original 2048x2048 restriction was imposed because of iOS devices.
Torque Owner Jason Gossiaux
Indie Dev
Still, this won't cull other stuff on the dungeon map obscured by fog of war (enemy critters, particles, decorations, etc) so performance will be affected in some situations.
I'm completely failing to see any performance improvement from batching as was shown on the T2D release notes. Just wondering if something broke along the way...
For example: https://www.garagegames.com/community/blogs/view/22130
There is an example given of a particle emitter with batching and no batching. This would indicate the engine is capable of >1000 fps with ~2000 sprites on screen at a time. I am seeing FPS closer to 300fps with a similar number of sprites.
I can't discount some aspect of my computer playing in here. It is a fast Core i7 with a Geforce 780 GTX - and I've tried using the integrated video for comparison, and confirmed it is slower. I've tried on a number of other, slower PCs and the performance is always less.
In looking through Fog of War resources I found an older version of BatchRender.cc that lacked all the Quad code the new version does. It was still using Tris everywhere. Not sure if that is a clue. I'll keep digging!