Game Development Community

dev|Pro Game Development Curriculum

Amazon EC2 Torque Test Follow Up

by David Wyand · 01/24/2008 (2:43 pm) · 12 comments

DISCLAIMER: This is a personal project undertaken by me, David Wyand. It does not represent work being performed by GarageGames, nor is it an indication of any future product releases.

Introduction
Last week I wrote a blog describing how I set up a dedicated TGE 1.5.2 server using Amazon's EC2 service. I then went on to ask the community to help in stress testing the server. This blog gives the results of this testing.

Executive Summary
The Amazon EC2 virtual server hosting the TGE 1.5.2 dedicated game server remained operational for the testing period (and has been up for nearly a month), and the TGE server process itself ran without any issues. Unfortunately only 4 (!) people showed up for the stress test. The good news is that it appears the Amazon EC2 server had no problems with hosting an active FPS game for that number of players and provided an experience on par with a dedicated server.

Server Data Collection
To help collect game server statistics during the stress test, I set up a separate Amazon EC2 server with MySQL and an Apache web server to display the real time data. This was accessible through the forwarding address of www.torquetest.info for anyone to see.

www.gnometech.com/torque/images/ec2/2008-01-24-AmazonEC2WebCharts.jpgCharts of the data collected during the stress testing hour

The data was collected on the game server and sent to the MySQL server every 10 seconds using a separate Python process. A diagram of this setup may be found on my previous blog. I'll talk a little about each of the charts and what they are showing.

Torque EC2 Network Traffic
This graph shows the transmitted and received bytes per second for the gaming server itself (the server's eth0 port). This is compared against the number of connected players as reported by the TGE server process (the $Server::PlayerCount global script variable). You may see the expected correlation between connected players and bandwidth used.

www.gnometech.com/torque/images/ec2/2008-01-24-AmazonEC2ServerBandwidthChart.jpgBandwidth and player chart produced from the collected statistics in Microsoft Excel

During the testing hour we tried to keep fairly active. Constant firing of crossbow bolts and running around. You'll notice that around 18:25 a lowering of the bandwidth used. I believe during this time we stopped shooting at each and had a chat. Of course the truce didn't last long.

www.gnometech.com/torque/images/ec2/2008-01-24-AmazonEC2Orcs.jpgAre you feeling lucky, punk?

This ended up as the only server-side graph to produce any practical results.

Torque Server Packet Round Trip Time
The idea behind this chart is to display the minimum and maximum round trip (ping) packet time for all clients. I was hoping it would provide some indication of when a player could be lagging. Unfortunately it didn't work out as well as I hoped.

The main problem lies with the fact that if the TGE client is not the front window it introduces a Sleep() into the event loop. This causes it to process and acknowledge network packets at a much slower rate than normal. So if someone brings a web page window forward for example, then from the server's point of view that game client will begin to lag. This makes it impossible to differentiate between true client lag, and just an end user's normal use of their computer, using this method.

Torque Server Main Loop
This chart displays how each 10 second time slice is divided between the tasks of the server's main event loop as found in DemoGame::main()'s while(Game->isRunning()){} section. The thought behind this is that as the game server is loaded down, some of these tasks will take more than their fair share of the processing time. In the chart the greedy task's colour area would increase at the expense of the other tasks.

Unfortunately with the tiny number of players during the stress test time I was never able to test this hypothesis. As you may see in the chart above, the time allocated to each event loop task remained fairly constant throughout the hour.

Torque Server Tick Load
This one is similar to the main loop chart above but focuses on 'ticks'. Over any given 10 second period the number of ticks processed on the server should be fairly consistent. If for some reason a tick has been missed, then all queued ticks are processed at once (but in order) until the server catches up. This is done in ProcessList::advanceServerTime() where advanceObjects() is called. What I was attempting to determine here was when the server was slipping due to load and had to catch up.

Interestingly when you look at the chart the number of tick loop calls is consistently less than the number of ticks processed. This means that more than one tick is processed per loop call as the server catches up. I hadn't expected this and don't know if this is a normal operation. I would have to test this on a dedicated server to know for sure.

Unfortunately as with the main loop chart above, the number of players was not enough to place a real load on the server. So I don't know if this set of statistics would provide any useful information while under load.

Client Data Collection
For the stress test I modified my TGE 1.5.2 client to write out all data as produced by NetGraph::updateStats() for the NetGraph GUI (press 'n' to open and close the graph). This was written to a comma delimited file and included the following information:

- Active Ghosts
- Ghost Updates
- Bits Sent
- Bits Received
- Ping
- Packet Loss

I imported this data into Excel and decided to produce a graph of Active Ghosts, Ping and number of clients on the server to see if there was any correlation. It is worth noting that Packet Loss was zero during the stress test.

www.gnometech.com/torque/images/ec2/2008-01-24-AmazonEC2ClientActivityChart.jpgClient data plotted with number of connected clients on the server

As I would expect, the number of active ghosts fluctuated during the test as people fired their weapons and caused general mayhem. Also my ping (as calculated by TGE using packet time to acknowledge) didn't appear to vary by the number of players. Note that where there are gaps in my client's ping data the next mission was loading. So we played two full Stronghold missions during the stress test.

As with all of the other graphs, the number of players doesn't appear to have loaded my client to any degree. So the collected data doesn't help with determining the Amazon virtual server's capabilities.

TGE Issues
Two issues did crop up with TGE during my month long run with the Amazon EC2 servers, but I have not investigated them:

Negative Connected Players
While leaving the TGE 1.5.2 game server up and running for a number of days, I would see a phantom disconnect in the console.log when no one is actually connected. This would drop $Server::PlayerCount to a negative value. When this occurred the server would no longer show up in a TGE client's server list.

In looking on my GG Master Server Monitor at www.gnometech.com the server would show up as '255/64' players. I often see a number of servers in this state within my monitor, so I suspect this is a general problem somewhere in the net code.

Ghost Count Always Increasing Between Missions
If you take a look at my Client Activity graph above you'll notice something wrong with the Active Ghosts value. It does not reset itself between mission restarts. I don't know if this is an actual problem with not releasing ghost objects or just a reporting issue with the NetGraph code. I only came across this while producing my graph.

Conclusions
I've come to three conclusion based on the stress test:

1. It is possible to run a TGE 1.5.2 dedicated server on the Amazon EC2 service.
2. The small instance Amazon EC2 virtual server successfully handles up to 4 players.
3. The upper limits of the small instance Amazon EC2 virtual server were not reached.

Available Data
I'm making available the data I used to produce my graphs above. It is in Microsoft Excel format and includes all collected server data for January 17th (24 hours worth) and all of my collected client data during the stress test. You may download the ZIP file here.

Thank-you
I wanted to thank those that came out for the stress test:

- warme (MEX)
- Dunsany
- JeremyA
- Gnometech (that's me! :o)

Looking to the Future
I'd like to attempt another stress test in the near future and hopefully get more people involved. I have another project that requires my attention for the next month or two, so any additional testing will have to wait until then.

I'll leave my two servers up until Monday for anyone that wants to try them out. The game server is Amazon EC2 Test and the real-time web statistics are at www.torquetest.info where you'll be forwarded to the actual Amazon web server.

Thanks!

- LightWave Dave

About the author

A long time Associate of the GarageGames' community and author of the Torque 3D Game Development Cookbook. Buy it today from Packt Publishing!


#1
01/24/2008 (3:00 pm)
Cool information Dave.

I am wondering what it might be like if you covertly setup say a Think Tanks or Lore Invasion dedicated server on the Amazon EQ2 system in order to gather more data. I am pretty interested to see what type of upper limits there might be on this setup.
#2
01/24/2008 (3:16 pm)
Dave: make it a bit earlier (like before 2am like last time) and I'll be there for sure :)
#3
01/24/2008 (3:26 pm)
Oops. I was going to be there and forgot. I'll staple a sticky note to my forehead next time.
#4
01/24/2008 (7:41 pm)
For the game Think Tanks we had a Linux dedicated server running on a similar set up out of a data center in Georgia for a bit. For whatever reason, I believe it was the sleep(), the server would just start eating up the cpu and would cause some server lag. If we would reset the server all would be fine for a bit. We also learned that if [Bots] were enabled that they would "play on" after people left. Meaning that cpu usage would stay high, 70-80%. For a Linux server running TT its normally under 5% with no one in it. A full server ,8-10, players would utilize 70 to 100% cpu depending on the type of game, map, objects etc. For TT the sleep function can not be turned off in linux and I believe in Windows it can not be turned on. The sleep time seems to vary for distros. My ubuntu sleep time is 1ms, Slackware is 5 to 6ms. Linux seems to be the preferred server set up because windows servers seem to get bogged down much easier. I am not sure how you are getting the data (love to see that code, I am a python newb). It would be interesting if you would be able to pull the info from a TT team scrum game. Put up a server they will come ;)
#5
01/25/2008 (12:49 am)
"Ghost Count Always Increasing Between Missions"
I'm glad I'm not the only one seeing that. We have the same issue on Penguins Arena.

By the way, how much doaes it cost to run a TGE dedicated server on Amazon EC2? Our players really want a dedicated server and it might be a good solution.
#6
01/25/2008 (4:37 am)
Quote:Ghost Count Always Increasing Between Missions
If you take a look at my Client Activity graph above you'll notice something wrong with the Active Ghosts value. It does not reset itself between mission restarts. I don't know if this is an actual problem with not releasing ghost objects or just a reporting issue with the NetGraph code. I only came across this while producing my graph.

I fixed this in one of our builds. Assuming it's the same problem you're seeing the fix is documented here
#7
01/25/2008 (6:48 am)
I saw the invitation for the stress test too late :(
Maybe next time I'll be there :)
Good work David! ;)
#8
01/25/2008 (1:47 pm)
Logan and NUTS!:
While using a Think Tanks or Lore server may help to get the user count up, the test would only confirm if those Torque builds work, and not Torque in general. I'm sure both of those games have a number of changes to get the most out of how they operate, which would not be available to the common Torque developer.

Also without source code access to those games, I cannot add any game process statistic collection, which I'm hoping will help point to possible bottlenecks that could be tweaked if necessary.

That said it may still be a worthwhile test just to exercise the Amazon EC2 hardware. I guess for that matter one could also use another commercial game, such as Battlefield 2, and compare EC2's operation with that of a known dedicated server. I don't have time for that right now, but setting this up should be straight forward for anyone else in the community.

Phil:
Time zones suck and it is nigh impossible to find a time that works for Europe and North America's West coast (with me on the East coast). I'd really like to know how Amazon's network fares across the pond, so I would like to include you. I may need to go with two stress test sessions, although it was hard enough to have people show up for just one. I'm open to suggestions...

- LightWave Dave
#9
01/25/2008 (2:18 pm)
Mathieu:
The cost of running an Amazon EC2 virtual server may be found on their web page. If you run a small instance for an entire month it works out to be $72 plus bandwidth. For that price and usage pattern you could likely find a better deal. Where the service pays off is in being able to rapidly start and stop a server as required.

Here's one possibility of how this service could work out (just making this up as I go):

You have customers that would like to run their own dedicated server (to improve their experience as running a server from home hasn't worked out) but they may only play their buddies a couple of times a week, with each session lasting two hours. Fortunately for them you've integrated Amazon's EC2 API into your game (internally or as a separate utility application).

The player that will start and stop the dedicated server (and will be charged for it) signs up at Amazon. This is a one time deal. They then use the keys Amazon gave them in your game to start up a server. When they see it show in the server list (usually takes a minute or less) they may start playing. When their play session is over, the player shuts down the server. Total cost for a week's worth of play:

$0.10 per hour for a small instance * 4 hours = $0.40 per week + bandwidth usage (likely a few pennies)

From your (the game developer) point of view, you set up a public AMI that includes the Linux OS and your game server. This is stored on Amazon's S3 service where you pay for monthly storage ($0.15 per GB). I do not believe you pay for someone using it on EC2. When the player starts a server from your game/utility/whatever, this AMI is used. Your application may even pass parameters to it, such as a password provided by the player to make the game private.

And that would be it. You don't need to worry about server overhead or player billing. All for what would be less than $0.15 per month. If you need to patch the game server, just create a new AMI and tell the game client to use that one instead (such as through the Master Server, client side patch, etc).

You could also flip this around and have you, the developer, control the dedicated servers. You could add or remove servers from a 'pool' based on the number of people playing at any time. With the quick server start up time, I believe this would be quite viable.

Based on what I just wrote you'd think that I work for Amazon. :o) Anyway, just food for thought.

- LightWave Dave
#10
01/25/2008 (2:20 pm)
Gary:
Thanks! As I mentioned, I only noticed this issue when I made my graphs above. It's good to know that it is likely a reporting issue rather than a memory leak. I'll have to make your change for my next stress test and see how it goes.

- LightWave Dave
#11
01/25/2008 (6:49 pm)
David

the negative player thing we had in one of our projects last year, I can't remember how it was fixed i will find out for you.
#12
01/26/2008 (2:23 pm)
David, Im sorry, I totally forgot about the stress test. If you go for a second one, I'll do my best to be there.