Simulating max clients on a server
by Igor G · in Torque Game Engine · 04/11/2007 (8:42 am) · 34 replies
Hi all,
I'm building a MMO world and would like to simulate and get an idea of how many clients my server can handle before performance starts to degrade. I have a dedicated server setup on one of my machines, and now my problem is getting 100 or more clients to connect to it. What would be the best way to do this simulation?
I'm thinking of spawning many Torque processes on multiple machines, but with graphics disabled.
How can I modify Torque so that I can create multiple Torque processes on my machine? and how do I disable the graphics on Torque so this is doable?
Thanks!
I'm building a MMO world and would like to simulate and get an idea of how many clients my server can handle before performance starts to degrade. I have a dedicated server setup on one of my machines, and now my problem is getting 100 or more clients to connect to it. What would be the best way to do this simulation?
I'm thinking of spawning many Torque processes on multiple machines, but with graphics disabled.
How can I modify Torque so that I can create multiple Torque processes on my machine? and how do I disable the graphics on Torque so this is doable?
Thanks!
#2
at my work, we did exactly what you're after by running as many clients as possible on a windows box and have them log into the server. the difficult part was increasing the number of clients which a single machine can run simultaneously. as you suggest, we increased that number by disabling graphics. unfortunately i didn't do that work and can't really describe it. yay! i think we're able to get fifty or seventy headless clients on a single box.
another option is bots of course,
altho it doesn't test everything real clients do.
04/12/2007 (4:13 pm)
There's some old discussion here.at my work, we did exactly what you're after by running as many clients as possible on a windows box and have them log into the server. the difficult part was increasing the number of clients which a single machine can run simultaneously. as you suggest, we increased that number by disabling graphics. unfortunately i didn't do that work and can't really describe it. yay! i think we're able to get fifty or seventy headless clients on a single box.
another option is bots of course,
altho it doesn't test everything real clients do.
#3
Thanks for the replies. I finally modified my server and client so that I can run an unlimited number (constrained only by system resources) of clients on a remote server. These clients run with graphics disabled and I have 1 client with graphics enabled to see the action in the world.
My clients log into the server and start running around wildly in random directions. This is done by a script that does a commandToServer('RunAroundRandomly') every few seconds. Now, each client has a GameConnection associated with it, so it's not a dumb NPC in the world.
I'm running into some performance issues on an extremely powerful machine - Windows 2003 server enterprise, Intel Xeon MP 3.16 Ghz (8 processors), 8 GB ram. Since Torque isn't multithreaded, I doubt the extra cores would help. In any case, after 100 or so client connections, the server just starts slowing down to a crawl. The rate it accepts new connections is delayed, extreme lag is experienced on my client (the one with graphics). Would you say this is normal? I've heard that people have tested Torque with up to 500 clients, but it doesn't seem to be the case for me. Sometimes my server just gets killed (by the OS?) and I don't get any errors as to why this is. The maximum # of client connections I've been able to make is 147.
I used Torque's profiler and noticed that a lot of time was being spent in updating the player's physics and position. In games such as The Lounge, I've seen instances of up to 200 players. Does anybody have any clue as to why my server just stops responding after 100 connections?
Thanks,
04/13/2007 (11:28 am)
Hi guys,Thanks for the replies. I finally modified my server and client so that I can run an unlimited number (constrained only by system resources) of clients on a remote server. These clients run with graphics disabled and I have 1 client with graphics enabled to see the action in the world.
My clients log into the server and start running around wildly in random directions. This is done by a script that does a commandToServer('RunAroundRandomly') every few seconds. Now, each client has a GameConnection associated with it, so it's not a dumb NPC in the world.
I'm running into some performance issues on an extremely powerful machine - Windows 2003 server enterprise, Intel Xeon MP 3.16 Ghz (8 processors), 8 GB ram. Since Torque isn't multithreaded, I doubt the extra cores would help. In any case, after 100 or so client connections, the server just starts slowing down to a crawl. The rate it accepts new connections is delayed, extreme lag is experienced on my client (the one with graphics). Would you say this is normal? I've heard that people have tested Torque with up to 500 clients, but it doesn't seem to be the case for me. Sometimes my server just gets killed (by the OS?) and I don't get any errors as to why this is. The maximum # of client connections I've been able to make is 147.
I used Torque's profiler and noticed that a lot of time was being spent in updating the player's physics and position. In games such as The Lounge, I've seen instances of up to 200 players. Does anybody have any clue as to why my server just stops responding after 100 connections?
Thanks,
#4
nice job building a test environment !
The Lounge servers run on fairly hefty linux boxes,
it seems like you should be able to get more than 100 or so connections.
I'll forward your post around and see if anyone has some specific suggestions.
You might also try running a profiler like VTune against the server.
It's pretty pricey but they have a free one-month trial.
04/13/2007 (1:24 pm)
Hey Igor -nice job building a test environment !
The Lounge servers run on fairly hefty linux boxes,
it seems like you should be able to get more than 100 or so connections.
I'll forward your post around and see if anyone has some specific suggestions.
You might also try running a profiler like VTune against the server.
It's pretty pricey but they have a free one-month trial.
#5
04/13/2007 (1:33 pm)
Try upping the servers tick rate from 32ms to 64ms. You typically don't need the high precision in an MMO and it will save a lot of CPU time.
#6
Edit: Changed because Orion reads things too closely. =)
04/13/2007 (1:55 pm)
Uhm. Unless you also make those clients move and do whatever real clients do, the accuracy of your simulation will suffer. Physics are a good example. Non-moving players do not process physics at all, unless you have them stacked on other objects.Edit: Changed because Orion reads things too closely. =)
#7
granted this isn't exactly testing the regular move stuff, but it's pretty good.
i believe for our load tests we had each client pretend to press the forward key for a while,
then pretend to press Turn Left for a while, then repeat.
04/13/2007 (2:23 pm)
Quote:
My clients log into the server and start running around wildly in random directions. This is done by a script that does a commandToServer('RunAroundRandomly') every few seconds.
granted this isn't exactly testing the regular move stuff, but it's pretty good.
i believe for our load tests we had each client pretend to press the forward key for a while,
then pretend to press Turn Left for a while, then repeat.
#8
this is not true.
go into the mission editor sometime and put a bot on top of an object, then move the object.
04/13/2007 (2:24 pm)
Quote:
Non-moving players do not process physics at all.
this is not true.
go into the mission editor sometime and put a bot on top of an object, then move the object.
#9
If you're running all those automated clients on the same machine... does that machine have a 1GB network connection? It could very well be that its just the client machine choking out. You didn't specify that in your message, so I just wanted to clarify.
When the clients experience the slow-down, how much CPU does the server process use? Is it maxed?
Is your server utilizing load balanced 1GB NICs?
Is your connection to your internal network from your server a 1GB connection? (Common mistake)
Have you verified that the time for packets leaving your internal network are < 5ms?
Is the switch reporting collisions?
Have you modified your network registry settings in W2k3 to allow for maximum MTU?
Are all non-critical W2k3 services disabled?
Is your server operating on a DMZ but clients attempting to access it via internal addressing? (I've seen this add significant latency with some router/firewall combinations)
I carried a screwdriver for many, many years. The first question I always as is "Is it plugged in and turned on?"... even to the most advanced user. We all overlook the simple now and then, and it's always a good idea to make sure that your infrastructure isn't what's degrading performance before you start tearing apart your beloved engine. You may just save yourself some headaches.
04/14/2007 (10:50 am)
@Igor: That's a pretty ballsy server and I cannot imagine that it cannot handle over 100 clients... or 300 for that matter. We've got a server half that beefy (quad core) and it can handle 150 clients with narry a stutter. I'm gonna guess that there's a flaw in your network topography that is causing a bottleneck.If you're running all those automated clients on the same machine... does that machine have a 1GB network connection? It could very well be that its just the client machine choking out. You didn't specify that in your message, so I just wanted to clarify.
When the clients experience the slow-down, how much CPU does the server process use? Is it maxed?
Is your server utilizing load balanced 1GB NICs?
Is your connection to your internal network from your server a 1GB connection? (Common mistake)
Have you verified that the time for packets leaving your internal network are < 5ms?
Is the switch reporting collisions?
Have you modified your network registry settings in W2k3 to allow for maximum MTU?
Are all non-critical W2k3 services disabled?
Is your server operating on a DMZ but clients attempting to access it via internal addressing? (I've seen this add significant latency with some router/firewall combinations)
I carried a screwdriver for many, many years. The first question I always as is "Is it plugged in and turned on?"... even to the most advanced user. We all overlook the simple now and then, and it's always a good idea to make sure that your infrastructure isn't what's degrading performance before you start tearing apart your beloved engine. You may just save yourself some headaches.
#10
Well, if someone goes trough the trouble of moving underlying objects so their player can process physics to simulate players moving around, when they instead can just remove the vel.lin () check or simply move the player instead - yes. Does it seem very likely? No.
My comment was not meant to be absolute, but I edited it for clearance. Thanks :)
04/14/2007 (1:32 pm)
Quote:
this is not true.
go into the mission editor sometime and put a bot on top of an object, then move the object.
Well, if someone goes trough the trouble of moving underlying objects so their player can process physics to simulate players moving around, when they instead can just remove the vel.lin () check or simply move the player instead - yes. Does it seem very likely? No.
My comment was not meant to be absolute, but I edited it for clearance. Thanks :)
#11
The problem that I'm seeing is the server choking on processing all the players. My world is empty, to minimize collisions. Everything seems fine, until there are 100+ players on the server. After 100+ players, the server seems to queue up connections (I see the challenge request message piling up), and the server slows down in accepting the connections. My clients are creating a connection every 5 seconds, and they're run from multiple remote machines.
It seems like everything is pointing towards the server - it just doesn't seem to handle many players moving around once the number of players hit 100+. Again, I used Torque's profiler and it seems to be spending all the time processing player positions in my empty world. And the fact that there's only 1 thread for the Torque process would be more evidence against the server. Perhaps change the client behavior, (reducing the amount of movement) would give better results?
04/16/2007 (7:54 am)
@Bryce - There shouldn't be any issues with the connection to/from the server. The setup in windows seems normal. The problem that I'm seeing is the server choking on processing all the players. My world is empty, to minimize collisions. Everything seems fine, until there are 100+ players on the server. After 100+ players, the server seems to queue up connections (I see the challenge request message piling up), and the server slows down in accepting the connections. My clients are creating a connection every 5 seconds, and they're run from multiple remote machines.
It seems like everything is pointing towards the server - it just doesn't seem to handle many players moving around once the number of players hit 100+. Again, I used Torque's profiler and it seems to be spending all the time processing player positions in my empty world. And the fact that there's only 1 thread for the Torque process would be more evidence against the server. Perhaps change the client behavior, (reducing the amount of movement) would give better results?
#12
in actual use, clients move via a different mechanism than commandToServer,
so perhaps 100s of those spread out over a few seconds is a problem ?
04/16/2007 (8:06 am)
It would certainly be interesting to stop the players from running around wildly every few seconds.in actual use, clients move via a different mechanism than commandToServer,
so perhaps 100s of those spread out over a few seconds is a problem ?
#13
I have played many MMORPG games (daoc,swg,wow,neocron etc) and as far I have seen nearly all the mmorpg engines have issues with 100's of player within the same location.
Is this what your test is doing? playing all your pretend players in the same sort of place?
Is it having to process and send data to keep each of the other 100 players updated to the position on screen of the other players?
Most mmorpg (I'm guessing) only get high populations on instanced or multi-sectioned game areas so you hardly ever see more than a 100 or so in the same unique zone...
Just as an example I remember playing wow when the gates of Ahn'Qiraj where openned and there were probably 400-500 in the same zone then and it was such a lagfest - taking 20-40 seconds to respond to anything it wasnt worth playing....
Probably not a very useful post to you but just some real world experience of mmorpg engines... :D
04/16/2007 (8:27 am)
I havent really got any technical help for you but just something that comes to mind with your situation.I have played many MMORPG games (daoc,swg,wow,neocron etc) and as far I have seen nearly all the mmorpg engines have issues with 100's of player within the same location.
Is this what your test is doing? playing all your pretend players in the same sort of place?
Is it having to process and send data to keep each of the other 100 players updated to the position on screen of the other players?
Most mmorpg (I'm guessing) only get high populations on instanced or multi-sectioned game areas so you hardly ever see more than a 100 or so in the same unique zone...
Just as an example I remember playing wow when the gates of Ahn'Qiraj where openned and there were probably 400-500 in the same zone then and it was such a lagfest - taking 20-40 seconds to respond to anything it wasnt worth playing....
Probably not a very useful post to you but just some real world experience of mmorpg engines... :D
#14
04/16/2007 (8:53 am)
@Orion - I agree with you that maybe the commandToServer commands may be slowing down the server drastically. In your simulations, how were you able to simulate player movement? Did you do a direct manipulation of the $mv...Action variables?
#15
i think we just called moveforward(1)
then wait a bit then call moveforward(0)
then wait a bit then call turlLeft(1)
then wait a bit then call turlLeft(0)
etc
hm but you know we also have them doing occasional commandToServer()s to stress the chat system,
i think they're sending one every five seconds or so.
one way to help decide if it's network or physics
would be to use Bots instead of real clients.
04/16/2007 (9:13 am)
I'll double-check today, but i believe that's the case, yes.i think we just called moveforward(1)
then wait a bit then call moveforward(0)
then wait a bit then call turlLeft(1)
then wait a bit then call turlLeft(0)
etc
hm but you know we also have them doing occasional commandToServer()s to stress the chat system,
i think they're sending one every five seconds or so.
one way to help decide if it's network or physics
would be to use Bots instead of real clients.
#16
Wanted to share something with you.
Here is what I've posted in my last .blog:
In this "test" there was REAL players (120 connections) with 127 AIPlayer objects (every ai object scheduled to "think" every 3 seconds). All of them were playing: running, hunting, chatting, trading, etc. I assume if I totally remove AIPlayers from the mission Torque can handle 200 players.
P.S. It was dedicated "release" build of Torque running under OpenSuSE 10.1. The same physical computer running the MySQL database of the game (everything is DB-based, including all datablocks - loaded from DB on server start-up).
P.P.S. "Clients" were connecting from all around the world including Australia. Server is located in Moscow, Russia.
The average "latency" (on client, reported by Torque "N" graph) at the peak time was not more than 250-300ms for Americans and 200-250ms for Europeans. Not sure about AU's, sorry.
For me (connecting from one ISP to another in same city) the latency was about 120-150ms at peak times.
Hope this info helps.
04/16/2007 (9:35 am)
This is very interesting read.Wanted to share something with you.
Here is what I've posted in my last .blog:
Quote:After some time we had about 40 players online on our "test server". Huh? Not much you say... Yeah, if take into account that it was PIII 1.3GHz computer with 512 RAM.. with 250 AI objects running on a single mission....
After 8 hours we have found, that most of our traffic been eat by our "testers". Blah! Total amount of registrations at that time was about 500...
So, we decided to move the game server from "home computer" to the co-location on local ISP.
During "move" suddenly the database got damaged (yeah, by bad, haven't made a backup) so we reverted a bit "back" and started again on new server.
P4, 3.2GHz, 2GB RAM on 100mbit pipe
Right after server startup we got 60++ players. In a few hours the limit of 80 was reached, and here problems began...
I've reset AI object to amount of 127 and..... server was handling 120 players with NO problems. The MAX amount of players online was 121... So..
Taking into account that it was not optimized network on heavily modified TGE1.5 (actually 1.4 + TLK + manual merge into 1.4.2 then 1.5 with TGB and ArcaneFX), and ALL AI "thinking" is done via scripts... All I can say - Torque FOREVER!
In this "test" there was REAL players (120 connections) with 127 AIPlayer objects (every ai object scheduled to "think" every 3 seconds). All of them were playing: running, hunting, chatting, trading, etc. I assume if I totally remove AIPlayers from the mission Torque can handle 200 players.
P.S. It was dedicated "release" build of Torque running under OpenSuSE 10.1. The same physical computer running the MySQL database of the game (everything is DB-based, including all datablocks - loaded from DB on server start-up).
P.P.S. "Clients" were connecting from all around the world including Australia. Server is located in Moscow, Russia.
The average "latency" (on client, reported by Torque "N" graph) at the peak time was not more than 250-300ms for Americans and 200-250ms for Europeans. Not sure about AU's, sorry.
For me (connecting from one ISP to another in same city) the latency was about 120-150ms at peak times.
Hope this info helps.
#17
I then tried to connect my fake clients into my empty world. I had each client connect every 2-3 seconds. I got up to 192 clients before the server just died - it just got killed (by the OS?) with no error messages. I am not using the commandToServer to send client movement, but instead my server controls all the client movement as it did with my 300 AI players. Once past 100+ players, I noticed considerable server lag (I had a graphical client connected to the server). Physics is then ruled out as the cause, and it seems like the network is the problem here.
The server handled 300 AI Players perfectly fine, but was unable to handle more than 100 real clients. The physical network should be able to handle the traffic and messages, so it seems like Torque doesn't like all the network connections? Perhaps my clients are connecting too fast? Any thoughts?
Thanks.
04/16/2007 (1:19 pm)
So, I put 300 AI players in my empty world - everything was smooth.I then tried to connect my fake clients into my empty world. I had each client connect every 2-3 seconds. I got up to 192 clients before the server just died - it just got killed (by the OS?) with no error messages. I am not using the commandToServer to send client movement, but instead my server controls all the client movement as it did with my 300 AI players. Once past 100+ players, I noticed considerable server lag (I had a graphical client connected to the server). Physics is then ruled out as the cause, and it seems like the network is the problem here.
The server handled 300 AI Players perfectly fine, but was unable to handle more than 100 real clients. The physical network should be able to handle the traffic and messages, so it seems like Torque doesn't like all the network connections? Perhaps my clients are connecting too fast? Any thoughts?
Thanks.
#18
connecting every 2-3 seconds is pretty fast. .. actually i just talked to the guy who launches the tests when we run them here, and apparently we tell five machines to each start connecting clients at about once per second, so the TGE server is getting hit with like five or eight connections per second, on average.
we've never had the server die as a result of load stress. it usually just bogs down as the numbers get very high.
on a not-actually-related note, we recently got a very nice under-load performance increase by reducing the size of the collision bins. i think stock torque was such that our entire environment took up like four or six bins. we cut the bins in four or eight in each dimension and got much better performance, which in turn affected how many clients we could connect.
04/16/2007 (1:36 pm)
Great thread. how does it roll if you have say 300 AIPlayers and limit the real players to forty or fifty ?connecting every 2-3 seconds is pretty fast. .. actually i just talked to the guy who launches the tests when we run them here, and apparently we tell five machines to each start connecting clients at about once per second, so the TGE server is getting hit with like five or eight connections per second, on average.
we've never had the server die as a result of load stress. it usually just bogs down as the numbers get very high.
on a not-actually-related note, we recently got a very nice under-load performance increase by reducing the size of the collision bins. i think stock torque was such that our entire environment took up like four or six bins. we cut the bins in four or eight in each dimension and got much better performance, which in turn affected how many clients we could connect.
#19
Be sure that the player objects are "free to move around".
04/16/2007 (1:43 pm)
One more thing just came up in my mind: long time ago I've found, that if many player objects are stuck in each other (spawned on the same spot for example) and all of them are trying to move - it will degrade server performance a lot.Be sure that the player objects are "free to move around".
#20
04/16/2007 (1:48 pm)
Yeah, all my servers are roaming around freely in the world, so no problems there.
Torque Owner Justin White
Get 10 friends to log into your server.
Have your server process each query 10 times.
This will simulate having 100 processes.