Weekly Update - 2015-08-13 - Server Performance
CapnJosh has posted his weekly update:
I figured I'd describe the hosting situation and what we've seen in regards to reported "server performance problems":
1. I have personally obtained dedicated physical servers in all locations. They are generally running Xeon E3-1270 processors. We cannot really get much faster CPUs.
2. All gameserver hosts are on at least 1Gb internet uplinks.
3. We have had gameserver hosts in 13 different datacenters and we've used no fewer than 6 independent hosting companies. We have tried both C-States enabled and disabled, various power configurations, hyperthreading enabled and disabled, turbo enabled and disabled, spread spectrum enabled and disabled.
4. Several times now when we have received Wireshark capture files, we have seen very high levels of "Duplicate ACKs" and "retransmission retries", both of which most often are indicative of packet loss.
5. CPU utilization on the gameserver hosts is relatively low
6. We have built out reporting that shows all client-reported ping times by server. The ping times are as expected, given geographical proximity. There are exceptional cases too, where a single user is showing either high pings or wild ping ranges even when they are geographically right next to many users who report low, steady pings.
7. The survey results came back showing that only 4% of respondents indicated they had very bad network problems, with another 11% or so indicating slightly higher levels of performance issues. The remaining 89% indicated no performance issues or they "sometimes" experienced them.
8. Gameserver telemetry has not been indicating high tick processing times (indicating server CPU overload)
9. Gameserver management systems are exactly as they were before we ever worked on Hawken
Now, there have been cases where gameserver hosts have been overloaded, or we have received splash damage from DDoS activity, or some routing peer has had network problems, or I have screwed up some server configuration, or the back-end data layer has had problems.
However, from all I have seen, the problem is very likely *not* the hosting, and that's based on the above data.
Remaining things to pursue may be:
1. optimizations to replication data size
2. adding further network monitoring features to the client and server (to comprehensively monitor all routes for all players, e.g. WinMTR and UDP delivery rate software that can run in the background on all clients all the time or as desired)
3. adding client-side network configuration reporting features (to be able to see what configs may or may not predict network performance results)
4. investigate what else could have changed over the past year
5. More WinMTR routes from client machines while they are experiencing what looks like server performance issues (these will show if it's due to packet loss and at which router packets are being lost)
Regarding the high number of gameserver instances, we will be reducing those, but we want to gather data from all the gameserver hosts during the same period of time so we can compare "apples to apples".
The dev team hiring process is still ongoing. I know for a fact there is frustration we have not staffed up and already released the first patches - hey, I wish we could have already reached that point, so I feel your pain. It's moving and it's definitely happening though. Once again (as we all feel like is stated too often), it's a matter of "when" not "if".
-capnjosh
I figured I'd describe the hosting situation and what we've seen in regards to reported "server performance problems":
1. I have personally obtained dedicated physical servers in all locations. They are generally running Xeon E3-1270 processors. We cannot really get much faster CPUs.
2. All gameserver hosts are on at least 1Gb internet uplinks.
3. We have had gameserver hosts in 13 different datacenters and we've used no fewer than 6 independent hosting companies. We have tried both C-States enabled and disabled, various power configurations, hyperthreading enabled and disabled, turbo enabled and disabled, spread spectrum enabled and disabled.
4. Several times now when we have received Wireshark capture files, we have seen very high levels of "Duplicate ACKs" and "retransmission retries", both of which most often are indicative of packet loss.
5. CPU utilization on the gameserver hosts is relatively low
6. We have built out reporting that shows all client-reported ping times by server. The ping times are as expected, given geographical proximity. There are exceptional cases too, where a single user is showing either high pings or wild ping ranges even when they are geographically right next to many users who report low, steady pings.
7. The survey results came back showing that only 4% of respondents indicated they had very bad network problems, with another 11% or so indicating slightly higher levels of performance issues. The remaining 89% indicated no performance issues or they "sometimes" experienced them.
8. Gameserver telemetry has not been indicating high tick processing times (indicating server CPU overload)
9. Gameserver management systems are exactly as they were before we ever worked on Hawken
Now, there have been cases where gameserver hosts have been overloaded, or we have received splash damage from DDoS activity, or some routing peer has had network problems, or I have screwed up some server configuration, or the back-end data layer has had problems.
However, from all I have seen, the problem is very likely *not* the hosting, and that's based on the above data.
Remaining things to pursue may be:
1. optimizations to replication data size
2. adding further network monitoring features to the client and server (to comprehensively monitor all routes for all players, e.g. WinMTR and UDP delivery rate software that can run in the background on all clients all the time or as desired)
3. adding client-side network configuration reporting features (to be able to see what configs may or may not predict network performance results)
4. investigate what else could have changed over the past year
5. More WinMTR routes from client machines while they are experiencing what looks like server performance issues (these will show if it's due to packet loss and at which router packets are being lost)
Regarding the high number of gameserver instances, we will be reducing those, but we want to gather data from all the gameserver hosts during the same period of time so we can compare "apples to apples".
The dev team hiring process is still ongoing. I know for a fact there is frustration we have not staffed up and already released the first patches - hey, I wish we could have already reached that point, so I feel your pain. It's moving and it's definitely happening though. Once again (as we all feel like is stated too often), it's a matter of "when" not "if".
-capnjosh