War Thunder background
Technical issues over the past weekend

Like other client/server products, our game has a very complex and distributed infrastructure, consisting of a variety of different servers. These are authorization servers, servers to store user profiles, battle servers, a squad server and a voice communication server. All of these are not on a single device, but dozens! Plus, a matching server, consisting of many physical devices of entrance gateways, that are proxies to eliminate points of failure, and a server that actually creates battles from players in the queue.

Now, a little more detail on what happened this past weekend. One of the gateway services was operating for 375 days, and during a scheduled reboot, a misconfigured version was loaded that used only a single core (we discovered it only on Sunday) and incorrectly proxying IP-addresses. Initially, it seemed to us that the server was overloaded due to the newly introduced vehicles and a gaming nation, as well as with the attention of the community for the major update. We decided to increase capacity as soon as possible by moving to the most powerful and expensive devices in Amazon. However, this cannot be done instantly, and moving itself took time. But even when the move was completed, the issue with incorrect configuration and single-threading remained. The server understood that all the players entering via this service had the same IP, it could not find them quickly and began lagging. After approximately 9:00 PM GMT, the issue of incorrect proxying was localized and corrected soon afterwards and the battles began to match again. During the night, however, the number of users decreased, and even the fact that the proxies were working with a single core did not interfere with the operation. 

On Sunday, with the growth of users and battles, the load increased again, and we finally located a second configuration error and during the afternoon we fixed the single-threaded proxying by smoothly removing the incorrectly configured machines and introducing the new ones so as to avoid service denial for users already playing. This process also took some time. It should be noted, there was plenty capacity power available, even before the introduction of the more powerful servers - the power of a single core wasn’t enough, but there were many of them After the upgrade we utilized only about 6% of the overall capacity (i.e. 20 times reserve in power from the peak).

Conclusions

Based on the results, we made conclusions and planned the improvements of both operation and matching code. First, we plan to check the fault tolerance of the service by restarting all servers with over lengthy uptime, using the experience of large streaming services, such as Netflix, that uses special bots that check the operating time 

Also, we have already introduced improvements with the matching code, that will allow us to keep an acceptable level of game operation even under heavy loads - queueing may take longer, but the service won't stop responding to players - staying operational. 

In addition, an increase in the number of gaming nations and possible game configurations in battles has reached large values and algorithmic optimizations are required. Discovering all the possible combinations of all players and all gaming nations is the quadratic complexity of the algorithm, so it is necessary to create optimizations to find out perhaps not an absolutely perfect possible match, but at least good enough.

More good news. Login to the game from November 4th (11:00 GMT) until November 8th (11:00 GMT) to receive a Premium booster, +30% RP for 5 battles!

 

Read more:
The Toolbox!
  • 8 April 2024
Mad Thunder: Rage and Loot!
  • 1 April 2024
Battle Pass Vehicles: P-51C-11-NT Mustang (China)
  • 18 April 2024
Earn the PLZ 83-130 in the Inferno Cannon Event!
  • 18 April 2024

Comments (96)

Commenting is no longer available for this news