Patching a live game server

This blog is part of our ongoing Essential Guide to Game Servers series.

Rolling out a patch can be a pain in the neck. Too much downtime frustrates your fan-base and, if patches are rolled out too often, can push players away from your game for good. Whether it’s fixing a bug, addressing a security flaw or rolling out your latest DLC, the aim is always to get it done quickly and with minimal disruption.

So how do you achieve this?

We spoke with our Global Support Manager for Multiplay, Michael Assis, and one of our senior developers, Andrew Montgomery, to get the inside track on the best ways to roll out a patch.

The support team at Multiplay rolls out dozens of patches a week for clients and Andrew has worked closely on some of the technology Multiplay has developed to solve the unique challenges involved in patching games of all shapes and sizes.

This blog covers the basics of what patching is, running through a couple of different approaches and some best practice advice. We then take a technical deep dive into Query and RCON, with Andrew talking you through how Multiplay mitigate risks and rolls out patches with minimal disruption to gamers worldwide.

What is patching?

In general terms, a patch is a set of changes to a computer program or its supporting data designed to update, fix, or improve it.

When it comes to patching a game, we’re not talking about the entire game, but rather a set of files that need updating in order to change certain aspects of the game server, whether that be new settings, maps, features or game modes.

Patching is an important part of hosting any game, as most game developers will be releasing new features for their games over time in order to keep players interested and engaged.

Scheduled maintenance vs Zero downtime patching

Scheduled maintenance

The most common type of patching seen across the gaming industry is where the entire game or service is taken offline, the files are updated, the new changes are tested and then those changes are rolled out to the remaining offline services. This is often the approach taken for Massively Multiplayer Online games (MMOs) and other large games. This approach is also known as “scheduled maintenance”.

Once the rollout is completed, these services are then brought back online and players are given access once again. This type of patching is simple to implement and usually pretty easy to manage.

The key issue with this approach of course is the significant customer impact, as the entire player base will be unable to connect and play a game. Everything is offline and there is no redundancy which, depending on the size of your game and how it is monetized, can mean a significant amount of lost revenue, not to mention unhappy gamers!

Larger MMOs; like World of Warcraft, use this system as having all the services offline for a period of time allows them time to make bigger changes to background systems (databases, etc) and also allows a number of tests and checks to be run, in order to ensure that the players’ experience is optimal.

Zero downtime patching

A relatively new way of patching is the idea of “zero downtime” patching. This means that the game server is left online and players can still join and continue their games.

Once a client side patch is rolled out, players will restart their side and, once they’ve received the update, they will then connect to a server that is already running the latest version.

The servers receive the latest patch once they restart (usually after a game/match has ended). This is usually achieved by using a profile system or two versions of a game, which are updated independently from each other.

The good thing about this system is that there is minimal player disruption, which is a better experience for the gamer and means they spend longer playing your game!

The challenges arise with the more complex setup, management of two versions of a game and having a matchmaker with logic that can put players on the correct server, depending on what version their client is running.

Another thing to consider is the number of servers needed for this kind of patching. If you’re using Amazon Web Services (AWS), you’ll need to factor in the fact that you’ll need to deploy the patched version of your game on a new server and wait for sessions to finish on the old server, before shutting it down. So you’ll essentially need extra capacity, which can be expensive. Multiplay gets around this by patching existing virtual and physical servers in the background, then switching the version in use without interrupting players, so no need for extra servers.

Tips and tricks for flawless patching

  1. Test everything – If you’re updating a game and there are changes to a single file, or every single file, make sure you test it first. There is nothing worse than rolling out a broken patch at 4am to hundreds of servers, and then receiving an equal amount of support requests complaining that the user’s server is down and they want everything fixed right now.
  2. Always check you are rolling out the correct version – A great way to annoy any player base is to roll out an older version of the game.
  3. Make a copy of the current version before you update – If you need to do a rollback, it is much quicker to just rename a folder, rather than having to download or roll back to a previous version.
  4. Test again  Once you’ve rolled out a patch, make sure you test it again. You want to ensure that servers receive all the file changes and don’t have any new issues.
  5. Take your time – Players will always want to play their game, regardless of the time of day. It is better in the long run that you take your time to ensure that your server patches are rolled out correctly the first time, even if it does mean a longer downtime. If you have to interrupt players more than once during your patching process, you’re probably doing something wrong, and your players won’t appreciate it!

 

Technical deep dive – Utilizing Query and RCON for patching

Query is a system in your game server that allows external agents to query the state of the game server. Typically, this is implemented as a network service that responds to specific network packets with encoded data about the current game session and the server itself, often via UDP.

An external agent will make a request to the game server on its query port and the game server will reply with information such as the number of connected players, the map, current game mode, player scores and other similar information.

Query is useful for external monitoring services, which can include your own matchmaker. It allows consumers to know essential information like the number of active players or whether a session is active, which is required to be able to shutdown a server without impacting players.

It can also be used to monitor the behavior of the game itself in various ways. For example, if query responses are handled as part of the main game loop, then you can also tell if the game has crashed without exiting, because Query will no longer respond.

Network protocols

In terms of the network protocol for Query, you can use your own but to have the best compatibility with existing third-party tools, you’ll likely want to use a well-defined existing protocol (e.g. A2S or BattleEye, which we’ve blogged about here). Some of these protocols have various limitations and most are UDP-based.

At Multiplay, we’ve had a lot of experience dealing with a large number of varying query protocol techniques which lead us to develop our own Server Query Protocol (SQP) and tooling. We also maintain an open source querying tool called Qstat, which supports a large number of various protocols.

Preventing Dedicated Denial of Server (DDoS) attacks

One of the important things to consider in the design of a query protocol is DDoS packet amplification attacks. Ideally your protocol needs to:

  • Do basic checks to not allow anyone to request any information
  • Send responses to the people that actually asked for it
  • Try to ensure responses are smaller in size than the requests to generate them

If you don’t do any of these, then rogue actors on the internet can spoof packets to your game server and get it to send packets to their DDoS targets. Amplification occurs when they can send a small spoofed packet to your game, and your game then sends a large (or even multiple) packets to their target in response.

At Multiplay, we’ve seen this exact kind of attack in use several times and while it can be partially mitigated at the network layer, making your game itself harder to abuse this way can save you money, since hosting becomes easier and less likely to require an investment in expensive packet filtering technology.

SQP solves these issues by using a challenge/response system, so that packets are dropped unless a small handshake is completed first. This means the source address can not be spoofed, as the challenge will not be able to be completed in order to make larger response requests.

So how is any of this useful for patching?

When patching your game, you want to do so with the minimal impact to players. This is impossible without being able to tell whether a game server is currently hosting a game session or not.

Query allows you to provide a mechanism to determine whether players are connected or a game session is currently active. Furthermore, it allows your matchmaker to have a mechanism for querying information about game servers it may want to place sessions on.

This means it can determine whether or not that game server is ready or willing to accept a session (e.g. If the game server is being patched, it may want to refuse sessions), or what version of the game server is running, so the matchmaker can only assign players to new versions as the older ones are scaled down.

Query and RCON to avoid downtime

Query alone only gets you part of the way through the journey of eliminating downtime that will affect your players. To completely avoid downtime due to patching, your game server also needs the ability to be told it should no longer accept game sessions.

The mechanism for doing this can be done in any number of ways, such as:

  • Local process signalling (platform-specific)
  • RCON commands
  • Local console commands (via STDIN)

The most portable of these is RCON commands, and it is likely RCON is a requirement for your game anyway, if you want to allow admins to control the game and kick/ban players (or other similar administrative tasks).

The downside of RCON is that there is no standardized command set or protocol, so it is hard to expect third-party applications to know what specific commands you implement to safely disable sessions on your game server.

Local process signalling requires platform-specific solutions (Linux can handle signals such as TERM or USR2, whilst Windows can support WM_QUIT via the Windows message pump or CTRL_C over the console message handler), but these are relatively few in terms of options and therefore can be expected to be used by third-parties.

Typically, making your game server respond to TERM/CTRL_C by capturing the signal and doing a graceful cleanup of any existing sessions before then exiting, allows your game server to be patchable without causing fleet-wide downtime.

This can be achieved by having the management software running your game server send signals when it knows a patch is being rolled out, allowing your game server to shutdown after a session is completed and then be restarted on the new patch.

At Multiplay, we use this method, in combination with custom support added for some RCON-supporting games, to allow us to do rolling patches without causing fleet-wide downtime, gradually upgrading the whole fleet as sessions naturally finish, replacing them with the new version. You push out a patch, our management layer tells the running servers to stop accepting sessions and exit after they finish their current one (via TERM/RCON). The system then restarts them on the new patch version.

Disruption from this can be mitigated further by the use of A/B patching, where instead of updating the game in-place, your patch applies to alternate A/B branches and each patch update runs the A or B branch as appropriate, eliminating any possible issues from the game being modified by a patch whilst it is hosting a session.

You can read more about how Multiplay utilizes this knowledge and technology in it’s hybrid scaling game server solution here.

2019-02-13T15:27:03+00:00February 8th, 2019|Essential Guide to Game Servers|