In part one, we explored reactive vs proactive monitoring, tools, and some general best practice advice.
In part two, one of Multiplay’s senior developers, Lewis Waddicor, is going to take a deeper dive into monitoring, with a focus on machine health, recovery and what you can do with the data.
Considering every organisation will have different requirements when it comes to monitoring, we will mainly run through how we handle things at Multiplay.
This is possibly over and above what you might need to do, as you may not need to monitor thousands of machines in dozens of locations, but it’ll give you insight into some areas to explore.
What needs Monitoring?
Our service runs in two main places:
If you’re a Multiplay customer, you’ll know all about our core servers, as you’ll be interacting with them regularly. They host our API, website, and the brains of our Hybrid Scale technology.
Edge nodes (Machines)
Edge nodes (or often just Machines) are either bare-metal boxes or machines we have provisioned from a cloud provider. We try to use a balance between the two for maximum cost effectiveness and will vary the cloud based on demand throughout the day.
For me, the term ‘edge nodes’ drives home the point that these really live at the edge of our system. We need to consider data transfer costs, latency, and all manner of communication issues that can arise from them being spread across the world.
Both ‘core’ and ‘edge node’ machines require monitoring, but as “core” servers are common to many, I will be mainly discussing how edge nodes are monitored and what we do with that information.
Keeping track of many thousands of machines is impossible without good monitoring. We consistently report back details about them to identify potential problems. There are a variety of different metrics that we use for this, including:
- Disk usage
- Swap usage
- Network usage
- Server query (game state, player information)
For example, if a new game version has been rolled out and RAM usage starts going up over time, it may be a sign that a memory leak has been introduced.
Game server health and behavior
To maximize cost-effectiveness, often a single machine will be running many game servers on it (you can read more about game server density in my colleague’s excellent piece).
To prevent a single bad server ruining the experience for players on the same machine process level, monitoring is vital.
For example, if a game server was to suddenly crash and begin consuming a huge amount of memory or CPU, this would likely be noticed by games on the remaining servers as a huge lag spike. The machine statistics we collect are vital to detecting problems like these.
The main executables may also spawn many sub-processes. It’s important to monitor all of these children to make correct decisions. Again, similar sorts of information is collected (CPU, RAM, etc).
One of the most important tools we use is server query. We use this to ask the game server directly for information about itself. We can then see if it responds, the number of players it has connected, game state, and any other information it sends us.
How do we make use of monitoring information?
The machines collect information and report it back to our core services. We developed a really powerful ‘server check’ service running on all the edge nodes which does this. This application is designed to work with all the platforms we support and report in a unified way.
Our system has to be resilient to any connection issues as clients expect peak performance. Core will keep track of servers that have not reported in recently and the edge nodes will send previous statistics to fill in any gaps in monitoring reports. Alerts may also be raised if needed (for more on alerts, check out part one on monitoring).
At scale, the core services will be receiving and processing a huge amount of information from every server check instance running. We use this information in a number of ways, including:
Perhaps the most important. We will always try to recover if we have an issue. Often though, we don’t have to as the system does a really good job of attempting to fix itself if it has problems. Based on the data it collects, it has a constantly evolving view of the world and is able to make better decisions.
In the case of a runaway executable, with the collected data, the system is able to analyze the expected resource profile against a previous one and take actions to try and recover it. In a lot of cases this is enough to permanently fix the problem without manual intervention.
If there are back-to-back failures, we may take further action. One option is to disable the machine and let the Hybrid Scaler automatically provision a new machine, if the capacity is needed. An alert can also be set up to allow someone to investigate, to prevent it happening in future.
Our service reports back a huge amount of data, given the number of machines and game servers it runs. This information is stored in various services we use to analyze the performance of the fleet and even help diagnose customer problems.
We use a number of services to do this (e.g. InfluxDB, logz.io, Slack, etc) and this storage gives us the ability to pinpoint any issue that we see.
Teams here at Multiplay have built a number of specialist tools to make use of this information. Data is used to catch issues before they become real problems.
The level of data we’re able to provide customers varies depending on the level of integration, but for many of our customers we regularly share trend data. For example you might want to understand whether a new map is using more CPU than an old one, or whether a patch has been rolled out to all of your players or not.
We can also break the data down to show you where your players are, and what game modes they’re playing, from a global level right down to individual region or location.
This is part four of our ongoing Essential Guide to Game Servers, which includes: