Monitoring game servers can be an absolutely monumental task, and is extremely easy to do wrong, but it is an essential component of game server management.
Fear not, after five years of evolving our monitoring system, we’re still figuring things out.
So what are the best ways to monitor your servers? What should you prioritize? What are the best tools to use?
We spoke with Global Support Manager for Multiplay, Michael Assis, to answer these questions and more.
Reactive vs Proactive monitoring
Monitoring can be broken down into two types; reactive and proactive. Both types should be utilized if possible, as this gives the customer and players the best experience overall. In terms of priority, reactive monitoring should be given top billing.
Reactive monitoring is using a system to display or alert a support/operations person when “something” happens.
Using logic, you can set up a system that will actively check active servers for specific “events” and then alert you in a specific place.
Reactive monitoring is a higher priority in terms of support, as “something” has already happened, which now needs to be investigated and hopefully resolved.
Proactive monitoring is fixing an issue or a problem, before it causes any damage or an outage.
An example of this would be having a system to alert you when your server is using 90% of its disk space and will become full and possibly cause an outage if left unchecked.
These types of alerts can usually be fed into a smart system, which can then use automation to take care of them.
One use case for proactive monitoring for Multiplay: we alert customers that a new game version is using more resources than before, which in turn allows us to advise the relevant team to lower the amount of instances they are running per machine.
Top tools for monitoring
Having tried a a number of options for monitoring, here are our preferred choices:
- Slack: The IM tool favoured by most modern teams. It’s great because of its open API and integration with other apps but, as I discuss below, caution is needed!
- Email: The old favourite but still a great tool for stakeholder visibility.
- Grafana: For a monitoring glance board, Grafana is our main choice. It does require some customization and apps added to it, but works perfectly.
- Zabbix: One of the most popular monitoring tools available. Its automation options and event management make it a must-have for us.
Key takeaways when monitoring game servers
The lessons we learned the hard way, so you don’t have to!
- Do not use Slack for alerting. It will spam your teams and, if you have more than one “thing” alerting at once, you’re in for a bad time.
- No, really, do not use Slack for alerting. When you open up your work’s Slack and six of the alerting channels have over 1000 messages in each, with no way of seeing if an alert was actioned, you’re in for a really bad time.
- Do not give your developers the ability to add alerts into Slack or anywhere without an approval process. Devs are amazing at creating tools to help us. They aren’t so good at creating documentation or explanations for teams that aren’t developers. This means you will end up with 100s of complex alerts that need to be actioned. The problem is no one in your support or operations teams will have any clue of how to action them!
- Automate as much as possible. The more complex your monitoring setup is in terms of data points that you are monitoring, the more you should put an emphasis on automating the fixes. If you have to fix an alert once, you should document it and then automate it if possible. This saves time in the long run and also allows other members of your team to figure out what an alert is for and how to resolve it if needs be.
This is part three of our ongoing Essential Guide to Game Servers, which includes:
- Patching a live game server
- Game server player density: tips and tricks to keep costs down
- Monitoring game servers – part one
- Monitoring game servers – part two
For help with your game servers, check out our professional services.