Monitoring services isn’t rocket science, but that doesn’t mean you can just turn on a simple dashboard and hope for the best. To make sure you don’t make some beginner mistakes, check out these six tips for better service monitoring.To make sure you don’t make some beginner mistakes, check out these six tips for better service… Click To Tweet
Tip #1: Backups Are Your Friend
Monitoring is one thing, but before you start doing anything, make sure you have a reliable way of doing backups. These should follow the 3-2-1 Rule of Backups (three copies of your data, on two different media, with at least one at a different location) when at all possible.
What does this have to do with monitoring?
It’s simple. When there are issues, you’re going to want to jump in there and change things, but sometimes you’ll just need to roll the whole system back to get a better idea of what’s going on with your code. That means your ability to monitor your services effectively is dependent on your ability to go back in time to an early backup point and restore your services from that point. It won’t happen all the time, but when it does happen you want to make sure you don’t suffer from any data loss.
This ties into a key part of great monitoring: understanding what the end result of that monitoring is going to be. What’s that? Fixing unexpected issues.
Tip #2: Make Sure You Can Check Per-User Usage
When you Google “software monitoring” you’re going to be faced with one thousand opinions on a wider range of topics. Most of these won’t be useful.
Just to set some definitions, let’s realize that monitoring your services could be as high-level as seeing what the electricity levels you’re using are at, or as low-level as reviewing every ‘0’ and ‘1’ that gets processed. Both could be useful in certain contexts, but unless you’re either building your own servers or running your own Google-scale data centers, you won’t need that information.
The ideal level of monitoring is to be able to connect performance changes to specific users.
That means you can see an uptick in load time on a chart and quickly be able to find out which user (or groups of users) are the driving force behind that change.
Users are unpredictable, so you should be able to quickly find out the specific account that is causing a performance bottleneck and either cut off or throttle their access.
Tip #3: API Specific Performance
Much like tracking users, you should be able to find out what API calls are causing issues within seconds. No, this isn’t a memory trick where you memorize every single API call and what it’s signature is.
This means being able to go from the issue that called your attention to performance issues, down to the API call that is being made, to which service is calling that API. It forms a sort of tree that you can trace all the way to the root of a problem. (Of course, this also brings up the fact that you need good documentation of all of your API services. But, that’s another story for another day.)
Without being able to do this you’re going to be left searching for an answer without a path to that answer, right in the moment when fast action is required. It’s a terrible situation to be in, especially when the issues are hurting the bottom line of your business.
Tip #4: Record, Record, Record
Monitoring performance is only useful if you can detect anomalies. You won’t know that something is wrong unless you have a rough idea of what is “right.”
But, to know that, you need to have something to compare it to. That comparison should be to the record of how your services normally operate.
Some companies just do the bare minimum of recording their logs, but in reality there’s no excuse for that. Storage space is effectively free (okay, it’s about two cents per gigabyte on AWS), but that amount is so small that recording your logs for future comparison is well worth the cost.
This is important because when you do detect an issue, it’s going to be useful to be able to go back in time and see when this issue may have appeared in the past. You can only do that if you’re diligently recording all sessions and activity.
Tip #5: Use Prometheus
Many companies rely on having scripts written that test their services, believing that these pre-written scripts are crucial to effective monitoring. They believe that these scripts take the burden off of their shoulders and allows them to focus more on the issues that matter.
We have a different perspective.
We use Prometheus, which easily plugs into any development environment. It is built with the idea of scraping up all available metrics exposed at an endpoint and then storing it for future evaluation as needed. Compared to pre-written scripts, it takes much of the pain out of reliably testing your services.
Tip #6: Triaging Using PagerDuty
PagerDuty is one of the most valuable tools you can invest in. If you ask any engineering manager from a world class technology company, they’ll quickly nod their head in agreement.
Without this turning into an advertisement for a specific tool (there are alternatives, though they are less widely used), PagerDuty is almost required to bring your monitoring services up to par with your competition.
It will notify you when incidents occur, give you pertinent information so you can resolve the issue, and allow you to focus on solving issues rather than figuring out what the issues are. That saves you time and money
Tip #7: Know What You’re Tracking
At the end of the day, none of these tips are going to be useful if you don’t identify which key metrics are worth tracking. Depending on your service and how your users use it these could be memory, CPU, and disk usage, or network bandwidth, or any number of other variables.
On a more granular level, you should also make sure you know what application-specific metrics need to be tracked. These will be different for every application, but if you closely examine how your users are using your service you’ll quickly be able to see where potential bottlenecks are.
Bonus: Communicating Service Outages with Users
You need a service uptime dashboard. It’s a requirement for any business that people will notice in the event of an outage. This isn’t strictly related to how you monitor your services, but it should be considered part of the same meta-problem: building infrastructure that is ready for whatever is thrown at it.
There are plenty of options for this providing this dashboard and these answers, from building it in-house, to outsourcing to a third-party tool, but whatever you choose make sure you remember: Your users will want answers, so anticipate how you’re going to provide those answers.