In IT monitoring, performance and health metrics both serve a purpose.
When system administrators are given a choice between having a healthy server or one that performs well, they opt for performance. The reason is simple enough: performance pays. While sysadmins may keep their jobs with recovery, we all know that we get paid for performance; health is always secondary.
When the same choices are applied to humans, you get a different perspective. While I may be healthy enough to run a four-minute mile, there is no way my legs could get the job done. You’d be hard pressed to find someone who wouldn’t take health over performance for their own body. Most of us know to not confuse our health with our ability to perform.
So why don’t we place more value on server health over server performance?
Well, as I stated before, the reason is that IT pros are not compensated for the health of a server. But it's also because we don’t measure the health of a server unless we’re using metrics based on performance. As a result, there exists a myriad of tools on the market that jumble together server health and server performance metrics.
The truth is that my server could be healthy, but not able to meet the performance demands of end users. Even if performance demands are met, the server could be on the verge of breaking down beyond repair.
Looking at the difference between the two, we can see where each has a purpose in a hierarchy of monitoring needs. Performance metrics help to measure throughput, and give us an idea how to properly tune a workload or query. Health metrics help to measure resource capacity, and give us an idea if hardware components are on the verge of failure.
Let’s take a common example: a simple database query that is consuming 6% of the overall CPU. Performance metrics will allow for us to tune the query to use less CPU. But health metrics will help us to understand if that 6% CPU usage is causing issues for other processes. In addition, having a baseline of previous query performance history will help us to understand if the 6% CPU usage is typical.
Putting all of this together will help us understand if we need to spend time tuning the workload, or if the time is right to scale up and/or out. Also, when performance and health metrics are combined, you can build actionable alerts that have the potential to eliminate the many hours spent in a reactive mode, fighting fires.
When you combine both health and performance metrics, you will get the right data, at the right time, so you can perform the right actions for your end users.