In this article I hope to give the reader a small history lesson as well as some advice on how to build a useful monitoring system for your platform. First, it’s key to understand where we came from. Before cloud computing systems, every company owned its own infrastructure and therefore had a need to monitor it. This kind of monitoring was used to check on things like, hard disk failures, memory usage, CPU overload etc. This was very important because if you had a faulty hard disk within a disk array, you really needed to know about this before the whole disk array collapsed. These types of systems were not online typically and everything was managed by the system administrators in the building. Nagios was the preferred tool then and is still heavily used by many organisations today; but as the world has migrated to the cloud, they have become less necessary. If you use Azure, AWS or GCP you don’t have to worry about faulty disks, memory, network switches etc. All the fundamental computer hardware is very well looked after by such providers; it is no longer your responsibility. This is one of the biggest mistakes we currently see in monitoring – using old approaches to monitor a new world. Let’s take a closer look.
How to monitor in the ‘new’ world
Now that the primary hardware is monitored by someone else it means that organisations can focus on what matters to them. In my experience (as the head of support for a global NOC) I can tell you this is the number one area where businesses fail. Over and over companies will ask us to monitor, and when we ask what needs monitoring, they cannot tell us – even worse, they expect us to tell them what needs monitoring! This is total madness. Every company has a unique system and different metrics that are important to them. If you are in this position then I urge you to go back and think of what is important to you? Say, for example, that the shopping cart on the website is the most income-generating feature – then, in this situation you should map all of the services that make up the feature, ‘shopping cart’. Once you know all the subsystems that support ‘shopping cart’, you can then look to monitor all of them, and by doing so you will have also built a way of monitoring the shopping cart more fully. As cloud computing scales up and down with demand you should not look to monitor CPU and RAM, but instead look to run ‘synthetic tests’ which attempt to run transactions through your system. These synthetic tests mimic the behaviour of uses more realistically and are more likely to tell you if your service is up/down; they are also much more useful for alerting. For example, you could have a synthetic test that runs a purchase on your site every minute – this would make sure the shopping cart is working correctly.
How One AI-Driven Media Platform Cut EBS Costs for AWS ASGs by 48%
The Benefits of Cloud Monitoring Tools
In the cloud you have lots of tools that come with inbuilt integrations, such as DataDog for example. This is important because you no longer need to build the monitoring yourself like you used to in the days of Nagios. DataDog has 500+ integrations ready to use. To try and build monitoring solutions on your own makes no sense – do not try to reinvent the wheel! The fun fact here is that by taking cloud (or hosted) solutions you get much more added benefit. Companies like DataDog use Machine Learning and Artificial Intelligence which helps you to find the root cause of issues more quickly – the more data you push to them, the smarter they become (beware of the additional costs though). Cloud tools also benefit from constant upgrades and feature releases – with Nagios you could have been on the same version for months or even years; DataDog has new features added daily. Hosted monitoring solutions also have the benefit of infinite scalability. You don’t have to worry that they will run out of space or have service interruptions, as all of these problems are pushed onto the vendor.
Do’s & Don’ts for Smart Monitoring
Do deduce the noise.
Think of what is really an alert to you. It is tempting to add alerts for all kinds of things, but if you add too many alerts everyone will learn to ignore them and you will miss the important one. Only set alerts to engineers when there is a real problem. It is also key to work out what the business sees as important. An engineers view of the world is not the same as the product manager – a system may appear totally healthy from a monitoring point of view, but be 100% broken from the user perspective.
Do look for existing integrations.
If you need to monitor a system, always look for existing integrations as it is most likely you are not the first person to do this. Take the lessons other people have learned and save yourself time and effort.
Do try to build a ‘single pane of glass’
Modern monitoring systems will allow you to build up time-based data sets from multiple sources, which means that you can view CDN logs next to application logs very easily. The more sources you can pull together in one place, the better your monitoring will be
Don’t monitor everything
Whilst it is tempting to monitor everything, this typically means that you have too much data to sort through. If you have a staging system creating millions of lines of logs in an hour, ask yourself ‘is this really necessary?’. Also, be very aware that modern systems are priced by volume, so if you start to push unnecessary data you will end up paying for this.
Don’t rely on the out-of-box settings
I have already mentioned that it is a good idea to find existing integrations for your systems. This remains true but you should be aware of what your system comprises and what makes it unique. The existing integrations are usually a very good start but don’t expect them to be perfect – you will need to do some fine tuning to get the best experience
Don’t make assumptions
Even though monitoring systems are much more advanced with Machine Learning and Artificial Intelligence, do not expect that the system will know what is important to you. Perhaps you have some system folders that should only hold files on a temporary basis, then delete them. There is no way for a monitoring platform to know such things, you must teach it. Understand your system before you look to monitor it.
Hopefully, this introduction to monitoring will help you on your journey. I shall leave you with a quote:
“If you can’t measure it, you can’t manage it”.
Keep this in mind when building your systems and you will have an easier life monitoring them.
Monitoring can be worry-free, even on hybrid or multi-cloud. Contact us to get the latest monitoring solutions.