Episode 8 – How to Build a Cloud-Scale Monitoring System

image of cloud-scale monitoring.

In Episode 8 of the Infralytics Show, Shankar interviewed Molly Struve. Molly is the Lead Site Reliability Engineer for DEV Community, an online portal designed as a place where programmers can exchange ideas to help each other. The discussion focused on two topics, “How to build a cloud-scale monitoring system” and “How to scale your Elastic Stack for cloud-scale monitoring.” 

How Molly started working in software engineering and cloud-scale monitoring

Molly earned an aerospace degree from MIT after originally thinking she would study software engineering. She said that since all engineering degrees provide students with the same core problem-solving skills, so when she later decided to work in the software engineering field, she already had the problem-solving background she needed in order to make the transition. The reason she didn’t end up going the aerospace route is that you have to be located in California or Washington where the aerospace industry is but she is from Chicago and didn’t really want to move. It’s good to know that people with various different educational backgrounds have still been able to find success in software engineering!

Let’s jump into the discussion of cloud-scale monitoring! Here are the key points Molly made in reference to the topics listed above.

The Interview – building a cloud-scale monitoring system

What are some of the key requirements to look for when you build out a large cloud-scale monitoring system?

When you start monitoring, you just want coverage, and to do that you often start adding all of these different tools and before you know it you have 6, 7, or 8 different tools doing all this monitoring. However, when the time comes to use it you have to open up all these different windows in your browser just to piece together what is actually going on in your system. So, one of the key things she tells people when they are building a monitoring system is that they have to consolidate all of the reporting. You can have different tools, but you need to consolidate the reporting to a single place. Make sure everything’s in one place so it’s a one stop shop to go and find all the information you need.

When an alert triggers, it must require an action so alert fatigue is a big problem in many monitoring systems. When you have a small team it might seem fine to have exceptions that everyone knows when you don’t respond to certain alerts, but as your team gets larger you have to tell new engineers what the exceptions are, and this process just simply doesn’t scale. So you have to be very disciplined in responding to alerts.

The goal is to get to a point where whoever is on call, whether it’s one person, two people, or three people, can handle the error workload that is coming into the system by way of alerts. 

In the beginning, when you are setting up a monitoring system you might have a lot of errors, and you just have to fix stuff and the improvement of the system comes with time. The ideal metric is zero errors, so you need to be aware of when errors get to a point where they need to be addressed.

Monitoring from an infrastructure perspective is different from monitoring from a security perspective

Trying to figure out what to monitor is also very challenging. You have to set up your monitoring and adjust it as you go depending on what perspective you are monitoring for. Knowing what to monitor is a little bit based on trial and error. That way, if there is data that you wish you had monitoring for, you can address the error and then go in and add the necessary code so that it’s there in the future. After you do that a few times you will end up with a really robust system so the next time an error occurs, all the information you need will be there and it might only take you a few minutes to figure out what’s wrong.

Beyond bringing the data together and optimizing alerting, what are the other best practices?

Another best practice is tracking monitoring history. When trying to solve the error from an alert, you will want to know what the past behavior was. Past behavior can help you debug a problem. What were you alerted about in the past and how was the problem addressed then?

Also, you have to remove all manual monitoring for your monitoring system to be truly scalable. Some systems require employees to check a dashboard every few hours, but this task is easily forgotten. So, if you want a monitoring system to scale you have to remove all manual monitoring. You don’t want to rely on someone opening up a file or checking a dashboard to find a problem. The problem should automatically come to you or whoever is tasked with addressing it. 

What tools did you use to automate?

At Kenna we used datadog. It’s super simple, it integrates really easily with ruby which is the language I primarily work with.

Anything else important on the topic of best practices for cloud-scale monitoring?

Having the ability to mute alerts when you are in the process of fixing them is important. When a developer is trying to fix a problem, it’s distracting to have an alert going off repeatedly every half hour. Having the ability to mute an alert for a set amount of time like an hour or a day can be very helpful. 

What else is part of your monitoring stack?

The list goes on and on. You can use honeybadger for application errors, AWS metrics for your low-end infrastructure metrics, StatusCake for your APIs to make sure your actual site is up, Elasticsearch for monitoring, circleci for continuous integration. It’s a large list of many different tools, but we consolidated them all through datadog. 

What kind of metrics did your management team look for?

Having a great monitoring system allows you to catch incidents and problems before they become massive problems. It’s best to be able to fix issues before the point at which you would have to alert users to the problem. You want to solve problems before they impact your user base. That way on the front-end it looks to the user like your product is 100% reliable, but it’s just because developers have a system on the backend that alerts them to problems so they can stop them before they directly impact users. Upper management obviously wants the app to run well because that’s what they are selling and the monitoring system allows for that to happen.

How big was the elasticsearch cluster where you worked before?

The logging cluster that we used at Kenna had 10 data nodes. The cluster we used for searching client data was even bigger. It was a 21 node cluster. 

What were some of the problems when it came to managing this large cluster?

You want to be defining what you are logging. and make it systematic. Early on at Kenna we would be logging user information we would end up with a ton of different keys which created more work for elasticsearch. This also makes searching and using the data nearly impossible. To avoid this you need to come up with a logging system by defining keys and making sure that everyone is using those keys when they are in the system and logging data. 

We set up our indexes by date, which is common. When you get a month out from the date on a specific index, you want to shrink them to a single shard, which will decrease the number of resources that elasticsearch needs in order to use that index. Even further out than that, you eventually should close that index so that elasticsearch doesn’t need to use any resources for it. 

Any other best practices for cloud-scale monitoring?

Keep your mapping strict and that can help you to avoid problems. If you are doing the searching yourself, try to use filters rather than queries. Filters run a lot faster and are easier on elasticsearch so you want to use them when you are searching through data.

Finally, educating your users on how to use elasticsearch is important. If developers don’t know how to use it correctly, elasticsearch will time out. So, teach users how to search keys, analyzed fields, unanalyzed fields, etc. Also, this will help your users get the targeted, accurate data they are looking for so educating them on how to use elasticsearch is for their benefit as well. Internal users at Kenna (which is who is being referred to here) were conducting searches through Kibana. Clients would interface with the data relevant to them (after training) through an interface that the Kenna team built which prevented clients from doing things that could take down the entire cluster. 

So are you using elasticsearch in your current role at DEV?

DEV is currently using a paid search tool, but we hope to switch to elasticsearch because elasticsearch is open source and it will give us more control over our data and how we search it.

There’s an affordable solution for achieving the best practices described

Molly described the importance of consolidating reporting, responding to alerts, avoiding alert fatigue, automating alerts and reports, and tracking monitoring history. Just two weeks prior to this interview, Shankar gave a presentation about avoiding alert fatigue, and this relevant topic keeps becoming a focus of discussions. Many of the points Molly made, from the importance of automating alerts and reports to the importance of consolidating reporting, are the reasons we started Skedler. 

Are you looking for an affordable way to send periodic reports from elasticsearch to users when they need it? Try Skedler Reports for free! 

Do you want to automate the monitoring of elasticsearch data and notify users of anomalies in the data even when they aren’t in front of their dashboards? Sign up for a free trial of Skedler Alerts!

We hope you are enjoying our podcast so far. Happy holidays to all of our listeners. We will be taking a short break, but will be back with new episodes of The Infralytics Show in 2020!