Observability 101 – How is it Different from Monitoring

Monitoring IT infrastructure was, in the past, a fairly complicated thing, because it required constant vigilance: software continuously scanned a network, looking for outages, inefficiencies, and other potential problems, and then logged them. Each of these logs would then have to be checked by a qualified SOC team, which would then identify any issues. This led to several common problems, such as alert fatigue and false flags – both of which we’ll discuss more later – and burnout was prevalent. In fact, these three issues (fatigue, flags, and burnout) have only increased as our interconnectivity has increased. Much like the pitfalls that have befallen the airline industry (such as increased security risks and tougher identification and authorization measures), our increasing connectivity is also presenting increased security risks, risks that require more stringent identification and authorization measures, adding to the workload of SOC teams.

Making a difference in your future, today. | Tech humor, Hissy fit, Geek  humor

What does monitoring do? It lets us know if there are latency issues; it lets us know if we’ve had a jump in TCP connections. And while these are important notifications, they are no longer enough. Secure systems do not remain secure unless they are also maintained. Security teams need a system that can monitor all of these interconnected components. This is where observability comes in.

What is monitoring?

Observability is the capacity to deduce a system’s internal states. Monitoring is the actions involved in observability: perceiving the quality of system performance over a time duration. The tools and processes which support monitoring can deduce the performance, health, and other relevant criteria of a system’s internal states. Monitoring specifically refers to the process of analyzing infrastructure log metrics data.

A system’s observability lets you know how well the infrastructure log metrics can extract the performance criteria connected with critical components. Monitoring helps to analyze the infra log metrics to take actions and deliver insights.

If you want to monitor your system and keep all the important data in a place Grafana will help you organize and visualize your data! To know more about Grafana check this blog

What is Observability?

Observability is the capacity to deduce the internal states of a system based on the system’s external outputs. In control theory, observability is a mathematical dual to controllability, which is the ability to control the internal states of a system by influencing external inputs. 

Infrastructure components that are distributed operate in multiple conceptual layers of software and virtualization. Therefore it is not feasible and challenging to analyze and compute system controllability.

Observability has three basic pillars:  metrics, logs, and tracing. As we noted a moment ago, observability employs all three of these to create a more holistic, end-to-end look at an entire system, using multiple-point tools to accomplish this. 

Comparing observability and monitoring

People are always curious about observability and its difference from monitoring. Let’s take a large, complex data center infrastructure system that is monitored using log analysis, monitoring, and ITSM tools. Monitoring multiple data points continuously will create a large number of unnecessary alerts, data, and red flags. Unless the correct metrics are evaluated and the redundant noise is carefully filtered monitoring solutions, the infrastructure may have low observability characteristics.

A single server machine can be easily monitored using metrics and parameters like energy consumption, temperature,  transfer rates, and speed. The health of internal system components is highly correlated with these parameters. Therefore, the system has high observability. Considering some basic monitoring criteria, such as energy and temperature measurement, the performance, life expectancy, and risk of potential performance incidents can be evaluated.

Observability in DevOps

The concept of observability is very important in DevOps methodologies. In earlier frameworks like waterfall and agile, developers created new features and product lines while separate teams worked on testing and operations for software dependability. This compartmentalized approach meant that operations and monitoring activities were outside the development’s scope. Projects were aimed for success and not for failure i.e debugging of the code was rarely a primary consideration. There was no proper understanding of infrastructure dependencies and application semantics by the developers. Apps and services were built with low dependability. 

Monitoring ultimately failed to give sufficient information of the distributed infrastructure system about the familiar unknowns, let alone the unfamiliar unknown.

The popularity of DevOps has transformed SDLC. Monitoring is no longer limited to just collecting and processing log data, metrics, and event traces but is now used to make the system more transparent I.e observable. 

The scope of observability encapsulates the development segment which is also aided by people, processes, and technologies operating across the pipeline.

Conclusion

Collaboration of cross-functional teams such as Devs, ITOps, and QA personnel is very important when designing a dependable system. Communication and feedback between developers and operations teams are necessary to achieve observability targets of the system that will help QA yield correct and insightful monitoring during the testing phase. In turn, DevOps teams can test systems and solutions for true real-world performance. Constant iteration based on feedback can further enhance IT’s ability to identify potential issues in the systems before the impact reaches end-users.

Observability has a strong human component involved, similar to DevOps. It’s not limited to technologies but also covers the approach, organizational culture, and priorities in reaching appropriate observability targets, and hence, the value of monitoring initiatives.

Keep your system as transparent as possible, track your system health and monitor your data with Grafana or Kibana. Also, keep your Stakeholders happy with professional reporting! Try our new and improved Skedler for custom generated Grafana reports for free!

Download Skedler

Episode 8 – How to Build a Cloud-Scale Monitoring System

In Episode 8 of the Infralytics Show, Shankar interviewed Molly Struve. Molly is the Lead Site Reliability Engineer for DEV Community, an online portal designed as a place where programmers can exchange ideas to help each other. The discussion focused on two topics, “How to build a cloud-scale monitoring system” and “How to scale your Elastic Stack for cloud-scale monitoring.” 

[video_embed video=”8bzSK3EiIPw” parameters=”” mp4=”” ogv=”” placeholder=”” width=”700″ height=”400″]

How Molly started working in software engineering and cloud-scale monitoring

Molly earned an aerospace degree from MIT after originally thinking she would study software engineering. She said that since all engineering degrees provide students with the same core problem-solving skills, so when she later decided to work in the software engineering field, she already had the problem-solving background she needed in order to make the transition. The reason she didn’t end up going the aerospace route is that you have to be located in California or Washington where the aerospace industry is but she is from Chicago and didn’t really want to move. It’s good to know that people with various different educational backgrounds have still been able to find success in software engineering!

Let’s jump into the discussion of cloud-scale monitoring! Here are the key points Molly made in reference to the topics listed above.

The Interview – building a cloud-scale monitoring system

What are some of the key requirements to look for when you build out a large cloud-scale monitoring system?

When you start monitoring, you just want coverage, and to do that you often start adding all of these different tools and before you know it you have 6, 7, or 8 different tools doing all this monitoring. However, when the time comes to use it you have to open up all these different windows in your browser just to piece together what is actually going on in your system. So, one of the key things she tells people when they are building a monitoring system is that they have to consolidate all of the reporting. You can have different tools, but you need to consolidate the reporting to a single place. Make sure everything’s in one place so it’s a one stop shop to go and find all the information you need.

When an alert triggers, it must require an action so alert fatigue is a big problem in many monitoring systems. When you have a small team it might seem fine to have exceptions that everyone knows when you don’t respond to certain alerts, but as your team gets larger you have to tell new engineers what the exceptions are, and this process just simply doesn’t scale. So you have to be very disciplined in responding to alerts.

The goal is to get to a point where whoever is on call, whether it’s one person, two people, or three people, can handle the error workload that is coming into the system by way of alerts. 

In the beginning, when you are setting up a monitoring system you might have a lot of errors, and you just have to fix stuff and the improvement of the system comes with time. The ideal metric is zero errors, so you need to be aware of when errors get to a point where they need to be addressed.

Monitoring from an infrastructure perspective is different from monitoring from a security perspective

Trying to figure out what to monitor is also very challenging. You have to set up your monitoring and adjust it as you go depending on what perspective you are monitoring for. Knowing what to monitor is a little bit based on trial and error. That way, if there is data that you wish you had monitoring for, you can address the error and then go in and add the necessary code so that it’s there in the future. After you do that a few times you will end up with a really robust system so the next time an error occurs, all the information you need will be there and it might only take you a few minutes to figure out what’s wrong.

Beyond bringing the data together and optimizing alerting, what are the other best practices?

Another best practice is tracking monitoring history. When trying to solve the error from an alert, you will want to know what the past behavior was. Past behavior can help you debug a problem. What were you alerted about in the past and how was the problem addressed then?

Also, you have to remove all manual monitoring for your monitoring system to be truly scalable. Some systems require employees to check a dashboard every few hours, but this task is easily forgotten. So, if you want a monitoring system to scale you have to remove all manual monitoring. You don’t want to rely on someone opening up a file or checking a dashboard to find a problem. The problem should automatically come to you or whoever is tasked with addressing it. 

What tools did you use to automate?

At Kenna we used datadog. It’s super simple, it integrates really easily with ruby which is the language I primarily work with.

Anything else important on the topic of best practices for cloud-scale monitoring?

Having the ability to mute alerts when you are in the process of fixing them is important. When a developer is trying to fix a problem, it’s distracting to have an alert going off repeatedly every half hour. Having the ability to mute an alert for a set amount of time like an hour or a day can be very helpful. 

What else is part of your monitoring stack?

The list goes on and on. You can use honeybadger for application errors, AWS metrics for your low-end infrastructure metrics, StatusCake for your APIs to make sure your actual site is up, Elasticsearch for monitoring, circleci for continuous integration. It’s a large list of many different tools, but we consolidated them all through datadog. 

What kind of metrics did your management team look for?

Having a great monitoring system allows you to catch incidents and problems before they become massive problems. It’s best to be able to fix issues before the point at which you would have to alert users to the problem. You want to solve problems before they impact your user base. That way on the front-end it looks to the user like your product is 100% reliable, but it’s just because developers have a system on the backend that alerts them to problems so they can stop them before they directly impact users. Upper management obviously wants the app to run well because that’s what they are selling and the monitoring system allows for that to happen.

How big was the elasticsearch cluster where you worked before?

The logging cluster that we used at Kenna had 10 data nodes. The cluster we used for searching client data was even bigger. It was a 21 node cluster. 

What were some of the problems when it came to managing this large cluster?

You want to be defining what you are logging. and make it systematic. Early on at Kenna we would be logging user information we would end up with a ton of different keys which created more work for elasticsearch. This also makes searching and using the data nearly impossible. To avoid this you need to come up with a logging system by defining keys and making sure that everyone is using those keys when they are in the system and logging data. 

We set up our indexes by date, which is common. When you get a month out from the date on a specific index, you want to shrink them to a single shard, which will decrease the number of resources that elasticsearch needs in order to use that index. Even further out than that, you eventually should close that index so that elasticsearch doesn’t need to use any resources for it. 

Any other best practices for cloud-scale monitoring?

Keep your mapping strict and that can help you to avoid problems. If you are doing the searching yourself, try to use filters rather than queries. Filters run a lot faster and are easier on elasticsearch so you want to use them when you are searching through data.

Finally, educating your users on how to use elasticsearch is important. If developers don’t know how to use it correctly, elasticsearch will time out. So, teach users how to search keys, analyzed fields, unanalyzed fields, etc. Also, this will help your users get the targeted, accurate data they are looking for so educating them on how to use elasticsearch is for their benefit as well. Internal users at Kenna (which is who is being referred to here) were conducting searches through Kibana. Clients would interface with the data relevant to them (after training) through an interface that the Kenna team built which prevented clients from doing things that could take down the entire cluster. 

So are you using elasticsearch in your current role at DEV?

DEV is currently using a paid search tool, but we hope to switch to elasticsearch because elasticsearch is open source and it will give us more control over our data and how we search it.

There’s an affordable solution for achieving the best practices described

Molly described the importance of consolidating reporting, responding to alerts, avoiding alert fatigue, automating alerts and reports, and tracking monitoring history. Just two weeks prior to this interview, Shankar gave a presentation about avoiding alert fatigue, and this relevant topic keeps becoming a focus of discussions. Many of the points Molly made, from the importance of automating alerts and reports to the importance of consolidating reporting, are the reasons we started Skedler. 

Are you looking for an affordable way to send periodic reports from elasticsearch to users when they need it? Try Skedler Reports for free! 

Do you want to automate the monitoring of elasticsearch data and notify users of anomalies in the data even when they aren’t in front of their dashboards? Sign up for a free trial of Skedler Alerts!

We hope you are enjoying our podcast so far. Happy holidays to all of our listeners. We will be taking a short break, but will be back with new episodes of The Infralytics Show in 2020!

Translate »