Our Blog

DevOps Best Practices for Observability: A Complete Guide

What is Observability in DevOps?

DevOps methodology is a set of technical, process, and cultural capabilities that enables IT teams to improve organizational performance and deliver software faster. The DevOps Research and Assessment Program of Google Cloud has identified twenty seven distinct capabilities that are critical for a successful DevOps operation.

Monitoring and observability are key technical capabilities for DevOps teams. It enables organizations to understand their production system, maintain SLAs, and debug problems faster.

Monitoring requires the collection of logs and metrics. By doing so, it allows IT teams to observe and understand the state of systems, receiving alerts when a problem arises.

On the other hand, observability consists of tools and methodologies that allow IT teams to actively troubleshoot and debug the system in today’s world of distributed applications. The term “observability” was coined by the Hungarian-American engineer Rudolf E. Kálmán as a measure for inferring the internal state of a system using its external outputs.

The Importance of Observability in DevOps

Before we delve deeper into the realm of observability, let us first understand why you need observability and its perceived benefits.

The main benefit of observability is that it enables DevOps teams to maintain service levels agreements (SLA) for business applications. Observability, when implemented well, helps IT teams to gain a deeper understanding of the systems.

Observability tools help identify problems when they occur and troubleshoot the root cause of the problems faster. This reduces the Mean Time to Detection (MTTD) and Mean Time to Resolution (MTTR), two key metrics for managing SLA. Application/system downtime is minimized and contained within SLA.

The utopian state of observability is when teams are able to identify, troubleshoot and resolve problems before they are experienced by the end users. Imagine a scenario where the system is able to predict a potential bottleneck, identifies the root cause of the bottleneck, isolates the problematic process, fixes it (with or without manual intervention) and brings the process back online without any service level impact to the end users. We are still not in the age of self-healing systems, but observability is a critical step in that direction.

Observability, when done right, delivers higher revenue, customer satisfaction, and employee productivity to businesses.

Three Pillars of Observability and It’s Fatal Flaw

Logs, metrics, and traces are often defined as the three pillars of observability. If your application is Google/Facebook scale, the three pillar approach works great. However, it is very likely that your application is not at that scale and might require an alternative approach which we will discuss later. Let’s first understand the three pillars:

Logs are written records of events ongoing in the system. Logs give you a view of events and errors experienced by the system, providing context to the problem at hand.
Metrics are a set of aggregated data that show the performance of a system. Changes in the metrics help identify common problems in the system.
Traces provide a detailed view of what happens in the various processes of a system when a request or transaction is processed. It helps identify the follow of bits through the system and identify bottlenecks or errors in the processes.

Ben Sigelman, Co-founder of LightStep – an observability platform acquired by ServiceNow, argues that the three pillars approach has fatal flaws and leaves the complex job of analyzing the data to the analyst/engineer. And this task isn’t trivial.

The goal of an observability solution should be to make it easy for the analyst/engineer to follow the breadcrumbs, understand what’s going on with their system (without making any changes to the system) and troubleshoot problems. The team should be able to keep the systems and business running smoothly with minimal interruption and without backbreaking effort.

So, how can DevOps teams achieve observability for their IT systems? Let’s look at what is needed.

How Can DevOps Teams Achieve Observability for Their Systems?

For most companies, a separate observability team is an overkill. Since the goal of the observability team is to deliver the aforementioned observability capability to the entire engineering organization, DevOps teams can allocate resources to help achieve this capability.

The team tasked with build and evangelizing observability capability should not only build the tools, methodology and infrastructure for achieving observability, but also train the end users, in this case – the engineers, so that they learn how to use the capability to troubleshoot, incorporate it into the code that they are writing, and take ownership of maintaining observability for the systems that they build.

Therefore, the best way to build the capability is to start with a small team of observability experts who put together the framework for instrumentation that will enable users to understand the state of a system without making any changes to it. Once the team is able to demonstrate the capability for a particular use case or system, they can train the end users and then expand it to other use cases. The key is to start small, deploy, refine, iterate and scale.

Building a Winning Observability Stack for DevOps

Observability is the ability to explain the unknown-unknowns. It is less about the dashboards and alerts and more about data exploration and debugging by following the breadcrumbs wherever they take the end user.

Therefore, it is not just a solution that provides logging, metrics and tracing capability. If you implement all these capabilities and you are still unable to understand what’s going on in a system, you do not have observability.

A true observability solution has the following characteristics:

Access to raw events
High cardinality data
Ad-hoc aggregation capability
Transaction/request lifecycle visibility
Contextual data
Not static dashboards, but exploratory capability

While logging, metrics, and tracing are necessary, they are not sufficient to make a system observable. A number of tools such as Datadog, Grafana, Splunk, New Relic, App Dynamics, LightStep and more provide a plethora of observability capabilities of varying levels. The goal of this article is not to shortlist any vendor solution, but rather to help DevOps/Observability teams to build observability capability in their organization.

Measuring the Success of Your DevOps Observability Implementation

A simple test for measuring the efficiency of any observability effort is to measure the before and after values of Mean Time To Detection (MTTD) and Mean Time To Resolution (MTTR) for troubleshooting problems. Your production engineering team must be able to identify and resolve issues faster with less effort and without making any changes to identify the root cause of the problem.

In more mature implementations, organizations should be able to scale their systems by gaining a deep understanding of their current system and use this knowledge to extend the systems to meet the growing needs of their business.

Best Practices for Enhancing Observability

When it comes to observability, DevOps teams should use the characteristics mentioned in the Stack for observability to develop best practices. Here are a few things to keep in mind.

Start with one or more business/transaction flows
End user should be able to traverse all the data related to a specific request or transaction.
Collect all the data that is needed for the flow
Provide the capability to define the metrics as needed to understand the system.
Use dashboards and alerts as just the starting point for ad-hoc exploration.
Empower the end user. Do not make the observability team the bottleneck.

Pitfalls to Avoid

It’s very easy for DevOps teams to get started on observability and spend a ton of time implementing platforms but still end up with the same MTTD and MTTR values as before. This can be prevented by avoiding the following pitfalls.

DevOps observability team doesn’t own or monopolize observability. They are the champions and enablers for building observability capability across the organization.
Do not start wide across a number of transactional flows. Instead, focus on a few flows and build deep observable capabilities into that flow. Once it is successful, other flows can incorporate the approach.
Collect all the relevant data and context needed for troubleshooting the specific flows.
Avoid rigid tools with static dashboards and alerts that are hard to customize. Add tools that provide flexibility, data exploration, and ease of use to the end users.

Conclusion

Observability is a holistic approach to provide users with the ability to explore and gain a deeper understanding of the system. Contrary to what it may seem, it is not limited to the implementation of tools such as Grafana.

DevOps teams can achieve their mission/goal of delivering software more efficiently and maintaining systems effectively by making their systems observable. They can help scale their business by incorporating and evangelizing observability across the engineering teams.

If you are using Grafana or Elastic Stack as the observability platform for your organization, your management teams need the reporting capability to track efficiency metrics and also key system metrics. Skedler provides an easy to use reporting solution that can be added to your Observability platform to deliver automated reports effortlessly to your management team and clients.