Three Pillars of Observability – Metrics ( Part 2)

Introduction

Distributed systems mean services and servers are spread out over multiple clouds. The individual users who consume the services increase their number, device of choice, and location. Having visibility into the client’s experience while using the application – i.e., observability – is now a vital part of the metrics required to operate the applications in your infrastructure.

What is Metrics?

A metric is a quantifiable value measured over a while and includes specific characteristics like timestamp, name, KPIs, and value. Unlike logs, metrics are structured by default, making it easier to query and optimize for storage giving you the ability to retain them for more extended periods.

Metrics help uncover some of the most primary queries of the IT department. Is there a performance issue that’s affecting customers? Are employees having trouble accessing? Is there high traffic volume? Is the rate of customer churn going up?

Standard metrics include

  1. System metrics such as CPU usage, memory usage, disk I/O,
  2. App metrics such as rate, number of errors, time,
  3. Business metrics such as revenue, signups, bounce rate, cart abandonment, etc.

Different Components of Metrics

Metrics is the most valuable of the three pillars because they’re generated very often and by every module, from operating systems to applications. Associating them can give you a complete view of an issue, but associating them is a huge and tedious task for human operators.

Data Collection

The most significant part of metrics is small and does not consume too much space. You can gather them cheaply and store them for an extended period. These give you a general overview of the whole system without insights.

So, metrics answer the question, “How does my system performance change through time?”

Data Storage

Most people used statsd along with graphite as the storage backend. Some people now prefer Prometheus, an open-source, metrics-based monitoring system. It does one thing pretty well, with a simple yet powerful data model and a query language, it lets you analyze how your applications and infrastructure perform.

Visualization and Reporting

I would also consider visualization a part of metrics, as it goes hand in hand with metrics.

Grafana is used to visualize the data scraped by sources like Prometheus, a  data source to grafana, which works on a pull model. You can also use Kibana as your visulaization tool, primarily supporting elastic stack.

And you can use Skedler to generate reports from these visualizations to share with your stakeholders.

There is a simple and effective way to add reporting for your Elasticsearch Kibana (including Open Distro for Elasticsearch) or Grafana applications that are deployed to Kubernetes using Skedler.

You can deploy Skedler on air-gapped, private, or public cloud environments with docker or VM on various flavors of Linux.

Skedler is easy to install, configure and use with Kibana or Grafana. Skedler’s no-code Drag-n-drop UI generates PDF, CSV, Excel Kibana, or Grafana reports in minutes and saves up to 10 hours per week.

Try our new and improved Skedler for custom generated Grafana or Kibana reports for free!

Download Skedler

Conclusion

Metrics are the entry point to all monitoring platforms based on the data collection from CPU, memory, disk, networks, etc. And so, they no longer belong only to operations —  metrics can be created by anyone and any system in the distributed network. For instance, a developer may opt to showcase application-specific data such as the number of tasks performed, the time required to complete the tasks, and the status. Their objective is to link these data to different levels of systems and define an application profile to identify the necessary architecture for the distributed system itself. This adds to improved performance, reliability, and better security system-wide.

Metrics used by development teams to identify points in the source code that need improvement can also be used by operators to assess the system requirements and plan needed to support user demand and the team to control and enhance the adoption and use of the application.

Three Pillars of Observability – Logs

Introduction

Observability evaluates what’s happening in your software from the outside. The term describes one cohesive capability. The goal of observability is to help you see the condition of your entire system.

Observability needs information about metrics, traces, and logs – the three pillars. When you combine these three “pillars,” a remarkable ability to understand the whole state of your system also emerges. This information might go unnoticed within the pillars on their own. Some observability solutions will put all this information together. They do that as different capabilities, and it’s up to the observer to determine the differences. Observability isn’t just about monitoring each of these pillars at a time; it’s also the ability to see the whole picture and to see how these pieces combine to fit in a puzzle and show you the actual state of your system.

The Three Pillars of Observability

As mentioned earlier, there are three pillars of observability: Logs, Metrics, and Traces.

Logs are the archival records of your system functions and errors. They are always time-stamped and come in either binary or plain text and a structured format that combines text and metadata. Logs allow you to look through and see what went wrong and where within a system.

Metrics can be a wide range of values monitored over some time. Metrics are often vital performance indicators such as CPU capacity, memory usage, latency, or anything else that provides insights into the health and performance of your system. The changes in these metrics allow teams to understand the system’s end performance better. Metrics offer modern businesses a measurable means to improve the user experience.

Traces are a method to follow a user’s journey through your application. Trace documents the user’s interaction and requests within the system, starting from the user interface to the backend systems and then back to the user once their request is processed. 

This is a three-part blog series on these 3 pillars of observability.  In this first part, we will dive into logs.

Check out this article to know more about observability here

The First Pillar – Logs

In this part of the blog, we will go through the first pillar of Observability – Logs. 

Logs consist of the system’s structured and unstructured data when specific programs run. Overall, you can think of a log as a database of events within an application. Logs help solve unpredictable and irregular behaviors of the components in a system.

They are relatively easy to generate. Almost all application frameworks, libraries, and languages support logging. In a distributed system, every component generates logs of actions and events at any point.

Log files entail complete system details, like fault and the specific time when the fault occurred. By examining the logs,  you can troubleshoot your program and identify where and why the error occurred. Logs are also helpful for troubleshooting security incidents in load balancers, caches, and databases.

Logs play a crucial role in understanding your system’s performance and health. Good logging practice is essential to power a good observability platform across your system design. Monitoring involves the collection and analysis of logs and system metrics. Log analysis is the process of deriving information from these logs. To conduct a proper log analysis, you first need to generate the logs, collect them, and store them. Two things that developers need to get better at logging are: what and how to log.

But one problem with logging is the sheer amount of logged data and the inability to search through it all efficiently. Storing and analyzing logs is expensive, so it’s essential to log only the necessary information to help you identify issues and manage them. It also helps to categorize log messages into priority buckets called logging levels. It’s vital to divide logs into various logging levels, such as Error, Warn, Info, Debug, and Trace. Logging helps us understand the system better and help set up necessary monitoring alerts. 

Insights from Logs

You need to know what happened in the software to troubleshoot system or software level issues. Logs give information about what happened before, during, and after an error occurred.

A trained eye monitoring log can tell what went wrong during a specific time segment within a particular piece of software.

Logs offer analysis at the granular level of the three pillars. You can use logs to discover the primary causes for your system’s errors and find why they occurred. There are many tools available for logs management like

You can then monitor logs using Grafana or Kibana or any other visualization tool.

The Logs app in Kibana helps you to search, filter, and follow all your logs present in Elasticsearch. Also, Log panels in Grafana are very useful when you want to see the correlations between visualized data and logs at a given time. You can also filter your logs for a specific term, label, or time period.

Check out these 3 best Grafana reporting tools here

Limitations of Logs

Logs show what is happening in a specific program. For companies running microservices, the issue may not lie within a given service but how different connected functions. Logs alone may show the problem but do not show how often the problem has occurred. Saving logs that go back a long time can increase costs due to the amount of storage required to keep all the information.

Similarly, coming up with new containers or instances to handle client activity means increasing the logging and storage cost. 

To solve this issue, you need to again look to another of the three pillars of observability—the solution for this: metrics. We will cover metrics in the second part of our observability series. Stay tuned to learn more about observability.

Try our new and improved Skedler for custom generated Grafana reports for free!

Download Skedler

OpenTelemetry 101

With a mindset shift in most organizations adopting DevOps & Agile practices, one of the usual starting points in their transformation journey was to break down their monolith into several microservices. This, not only, helped with the continuous integration, delivery, and testing tenet but also expedited changes that previously used to take weeks or even months to execute. However, this transformation in architecture presented its own set of challenges with respect to monitoring. For traditional architectures, monitoring was relegated to understanding a known set of failures based on usage thresholds & parsing content off logs. With the architectural shift, monitoring alone no longer served the purpose of understanding the state of the system given that failures within the newer systems were never linear. 

Observability & Telemetry

Thus was born observability as a discipline to complement monitoring with its data-driven approach. In short, monitoring systems didn’t die out but were supplemented with more data to understand the internal state of the architecture & navigate from effect to cause more easily with the discipline of observability. Founded on the three pillars of logs, metrics, and tracing, commonly known as telemetry data, observability systems enabled us to understand our systems & their failures better.

However, with an increasing demand for observability systems, there also was another challenge on the rise – the lack of standardization in the offerings. In addition to this, the ones that were adopted lacked portability across languages. A combination of the above two challenges resulted in the overhead of implementations being maintained by the developer/SRE staff within the organization contributing to an increase in complexity & workload.

Thus was born OpenTelemetry: Built-in, high-quality telemetry for all

In 2019, the maintainers of OpenTracing (a CNCF vendor-agnostic tracing project) & OpenCensus (a vendor-agnostic tracing & metrics library led by Google) merged towards solving some of these challenges and standardizing the telemetry ecosystem with OpenTelemetry. As outlined in this excellent announcement post, the vision of the project was to provide a unified set of instrumentation libraries and specifications towards providing built-in high-quality telemetry for all. 

With an open, vendor-agnostic standard that was backward-compatible with both of its founding projects, OpenTelemetry’s aim was to allow for cross-platform & streamlined observability that would allow for more focus on delivering reliable software without getting mired down by the various available options. Because in the end, isn’t that the end goal of every business?

The nitty-gritty

A CNCF incubating project as of writing this post, OpenTelemetry is composed of the following main components as of v 0.11.0 released on 8th October 2021

  1. Proto files to define language independent interface types such as collectors, instrumentation libraries etc. 
  2. Specifications to describe the cross-language requirements for all implementations. 
  3. APIs containing the interfaces & implementations of the specifications
  4. SDKs implementing the APIs with processing & exporting capabilities
  5. Collectors to receive, process, and export the telemetry data in a vendor-agnostic manner
  6. Instrumentation libraries towards enabling observability for other libraries via manual & automatic instrumentation

As aforementioned, both manual & automatic instrumentation are supported; automatic instrumentation,  being the simpler of the two, involves only the addition of dependencies and configuration via environment variables or language-specific means such as system properties in Java. Manual instrumentation, on the other hand, would involve significant code dependencies on the API & SDK components in addition to the actual creation & exporting of the telemetry data. While extremely useful, a significant drawback is that manual instrumentation can lead to redundancies & inconsistencies in the way we treat observability data along with being a massive expenditure of manual efforts.

So where are we headed?

As of today, 14 vendors support OpenTelemetry. With a focus on developing the project on a signal-by-signal basis, the project aims to stabilize & improve LTS for instrumentation. With support for over 11 languages, there are also efforts expended towards expanding & improving the instrumentation across a wider variety of libraries as also incorporating testing & CI/CD tooling towards writing & verifying the quality of instrumentation offered.

With a vibrant community & extensive documentation around the project, there has never been a better time to get involved in the efforts towards standardizing the efforts for built-in high-quality telemetry.

Keep your system as transparent as possible, track your system health and monitor your data with Grafana or Kibana. Also, keep your Stakeholders happy with professional reporting! Try our new and improved Skedler for custom generated Grafana reports for free!

Download Skedler

Observability 101 – How is it Different from Monitoring

Monitoring IT infrastructure was, in the past, a fairly complicated thing, because it required constant vigilance: software continuously scanned a network, looking for outages, inefficiencies, and other potential problems, and then logged them. Each of these logs would then have to be checked by a qualified SOC team, which would then identify any issues. This led to several common problems, such as alert fatigue and false flags – both of which we’ll discuss more later – and burnout was prevalent. In fact, these three issues (fatigue, flags, and burnout) have only increased as our interconnectivity has increased. Much like the pitfalls that have befallen the airline industry (such as increased security risks and tougher identification and authorization measures), our increasing connectivity is also presenting increased security risks, risks that require more stringent identification and authorization measures, adding to the workload of SOC teams.

Making a difference in your future, today. | Tech humor, Hissy fit, Geek  humor

What does monitoring do? It lets us know if there are latency issues; it lets us know if we’ve had a jump in TCP connections. And while these are important notifications, they are no longer enough. Secure systems do not remain secure unless they are also maintained. Security teams need a system that can monitor all of these interconnected components. This is where observability comes in.

What is monitoring?

Observability is the capacity to deduce a system’s internal states. Monitoring is the actions involved in observability: perceiving the quality of system performance over a time duration. The tools and processes which support monitoring can deduce the performance, health, and other relevant criteria of a system’s internal states. Monitoring specifically refers to the process of analyzing infrastructure log metrics data.

A system’s observability lets you know how well the infrastructure log metrics can extract the performance criteria connected with critical components. Monitoring helps to analyze the infra log metrics to take actions and deliver insights.

If you want to monitor your system and keep all the important data in a place Grafana will help you organize and visualize your data! To know more about Grafana check this blog

What is Observability?

Observability is the capacity to deduce the internal states of a system based on the system’s external outputs. In control theory, observability is a mathematical dual to controllability, which is the ability to control the internal states of a system by influencing external inputs. 

Infrastructure components that are distributed operate in multiple conceptual layers of software and virtualization. Therefore it is not feasible and challenging to analyze and compute system controllability.

Observability has three basic pillars:  metrics, logs, and tracing. As we noted a moment ago, observability employs all three of these to create a more holistic, end-to-end look at an entire system, using multiple-point tools to accomplish this. 

Comparing observability and monitoring

People are always curious about observability and its difference from monitoring. Let’s take a large, complex data center infrastructure system that is monitored using log analysis, monitoring, and ITSM tools. Monitoring multiple data points continuously will create a large number of unnecessary alerts, data, and red flags. Unless the correct metrics are evaluated and the redundant noise is carefully filtered monitoring solutions, the infrastructure may have low observability characteristics.

A single server machine can be easily monitored using metrics and parameters like energy consumption, temperature,  transfer rates, and speed. The health of internal system components is highly correlated with these parameters. Therefore, the system has high observability. Considering some basic monitoring criteria, such as energy and temperature measurement, the performance, life expectancy, and risk of potential performance incidents can be evaluated.

Observability in DevOps

The concept of observability is very important in DevOps methodologies. In earlier frameworks like waterfall and agile, developers created new features and product lines while separate teams worked on testing and operations for software dependability. This compartmentalized approach meant that operations and monitoring activities were outside the development’s scope. Projects were aimed for success and not for failure i.e debugging of the code was rarely a primary consideration. There was no proper understanding of infrastructure dependencies and application semantics by the developers. Apps and services were built with low dependability. 

Monitoring ultimately failed to give sufficient information of the distributed infrastructure system about the familiar unknowns, let alone the unfamiliar unknown.

The popularity of DevOps has transformed SDLC. Monitoring is no longer limited to just collecting and processing log data, metrics, and event traces but is now used to make the system more transparent I.e observable. 

The scope of observability encapsulates the development segment which is also aided by people, processes, and technologies operating across the pipeline.

Conclusion

Collaboration of cross-functional teams such as Devs, ITOps, and QA personnel is very important when designing a dependable system. Communication and feedback between developers and operations teams are necessary to achieve observability targets of the system that will help QA yield correct and insightful monitoring during the testing phase. In turn, DevOps teams can test systems and solutions for true real-world performance. Constant iteration based on feedback can further enhance IT’s ability to identify potential issues in the systems before the impact reaches end-users.

Observability has a strong human component involved, similar to DevOps. It’s not limited to technologies but also covers the approach, organizational culture, and priorities in reaching appropriate observability targets, and hence, the value of monitoring initiatives.

Keep your system as transparent as possible, track your system health and monitor your data with Grafana or Kibana. Also, keep your Stakeholders happy with professional reporting! Try our new and improved Skedler for custom generated Grafana reports for free!

Download Skedler

Episode 7 – Best Practices for Implementing Observability in Microservices Environments

In this episode of Infralytics, Shankar interviewed Stefan Thies, the DevOps Evangelist at Sematext, a provider of infrastructure and application performance monitoring and log management solutions including consulting services for Elastic Stack and Solr. Stefan also has extensive experience as a product manager and pre-sales engineer in the Telecom domain. Here are some of the key discussion points from our interview with Stefan on implementing observability in microservices!

[video_embed video=”hY1gkea4LDo” parameters=”” mp4=”” ogv=”” placeholder=”” width=”700″ height=”400″]

Microservices based on containers have become widely popular as the platform for deployingsolutions in public, private or hybrid clouds. What are the top monitoring and management challenges faced by organizations deploying container based microservices that want to implement observability?

There are quite a lot of challenges. Some people start simply with a simple host and later use orchestration tools like Kubernetes and what we see is that containers add another infrastructure layer and a new kind of resource management. At the same time we are monitoring performance with a new kind of metrics. What we developed in the past was special monitoring agents to collect these new kinds of metrics on all layers, so we have a cluster node with performance metrics for the specific node, on top of Kubernetes ports and in the port you want several containers and multiple processes, and first, new monitoring agents need to be container aware so they have to collect metrics from all of the layers. 

The second challenge is the new way of dynamic deployment and orchestration. You deal with more objects than just servers and your services, because you also deal with cluster nodes, containers, deployment status of your containers. This can be very dynamic and orchestrators like Kubernetes move your applications around so maybe an application fails on one node and then the cluster shifts the application to another node. It’s very hard to track errors and failures in your application. So the new orchestration tools add additional challenges for DevOps people, because they need to see not only what happens on the applications but at the cluster level. Additional challenges are also added because things are moving around. There is now another layer of complexity added to the process. 

What are additional challenges that come with containers? What should administrators be looking for?

There are metrics on every layer; servers, clusters, ports, containers, deployment status. Also another challenge is that lock management has also changed completely. You need a logging agent that’s able to collect the container logs. With every logline we add information on which node it is on, in which port it is deployed, and which container and container image, so we have better visibility. The next thing that comes with container deployment is microservices. Typically architectures today are more distributed and split into little services that work closely together, but it’s harder to trace transactions that go through multiple services. Transaction tracing is a new pillar of the observability but it requires more work to implement the necessary code. 

Basically, log management becomes a challenge because of all of these microservices and you are also doing the tracing not just on the metrics and events, but you are also now looking at all of the trace files. So having more data requires people to have larger data stores.

How do you consolidate the different datasets?

We use monitoring agents and logagents. Both tools use the same tags so the logs and metrics can be correlated.

How do you standardize the different standards and practices?

With open source, it’s a lot of do-it-yourself which means you need to sit down and think what metrics do I have, what labels do I need, and do the same for the logging and for the monitoring.

What are your recommended strategies for organizations?

More and more users are used to having 24/7 services because they are used to getting that from Google and Facebook. All the big vendors offer 24/7 services. Smaller software vendors really have a challenge to be on the same level to be aware of any problem as soon as possible. 

What you need to do is to first start monitoring; availability monitoring, then add metrics to it for infrastructure monitoring. Are your servers healthy? Are all the processes running? Then the next level is education monitoring to check the performance of your databases, your message queues, and the other tools you use in your stack, and finally the performance of your own applications. 

When it comes to troubleshooting and you recognize some service is not performing well, then you need the logs. In the initial stage typically people use SSH, log into the server, try to find the log file, and look for errors. You need to collect the logs from all your servers, from all of your processes, and from all of your containers. Index the data and make it searchable and accessible. If you want to be really advanced you go to the level of code implementation and tracing. 

What is observability? How is it different from monitoring?

Observability is the whole process. Monitoring, you have metrics, you have logs, and transaction tracing to have code level visibility. This process allows you to pinpoint where exactly the failure happens so it’s easier to fix it. When you have more information, it’s much faster to solve the problem. 

How would an organization move from just monitoring to observability?

At Sematext our log management is very well accepted so people typically start with collecting the logs because it’s the first challenge they have; Where do I store all these logs? Should I set up a seperate server for it, or do I go for Software as a Service? These are the types of questions people are asking, so we see that people start collecting logs and then they start to discover more features and that we offer monitoring and then they start installing monitoring agents and they start to ask about specific applications. Automatically they start to do more and more steps. That is the process that our customers normally follow.

Are you interested in learning more about what Stefan’s company offers? Go to www.sematext.com. Are you looking for an easy-to-use system for data monitoring that provides you with automated delivery of metrics and provides code-free and easy to manage Alerts? Check out Skedler!

If you want to learn more tips from experts like Stefan, you can read more articles about the Infralytics video podcast on our blog!

Translate »