Episode 8 – How to Build a Cloud-Scale Monitoring System

In Episode 8 of the Infralytics Show, Shankar interviewed Molly Struve. Molly is the Lead Site Reliability Engineer for DEV Community, an online portal designed as a place where programmers can exchange ideas to help each other. The discussion focused on two topics, “How to build a cloud-scale monitoring system” and “How to scale your Elastic Stack for cloud-scale monitoring.” 

[video_embed video=”8bzSK3EiIPw” parameters=”” mp4=”” ogv=”” placeholder=”” width=”700″ height=”400″]

How Molly started working in software engineering and cloud-scale monitoring

Molly earned an aerospace degree from MIT after originally thinking she would study software engineering. She said that since all engineering degrees provide students with the same core problem-solving skills, so when she later decided to work in the software engineering field, she already had the problem-solving background she needed in order to make the transition. The reason she didn’t end up going the aerospace route is that you have to be located in California or Washington where the aerospace industry is but she is from Chicago and didn’t really want to move. It’s good to know that people with various different educational backgrounds have still been able to find success in software engineering!

Let’s jump into the discussion of cloud-scale monitoring! Here are the key points Molly made in reference to the topics listed above.

The Interview – building a cloud-scale monitoring system

What are some of the key requirements to look for when you build out a large cloud-scale monitoring system?

When you start monitoring, you just want coverage, and to do that you often start adding all of these different tools and before you know it you have 6, 7, or 8 different tools doing all this monitoring. However, when the time comes to use it you have to open up all these different windows in your browser just to piece together what is actually going on in your system. So, one of the key things she tells people when they are building a monitoring system is that they have to consolidate all of the reporting. You can have different tools, but you need to consolidate the reporting to a single place. Make sure everything’s in one place so it’s a one stop shop to go and find all the information you need.

When an alert triggers, it must require an action so alert fatigue is a big problem in many monitoring systems. When you have a small team it might seem fine to have exceptions that everyone knows when you don’t respond to certain alerts, but as your team gets larger you have to tell new engineers what the exceptions are, and this process just simply doesn’t scale. So you have to be very disciplined in responding to alerts.

The goal is to get to a point where whoever is on call, whether it’s one person, two people, or three people, can handle the error workload that is coming into the system by way of alerts. 

In the beginning, when you are setting up a monitoring system you might have a lot of errors, and you just have to fix stuff and the improvement of the system comes with time. The ideal metric is zero errors, so you need to be aware of when errors get to a point where they need to be addressed.

Monitoring from an infrastructure perspective is different from monitoring from a security perspective

Trying to figure out what to monitor is also very challenging. You have to set up your monitoring and adjust it as you go depending on what perspective you are monitoring for. Knowing what to monitor is a little bit based on trial and error. That way, if there is data that you wish you had monitoring for, you can address the error and then go in and add the necessary code so that it’s there in the future. After you do that a few times you will end up with a really robust system so the next time an error occurs, all the information you need will be there and it might only take you a few minutes to figure out what’s wrong.

Beyond bringing the data together and optimizing alerting, what are the other best practices?

Another best practice is tracking monitoring history. When trying to solve the error from an alert, you will want to know what the past behavior was. Past behavior can help you debug a problem. What were you alerted about in the past and how was the problem addressed then?

Also, you have to remove all manual monitoring for your monitoring system to be truly scalable. Some systems require employees to check a dashboard every few hours, but this task is easily forgotten. So, if you want a monitoring system to scale you have to remove all manual monitoring. You don’t want to rely on someone opening up a file or checking a dashboard to find a problem. The problem should automatically come to you or whoever is tasked with addressing it. 

What tools did you use to automate?

At Kenna we used datadog. It’s super simple, it integrates really easily with ruby which is the language I primarily work with.

Anything else important on the topic of best practices for cloud-scale monitoring?

Having the ability to mute alerts when you are in the process of fixing them is important. When a developer is trying to fix a problem, it’s distracting to have an alert going off repeatedly every half hour. Having the ability to mute an alert for a set amount of time like an hour or a day can be very helpful. 

What else is part of your monitoring stack?

The list goes on and on. You can use honeybadger for application errors, AWS metrics for your low-end infrastructure metrics, StatusCake for your APIs to make sure your actual site is up, Elasticsearch for monitoring, circleci for continuous integration. It’s a large list of many different tools, but we consolidated them all through datadog. 

What kind of metrics did your management team look for?

Having a great monitoring system allows you to catch incidents and problems before they become massive problems. It’s best to be able to fix issues before the point at which you would have to alert users to the problem. You want to solve problems before they impact your user base. That way on the front-end it looks to the user like your product is 100% reliable, but it’s just because developers have a system on the backend that alerts them to problems so they can stop them before they directly impact users. Upper management obviously wants the app to run well because that’s what they are selling and the monitoring system allows for that to happen.

How big was the elasticsearch cluster where you worked before?

The logging cluster that we used at Kenna had 10 data nodes. The cluster we used for searching client data was even bigger. It was a 21 node cluster. 

What were some of the problems when it came to managing this large cluster?

You want to be defining what you are logging. and make it systematic. Early on at Kenna we would be logging user information we would end up with a ton of different keys which created more work for elasticsearch. This also makes searching and using the data nearly impossible. To avoid this you need to come up with a logging system by defining keys and making sure that everyone is using those keys when they are in the system and logging data. 

We set up our indexes by date, which is common. When you get a month out from the date on a specific index, you want to shrink them to a single shard, which will decrease the number of resources that elasticsearch needs in order to use that index. Even further out than that, you eventually should close that index so that elasticsearch doesn’t need to use any resources for it. 

Any other best practices for cloud-scale monitoring?

Keep your mapping strict and that can help you to avoid problems. If you are doing the searching yourself, try to use filters rather than queries. Filters run a lot faster and are easier on elasticsearch so you want to use them when you are searching through data.

Finally, educating your users on how to use elasticsearch is important. If developers don’t know how to use it correctly, elasticsearch will time out. So, teach users how to search keys, analyzed fields, unanalyzed fields, etc. Also, this will help your users get the targeted, accurate data they are looking for so educating them on how to use elasticsearch is for their benefit as well. Internal users at Kenna (which is who is being referred to here) were conducting searches through Kibana. Clients would interface with the data relevant to them (after training) through an interface that the Kenna team built which prevented clients from doing things that could take down the entire cluster. 

So are you using elasticsearch in your current role at DEV?

DEV is currently using a paid search tool, but we hope to switch to elasticsearch because elasticsearch is open source and it will give us more control over our data and how we search it.

There’s an affordable solution for achieving the best practices described

Molly described the importance of consolidating reporting, responding to alerts, avoiding alert fatigue, automating alerts and reports, and tracking monitoring history. Just two weeks prior to this interview, Shankar gave a presentation about avoiding alert fatigue, and this relevant topic keeps becoming a focus of discussions. Many of the points Molly made, from the importance of automating alerts and reports to the importance of consolidating reporting, are the reasons we started Skedler. 

Are you looking for an affordable way to send periodic reports from elasticsearch to users when they need it? Try Skedler Reports for free! 

Do you want to automate the monitoring of elasticsearch data and notify users of anomalies in the data even when they aren’t in front of their dashboards? Sign up for a free trial of Skedler Alerts!

We hope you are enjoying our podcast so far. Happy holidays to all of our listeners. We will be taking a short break, but will be back with new episodes of The Infralytics Show in 2020!

Episode 6 – Cybersecurity Alerts: 6 Steps To Manage Them

Is your Security Ops team overwhelmed by cybersecurity alerts? In this episode of The Infralytics Show, Shankar, Founder, and CEO of Skedler, describes the seemingly endless number of cybersecurity alerts that security ops teams encounter. 

[video_embed video=”7nul5V5pM9o” parameters=”” mp4=”” ogv=”” placeholder=”” width=”700″ height=”400″]

The Problem Of Too Many Cybersecurity Alerts

Just to give you an understanding of how far-reaching this problem is, here are some facts. According to information published in a recent study by Bitdefender, 72% of CISOs reported alert or agent fatigue. So, don’t worry, you aren’t alone. A report published by Critical Start found that 70% of SOC Analysts who responded to the study said they investigate 10+ cybersecurity alerts every day. This is a dramatic increase from just last year when only 45% said they investigate more than 10 alerts each day. 78% spend more than 10 minutes investigating each of the cybersecurity alerts, and 45% reported that they get a rate of 50% or more false positives. 

When asked the question, If your SOC has too many alerts for the analysts to process, what do you do? 38% said they turn off high-volume alerts and the same percentage said that they hire more analysts. However, the problem that arises with the need to hire more analysts, is that more than three quarters of respondents reported an analyst turnover rate of more than 10% with at least half reporting a 10-25% rate. This turnover rate is directly impacted by the overwhelming number of cybersecurity alerts, but it raises the question, what do you do if you need to hire more analysts to handle the endless number of alerts, but the cybersecurity alerts themselves are contributing to a high SOC analyst turnover rate. It seems a situation has been created where there are never enough SOC analysts to meet the demand. 

To make matters worse, more than 50% of respondents reported that they have experienced a security breach in the past year! Thankfully, you can eliminate alert fatigue and manage alerts effectively with these 6 simple steps.

A woman looks overwhelmed over cybersecurity alerts on her laptop.

The Solution To Being Overwhelmed By Cybersecurity Alerts

1. Prioritize Detection and Alerting

According to Shankar’s research, step 1 is that business and security goals and the available resources that you have at your disposal to use to achieve them must prioritize threat detection and alerting. Defining what your goals are is a great way to start. Use knowledge of your available resources to better plan how you are going to respond to alerts and how many you will be able to manage per day. 

2. Map Required Data

Step 2 is to map your goals and what you are trying to achieve to the data that you are already capturing. Then you can see if you are collecting all of the required data to adequately monitor and meet your security requirements. Identify the gaps in your data by completing a gap analysis to see what information you are not collecting that needs to be collected, and then set up your telemetry architecture to collect the data that is needed.

3. Define Metrics Based Cybersecurity Alerts

Step 3 is to define metrics based alerts. What type of alerts are you going to monitor? Look for metric-based alerts that often search for variations in events. Metric based alerts are more efficient than other types of alerts, so Shankar recommends this to those of you who are at this step. You should augment your alerts with machine learning.

Definitely avoid cookie cutter detection. The cookie cutter approach is more of a one size fits all organizations approach that most definitely will not be the best approach for YOUR organization. Each organization has its own unique setup, and you need to have your own setup that is derived from your own security goals.  Also, optimize event-based detection but keep these to a minimum so that your analysts do not end up getting overwhelmed by the alerts.

4. Automate Enrichment and Post-Alert Standard Analysis

Once you have set up these rules, the next step is to see how you can automate some of the additional data that your analysts need for their analysis. Can you automate the enrichment of the alert data so that your analysts don’t have to go and manually look for additional data to provide more context to the alerts? Also, 70-80% of the analysis that an analyst goes through as part of the investigation of an alert is very standard. So ask yourself, is it possible to automate it?

5. Setup a Universal Work Bench

  • Use a setup similar to what Kanban or Trello uses where you have a queue and the alerts that need to be investigated are moved from one stage to the next. This will help you keep everything organized. This can help you arrange the alerts in order of importance so that your analysts know which alerts to address first.
  • Add enriched data to these alerts, so automate the enrichment process to make sure it is readily available for your analysts through the work bench.
  • Provide more intelligence to the alerts (adding data or whatever else is needed to provide context). This will help you provide a narrative for the alerts and this will help you use immersion learning to come up with recommendations that your security analysts can investigate.

These first five steps are not intended to be a one time initiative but rather a repetitive process where each step can be perfected over a long period. 

6. Measure and Refine

  • Continuous improvement – measure the effectiveness of your alert system. How many alerts are flowing into the system, how much time is it taking for your analysts to investigate each of the alerts, and what is the false-positive rate vs. the true-positive rate.
  • Iterative approach- Think of a sprint-based approach? What changes can you make to improve your results in the next sprint iteration? Add more data or change your alert algorithms for different results and be more precise.

By making regular changes to improve your results, you can reduce the operations costs of your organization and provide more security coverage, reducing the overall likelihood of a major cybersecurity breach.

If you are looking for alerting and reporting for ELK SIEM or Grafana that is easy to use check out Skedler. Interested in other episodes of the Infralytics Show? Check out our blog for the Infralytics Show videos and articles in addition to other informative articles that may be relevant to your business!

An easy way to add alerting to Elasticsearch on Kubernetes with Skedler Alerts

There is a simple and effective way to add alerting for your Elasticsearch applications that are deployed to Kubernetes. Skedler Alerts offers no-code alerting for Elasticsearch and reduces the time, effort, and cost of monitoring your machine data for anomalies.   In this article, you are going to learn how to deploy Skedler Alerts for Elasticsearch applications to Kubernetes with ease.

What is Kubernetes?

For those that haven’t ventured into container orchestration, you’re probably wondering what Kubernetes is. Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.

Kubernetes (“k8s” for short), was a project originally started at, and designed by Google, and is heavily influenced by Google’s large scale cluster management system, Borg. More simply, k8s gives you a platform for managing and running your applications at scale across multiple physical (or virtual) machines.

Kubernetes offers the following benefits:

  • Workload Scalability
  • High Availability
  • Designed for deployment

Deploying Skedler Alerts to Kubernetes

If you haven’t already downloaded Skedler Alerts, please download it from www.skedler.com.  Review the documentation to get started.   

Creating a K8s ConfigMap

Kubernetes ConfigMaps allows a containerized application to become portable without worrying about configurations. Users and system components can store configuration data in ConfigMap. In Skedler Alerts ConfigMaps can be used to store database connection string information such as datastore settings, port number, server information and files locations, log directory etc.

If Skedler Alerts defaults are not enough, one may want to customize alertconfig.yml through a ConfigMap. Please refer to Alertconfig.yml Configuration for all available attributes.

1.Create a file called alerts-configmap.yaml in your project directory and paste the following

alerts-configmap.yaml:

2. To deploy your configmap, execute the following command

Creating Deployment and Service

To deploy Skedler Alerts, we’re going to use the “skedler-deployment” pod type. A deployment wraps the functionality of Pods and Replica Sets to allow you to update your application. Now that our Skedler Alerts application is deployed, we need a way to expose it to traffic from outside the cluster. To this, we’re going to add a Service inside the skedler-deployment.yaml file. We’re going to open up a NodePort directly to our application on port 30001.

1.Create a file called alerts-deployment.yaml in your project directory and paste the following

alerts-deployment.yaml:

2. For deployment, execute the following command,

3. To get your deployment with kubectl, execute the following command,

4. We can get the service details by executing the following command,

Now, Skedler Alerts will be deployed in 30001 port.

Accessing Skedler Alerts

Skedler Alerts can be accessed from the following URL, http://<hostIP>:30001

To learn more about creating Skedler Alerts, visit Skedler documentation site.

Summary

This blog was a very quick overview of how to get Skedler Alerts for Elasticsearch application up and running on Kubernetes with the least amount of configuration possible. Kubernetes is an incredibly powerful platform that has many more features than we used today.  We hope that this article gave a headstart and saved you time.

Webinar: Save Time and Money With Automated Reports & Alerts

How do you stay up to date on the critical events in your log analytics platform? Do you spend tens of thousands of dollars and countless hours to create reports and alerts from your Elastic Stack or Grafana application?

Whatever critical scenario arises, receiving the right information at the right time can ultimately be the difference between success and failure. Therefore, it’s important to be constantly aware of every situation, whether it be business partners, operations, customers, or employees, is crucial. The faster a possible issue is identified the faster it can be solved.

Benefits of Automation

Join us in the upcoming webinar on Tuesday, December 18th, 2018 @10AM PST to learn how Skedler, which installs in minutes, can help you save time & money with automated reports and alerts for Elastic Stack & Grafana.

Watch Our Webinar Here

You’ll learn how to quickly add reporting and alerting for Elastic Stack and Grafana while seeing how Skedler can provide a flexible framework to meet your complex monitoring requirements. Be ready with your questions and we’ll be more than happy to discuss them in the webinar Q&A session.

Watch Our Webinar Here

Graph Source: https://www.statista.com/chart/10659/risks-and-advantages-to-automation-at-work/

Skedler Update: Version 3.9 Released

Skedler Update: Version 3.9 Released

Here’s everything you need to know about the new Skedler v3.9. Download the update now to take advantage of its new features for both Skedler Reports and Alerts.

What’s New With Skedler Reports v3.9

  • Support for:
    • ReadOnlyRest Elasticsearch/Kibana Security Plugin.
    • Chromium web browser for Skedler report generation.
    • Report bursting in Grafana reports if the Grafana dashboard is set with Template Variables.
    • Elasticsearch version 6.4.0 and Kibana version 6.4.0.
  • Ability to install Skedler Reports through Debian and RPM packages.
  • Simplified installation levels of Skedler Reports here.
  • Upgraded license module
    • NOTE: License reactivation is required when you upgrade Skedler Reports from the older version to the latest v3.8. Refer to this URL to reactivate the Skedler Reports license key.
    • Deactivation of Skedler license key in UI

What’s New With Skedler Alerts v3.9

  • Support for:
    • Installing Skedler Alerts via Debian and RPM packages.
    • GET method type in Webhook.
    • Elasticsearch 6.4.0.
  • Simplified installation levels of Skedler. Refer to this URL for installation guides.
  • Upgraded license module:
    • NOTE: License reactivation is required when you upgrade Skedler Alerts from the older version to the latest v3.8. Refer to this URL to reactivate the Skedler Alerts license key.
  • Deactivation of Skedler Alerts license key in UI

 

Get Skedler Reports

Download Skedler Reports

Get Skedler Alerts

Download Skedler Alerts

 

Alerting for Elasticsearch – How to install with docker

Skedler Alerts provides an easy to use alerting solution for Elasticsearch data.  It is designed for users who find:

  • yaml based rules are difficult to manage and time consuming
  • alternative pack based plugins are cost prohibitive

Powerful, Yet Easy to Use Alerting for Elasticsearch

Skedler Alerts simplifies how you create and manage alert rules for Elasticsearch.  In addition, it provides a flexible approach to notification.  You can send notifications via email or slack.  You can also integrate alert notifications to your application using webhooks.  Last but not the least, with just a few button clicks you can drill down from notifications to root cause documents.

Easy drilldown from notification to root cause event

Alerts Installation

Installation of Skedler Alerts is equally simple.  Of course, the easiest way to is install Skedler Alerts with Docker.  Please note that Skedler Alerts is available for Linux 7 only.

[video_embed video=”EuNKgZPki8c” parameters=”” mp4=”” ogv=”” placeholder=”” width=”700″ height=”400″]

If you would like to install Skedler on a VM, reach out to us via the free trial page.  You will hear back from us within 24 hours to help you get started.

Questions/Comments?

Reach out to us via Skedler Forum  with your questions and comments.   Happy, Easy Alerting!

Translate »