Use Data to Trim alerts

Problem

I leverage sensu alerts that flow through to PagerDuty and actually wake people up if necessary. This empowers us to respond quickly to customer issues. However for false positives it’s really frustrating to get a phone call for it. Even worse, false positives train team members to start ignoring alerts and not to worry about them. Then the presence of alerts becomes normal and it’s barely more useful than not having any alerts.

Solution

Ensure that your alerts are meaningful and actionable. Follow the steps below.

Export data from PagerDuty
1. Click Analytics
2. Click Incident Volume

From there you can download csv’s to get your data

Concat all the csv’s so you have 6 months of data

cat *csv > combined.csv

Grep for your service

grep "my-service" combined.csv > filtered_combined.csv

Upload the filtered_combined to Google Sheets
Get the average time to resolve
Hide any that aren’t relevant so they’re omitted
Then use that information to adjust occurrences and interval in sensu checks.
You’ll find this configuration in /etc/sensu/conf.d/checks/my_check.json

"occurrences": 3,

Additionally you can configure the warning and critical levels for many sensu plugins
Take for example the below check with a warning of 80% disk usage and a critical of 90% disk usage

{
     "checks": {
        "check-disk-usage-linux": {
"handlers": ["mailer"],
      "command": "/opt/sensu/embedded/bin/check-disk-usage.rb -w 80 -c 90",
      "interval": 60,
      "occurrences": 5,
      "subscribers": [ "linux" ]
        }
      }
    }

Cultivating Software

Use Data to Trim alerts

Problem

Solution

References

Leave a comment Cancel reply

Problem

Solution

References

Share this:

Leave a comment Cancel reply