Problem

I leverage sensu alerts that flow through to PagerDuty and actually wake people up if necessary. This empowers us to respond quickly to customer issues. However for false positives it’s really frustrating to get a phone call for it. Even worse, false positives train team members to start ignoring alerts and not to worry about them. Then the presence of alerts becomes normal and it’s barely more useful than not having any alerts.

Solution

Ensure that your alerts are meaningful and actionable. Follow the steps below.

  1. Export data from PagerDuty
    1. Click Analytics
    2. Click Incident Volume
  1. From there you can download csv’s to get your data
  1. Concat all the csv’s so you have 6 months of data
cat *csv > combined.csv
  1. Grep for your service
grep "my-service" combined.csv > filtered_combined.csv
  1. Upload the filtered_combined to Google Sheets
  2. Get the average time to resolve
  3. Hide any that aren’t relevant so they’re omitted
  4. Then use that information to adjust occurrences and interval in sensu checks.
  5. You’ll find this configuration in /etc/sensu/conf.d/checks/my_check.json
"occurrences": 3,
  1. Additionally you can configure the warning and critical levels for many sensu plugins
  2. Take for example the below check with a warning of 80% disk usage and a critical of 90% disk usage
{
     "checks": {
        "check-disk-usage-linux": {
"handlers": ["mailer"],
      "command": "/opt/sensu/embedded/bin/check-disk-usage.rb -w 80 -c 90",
      "interval": 60,
      "occurrences": 5,
      "subscribers": [ "linux" ]
        }
      }
    }

References

  1. Merging contents of multiple .csv files into single .csv file
  2. Step By Step: Install and Configure Sensu + Grafana

Leave a comment