host_alert_rules

Configure and manage Prometheus host alert rules for effective monitoring. Learn to define alerts for node down, low disk space, and more with clear annotations and labels.

Prometheus Host Alert Rules

This document outlines example Prometheus alert rules for monitoring host systems, primarily using the node_exporter. These rules are designed to detect critical issues such as nodes being down or running critically low on disk space. By defining these alert rules, teams can proactively manage their infrastructure, ensuring high availability and performance. The configuration includes essential labels and annotations to provide context for alerts, aiding in faster diagnosis and resolution.

Node Down Alert Rule

This rule triggers an alert when a node exporter instance is unreachable for a specified duration, indicating a potential host outage.

  - alert: node_down
    expr: up{job="node-exporter"} == 0
    for: 1m
    labels:
      severity: warning
      environment: prod
      alert_target: "{{ $labels.host }}"
    annotations:
      summary: "Job {{ $labels.job }} is down on {{ $labels.instance }}"
      description: "Failed to scrape {{ $labels.job }} on {{ $labels.instance }} for more than 1 minute. Node might be down."
      impact: "Any metrics from {{ $labels.job }} on {{ $labels.instance }} will be missing"
      action: "Check on {{ $labels.instance }} if {{ $labels.job }} is running"
      dashboard: http://grafana.localdns.xyz/d/pjhLJOzmk/infrastructure-hosts-stats
      runbook: http://wiki.localdns.xyz
      priority: P2

Low Disk Space Alert Rule

This rule detects when the available disk space on a host's root mount point falls below a critical threshold, signaling an impending storage issue.

  - alert: debug_instance_hard_disk_low
    expr: (node_filesystem_avail_bytes{mountpoint="/"}  * 100) / node_filesystem_size_bytes{mountpoint="/"} < 20
    for: 1m
    labels:
      severity: warning
      alert_channel: notifications
      environment: prod
      team: devops
      aws_region: eu-west-1
    annotations:
      title: "[TEST] Disk Usage is Low in {{ $labels.instance }}"
      description: "Instance {{ $labels.instance }} has less than {{ humanize $value}}% available on mount {{ $labels.mountpoint }} "
      summary: "Low Disk Space Available"
      dashboard: http://grafana.localdns.xyz/d/pjhLJOzmk/infrastructure-hosts-stats
      runbook: http://wiki.localdns.xyz