pushover_host_alert_rules

Monitor node status with Prometheus host alert rules. Get alerts for unreachable nodes, configure critical alerts, and ensure smooth operation.

Host Alert Rules

These Prometheus alert rules monitor the status of nodes and trigger alerts when a node becomes unreachable. Configure these rules to receive timely notifications and maintain system stability.

Alert Rules Configuration

Below is the configuration for Prometheus alert rules to monitor host status. These rules define when and how alerts are triggered based on node availability.

groups:
- name: host_alert_rules.yml
  rules:

  # Alert for any node that is unreachable for > 1 minute.
  - alert: node_down
    expr: up{job="node-exporter"} == 0
    for: 1m
    labels:
      severity: critical
      environment: env-production
    annotations:
      summary: "Job {{ $labels.job }} is down on {{ $labels.instance }}"
      description: "Failed to scrape {{ $labels.job }} on {{ $labels.instance }} for more than 1 minute. Node might be down."
      impact: "Any metrics from {{ $labels.job }} on {{ $labels.instance }} will be missing"
      action: "Check on {{ $labels.instance }} if {{ $labels.job }} is running"
      dashboard: https://grafana.localdns.xyz
      runbook: https://runbooks.localdns.xyz

Understanding the Alert

The node_down alert is triggered when the up metric from the node-exporter job is 0 for more than 1 minute. This indicates that the node is unreachable and requires immediate attention.

Explanation of Key Parameters

  • expr: up{job="node-exporter"} == 0: This expression checks if the up metric for the node-exporter job is 0, indicating the node is down.
  • for: 1m: The alert is triggered only if the condition persists for more than 1 minute.
  • labels: severity: critical: Sets the severity of the alert to critical.
  • annotations: Provides additional information about the alert, such as a summary, description, impact, and recommended action.

Further Reading

For more information on Prometheus alerting and node monitoring, refer to the following resources: