Prometheus Host Alert Rules - Monitor Node Status - alertmanager Cheatsheets

These Prometheus alert rules monitor the status of nodes and trigger alerts when a node becomes unreachable. Configure these rules to receive timely notifications and maintain system stability.

Alert Rules Configuration

Below is the configuration for Prometheus alert rules to monitor host status. These rules define when and how alerts are triggered based on node availability.

groups:
- name: host_alert_rules.yml
  rules:

  # Alert for any node that is unreachable for > 1 minute.
  - alert: node_down
    expr: up{job="node-exporter"} == 0
    for: 1m
    labels:
      severity: critical
      environment: env-production
    annotations:
      summary: "Job {{ $labels.job }} is down on {{ $labels.instance }}"
      description: "Failed to scrape {{ $labels.job }} on {{ $labels.instance }} for more than 1 minute. Node might be down."
      impact: "Any metrics from {{ $labels.job }} on {{ $labels.instance }} will be missing"
      action: "Check on {{ $labels.instance }} if {{ $labels.job }} is running"
      dashboard: https://grafana.localdns.xyz
      runbook: https://runbooks.localdns.xyz

Understanding the Alert

The node_down alert is triggered when the up metric from the node-exporter job is 0 for more than 1 minute. This indicates that the node is unreachable and requires immediate attention.

Explanation of Key Parameters

expr: up{job="node-exporter"} == 0: This expression checks if the up metric for the node-exporter job is 0, indicating the node is down.
for: 1m: The alert is triggered only if the condition persists for more than 1 minute.
labels: severity: critical: Sets the severity of the alert to critical.
annotations: Provides additional information about the alert, such as a summary, description, impact, and recommended action.

Prometheus Host Alert Rules - Monitor Node Status

Host Alert Rules

Alert Rules Configuration

Understanding the Alert

Explanation of Key Parameters

Further Reading