Skip to content

Alerting

Architecture

flowchart LR
    VM["VictoriaMetrics<br/>(metrics)"]
    VMA["vmalert<br/>(rule evaluation)"]
    AM["Alertmanager<br/>(routing)"]
    Ntfy["ntfy.sh<br/>(push notifications)"]
    Phone["Phone<br/>(ntfy app)"]

    VM -->|"PromQL queries"| VMA
    VMA -->|"firing alerts"| AM
    AM -->|"webhook"| Ntfy
    Ntfy -->|"push"| Phone

    style VMA fill:#7b241c,stroke:#c0392b,color:#fff
    style Ntfy fill:#1e8449,stroke:#27ae60,color:#fff

vmalert

vmalert evaluates alerting rules against VictoriaMetrics and sends firing alerts to Alertmanager.

Parameter Value
Evaluation interval 30s
Data source VictoriaMetrics (http://victoriametrics:8428)
Alert destination Alertmanager (http://alertmanager:9093)
Rule files /etc/vmalert/rules/*.yml

Alert Rules

Infrastructure Alerts

Alert Condition For Priority
NodeDown up{job="node-exporter"} == 0 2m Critical
NodeHighCPU (1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 0.90 10m Warning
NodeHighMemory (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90 5m Warning
NodeDiskFull (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.10 5m Critical
NodeDiskPrediction predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0 30m Warning
SystemdUnitFailed node_systemd_unit_state{state="failed"} == 1 1m Warning

Kubernetes Alerts

Alert Condition For Priority
PodCrashLooping rate(kube_pod_container_status_restarts_total[15m]) > 0 5m Warning
PodNotReady kube_pod_status_phase{phase=~"Pending\|Unknown"} == 1 10m Warning
PodOOMKilled kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1 0m Critical
DeploymentReplicasMismatch kube_deployment_spec_replicas != kube_deployment_status_ready_replicas 10m Warning
K3sAPIDown up{job="k3s"} == 0 1m Critical
K3sAPILatencyHigh histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket[5m])) > 1 5m Warning

Security Alerts

Alert Condition For Priority
CrowdSecBanSpike increase(cs_active_decisions[1h]) > 20 0m Warning
SSHAuthFailure increase(node_logind_sessions_total{type="failed"}[5m]) > 5 0m Warning
CertExpiringSoon (x509_cert_not_after - time()) / 86400 < 14 1h Warning
CertExpired (x509_cert_not_after - time()) < 0 0m Critical

Beast-Specific Alerts

Alert Condition For Priority
BeastLongSession Custom: session > 8h 0m Warning
BeastOrphan absent(up{instance=~".*10.0.1.3.*"}) and no drain event 30m Warning
BeastBudgetAlert Custom: monthly hours > 50 0m Info

Observability Meta-Alerts

Alert Condition For Priority
VictoriaMetricsDown up{job="victoriametrics"} == 0 1m Critical
LokiDown up{job="loki"} == 0 2m Critical
GrafanaDown up{job="grafana"} == 0 2m Warning
AlertmanagerDown up{job="alertmanager"} == 0 1m Critical
ScrapeTargetDown up == 0 and not Beast-related 5m Warning

ntfy.sh Webhook Relay

Alertmanager sends notifications to ntfy.sh via webhook. ntfy.sh delivers push notifications to the phone app.

Alertmanager Configuration

route:
  receiver: ntfy
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: ntfy-critical
      repeat_interval: 1h

    - match:
        severity: info
      receiver: ntfy-info
      repeat_interval: 24h

receivers:
  - name: ntfy
    webhook_configs:
      - url: https://ntfy.sh/lron-alerts
        send_resolved: true

  - name: ntfy-critical
    webhook_configs:
      - url: https://ntfy.sh/lron-alerts-critical
        send_resolved: true

  - name: ntfy-info
    webhook_configs:
      - url: https://ntfy.sh/lron-alerts-info
        send_resolved: false

Priority Mapping

Alert Priority ntfy Priority ntfy Topic Phone Behavior
Critical urgent (5) lron-alerts-critical Sound + vibrate + persistent notification
Warning high (4) lron-alerts Sound + notification
Info default (3) lron-alerts-info Silent notification

ntfy.sh is free

ntfy.sh is a free, open-source push notification service. No account required -- topics are public but unguessable (use random suffixes in production). For true privacy, self-host ntfy on the Hub node.

Alert fatigue

The repeat_interval settings are tuned to avoid alert fatigue. Critical alerts repeat every hour, warnings every 4 hours, info every 24 hours. Adjust as needed based on operational experience.