Alerting¶

Architecture¶

flowchart LR
    VM["VictoriaMetrics<br/>(metrics)"]
    VMA["vmalert<br/>(rule evaluation)"]
    AM["Alertmanager<br/>(routing)"]
    Ntfy["ntfy.sh<br/>(push notifications)"]
    Phone["Phone<br/>(ntfy app)"]

    VM -->|"PromQL queries"| VMA
    VMA -->|"firing alerts"| AM
    AM -->|"webhook"| Ntfy
    Ntfy -->|"push"| Phone

    style VMA fill:#7b241c,stroke:#c0392b,color:#fff
    style Ntfy fill:#1e8449,stroke:#27ae60,color:#fff

vmalert¶

vmalert evaluates alerting rules against VictoriaMetrics and sends firing alerts to Alertmanager.

Parameter	Value
Evaluation interval	30s
Data source	VictoriaMetrics (http://victoriametrics:8428)
Alert destination	Alertmanager (http://alertmanager:9093)
Rule files	`/etc/vmalert/rules/*.yml`

Alert Rules¶

Infrastructure Alerts¶

Alert	Condition	For	Priority
NodeDown	`up{job="node-exporter"} == 0`	2m	Critical
NodeHighCPU	`(1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 0.90`	10m	Warning
NodeHighMemory	`(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90`	5m	Warning
NodeDiskFull	`(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.10`	5m	Critical
NodeDiskPrediction	`predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0`	30m	Warning
SystemdUnitFailed	`node_systemd_unit_state{state="failed"} == 1`	1m	Warning

Kubernetes Alerts¶

Alert	Condition	For	Priority
PodCrashLooping	`rate(kube_pod_container_status_restarts_total[15m]) > 0`	5m	Warning
PodNotReady	`kube_pod_status_phase{phase=~"Pending\\|Unknown"} == 1`	10m	Warning
PodOOMKilled	`kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1`	0m	Critical
DeploymentReplicasMismatch	`kube_deployment_spec_replicas != kube_deployment_status_ready_replicas`	10m	Warning
K3sAPIDown	`up{job="k3s"} == 0`	1m	Critical
K3sAPILatencyHigh	`histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket[5m])) > 1`	5m	Warning

Security Alerts¶

Alert	Condition	For	Priority
CrowdSecBanSpike	`increase(cs_active_decisions[1h]) > 20`	0m	Warning
SSHAuthFailure	`increase(node_logind_sessions_total{type="failed"}[5m]) > 5`	0m	Warning
CertExpiringSoon	`(x509_cert_not_after - time()) / 86400 < 14`	1h	Warning
CertExpired	`(x509_cert_not_after - time()) < 0`	0m	Critical

Beast-Specific Alerts¶

Alert	Condition	For	Priority
BeastLongSession	Custom: session > 8h	0m	Warning
BeastOrphan	`absent(up{instance=~".10.0.1.3."})` and no drain event	30m	Warning
BeastBudgetAlert	Custom: monthly hours > 50	0m	Info

Observability Meta-Alerts¶

Alert	Condition	For	Priority
VictoriaMetricsDown	`up{job="victoriametrics"} == 0`	1m	Critical
LokiDown	`up{job="loki"} == 0`	2m	Critical
GrafanaDown	`up{job="grafana"} == 0`	2m	Warning
AlertmanagerDown	`up{job="alertmanager"} == 0`	1m	Critical
ScrapeTargetDown	`up == 0` and not Beast-related	5m	Warning

ntfy.sh Webhook Relay¶

Alertmanager sends notifications to ntfy.sh via webhook. ntfy.sh delivers push notifications to the phone app.

Alertmanager Configuration¶

route:
  receiver: ntfy
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: ntfy-critical
      repeat_interval: 1h

    - match:
        severity: info
      receiver: ntfy-info
      repeat_interval: 24h

receivers:
  - name: ntfy
    webhook_configs:
      - url: https://ntfy.sh/lron-alerts
        send_resolved: true

  - name: ntfy-critical
    webhook_configs:
      - url: https://ntfy.sh/lron-alerts-critical
        send_resolved: true

  - name: ntfy-info
    webhook_configs:
      - url: https://ntfy.sh/lron-alerts-info
        send_resolved: false

Priority Mapping¶

Alert Priority	ntfy Priority	ntfy Topic	Phone Behavior
Critical	`urgent` (5)	`lron-alerts-critical`	Sound + vibrate + persistent notification
Warning	`high` (4)	`lron-alerts`	Sound + notification
Info	`default` (3)	`lron-alerts-info`	Silent notification

ntfy.sh is free

ntfy.sh is a free, open-source push notification service. No account required -- topics are public but unguessable (use random suffixes in production). For true privacy, self-host ntfy on the Hub node.

Alert fatigue

The repeat_interval settings are tuned to avoid alert fatigue. Critical alerts repeat every hour, warnings every 4 hours, info every 24 hours. Adjust as needed based on operational experience.