Uptime Monitoring¶

Two-Layer Approach¶

Uptime monitoring uses two layers: internal (Uptime Kuma, self-hosted) and external (UptimeRobot, third-party). This ensures visibility even when the cluster itself is down.

flowchart TB
    subgraph Internal["Internal (Uptime Kuma on Hub)"]
        UK["Uptime Kuma"]
        UK -->|"probe"| T1["Caddy (DMZ)"]
        UK -->|"probe"| T2["Rancher (Hub)"]
        UK -->|"probe"| T3["Grafana (Hub)"]
        UK -->|"probe"| T4["VictoriaMetrics"]
        UK -->|"probe"| T5["Loki"]
        UK -->|"probe"| T6["K3s API"]
        UK -->|"probe"| T7["WireGuard"]
    end

    subgraph External["External (UptimeRobot)"]
        UR["UptimeRobot"]
        UR -->|"HTTP check"| E1["ttyd.vdhome.be"]
        UR -->|"HTTP check"| E2["auth.vdhome.be"]
        UR -->|"ping"| E3["Hub IP"]
        UR -->|"ping"| E4["DMZ IP"]
    end

    UK -->|"ntfy webhook"| Ntfy["ntfy.sh"]
    UR -->|"email"| Email["dev@vdhome.be"]

    style UK fill:#1a5276,stroke:#2980b9,color:#fff
    style UR fill:#1e8449,stroke:#27ae60,color:#fff

Uptime Kuma (Self-Hosted)¶

Uptime Kuma runs on the Hub node as a Kubernetes deployment, monitoring all internal services.

Parameter	Value
Deployment	Hub node (`lron/role=hub`)
Namespace	`monitoring`
Port	3001
Check interval	60s
Retry count	3
Notification	ntfy.sh webhook

Monitors¶

Monitor	Type	Target	Interval	Timeout
Caddy HTTPS	HTTP(S)	`https://ttyd.vdhome.be` (expect 401/302)	60s	10s
Authelia	HTTP(S)	`https://auth.vdhome.be/api/health`	60s	10s
Rancher UI	HTTP(S)	`https://rancher.vdhome.be/ping`	60s	10s
Grafana	HTTP	`http://grafana.monitoring:3000/api/health`	60s	5s
VictoriaMetrics	HTTP	`http://victoriametrics.monitoring:8428/-/healthy`	60s	5s
Loki	HTTP	`http://loki.monitoring:3100/ready`	60s	5s
K3s API	TCP	`10.0.1.1:6443`	60s	5s
WireGuard	Ping	`10.0.2.2` (home peer)	120s	10s
DMZ SSH	TCP	`10.0.1.2:2222`	60s	5s
Beast SSH	TCP	`10.0.1.3:2222`	60s	5s
Hub disk	Push	Agent script checks disk > 90%	300s	--

Beast monitoring

The Beast SSH monitor will be in "down" state whenever Beast is not running. This is expected and does not trigger alerts -- the monitor is paused via the Beast lifecycle scripts.

Status Page¶

Uptime Kuma provides a status page at http://uptime.monitoring:3001/status/lron (accessible via WireGuard). It shows:

Current status of all monitors
30-day uptime percentage
Average response time
Incident history

UptimeRobot (External)¶

UptimeRobot provides monitoring from outside the Hetzner infrastructure. This catches scenarios where the entire Hetzner project is unreachable.

Parameter	Value
Plan	Free (50 monitors, 5-min interval)
Check interval	5 minutes
Alert contacts	dev@vdhome.be (email)

External Monitors¶

Monitor	Type	Target	Expected
ttyd.vdhome.be	HTTPS	`https://ttyd.vdhome.be`	302 (redirect to auth)
auth.vdhome.be	HTTPS	`https://auth.vdhome.be`	200
Hub ping	Ping	Hub public IP	ICMP reply
DMZ ping	Ping	DMZ public IP	ICMP reply

Why Both Internal and External?¶

Scenario	Uptime Kuma	UptimeRobot
Single service crash	Detects (1 min)	Detects (5 min)
Hub VM down	Down itself	Detects (5 min)
Hetzner network issue	May not detect	Detects (5 min)
Internet routing issue (path-specific)	Detects (1 min)	May not detect
DNS failure	Detects (uses IPs internally)	Detects (uses domains)

Single point of failure

Uptime Kuma runs on the Hub. If the Hub dies, internal monitoring dies with it. UptimeRobot is the safety net for this scenario.