Skip to content

Backup & Recovery

Backup Strategy

Data Location Backup Method Frequency Retention
Infrastructure code Git repo GitLab + GDrive mirror Every push + daily Unlimited (git history)
SOPS-encrypted secrets Git repo GitLab + GDrive mirror Every push + daily Unlimited
OpenTofu state Git repo (SOPS) GitLab + GDrive mirror After each apply Unlimited
Bitwarden vault Cloud Bitwarden cloud sync Continuous Bitwarden retention
K3s etcd data Hub VM /var/lib/rancher/k3s/server/ k3s etcd-snapshot Every 12h 5 snapshots
Grafana dashboards Git repo dashboards/ GitLab Every push Unlimited
VictoriaMetrics data Hub VM /var/lib/victoria-metrics/ None (expendable) -- 30 days
Loki log data Hub VM /var/lib/loki/ None (expendable) -- 14 days
Beast VM None Not backed up (cattle) -- --

Metrics and logs are expendable

VictoriaMetrics and Loki data are not backed up. If lost, historical metrics and logs are gone, but the monitoring stack is rebuilt automatically. This is an acceptable trade-off for a personal lab.

Loss Scenarios

Scenario Impact RTO RPO Recovery Path
Beast VM dies Zero -- cattle 5 min N/A Run beast-up.sh
DMZ VM dies Web access down 30 min 0 (stateless) tofu apply + ansible-playbook
Hub VM dies Full cluster down 2-3 hours 12h (etcd snapshot) Full CX32 recovery procedure
Local workstation dies Dev environment lost 1 hour Last push Clone from GitLab, restore keys from Bitwarden
GitLab unavailable No push/pull, Fleet paused Varies 0 Wait for recovery or restore from GDrive mirror
SOPS key lost Cannot decrypt secrets 15 min 0 Restore age key from Bitwarden
Hetzner account compromise Full infrastructure loss 4+ hours 12h Revoke token, rebuild from code + Bitwarden
Hetzner region failure Full infrastructure loss 4+ hours 12h Rebuild in different region from code

CX32 Hub Recovery Procedure

This is the worst-case scenario: the Hub VM is completely lost and must be rebuilt.

This is the most complex recovery

The Hub is the only VM with meaningful state (K3s server data, etcd). All other VMs are stateless and can be recreated trivially.

Prerequisites

  • Access to GitLab repo (or GDrive mirror)
  • Access to Bitwarden (age key, API tokens)
  • Local workstation with OpenTofu, Ansible, kubectl, sops, age installed

Steps

Step Action Details
1 Export age key from Bitwarden Save to ~/.config/sops/age/keys.txt
2 Clone repo git clone <repo-url> (or copy from GDrive)
3 Decrypt secrets sops --decrypt secrets/hetzner.yaml to get API token
4 Create new Hub VM cd tofu/ && tofu apply -target=hcloud_server.hub
5 Attach network tofu apply -target=hcloud_server_network.hub
6 Bootstrap OS cd ansible/ && ansible-playbook -l hub site.yml
7 Install K3s server K3s install script with server flags (see Kubernetes)
8 Install Cilium Helm install Cilium with cluster config
9 Install Rancher Helm install Rancher (new bootstrap password)
10 Restore etcd (if available) k3s server --cluster-reset --cluster-reset-restore-path=<snapshot>
11 Rejoin agent nodes Update K3s token on DMZ and Beast (if running), restart agents

Post-Recovery Verification

  • kubectl get nodes shows Hub as Ready
  • DMZ and Beast (if running) rejoin as Ready
  • Rancher UI accessible via WireGuard
  • Fleet syncs and deploys workloads
  • VictoriaMetrics scraping resumes (historical data lost)
  • Loki receiving logs (historical data lost)
  • CrowdSec enrolled and parsing logs
  • WireGuard tunnel re-established from home
  • Alerting via ntfy functional

If etcd Snapshot Is Unavailable

If no etcd snapshot exists (RPO exceeded):

  1. Install K3s server fresh (new cluster)
  2. Rejoin DMZ and Beast as new agent nodes
  3. Re-install Rancher (new instance)
  4. Fleet will redeploy all workloads from GitLab
  5. Grafana dashboards restore from dashboards/ in repo
  6. Kubernetes Secrets must be recreated from SOPS-encrypted sources

Fleet makes this survivable

Because all workloads are defined in Git and deployed via Fleet, a full cluster rebuild only loses runtime state (metrics, logs, running pods). The desired state is always recoverable from the repository.