Backup & Recovery¶

Backup Strategy¶

Data	Location	Backup Method	Frequency	Retention
Infrastructure code	Git repo	GitLab + GDrive mirror	Every push + daily	Unlimited (git history)
SOPS-encrypted secrets	Git repo	GitLab + GDrive mirror	Every push + daily	Unlimited
OpenTofu state	Git repo (SOPS)	GitLab + GDrive mirror	After each apply	Unlimited
Bitwarden vault	Cloud	Bitwarden cloud sync	Continuous	Bitwarden retention
K3s etcd data	Hub VM `/var/lib/rancher/k3s/server/`	k3s etcd-snapshot	Every 12h	5 snapshots
Grafana dashboards	Git repo `dashboards/`	GitLab	Every push	Unlimited
VictoriaMetrics data	Hub VM `/var/lib/victoria-metrics/`	None (expendable)	--	30 days
Loki log data	Hub VM `/var/lib/loki/`	None (expendable)	--	14 days
Beast VM	None	Not backed up (cattle)	--	--

Metrics and logs are expendable

VictoriaMetrics and Loki data are not backed up. If lost, historical metrics and logs are gone, but the monitoring stack is rebuilt automatically. This is an acceptable trade-off for a personal lab.

Loss Scenarios¶

Scenario	Impact	RTO	RPO	Recovery Path
Beast VM dies	Zero -- cattle	5 min	N/A	Run `beast-up.sh`
DMZ VM dies	Web access down	30 min	0 (stateless)	`tofu apply` + `ansible-playbook`
Hub VM dies	Full cluster down	2-3 hours	12h (etcd snapshot)	Full CX32 recovery procedure
Local workstation dies	Dev environment lost	1 hour	Last push	Clone from GitLab, restore keys from Bitwarden
GitLab unavailable	No push/pull, Fleet paused	Varies	0	Wait for recovery or restore from GDrive mirror
SOPS key lost	Cannot decrypt secrets	15 min	0	Restore age key from Bitwarden
Hetzner account compromise	Full infrastructure loss	4+ hours	12h	Revoke token, rebuild from code + Bitwarden
Hetzner region failure	Full infrastructure loss	4+ hours	12h	Rebuild in different region from code

CX32 Hub Recovery Procedure¶

This is the worst-case scenario: the Hub VM is completely lost and must be rebuilt.

This is the most complex recovery

The Hub is the only VM with meaningful state (K3s server data, etcd). All other VMs are stateless and can be recreated trivially.

Prerequisites¶

Access to GitLab repo (or GDrive mirror)
Access to Bitwarden (age key, API tokens)
Local workstation with OpenTofu, Ansible, kubectl, sops, age installed

Steps¶

Step	Action	Details
1	Export age key from Bitwarden	Save to `~/.config/sops/age/keys.txt`
2	Clone repo	`git clone <repo-url>` (or copy from GDrive)
3	Decrypt secrets	`sops --decrypt secrets/hetzner.yaml` to get API token
4	Create new Hub VM	`cd tofu/ && tofu apply -target=hcloud_server.hub`
5	Attach network	`tofu apply -target=hcloud_server_network.hub`
6	Bootstrap OS	`cd ansible/ && ansible-playbook -l hub site.yml`
7	Install K3s server	K3s install script with server flags (see Kubernetes)
8	Install Cilium	Helm install Cilium with cluster config
9	Install Rancher	Helm install Rancher (new bootstrap password)
10	Restore etcd (if available)	`k3s server --cluster-reset --cluster-reset-restore-path=<snapshot>`
11	Rejoin agent nodes	Update K3s token on DMZ and Beast (if running), restart agents

Post-Recovery Verification¶

kubectl get nodes shows Hub as Ready
DMZ and Beast (if running) rejoin as Ready
Rancher UI accessible via WireGuard
Fleet syncs and deploys workloads
VictoriaMetrics scraping resumes (historical data lost)
Loki receiving logs (historical data lost)
CrowdSec enrolled and parsing logs
WireGuard tunnel re-established from home
Alerting via ntfy functional

If etcd Snapshot Is Unavailable¶

If no etcd snapshot exists (RPO exceeded):

Install K3s server fresh (new cluster)
Rejoin DMZ and Beast as new agent nodes
Re-install Rancher (new instance)
Fleet will redeploy all workloads from GitLab
Grafana dashboards restore from dashboards/ in repo
Kubernetes Secrets must be recreated from SOPS-encrypted sources

Fleet makes this survivable

Because all workloads are defined in Git and deployed via Fleet, a full cluster rebuild only loses runtime state (metrics, logs, running pods). The desired state is always recoverable from the repository.