Backup & Recovery¶
Backup Strategy¶
| Data | Location | Backup Method | Frequency | Retention |
|---|---|---|---|---|
| Infrastructure code | Git repo | GitLab + GDrive mirror | Every push + daily | Unlimited (git history) |
| SOPS-encrypted secrets | Git repo | GitLab + GDrive mirror | Every push + daily | Unlimited |
| OpenTofu state | Git repo (SOPS) | GitLab + GDrive mirror | After each apply | Unlimited |
| Bitwarden vault | Cloud | Bitwarden cloud sync | Continuous | Bitwarden retention |
| K3s etcd data | Hub VM /var/lib/rancher/k3s/server/ |
k3s etcd-snapshot | Every 12h | 5 snapshots |
| Grafana dashboards | Git repo dashboards/ |
GitLab | Every push | Unlimited |
| VictoriaMetrics data | Hub VM /var/lib/victoria-metrics/ |
None (expendable) | -- | 30 days |
| Loki log data | Hub VM /var/lib/loki/ |
None (expendable) | -- | 14 days |
| Beast VM | None | Not backed up (cattle) | -- | -- |
Metrics and logs are expendable
VictoriaMetrics and Loki data are not backed up. If lost, historical metrics and logs are gone, but the monitoring stack is rebuilt automatically. This is an acceptable trade-off for a personal lab.
Loss Scenarios¶
| Scenario | Impact | RTO | RPO | Recovery Path |
|---|---|---|---|---|
| Beast VM dies | Zero -- cattle | 5 min | N/A | Run beast-up.sh |
| DMZ VM dies | Web access down | 30 min | 0 (stateless) | tofu apply + ansible-playbook |
| Hub VM dies | Full cluster down | 2-3 hours | 12h (etcd snapshot) | Full CX32 recovery procedure |
| Local workstation dies | Dev environment lost | 1 hour | Last push | Clone from GitLab, restore keys from Bitwarden |
| GitLab unavailable | No push/pull, Fleet paused | Varies | 0 | Wait for recovery or restore from GDrive mirror |
| SOPS key lost | Cannot decrypt secrets | 15 min | 0 | Restore age key from Bitwarden |
| Hetzner account compromise | Full infrastructure loss | 4+ hours | 12h | Revoke token, rebuild from code + Bitwarden |
| Hetzner region failure | Full infrastructure loss | 4+ hours | 12h | Rebuild in different region from code |
CX32 Hub Recovery Procedure¶
This is the worst-case scenario: the Hub VM is completely lost and must be rebuilt.
This is the most complex recovery
The Hub is the only VM with meaningful state (K3s server data, etcd). All other VMs are stateless and can be recreated trivially.
Prerequisites¶
- Access to GitLab repo (or GDrive mirror)
- Access to Bitwarden (age key, API tokens)
- Local workstation with OpenTofu, Ansible, kubectl, sops, age installed
Steps¶
| Step | Action | Details |
|---|---|---|
| 1 | Export age key from Bitwarden | Save to ~/.config/sops/age/keys.txt |
| 2 | Clone repo | git clone <repo-url> (or copy from GDrive) |
| 3 | Decrypt secrets | sops --decrypt secrets/hetzner.yaml to get API token |
| 4 | Create new Hub VM | cd tofu/ && tofu apply -target=hcloud_server.hub |
| 5 | Attach network | tofu apply -target=hcloud_server_network.hub |
| 6 | Bootstrap OS | cd ansible/ && ansible-playbook -l hub site.yml |
| 7 | Install K3s server | K3s install script with server flags (see Kubernetes) |
| 8 | Install Cilium | Helm install Cilium with cluster config |
| 9 | Install Rancher | Helm install Rancher (new bootstrap password) |
| 10 | Restore etcd (if available) | k3s server --cluster-reset --cluster-reset-restore-path=<snapshot> |
| 11 | Rejoin agent nodes | Update K3s token on DMZ and Beast (if running), restart agents |
Post-Recovery Verification¶
-
kubectl get nodesshows Hub as Ready - DMZ and Beast (if running) rejoin as Ready
- Rancher UI accessible via WireGuard
- Fleet syncs and deploys workloads
- VictoriaMetrics scraping resumes (historical data lost)
- Loki receiving logs (historical data lost)
- CrowdSec enrolled and parsing logs
- WireGuard tunnel re-established from home
- Alerting via ntfy functional
If etcd Snapshot Is Unavailable¶
If no etcd snapshot exists (RPO exceeded):
- Install K3s server fresh (new cluster)
- Rejoin DMZ and Beast as new agent nodes
- Re-install Rancher (new instance)
- Fleet will redeploy all workloads from GitLab
- Grafana dashboards restore from
dashboards/in repo - Kubernetes Secrets must be recreated from SOPS-encrypted sources
Fleet makes this survivable
Because all workloads are defined in Git and deployed via Fleet, a full cluster rebuild only loses runtime state (metrics, logs, running pods). The desired state is always recoverable from the repository.