Risk Register¶
Resolved Risks¶
Risks that were identified during design and have been mitigated.
| ID | Risk | Mitigation | Status |
|---|---|---|---|
| R-01 | SSH brute-force on public IPs | endlessh tarpit on port 22, real SSH on 2222, CrowdSec IDS, key-only auth | Resolved |
| R-02 | Unauthorized web access | Authelia TOTP on all public endpoints, Caddy forward_auth | Resolved |
| R-03 | Secrets committed in plaintext | SOPS+age encryption for all secrets in repo | Resolved |
| R-04 | Beast VM forgotten running (cost overrun) | 8h warning, 12h auto-destroy, monthly hour tracking, budget hard stop | Resolved |
| R-05 | Configuration drift on long-lived VMs | Ansible idempotent playbooks, periodic re-runs, K3s + Fleet GitOps | Resolved |
| R-06 | No visibility into cluster health | Full observability stack: VictoriaMetrics, Loki, Grafana, Uptime Kuma | Resolved |
| R-07 | Alert fatigue from noisy alerts | Tuned repeat intervals, priority-based routing, Beast-aware suppression | Resolved |
| R-08 | DMZ compromise exposes management plane | Dedicated VM + Cilium NetworkPolicy isolation, no direct DMZ-to-monitoring path | Resolved |
| R-09 | TLS certificate expiration | Caddy auto-renewal via Let's Encrypt DNS-01, cert expiry alert rule | Resolved |
| R-10 | Single point of monitoring failure | UptimeRobot external monitoring as backup for Uptime Kuma | Resolved |
| R-11 | Lateral movement after node compromise | Cilium NetworkPolicy default-deny per namespace, UFW egress restrictions | Resolved |
| R-12 | Kubernetes secrets readable at rest | K3s --secrets-encryption flag (AES-CBC), encryption config backed up |
Resolved |
Remaining Risks¶
Risks that are accepted or partially mitigated.
| ID | Risk | Severity | Likelihood | Impact | Current Mitigation | Residual |
|---|---|---|---|---|---|---|
| R-20 | Hub VM total loss (disk failure, Hetzner issue) | High | Low | Full cluster down, 12h RPO | etcd snapshots every 12h, full rebuild procedure documented | 2-3h RTO, possible 12h data loss |
| R-21 | age key loss (locked out of all SOPS secrets) | Critical | Very Low | Cannot decrypt any secrets, full rebuild impossible without key | Bitwarden backup of age key | If Bitwarden also lost, unrecoverable |
| R-22 | Hetzner account compromise | Critical | Very Low | Attacker controls all infrastructure | Strong password + 2FA on Hetzner, API token scoped, CrowdSec on VMs | Attacker could destroy all VMs; rebuild from code + Bitwarden |
| R-23 | Supply chain attack via container images | Medium | Low | Compromised workloads | Use official images, pin versions, no third-party registries | No image scanning (Trivy not deployed) |
| R-24 | Hetzner region outage (FSN1) | Medium | Very Low | All VMs unavailable | No multi-region redundancy | Accept downtime, rebuild in NBG1/HEL1 if prolonged |
| R-25 | WireGuard key compromise | High | Very Low | Attacker gains tunnel access to private network | Key rotation procedure documented, WireGuard listens on non-standard port | Manual detection only (no WG auth logging) |
| R-26 | ntfy.sh service outage | Low | Low | Alert notifications not delivered | Alerts still visible in Grafana/vmalert UI | No backup notification channel |
| R-27 | K3s zero-day vulnerability | High | Low | Cluster compromise | Timely upgrades, CrowdSec, network segmentation | Manual patching, no auto-update |
| R-28 | OpenTofu state corruption | Medium | Very Low | Cannot manage infrastructure via IaC | State committed to Git (version history), tofu import as fallback |
Manual state reconstruction possible but tedious |
| R-29 | Caddy/Authelia bypass vulnerability | High | Low | Unauthenticated access to internal services | Keep up to date, CrowdSec HTTP scenarios, minimal attack surface | Single-user lab, limited blast radius |
| R-30 | Google Drive sync exposes encrypted secrets | Low | Very Low | SOPS files visible on GDrive | Secrets are SOPS-encrypted (ciphertext only), GDrive access requires Google auth | Encrypted at rest, no cleartext exposure |
Risk Matrix¶
Likelihood → Very Low Low Medium High
Impact ↓
──────────────────────────────────────────────────────────
Critical R-21,R-22
High R-25 R-27,R-29 R-20
Medium R-28,R-24 R-23
Low R-30 R-26
Accepted risks
R-20 (Hub total loss) and R-21 (age key loss) are the two highest-impact risks. Both are mitigated but not eliminated. The residual risk is accepted for a personal R&D lab. In a production Defence environment, these would require additional controls (HA cluster, HSM-backed keys, multi-region deployment).
Risk Review Schedule¶
| Review | Frequency | Actions |
|---|---|---|
| Monthly | After each Beast budget review | Check R-04 cost controls, review alert noise (R-07) |
| Quarterly | Every 3 months | Review remaining risks, check for new CVEs (R-27, R-29), rotate keys (R-25) |
| After incident | As needed | Add new risk entries, update mitigations |
| After architecture change | As needed | Re-evaluate all risks against new topology |