Risk Register¶

Resolved Risks¶

Risks that were identified during design and have been mitigated.

ID	Risk	Mitigation	Status
R-01	SSH brute-force on public IPs	endlessh tarpit on port 22, real SSH on 2222, CrowdSec IDS, key-only auth	Resolved
R-02	Unauthorized web access	Authelia TOTP on all public endpoints, Caddy forward_auth	Resolved
R-03	Secrets committed in plaintext	SOPS+age encryption for all secrets in repo	Resolved
R-04	Beast VM forgotten running (cost overrun)	8h warning, 12h auto-destroy, monthly hour tracking, budget hard stop	Resolved
R-05	Configuration drift on long-lived VMs	Ansible idempotent playbooks, periodic re-runs, K3s + Fleet GitOps	Resolved
R-06	No visibility into cluster health	Full observability stack: VictoriaMetrics, Loki, Grafana, Uptime Kuma	Resolved
R-07	Alert fatigue from noisy alerts	Tuned repeat intervals, priority-based routing, Beast-aware suppression	Resolved
R-08	DMZ compromise exposes management plane	Dedicated VM + Cilium NetworkPolicy isolation, no direct DMZ-to-monitoring path	Resolved
R-09	TLS certificate expiration	Caddy auto-renewal via Let's Encrypt DNS-01, cert expiry alert rule	Resolved
R-10	Single point of monitoring failure	UptimeRobot external monitoring as backup for Uptime Kuma	Resolved
R-11	Lateral movement after node compromise	Cilium NetworkPolicy default-deny per namespace, UFW egress restrictions	Resolved
R-12	Kubernetes secrets readable at rest	K3s `--secrets-encryption` flag (AES-CBC), encryption config backed up	Resolved

Remaining Risks¶

Risks that are accepted or partially mitigated.

ID	Risk	Severity	Likelihood	Impact	Current Mitigation	Residual
R-20	Hub VM total loss (disk failure, Hetzner issue)	High	Low	Full cluster down, 12h RPO	etcd snapshots every 12h, full rebuild procedure documented	2-3h RTO, possible 12h data loss
R-21	age key loss (locked out of all SOPS secrets)	Critical	Very Low	Cannot decrypt any secrets, full rebuild impossible without key	Bitwarden backup of age key	If Bitwarden also lost, unrecoverable
R-22	Hetzner account compromise	Critical	Very Low	Attacker controls all infrastructure	Strong password + 2FA on Hetzner, API token scoped, CrowdSec on VMs	Attacker could destroy all VMs; rebuild from code + Bitwarden
R-23	Supply chain attack via container images	Medium	Low	Compromised workloads	Use official images, pin versions, no third-party registries	No image scanning (Trivy not deployed)
R-24	Hetzner region outage (FSN1)	Medium	Very Low	All VMs unavailable	No multi-region redundancy	Accept downtime, rebuild in NBG1/HEL1 if prolonged
R-25	WireGuard key compromise	High	Very Low	Attacker gains tunnel access to private network	Key rotation procedure documented, WireGuard listens on non-standard port	Manual detection only (no WG auth logging)
R-26	ntfy.sh service outage	Low	Low	Alert notifications not delivered	Alerts still visible in Grafana/vmalert UI	No backup notification channel
R-27	K3s zero-day vulnerability	High	Low	Cluster compromise	Timely upgrades, CrowdSec, network segmentation	Manual patching, no auto-update
R-28	OpenTofu state corruption	Medium	Very Low	Cannot manage infrastructure via IaC	State committed to Git (version history), `tofu import` as fallback	Manual state reconstruction possible but tedious
R-29	Caddy/Authelia bypass vulnerability	High	Low	Unauthenticated access to internal services	Keep up to date, CrowdSec HTTP scenarios, minimal attack surface	Single-user lab, limited blast radius
R-30	Google Drive sync exposes encrypted secrets	Low	Very Low	SOPS files visible on GDrive	Secrets are SOPS-encrypted (ciphertext only), GDrive access requires Google auth	Encrypted at rest, no cleartext exposure

Risk Matrix¶

Likelihood →   Very Low    Low         Medium      High
Impact ↓
──────────────────────────────────────────────────────────
Critical        R-21,R-22                           
High            R-25        R-27,R-29               R-20
Medium          R-28,R-24   R-23                    
Low             R-30        R-26

Accepted risks

R-20 (Hub total loss) and R-21 (age key loss) are the two highest-impact risks. Both are mitigated but not eliminated. The residual risk is accepted for a personal R&D lab. In a production Defence environment, these would require additional controls (HA cluster, HSM-backed keys, multi-region deployment).

Risk Review Schedule¶

Review	Frequency	Actions
Monthly	After each Beast budget review	Check R-04 cost controls, review alert noise (R-07)
Quarterly	Every 3 months	Review remaining risks, check for new CVEs (R-27, R-29), rotate keys (R-25)
After incident	As needed	Add new risk entries, update mitigations
After architecture change	As needed	Re-evaluate all risks against new topology