Architecture Decision Records¶

All ADRs follow the format: Context (why the decision was needed), Decision (what was chosen), Consequences (trade-offs accepted).

ADR-001: Hetzner Cloud as Provider¶

Status: Accepted

Context: Need a European cloud provider with ARM support, hourly billing, and low cost for a personal R&D lab.

Decision: Use Hetzner Cloud (FSN1 datacenter, Falkenstein, Germany).

Consequences:

Best EUR/vCPU ratio in Europe -- entire lab under EUR 15/month
ARM64 via Ampere Altra (CAX line) at commodity pricing
EU data residency (Germany)
No managed Kubernetes -- must self-manage K3s
Limited managed services compared to hyperscalers (no managed DB, no IAM)
Hetzner private networking is free but basic (no VPC peering, no transit gateway)

ADR-002: OpenTofu for Infrastructure as Code¶

Status: Accepted

Context: Infrastructure must be reproducible and version-controlled. Terraform is the standard but has BSL licensing concerns.

Decision: Use OpenTofu (open-source Terraform fork) with the Hetzner provider.

Consequences:

MPL-2.0 license -- no licensing risk
Full compatibility with existing Terraform providers and HCL syntax
State stored locally (encrypted with SOPS) -- no remote backend cost
Community-driven roadmap, may diverge from Terraform over time

ADR-003: SOPS + age for Secrets Encryption¶

Status: Accepted

Context: Secrets (API tokens, WireGuard keys, kubeconfig) must be stored in the repository encrypted.

Decision: Use Mozilla SOPS with age (not GPG) for encryption.

Consequences:

Simple key management -- single age key, no GPG keyring complexity
SOPS integrates with YAML/JSON natively (encrypts values, not keys)
Age key backed up in Bitwarden
No cloud KMS dependency
Single point of failure: if age key is lost and Bitwarden backup is inaccessible, all secrets are lost

ADR-004: DMZ Isolation via Dedicated VM¶

Status: Accepted

Context: Public-facing services (web terminal, reverse proxy) must be isolated from the management plane.

Decision: Run a separate cx23 VM as a DMZ tier, joined to the K3s cluster as an agent node with Cilium NetworkPolicy enforcement.

Consequences:

Hard VM boundary between internet-facing and management workloads
Additional EUR 3.99/month cost
Cilium NetworkPolicies enforce pod-level isolation on top of VM separation
DMZ compromise does not directly expose Rancher, observability, or cluster secrets
Requires node affinity/selectors to pin workloads to the correct VM

ADR-005: Rancher CE for Cluster Management¶

Status: Accepted

Context: Need a cluster management UI and GitOps deployment mechanism. Options: Rancher CE, ArgoCD, plain kubectl.

Decision: Use Rancher CE with Fleet for GitOps.

Consequences:

Single pane of glass for cluster, workloads, and RBAC
Fleet provides native GitOps without a separate ArgoCD install
Rancher adds ~1.5 GB RAM overhead
CE edition has no SLA or enterprise support
Validates HDCP concept: Rancher as the management plane for federated clusters

ADR-006: tc netem for WAN Emulation¶

Status: Accepted

Context: HDCP scenarios require testing under degraded network conditions (high latency, packet loss, bandwidth limits).

Decision: Use Linux tc netem with predefined latency profiles applied via network namespaces.

Consequences:

Zero-cost, built into the Linux kernel
Can simulate DDIL (Denied, Disrupted, Intermittent, Limited) conditions
Profiles are scriptable and reproducible
Only affects traffic within the configured namespace -- production traffic is not impacted
No hardware WAN emulator needed for basic testing

ADR-007: ttyd for Browser-Based Terminal Access¶

Status: Accepted

Context: Defence workstations cannot install VPN clients or SSH tools. Need browser-only access to cluster management.

Decision: Deploy ttyd behind Caddy reverse proxy with Authelia TOTP authentication.

Consequences:

Access from any browser -- no client software needed
TOTP authentication prevents unauthorized access
Forced command limits what the ttyd session can do
WebSocket-based -- works through most corporate proxies
No clipboard sharing by default (security feature)

ADR-008: Cilium as CNI¶

Status: Accepted

Context: Need a CNI that provides NetworkPolicy enforcement, observability (network flows), and kube-proxy replacement.

Decision: Replace K3s default Flannel with Cilium.

Consequences:

eBPF-based: faster than iptables, lower overhead
CiliumNetworkPolicy CRDs for L3/L4/L7 policy
Hubble provides network flow visibility without tcpdump
kube-proxy replacement reduces iptables rule count
More complex than Flannel -- requires understanding of eBPF, CiliumNetworkPolicy syntax
Validates HDCP concept: Cilium as the micro-segmentation layer

ADR-009: x86 Beast as Ephemeral Dev Node¶

Status: Accepted (revised 2026-04-15 -- ARM to x86)

Context: Need burst compute for builds and experiments. Cannot justify a permanent third VM. Originally planned as CAX31 (ARM64), but first deployment revealed ARM compatibility issues with several Docker images.

Decision: Use Hetzner cx53 (x86_64, 16 vCPU, 32 GB RAM) as a cattle-not-pets ephemeral node, created and destroyed per session.

Rationale for ARM-to-x86 switch:

cx53 x86 is cheaper per hour than cax31 ARM for ephemeral use (EUR 0.036/h vs EUR 0.0045/h -- but cx53 has 16 vCPU / 32 GB vs cax31's 4 vCPU / 8 GB, making it 4x more powerful)
Eliminates all ARM compatibility issues (some Docker images are x86-only)
At 40h/month ephemeral usage, cx53 costs ~EUR 1.44/month
No QEMU emulation overhead for multi-arch builds

Consequences:

~EUR 0.036/hour -- only pay for usage (~EUR 1.44/month at 40h)
x86_64 native -- universal Docker image compatibility
16 vCPU / 32 GB provides substantial headroom for dev workloads
Forces cattle-not-pets discipline: no persistent state on Beast
Node join/leave takes ~3 minutes (K3s agent bootstrap)
Requires scripted spin-up/spin-down automation

ADR-010: CrowdSec for Intrusion Detection¶

Status: Accepted

Context: Public-facing SSH and HTTPS need protection against brute-force and automated attacks.

Decision: Deploy CrowdSec on Hub and DMZ with the firewall bouncer.

Consequences:

Community threat intelligence sharing (collective blocklists)
Progressive ban duration for repeat offenders
Lower resource usage than Fail2ban (Go-based, single binary)
Integrates with Caddy, SSH, and system auth logs
Requires CrowdSec Central API enrollment (free tier)

ADR-011: VictoriaMetrics over Prometheus¶

Status: Accepted

Context: Need Prometheus-compatible metrics storage. Prometheus itself is RAM-hungry for a small cluster.

Decision: Use VictoriaMetrics (single-node) as a drop-in Prometheus replacement.

Consequences:

2-5x less RAM than Prometheus for the same dataset
Full PromQL compatibility
Single binary, simple to operate
Long-term storage with compression
Less community momentum than Prometheus -- some dashboards need minor adjustments

ADR-012: Cattle-Not-Pets for Beast VM¶

Status: Accepted

Context: The Beast VM must be treated as disposable infrastructure. Any state on it will be lost.

Decision: Enforce cattle-not-pets discipline: Beast has no persistent volumes, no local state, and is rebuilt from scratch each spin-up.

Consequences:

Forces all state into Git, Bitwarden, or Hub-hosted services
Spin-up is fully automated (10-step script)
No backup needed for Beast
Cannot run stateful workloads (databases, registries) on Beast
Validates HDCP concept: recoverable infrastructure from code

ADR-013: WireGuard for Management Access¶

Status: Accepted

Context: Need secure remote access to the cluster from the home workstation. Options: WireGuard, OpenVPN, SSH tunnels, Tailscale.

Decision: Use WireGuard (kernel module) on the Hub node.

Consequences:

Kernel-space: minimal overhead, high throughput
Simple configuration: single config file, no PKI
UDP-based: works through most NATs
Provides full L3 access to private and service networks
No central coordination server (unlike Tailscale) -- fully self-hosted
Single point of entry: if Hub is down, tunnel is down

ADR-014: TOTP via Authelia for Web Access¶

Status: Accepted

Context: Web-exposed services (ttyd, dashboards) need authentication. Options: Authelia, Authentik, Keycloak, HTTP basic auth.

Decision: Use Authelia with TOTP (time-based one-time password) for all web-exposed services behind Caddy.

Consequences:

Lightweight: single Go binary, ~128 MB RAM
TOTP is hardware-independent (works with any authenticator app)
No external IdP dependency
Forward-auth integration with Caddy is simple
No SSO/OIDC federation (acceptable for single-user lab)
User database is a local YAML file (backed up via SOPS)

ADR-015: Cilium k8sServiceHost Must Be Server IP¶

Status: Accepted (2026-04-15, first deployment lesson)

Context: During first deployment, Cilium was configured with k8sServiceHost=127.0.0.1 (localhost). This worked on the Hub (server node) because the K3s API server runs locally. However, when agent nodes (DMZ, Beast) joined the cluster, Cilium crashed on init because their localhost is NOT the API server.

Decision: Always set k8sServiceHost to the Hub's real IP address (91.98.121.97 or the private IP 10.0.1.1, depending on which network the agents use to reach the API server).

Consequences:

Cilium init succeeds on all nodes (server and agents)
The value must be updated if the Hub's IP changes
This is a common multi-node K3s + Cilium pitfall not well documented upstream

ADR-016: Create VMs Without Firewall During Bootstrap¶

Status: Accepted (2026-04-15, first deployment lesson)

Context: During first deployment, VMs were created with Hetzner Cloud Firewall attached from the start. Cloud-init could not complete because the firewall blocked necessary bootstrap traffic (including port 22 for initial SSH access before the SSH port move to 2222). Debugging was difficult because the VM was unreachable.

Decision: Create all VMs without a Hetzner Cloud Firewall attached. Apply the firewall only after cloud-init has completed and SSH access on port 2222 is confirmed working.

Alternatives considered:

Use bootcmd to move SSH to 2222 before main cloud-init runs: fragile, cloud-init bootcmd runs with sh (not bash), so brace expansion {22,2222} does not work
Pre-open port 22 in the firewall: defeats the purpose of the firewall during bootstrap

Consequences:

Brief window (~2-3 minutes) where VMs are unprotected during bootstrap
Acceptable risk: cloud-init completes quickly, and the window is short
Eliminates the chicken-and-egg problem of needing network access to configure network access
Firewall attachment is a separate OpenTofu step after server creation