Skip to content

Architecture Decision Records

All ADRs follow the format: Context (why the decision was needed), Decision (what was chosen), Consequences (trade-offs accepted).


ADR-001: Hetzner Cloud as Provider

Status: Accepted

Context: Need a European cloud provider with ARM support, hourly billing, and low cost for a personal R&D lab.

Decision: Use Hetzner Cloud (FSN1 datacenter, Falkenstein, Germany).

Consequences:

  • Best EUR/vCPU ratio in Europe -- entire lab under EUR 15/month
  • ARM64 via Ampere Altra (CAX line) at commodity pricing
  • EU data residency (Germany)
  • No managed Kubernetes -- must self-manage K3s
  • Limited managed services compared to hyperscalers (no managed DB, no IAM)
  • Hetzner private networking is free but basic (no VPC peering, no transit gateway)

ADR-002: OpenTofu for Infrastructure as Code

Status: Accepted

Context: Infrastructure must be reproducible and version-controlled. Terraform is the standard but has BSL licensing concerns.

Decision: Use OpenTofu (open-source Terraform fork) with the Hetzner provider.

Consequences:

  • MPL-2.0 license -- no licensing risk
  • Full compatibility with existing Terraform providers and HCL syntax
  • State stored locally (encrypted with SOPS) -- no remote backend cost
  • Community-driven roadmap, may diverge from Terraform over time

ADR-003: SOPS + age for Secrets Encryption

Status: Accepted

Context: Secrets (API tokens, WireGuard keys, kubeconfig) must be stored in the repository encrypted.

Decision: Use Mozilla SOPS with age (not GPG) for encryption.

Consequences:

  • Simple key management -- single age key, no GPG keyring complexity
  • SOPS integrates with YAML/JSON natively (encrypts values, not keys)
  • Age key backed up in Bitwarden
  • No cloud KMS dependency
  • Single point of failure: if age key is lost and Bitwarden backup is inaccessible, all secrets are lost

ADR-004: DMZ Isolation via Dedicated VM

Status: Accepted

Context: Public-facing services (web terminal, reverse proxy) must be isolated from the management plane.

Decision: Run a separate cx23 VM as a DMZ tier, joined to the K3s cluster as an agent node with Cilium NetworkPolicy enforcement.

Consequences:

  • Hard VM boundary between internet-facing and management workloads
  • Additional EUR 3.99/month cost
  • Cilium NetworkPolicies enforce pod-level isolation on top of VM separation
  • DMZ compromise does not directly expose Rancher, observability, or cluster secrets
  • Requires node affinity/selectors to pin workloads to the correct VM

ADR-005: Rancher CE for Cluster Management

Status: Accepted

Context: Need a cluster management UI and GitOps deployment mechanism. Options: Rancher CE, ArgoCD, plain kubectl.

Decision: Use Rancher CE with Fleet for GitOps.

Consequences:

  • Single pane of glass for cluster, workloads, and RBAC
  • Fleet provides native GitOps without a separate ArgoCD install
  • Rancher adds ~1.5 GB RAM overhead
  • CE edition has no SLA or enterprise support
  • Validates HDCP concept: Rancher as the management plane for federated clusters

ADR-006: tc netem for WAN Emulation

Status: Accepted

Context: HDCP scenarios require testing under degraded network conditions (high latency, packet loss, bandwidth limits).

Decision: Use Linux tc netem with predefined latency profiles applied via network namespaces.

Consequences:

  • Zero-cost, built into the Linux kernel
  • Can simulate DDIL (Denied, Disrupted, Intermittent, Limited) conditions
  • Profiles are scriptable and reproducible
  • Only affects traffic within the configured namespace -- production traffic is not impacted
  • No hardware WAN emulator needed for basic testing

ADR-007: ttyd for Browser-Based Terminal Access

Status: Accepted

Context: Defence workstations cannot install VPN clients or SSH tools. Need browser-only access to cluster management.

Decision: Deploy ttyd behind Caddy reverse proxy with Authelia TOTP authentication.

Consequences:

  • Access from any browser -- no client software needed
  • TOTP authentication prevents unauthorized access
  • Forced command limits what the ttyd session can do
  • WebSocket-based -- works through most corporate proxies
  • No clipboard sharing by default (security feature)

ADR-008: Cilium as CNI

Status: Accepted

Context: Need a CNI that provides NetworkPolicy enforcement, observability (network flows), and kube-proxy replacement.

Decision: Replace K3s default Flannel with Cilium.

Consequences:

  • eBPF-based: faster than iptables, lower overhead
  • CiliumNetworkPolicy CRDs for L3/L4/L7 policy
  • Hubble provides network flow visibility without tcpdump
  • kube-proxy replacement reduces iptables rule count
  • More complex than Flannel -- requires understanding of eBPF, CiliumNetworkPolicy syntax
  • Validates HDCP concept: Cilium as the micro-segmentation layer

ADR-009: x86 Beast as Ephemeral Dev Node

Status: Accepted (revised 2026-04-15 -- ARM to x86)

Context: Need burst compute for builds and experiments. Cannot justify a permanent third VM. Originally planned as CAX31 (ARM64), but first deployment revealed ARM compatibility issues with several Docker images.

Decision: Use Hetzner cx53 (x86_64, 16 vCPU, 32 GB RAM) as a cattle-not-pets ephemeral node, created and destroyed per session.

Rationale for ARM-to-x86 switch:

  • cx53 x86 is cheaper per hour than cax31 ARM for ephemeral use (EUR 0.036/h vs EUR 0.0045/h -- but cx53 has 16 vCPU / 32 GB vs cax31's 4 vCPU / 8 GB, making it 4x more powerful)
  • Eliminates all ARM compatibility issues (some Docker images are x86-only)
  • At 40h/month ephemeral usage, cx53 costs ~EUR 1.44/month
  • No QEMU emulation overhead for multi-arch builds

Consequences:

  • ~EUR 0.036/hour -- only pay for usage (~EUR 1.44/month at 40h)
  • x86_64 native -- universal Docker image compatibility
  • 16 vCPU / 32 GB provides substantial headroom for dev workloads
  • Forces cattle-not-pets discipline: no persistent state on Beast
  • Node join/leave takes ~3 minutes (K3s agent bootstrap)
  • Requires scripted spin-up/spin-down automation

ADR-010: CrowdSec for Intrusion Detection

Status: Accepted

Context: Public-facing SSH and HTTPS need protection against brute-force and automated attacks.

Decision: Deploy CrowdSec on Hub and DMZ with the firewall bouncer.

Consequences:

  • Community threat intelligence sharing (collective blocklists)
  • Progressive ban duration for repeat offenders
  • Lower resource usage than Fail2ban (Go-based, single binary)
  • Integrates with Caddy, SSH, and system auth logs
  • Requires CrowdSec Central API enrollment (free tier)

ADR-011: VictoriaMetrics over Prometheus

Status: Accepted

Context: Need Prometheus-compatible metrics storage. Prometheus itself is RAM-hungry for a small cluster.

Decision: Use VictoriaMetrics (single-node) as a drop-in Prometheus replacement.

Consequences:

  • 2-5x less RAM than Prometheus for the same dataset
  • Full PromQL compatibility
  • Single binary, simple to operate
  • Long-term storage with compression
  • Less community momentum than Prometheus -- some dashboards need minor adjustments

ADR-012: Cattle-Not-Pets for Beast VM

Status: Accepted

Context: The Beast VM must be treated as disposable infrastructure. Any state on it will be lost.

Decision: Enforce cattle-not-pets discipline: Beast has no persistent volumes, no local state, and is rebuilt from scratch each spin-up.

Consequences:

  • Forces all state into Git, Bitwarden, or Hub-hosted services
  • Spin-up is fully automated (10-step script)
  • No backup needed for Beast
  • Cannot run stateful workloads (databases, registries) on Beast
  • Validates HDCP concept: recoverable infrastructure from code

ADR-013: WireGuard for Management Access

Status: Accepted

Context: Need secure remote access to the cluster from the home workstation. Options: WireGuard, OpenVPN, SSH tunnels, Tailscale.

Decision: Use WireGuard (kernel module) on the Hub node.

Consequences:

  • Kernel-space: minimal overhead, high throughput
  • Simple configuration: single config file, no PKI
  • UDP-based: works through most NATs
  • Provides full L3 access to private and service networks
  • No central coordination server (unlike Tailscale) -- fully self-hosted
  • Single point of entry: if Hub is down, tunnel is down

ADR-014: TOTP via Authelia for Web Access

Status: Accepted

Context: Web-exposed services (ttyd, dashboards) need authentication. Options: Authelia, Authentik, Keycloak, HTTP basic auth.

Decision: Use Authelia with TOTP (time-based one-time password) for all web-exposed services behind Caddy.

Consequences:

  • Lightweight: single Go binary, ~128 MB RAM
  • TOTP is hardware-independent (works with any authenticator app)
  • No external IdP dependency
  • Forward-auth integration with Caddy is simple
  • No SSO/OIDC federation (acceptable for single-user lab)
  • User database is a local YAML file (backed up via SOPS)

ADR-015: Cilium k8sServiceHost Must Be Server IP

Status: Accepted (2026-04-15, first deployment lesson)

Context: During first deployment, Cilium was configured with k8sServiceHost=127.0.0.1 (localhost). This worked on the Hub (server node) because the K3s API server runs locally. However, when agent nodes (DMZ, Beast) joined the cluster, Cilium crashed on init because their localhost is NOT the API server.

Decision: Always set k8sServiceHost to the Hub's real IP address (91.98.121.97 or the private IP 10.0.1.1, depending on which network the agents use to reach the API server).

Consequences:

  • Cilium init succeeds on all nodes (server and agents)
  • The value must be updated if the Hub's IP changes
  • This is a common multi-node K3s + Cilium pitfall not well documented upstream

ADR-016: Create VMs Without Firewall During Bootstrap

Status: Accepted (2026-04-15, first deployment lesson)

Context: During first deployment, VMs were created with Hetzner Cloud Firewall attached from the start. Cloud-init could not complete because the firewall blocked necessary bootstrap traffic (including port 22 for initial SSH access before the SSH port move to 2222). Debugging was difficult because the VM was unreachable.

Decision: Create all VMs without a Hetzner Cloud Firewall attached. Apply the firewall only after cloud-init has completed and SSH access on port 2222 is confirmed working.

Alternatives considered:

  • Use bootcmd to move SSH to 2222 before main cloud-init runs: fragile, cloud-init bootcmd runs with sh (not bash), so brace expansion {22,2222} does not work
  • Pre-open port 22 in the firewall: defeats the purpose of the firewall during bootstrap

Consequences:

  • Brief window (~2-3 minutes) where VMs are unprotected during bootstrap
  • Acceptable risk: cloud-init completes quickly, and the window is short
  • Eliminates the chicken-and-egg problem of needing network access to configure network access
  • Firewall attachment is a separate OpenTofu step after server creation