Monitoring¶

Stack Overview¶

flowchart TB
    subgraph Targets["Scrape Targets"]
        NE1["node_exporter<br/>(Hub)"]
        NE2["node_exporter<br/>(DMZ)"]
        NE3["node_exporter<br/>(Beast)"]
        KSM["kube-state-metrics"]
        K3s["K3s metrics"]
        VM_self["VictoriaMetrics<br/>(self-scrape)"]
    end

    subgraph Stack["Monitoring Stack (Hub)"]
        VictM["VictoriaMetrics<br/>(TSDB)"]
        Graf["Grafana<br/>(Dashboards)"]
        VMA["vmalert<br/>(Alert rules)"]
    end

    NE1 -->|"/metrics"| VictM
    NE2 -->|"/metrics"| VictM
    NE3 -->|"/metrics"| VictM
    KSM -->|"/metrics"| VictM
    K3s -->|"/metrics"| VictM
    VM_self -->|"/metrics"| VictM

    VictM -->|"PromQL"| Graf
    VictM -->|"PromQL"| VMA
    VMA -->|"alerts"| Ntfy["ntfy.sh<br/>(notifications)"]

    style VictM fill:#1a5276,stroke:#2980b9,color:#fff
    style Graf fill:#7d6608,stroke:#f1c40f,color:#fff
    style VMA fill:#7b241c,stroke:#c0392b,color:#fff

VictoriaMetrics¶

VictoriaMetrics (single-node) replaces Prometheus as the metrics TSDB. It provides full PromQL compatibility with significantly lower resource usage.

Parameter	Value
Mode	Single-node
Retention	30 days
Scrape interval	30s
Storage path	`/var/lib/victoria-metrics/`
Memory limit	512 MB
Listen port	8428

Scrape Configuration¶

scrape_configs:
  - job_name: node-exporter
    static_configs:
      - targets:
          - 10.0.1.1:9100  # Hub
          - 10.0.1.2:9100  # DMZ
          - 10.0.1.3:9100  # Beast (when running)
    scrape_interval: 30s

  - job_name: kube-state-metrics
    static_configs:
      - targets:
          - kube-state-metrics.kube-system:8080
    scrape_interval: 30s

  - job_name: k3s
    static_configs:
      - targets:
          - 10.0.1.1:10250  # kubelet Hub
          - 10.0.1.2:10250  # kubelet DMZ
          - 10.0.1.3:10250  # kubelet Beast
    scheme: https
    tls_config:
      insecure_skip_verify: true

  - job_name: victoriametrics
    static_configs:
      - targets:
          - localhost:8428
    scrape_interval: 60s

Beast target

When Beast is not running, VictoriaMetrics logs a scrape error for 10.0.1.3:9100 every 30s. This is expected and does not trigger an alert (the alert fires on node absence, not scrape failure).

Grafana¶

Grafana provides dashboards with VictoriaMetrics as the data source.

Parameter	Value
Version	Latest OSS
Port	3000
Auth	Authelia (via Caddy forward_auth)
Data source	VictoriaMetrics (http://victoriametrics:8428)
Dashboard provisioning	Git repo `dashboards/`

Dashboards¶

1. Cluster Overview¶

Purpose: Single-pane-of-glass view of the entire cluster.

Panel	Metric
Node count (up/down)	`up{job="node-exporter"}`
Total CPU usage	`sum(rate(node_cpu_seconds_total{mode!="idle"}[5m]))`
Total memory usage	`sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)`
Pod count by status	`kube_pod_status_phase`
K3s API latency	`apiserver_request_duration_seconds`
Cluster events	`kube_event_count`

2. Node Detail¶

Purpose: Per-node resource utilization for capacity planning.

Panel	Metric
CPU usage per core	`rate(node_cpu_seconds_total{mode!="idle"}[5m])`
Memory usage breakdown	`node_memory_MemTotal_bytes`, `MemAvailable`, `Buffers`, `Cached`
Disk I/O	`rate(node_disk_read_bytes_total[5m])`, `write_bytes_total`
Disk usage	`node_filesystem_size_bytes`, `avail_bytes`
Network throughput	`rate(node_network_receive_bytes_total[5m])`, `transmit`
System load	`node_load1`, `node_load5`, `node_load15`

3. Pod Resources¶

Purpose: Kubernetes workload resource consumption and limits.

Panel	Metric
CPU usage vs request vs limit	`container_cpu_usage_seconds_total`, `kube_pod_container_resource_*`
Memory usage vs request vs limit	`container_memory_working_set_bytes`, `kube_pod_container_resource_*`
Pod restarts	`kube_pod_container_status_restarts_total`
OOMKilled events	`kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}`

4. Network (Cilium + Hubble)¶

Purpose: Network flow visibility and policy enforcement.

Panel	Metric
Flows per namespace	`hubble_flows_processed_total`
Policy verdict (allow/deny)	`hubble_policy_verdict`
DNS queries	`hubble_dns_queries_total`
TCP connections	`hubble_tcp_flags_total`

5. Beast Session¶

Purpose: Beast VM lifecycle tracking and cost monitoring.

Panel	Metric
Beast up/down	`up{instance=~".10.0.1.3."}`
Session duration	Custom metric from session log
CPU during session	`rate(node_cpu_seconds_total{instance=~".10.0.1.3."}[5m])`
Estimated cost	Session hours x EUR 0.0045

6. CrowdSec¶

Purpose: Security events and threat intelligence.

Panel	Metric
Active bans	`cs_active_decisions`
Alerts by scenario	`cs_alerts_total`
Bouncer decisions	`cs_bouncer_decisions_total`
Parsed log lines	`cs_parsed_lines_total`

kube-state-metrics¶

Exposes Kubernetes object state as Prometheus metrics:

Metric Family	Purpose
`kube_pod_*`	Pod status, phase, restarts, container states
`kube_node_*`	Node conditions, allocatable resources
`kube_deployment_*`	Deployment status, replicas
`kube_namespace_*`	Namespace status
`kube_daemonset_*`	DaemonSet status

node_exporter¶

Runs as a DaemonSet on all nodes, exposing host-level metrics:

Collector	Metrics
CPU	`node_cpu_seconds_total`
Memory	`node_memory_*`
Disk	`node_disk_`, `node_filesystem_`
Network	`node_network_*`
Load	`node_load1`, `node_load5`, `node_load15`
Systemd	`node_systemd_unit_state`