Skip to content

Monitoring

Stack Overview

flowchart TB
    subgraph Targets["Scrape Targets"]
        NE1["node_exporter<br/>(Hub)"]
        NE2["node_exporter<br/>(DMZ)"]
        NE3["node_exporter<br/>(Beast)"]
        KSM["kube-state-metrics"]
        K3s["K3s metrics"]
        VM_self["VictoriaMetrics<br/>(self-scrape)"]
    end

    subgraph Stack["Monitoring Stack (Hub)"]
        VictM["VictoriaMetrics<br/>(TSDB)"]
        Graf["Grafana<br/>(Dashboards)"]
        VMA["vmalert<br/>(Alert rules)"]
    end

    NE1 -->|"/metrics"| VictM
    NE2 -->|"/metrics"| VictM
    NE3 -->|"/metrics"| VictM
    KSM -->|"/metrics"| VictM
    K3s -->|"/metrics"| VictM
    VM_self -->|"/metrics"| VictM

    VictM -->|"PromQL"| Graf
    VictM -->|"PromQL"| VMA
    VMA -->|"alerts"| Ntfy["ntfy.sh<br/>(notifications)"]

    style VictM fill:#1a5276,stroke:#2980b9,color:#fff
    style Graf fill:#7d6608,stroke:#f1c40f,color:#fff
    style VMA fill:#7b241c,stroke:#c0392b,color:#fff

VictoriaMetrics

VictoriaMetrics (single-node) replaces Prometheus as the metrics TSDB. It provides full PromQL compatibility with significantly lower resource usage.

Parameter Value
Mode Single-node
Retention 30 days
Scrape interval 30s
Storage path /var/lib/victoria-metrics/
Memory limit 512 MB
Listen port 8428

Scrape Configuration

scrape_configs:
  - job_name: node-exporter
    static_configs:
      - targets:
          - 10.0.1.1:9100  # Hub
          - 10.0.1.2:9100  # DMZ
          - 10.0.1.3:9100  # Beast (when running)
    scrape_interval: 30s

  - job_name: kube-state-metrics
    static_configs:
      - targets:
          - kube-state-metrics.kube-system:8080
    scrape_interval: 30s

  - job_name: k3s
    static_configs:
      - targets:
          - 10.0.1.1:10250  # kubelet Hub
          - 10.0.1.2:10250  # kubelet DMZ
          - 10.0.1.3:10250  # kubelet Beast
    scheme: https
    tls_config:
      insecure_skip_verify: true

  - job_name: victoriametrics
    static_configs:
      - targets:
          - localhost:8428
    scrape_interval: 60s

Beast target

When Beast is not running, VictoriaMetrics logs a scrape error for 10.0.1.3:9100 every 30s. This is expected and does not trigger an alert (the alert fires on node absence, not scrape failure).

Grafana

Grafana provides dashboards with VictoriaMetrics as the data source.

Parameter Value
Version Latest OSS
Port 3000
Auth Authelia (via Caddy forward_auth)
Data source VictoriaMetrics (http://victoriametrics:8428)
Dashboard provisioning Git repo dashboards/

Dashboards

1. Cluster Overview

Purpose: Single-pane-of-glass view of the entire cluster.

Panel Metric
Node count (up/down) up{job="node-exporter"}
Total CPU usage sum(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
Total memory usage sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
Pod count by status kube_pod_status_phase
K3s API latency apiserver_request_duration_seconds
Cluster events kube_event_count

2. Node Detail

Purpose: Per-node resource utilization for capacity planning.

Panel Metric
CPU usage per core rate(node_cpu_seconds_total{mode!="idle"}[5m])
Memory usage breakdown node_memory_MemTotal_bytes, MemAvailable, Buffers, Cached
Disk I/O rate(node_disk_read_bytes_total[5m]), write_bytes_total
Disk usage node_filesystem_size_bytes, avail_bytes
Network throughput rate(node_network_receive_bytes_total[5m]), transmit
System load node_load1, node_load5, node_load15

3. Pod Resources

Purpose: Kubernetes workload resource consumption and limits.

Panel Metric
CPU usage vs request vs limit container_cpu_usage_seconds_total, kube_pod_container_resource_*
Memory usage vs request vs limit container_memory_working_set_bytes, kube_pod_container_resource_*
Pod restarts kube_pod_container_status_restarts_total
OOMKilled events kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

4. Network (Cilium + Hubble)

Purpose: Network flow visibility and policy enforcement.

Panel Metric
Flows per namespace hubble_flows_processed_total
Policy verdict (allow/deny) hubble_policy_verdict
DNS queries hubble_dns_queries_total
TCP connections hubble_tcp_flags_total

5. Beast Session

Purpose: Beast VM lifecycle tracking and cost monitoring.

Panel Metric
Beast up/down up{instance=~".*10.0.1.3.*"}
Session duration Custom metric from session log
CPU during session rate(node_cpu_seconds_total{instance=~".*10.0.1.3.*"}[5m])
Estimated cost Session hours x EUR 0.0045

6. CrowdSec

Purpose: Security events and threat intelligence.

Panel Metric
Active bans cs_active_decisions
Alerts by scenario cs_alerts_total
Bouncer decisions cs_bouncer_decisions_total
Parsed log lines cs_parsed_lines_total

kube-state-metrics

Exposes Kubernetes object state as Prometheus metrics:

Metric Family Purpose
kube_pod_* Pod status, phase, restarts, container states
kube_node_* Node conditions, allocatable resources
kube_deployment_* Deployment status, replicas
kube_namespace_* Namespace status
kube_daemonset_* DaemonSet status

node_exporter

Runs as a DaemonSet on all nodes, exposing host-level metrics:

Collector Metrics
CPU node_cpu_seconds_total
Memory node_memory_*
Disk node_disk_*, node_filesystem_*
Network node_network_*
Load node_load1, node_load5, node_load15
Systemd node_systemd_unit_state