Monitoring¶
Stack Overview¶
flowchart TB
subgraph Targets["Scrape Targets"]
NE1["node_exporter<br/>(Hub)"]
NE2["node_exporter<br/>(DMZ)"]
NE3["node_exporter<br/>(Beast)"]
KSM["kube-state-metrics"]
K3s["K3s metrics"]
VM_self["VictoriaMetrics<br/>(self-scrape)"]
end
subgraph Stack["Monitoring Stack (Hub)"]
VictM["VictoriaMetrics<br/>(TSDB)"]
Graf["Grafana<br/>(Dashboards)"]
VMA["vmalert<br/>(Alert rules)"]
end
NE1 -->|"/metrics"| VictM
NE2 -->|"/metrics"| VictM
NE3 -->|"/metrics"| VictM
KSM -->|"/metrics"| VictM
K3s -->|"/metrics"| VictM
VM_self -->|"/metrics"| VictM
VictM -->|"PromQL"| Graf
VictM -->|"PromQL"| VMA
VMA -->|"alerts"| Ntfy["ntfy.sh<br/>(notifications)"]
style VictM fill:#1a5276,stroke:#2980b9,color:#fff
style Graf fill:#7d6608,stroke:#f1c40f,color:#fff
style VMA fill:#7b241c,stroke:#c0392b,color:#fff
VictoriaMetrics¶
VictoriaMetrics (single-node) replaces Prometheus as the metrics TSDB. It provides full PromQL compatibility with significantly lower resource usage.
| Parameter | Value |
|---|---|
| Mode | Single-node |
| Retention | 30 days |
| Scrape interval | 30s |
| Storage path | /var/lib/victoria-metrics/ |
| Memory limit | 512 MB |
| Listen port | 8428 |
Scrape Configuration¶
scrape_configs:
- job_name: node-exporter
static_configs:
- targets:
- 10.0.1.1:9100 # Hub
- 10.0.1.2:9100 # DMZ
- 10.0.1.3:9100 # Beast (when running)
scrape_interval: 30s
- job_name: kube-state-metrics
static_configs:
- targets:
- kube-state-metrics.kube-system:8080
scrape_interval: 30s
- job_name: k3s
static_configs:
- targets:
- 10.0.1.1:10250 # kubelet Hub
- 10.0.1.2:10250 # kubelet DMZ
- 10.0.1.3:10250 # kubelet Beast
scheme: https
tls_config:
insecure_skip_verify: true
- job_name: victoriametrics
static_configs:
- targets:
- localhost:8428
scrape_interval: 60s
Beast target
When Beast is not running, VictoriaMetrics logs a scrape error for 10.0.1.3:9100 every 30s. This is expected and does not trigger an alert (the alert fires on node absence, not scrape failure).
Grafana¶
Grafana provides dashboards with VictoriaMetrics as the data source.
| Parameter | Value |
|---|---|
| Version | Latest OSS |
| Port | 3000 |
| Auth | Authelia (via Caddy forward_auth) |
| Data source | VictoriaMetrics (http://victoriametrics:8428) |
| Dashboard provisioning | Git repo dashboards/ |
Dashboards¶
1. Cluster Overview¶
Purpose: Single-pane-of-glass view of the entire cluster.
| Panel | Metric |
|---|---|
| Node count (up/down) | up{job="node-exporter"} |
| Total CPU usage | sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) |
| Total memory usage | sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) |
| Pod count by status | kube_pod_status_phase |
| K3s API latency | apiserver_request_duration_seconds |
| Cluster events | kube_event_count |
2. Node Detail¶
Purpose: Per-node resource utilization for capacity planning.
| Panel | Metric |
|---|---|
| CPU usage per core | rate(node_cpu_seconds_total{mode!="idle"}[5m]) |
| Memory usage breakdown | node_memory_MemTotal_bytes, MemAvailable, Buffers, Cached |
| Disk I/O | rate(node_disk_read_bytes_total[5m]), write_bytes_total |
| Disk usage | node_filesystem_size_bytes, avail_bytes |
| Network throughput | rate(node_network_receive_bytes_total[5m]), transmit |
| System load | node_load1, node_load5, node_load15 |
3. Pod Resources¶
Purpose: Kubernetes workload resource consumption and limits.
| Panel | Metric |
|---|---|
| CPU usage vs request vs limit | container_cpu_usage_seconds_total, kube_pod_container_resource_* |
| Memory usage vs request vs limit | container_memory_working_set_bytes, kube_pod_container_resource_* |
| Pod restarts | kube_pod_container_status_restarts_total |
| OOMKilled events | kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} |
4. Network (Cilium + Hubble)¶
Purpose: Network flow visibility and policy enforcement.
| Panel | Metric |
|---|---|
| Flows per namespace | hubble_flows_processed_total |
| Policy verdict (allow/deny) | hubble_policy_verdict |
| DNS queries | hubble_dns_queries_total |
| TCP connections | hubble_tcp_flags_total |
5. Beast Session¶
Purpose: Beast VM lifecycle tracking and cost monitoring.
| Panel | Metric |
|---|---|
| Beast up/down | up{instance=~".*10.0.1.3.*"} |
| Session duration | Custom metric from session log |
| CPU during session | rate(node_cpu_seconds_total{instance=~".*10.0.1.3.*"}[5m]) |
| Estimated cost | Session hours x EUR 0.0045 |
6. CrowdSec¶
Purpose: Security events and threat intelligence.
| Panel | Metric |
|---|---|
| Active bans | cs_active_decisions |
| Alerts by scenario | cs_alerts_total |
| Bouncer decisions | cs_bouncer_decisions_total |
| Parsed log lines | cs_parsed_lines_total |
kube-state-metrics¶
Exposes Kubernetes object state as Prometheus metrics:
| Metric Family | Purpose |
|---|---|
kube_pod_* |
Pod status, phase, restarts, container states |
kube_node_* |
Node conditions, allocatable resources |
kube_deployment_* |
Deployment status, replicas |
kube_namespace_* |
Namespace status |
kube_daemonset_* |
DaemonSet status |
node_exporter¶
Runs as a DaemonSet on all nodes, exposing host-level metrics:
| Collector | Metrics |
|---|---|
| CPU | node_cpu_seconds_total |
| Memory | node_memory_* |
| Disk | node_disk_*, node_filesystem_* |
| Network | node_network_* |
| Load | node_load1, node_load5, node_load15 |
| Systemd | node_systemd_unit_state |