Monitoring
The monitoring stack runs on the dedicated monitoring server — a repurposed Lenovo laptop running headless with the lid closed. Sleep is disabled via systemd-logind so it stays on regardless of lid state:
This is deployed by Ansible when disable_sleep: true is set in the host's variables. It covers metrics, logs, alerting, and endpoint availability across the entire homelab.
| Service | Purpose |
|---|---|
| Prometheus | Metrics collection and storage |
| Grafana | Dashboards and visualization |
| Alertmanager | Alert routing and notifications |
| Loki | Log aggregation |
| Grafana Alloy | Log collection from systemd journal |
| cAdvisor | Container resource metrics |
| Node Exporter | Host system metrics |
| PVE Exporter | Proxmox VE metrics |
| SNMP Exporter | Synology NAS metrics |
| Blackbox Exporter | HTTP, ICMP, and TCP endpoint probing |
| Pushgateway | Metrics from batch jobs (e.g. backups) |
All services run as Podman containers on a shared internal network, provisioned with Ansible. Containers use system-level Quadlets (/etc/containers/systemd/) and run as root.
Metrics
Prometheus scrapes metrics every 15 seconds from targets across the homelab:
| Job | Source | What it covers |
|---|---|---|
node | Node Exporter on each host | CPU, memory, disk, network |
cadvisor | cAdvisor on monitoring & kontti | Per-container resource usage |
pve | PVE Exporter | Proxmox VMs, storage, cluster |
snmp | SNMP Exporter | Synology NAS health and storage |
blackbox | Blackbox Exporter | HTTP/TCP availability of internal services |
blackbox_icmp | Blackbox Exporter | ICMP ping to servers |
pushgateway | Pushgateway | Backup job results and durations |
alertmanager | Alertmanager itself | Alertmanager health |
Alert rules
Prometheus evaluates alert rules every 15 seconds. Rules are organized by category:
node.rules.yml— CPU, memory, disk usage thresholdscontainer.rules.yml— container restarts and resource limitsprobe.rules.yml— endpoint availability failuresproxmox.rules.yml— Proxmox VM and storage alertsbackup.rules.yml— missed or failed backups via Pushgatewaysynology.rules.yml— NAS health and disk status
groups:
- name: node
rules:
- alert: NodeDown
expr: up{job=~"node|opnsense|proxmox|kontti"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Node unreachable"
description: "Node Exporter on {{ $labels.instance }} has been unreachable for 2 minutes."
- alert: DiskSpaceLow
expr: >
(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs",mountpoint=~"/|/sysroot|/var"}
/ node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs",mountpoint=~"/|/sysroot|/var"}) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low"
description: "{{ $labels.instance }} mountpoint {{ $labels.mountpoint }} has {{ $value | printf \"%.1f\" }}% free."
# NAS uses absolute thresholds — on a 47 TB volume, percentages are too coarse
- alert: NasDiskSpaceLow
expr: node_filesystem_avail_bytes{mountpoint="/var/mnt/nfs-data"} < 2199023255552
for: 5m
labels:
severity: warning
annotations:
summary: "NAS disk space low"
description: "{{ $labels.instance }} NAS has {{ $value | humanize1024 }}B free (under 2 TB)."
- alert: HighCPULoad
expr: >
(1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU load"
description: "{{ $labels.instance }} CPU usage is {{ $value | printf \"%.1f\" }}% (over 85% for 10 min)."
Logs
Grafana Alloy runs on the monitoring server and collects logs from the systemd journal, forwarding them to Loki. Noisy but harmless log lines are filtered out before reaching Loki to keep the log volume manageable. Logs are tagged with host, systemd unit, and container name for easy filtering in Grafana.
Alertmanager
Alertmanager receives alerts from Prometheus and handles deduplication, grouping, and routing. Notifications are sent to Telegram.