Monitoring

The monitoring stack runs on the dedicated monitoring server — a repurposed Lenovo laptop running headless with the lid closed. Sleep is disabled via systemd-logind so it stays on regardless of lid state:

/etc/systemd/logind.conf

[Login]
HandleLidSwitch=ignore

This is deployed by Ansible when disable_sleep: true is set in the host's variables. It covers metrics, logs, alerting, and endpoint availability across the entire homelab.

Service	Purpose
Prometheus	Metrics collection and storage
Grafana	Dashboards and visualization
Alertmanager	Alert routing and notifications
Loki	Log aggregation
Grafana Alloy	Log collection from systemd journal
cAdvisor	Container resource metrics
Node Exporter	Host system metrics
PVE Exporter	Proxmox VE metrics
SNMP Exporter	Synology DS920+ metrics
Blackbox Exporter	HTTP, ICMP, and TCP endpoint probing
Pushgateway	Metrics from batch jobs (e.g. backups)

All services run as Podman containers on a shared internal network, provisioned with Ansible. Containers are defined as Quadlets under /etc/containers/systemd/ and run rootless as the unprivileged core user — the observability stack needs no host-level privileges, so keeping it rootless limits the blast radius of a container compromise. (The media server kontti runs Podman as root; the difference is explained on the uCore page.)

The full stack

Two collection paths — metrics and logs — converge on the monitoring server, where Grafana reads both and Alertmanager fans out anything that breaches a rule:

graph LR
    subgraph sources[Sources across the homelab]
        ne[Node Exporter<br/>hosts + OPNsense]
        cad[cAdvisor<br/>monitoring + kontti]
        pve[PVE Exporter]
        snmp[SNMP Exporter<br/>Synology DS920+]
        bb[Blackbox Exporter<br/>HTTP / ICMP / TCP]
        pg[Pushgateway<br/>backup jobs]
        alloy[Grafana Alloy<br/>monitoring, kontti, HA]
    end

    subgraph mon[monitoring server]
        prom[Prometheus]
        loki[Loki]
        graf[Grafana]
        am[Alertmanager]
    end

    ne & cad & pve & snmp & bb & pg --> prom
    alloy -->|logs| loki
    alloy -->|metrics| prom
    prom --> graf
    loki --> graf
    prom -->|alert rules| am
    am -->|Telegram| tg[Phone]

Metrics

Prometheus scrapes metrics every 15 seconds from targets across the homelab:

Job	Source	What it covers
`node`	Node Exporter on each host	CPU, memory, disk, network
`opnsense`	Node Exporter on OPNsense	Router CPU, memory, network
`cadvisor`	cAdvisor on monitoring & kontti	Per-container resource usage
`pve`	PVE Exporter	Proxmox VMs, storage, cluster
`snmp`	SNMP Exporter	Synology DS920+ health and storage
`blackbox`	Blackbox Exporter	HTTP/TCP availability of internal services
`blackbox_icmp`	Blackbox Exporter	ICMP ping to servers
`pushgateway`	Pushgateway	Backup job results and durations
`alertmanager`	Alertmanager itself	Alertmanager health

Alert rules

Prometheus evaluates alert rules every 15 seconds. Rules are organized by category:

node.rules.yml — CPU, memory, disk usage thresholds
container.rules.yml — container restarts and resource limits
probe.rules.yml — endpoint availability failures
proxmox.rules.yml — Proxmox VM and storage alerts
backup.rules.yml — missed or failed backups via Pushgateway
synology.rules.yml — NAS health and disk status

node.rules.yml

groups:
  - name: node
    rules:
      - alert: NodeDown
        expr: up{job=~"node|opnsense|proxmox|kontti"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node unreachable"
          description: "Node Exporter on {{ $labels.instance }} has been unreachable for 2 minutes."

      - alert: DiskSpaceLow
        expr: >
          (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs",mountpoint=~"/|/sysroot|/var"}
          / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs",mountpoint=~"/|/sysroot|/var"}) * 100 < 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low"
          description: "{{ $labels.instance }} mountpoint {{ $labels.mountpoint }} has {{ $value | printf \"%.1f\" }}% free."

      # NAS uses absolute thresholds — on a 47 TB volume, percentages are too coarse
      - alert: NasDiskSpaceLow
        expr: node_filesystem_avail_bytes{mountpoint="/var/mnt/nfs-data"} < 2199023255552
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "NAS disk space low"
          description: "{{ $labels.instance }} NAS has {{ $value | humanize1024 }}B free (under 2 TB)."

      - alert: HighCPULoad
        expr: >
          (1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU load"
          description: "{{ $labels.instance }} CPU usage is {{ $value | printf \"%.1f\" }}% (over 85% for 10 min)."

Logs

Grafana Alloy runs on monitoring, kontti, and the Home Assistant host, collecting logs from each host's systemd journal and forwarding them to Loki on the monitoring server. The agent on kontti is deployed by the kontti_monitoring role alongside cAdvisor and Node Exporter, so the application server's logs and container metrics make it into Grafana even though the storage backends live elsewhere. The Home Assistant host runs its own Alloy agent that ships both its journal logs to Loki and system metrics to Prometheus via remote write, filtered down to the Home Assistant unit. Noisy but harmless log lines are filtered out before reaching Loki to keep the log volume manageable, and logs are tagged with host, systemd unit, and container name for easy filtering.

Alertmanager

Alertmanager receives alerts from Prometheus and handles deduplication, grouping, and routing. Notifications are sent to Telegram.

The metrics-to-notification path looks like this:

graph LR
    ne[Node Exporter] --> prom[Prometheus]
    cad[cAdvisor] --> prom
    bb[Blackbox Exporter] --> prom
    prom --> graf[Grafana]
    prom -->|alert rules| am[Alertmanager]
    am -->|Telegram| tg[Phone]

Alert policy

I'm the only on-call engineer, so the routing is tuned to stay useful without being noisy. Alerts group by alertname, wait 30s to batch related firings, and repeat at most once an hour while a condition persists. Critically, a yö (night) mute interval silences Telegram notifications between 20:00 and 07:00 — alerts still fire and resolve in Alertmanager, but they don't wake me for anything short of an outage I'd notice anyway. In practice I treat a daytime critical (e.g. NodeDown) as immediate and warnings (disk, CPU) as something to look at within a few hours.

Grafana dashboards

Grafana runs the Enterprise edition image (used here without a paid licence, so it behaves as open-source Grafana). It is provisioned with a fixed set of dashboards so the visualisations are reproducible from Ansible rather than clicked together by hand. Community dashboards are downloaded at a pinned revision and patched (datasource, mountpoint, variables) to fit this environment; the rest are maintained locally as JSON.

Dashboard	Source
Node Exporter Full	grafana.com/dashboards/1860
OPNsense	grafana.com/dashboards/19366 (rev 7)
cAdvisor	grafana.com/dashboards/14282
Loki	grafana.com/dashboards/13639
Blackbox Exporter	grafana.com/dashboards/7587 (rev 3)
Error Logs	Local JSON
Synology NAS Details	Local JSON
Proxmox via Prometheus	Local JSON

My experience

I previously used Zabbix for monitoring and wanted to move to more modern tooling. Prometheus, Alertmanager and Loki are common in the Kubernetes world, and Loki is significantly lighter than the Elastic stack. I used AI to help set up the stack.

Alerts have been useful once I got them tuned to my liking — the initial configuration took some iteration. The Grafana dashboards were more of a learning project; I rarely look at them in practice because the environment is stable enough that troubleshooting is infrequent. Log collection has been similar — mostly a learning exercise so far, but it's convenient to have metrics and logs in the same system if I ever need to dig into an issue.