Skip to content

Monitoring

The monitoring stack runs on the dedicated monitoring server — a repurposed Lenovo laptop running headless with the lid closed. Sleep is disabled via systemd-logind so it stays on regardless of lid state:

/etc/systemd/logind.conf
[Login]
HandleLidSwitch=ignore

This is deployed by Ansible when disable_sleep: true is set in the host's variables. It covers metrics, logs, alerting, and endpoint availability across the entire homelab.

Service Purpose
Prometheus Metrics collection and storage
Grafana Dashboards and visualization
Alertmanager Alert routing and notifications
Loki Log aggregation
Grafana Alloy Log collection from systemd journal
cAdvisor Container resource metrics
Node Exporter Host system metrics
PVE Exporter Proxmox VE metrics
SNMP Exporter Synology NAS metrics
Blackbox Exporter HTTP, ICMP, and TCP endpoint probing
Pushgateway Metrics from batch jobs (e.g. backups)

All services run as Podman containers on a shared internal network, provisioned with Ansible. Containers use system-level Quadlets (/etc/containers/systemd/) and run as root.


Metrics

Prometheus scrapes metrics every 15 seconds from targets across the homelab:

Job Source What it covers
node Node Exporter on each host CPU, memory, disk, network
cadvisor cAdvisor on monitoring & kontti Per-container resource usage
pve PVE Exporter Proxmox VMs, storage, cluster
snmp SNMP Exporter Synology NAS health and storage
blackbox Blackbox Exporter HTTP/TCP availability of internal services
blackbox_icmp Blackbox Exporter ICMP ping to servers
pushgateway Pushgateway Backup job results and durations
alertmanager Alertmanager itself Alertmanager health

Alert rules

Prometheus evaluates alert rules every 15 seconds. Rules are organized by category:

  • node.rules.yml — CPU, memory, disk usage thresholds
  • container.rules.yml — container restarts and resource limits
  • probe.rules.yml — endpoint availability failures
  • proxmox.rules.yml — Proxmox VM and storage alerts
  • backup.rules.yml — missed or failed backups via Pushgateway
  • synology.rules.yml — NAS health and disk status
node.rules.yml
groups:
  - name: node
    rules:
      - alert: NodeDown
        expr: up{job=~"node|opnsense|proxmox|kontti"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node unreachable"
          description: "Node Exporter on {{ $labels.instance }} has been unreachable for 2 minutes."

      - alert: DiskSpaceLow
        expr: >
          (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs",mountpoint=~"/|/sysroot|/var"}
          / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs",mountpoint=~"/|/sysroot|/var"}) * 100 < 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low"
          description: "{{ $labels.instance }} mountpoint {{ $labels.mountpoint }} has {{ $value | printf \"%.1f\" }}% free."

      # NAS uses absolute thresholds — on a 47 TB volume, percentages are too coarse
      - alert: NasDiskSpaceLow
        expr: node_filesystem_avail_bytes{mountpoint="/var/mnt/nfs-data"} < 2199023255552
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "NAS disk space low"
          description: "{{ $labels.instance }} NAS has {{ $value | humanize1024 }}B free (under 2 TB)."

      - alert: HighCPULoad
        expr: >
          (1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU load"
          description: "{{ $labels.instance }} CPU usage is {{ $value | printf \"%.1f\" }}% (over 85% for 10 min)."

Logs

Grafana Alloy runs on the monitoring server and collects logs from the systemd journal, forwarding them to Loki. Noisy but harmless log lines are filtered out before reaching Loki to keep the log volume manageable. Logs are tagged with host, systemd unit, and container name for easy filtering in Grafana.


Alertmanager

Alertmanager receives alerts from Prometheus and handles deduplication, grouping, and routing. Notifications are sent to Telegram.