Tech Guide
Monitoring Stack: Grafana + Prometheus for Small Teams
MonitoringGrafanaPrometheus
A monitoring stack is essential for understanding system behavior. This guide covers a minimal, production-ready setup with Prometheus and Grafana.
Architecture Overview
- Prometheus: Scrapes metrics from endpoints, stores time-series data
- Grafana: Visualizes Prometheus data, manages alerting rules
- Node Exporter: Collects host-level metrics (CPU, disk, memory)
- Alertmanager: Handles alert routing and deduplication
Prometheus Configuration
Set up a prometheus.yml scrape config:
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
- job_name: 'docker'
static_configs:
- targets: ['localhost:9323']
- job_name: 'proxmox'
static_configs:
- targets: ['proxmox-host:8006']
Store metrics on a dedicated, high-IOPS volume. Retention policy: 30 days for raw data, 1 year for aggregates.
Grafana Dashboards
Import pre-built dashboards from the community or build custom ones:
- System health (CPU, memory, disk usage)
- Network throughput and error rates
- Application response times
- Container and VM resource usage
Configure notification channels:
- Slack for critical alerts
- Email for weekly summaries
- PagerDuty for on-call escalation
Alerting Rules
Define alert thresholds based on your SLA:
groups:
- name: infrastructure
rules:
- alert: HighCPU
expr: node_cpu_seconds_total > 0.8
for: 5m
annotations:
summary: "High CPU on {{ $labels.instance }}"
Test alerts regularly to ensure they reach the right teams.
Capacity Planning
Review Prometheus storage needs monthly. For a typical small team setup:
- 20 targets × 1000 metrics each × 30-day retention = ~200 GB storage
- CPU: modest (2 cores sufficient)
- Memory: 4 GB minimum, 8 GB recommended