Incident Response
Severity Levels
| Level |
Description |
Example |
| P1 — Critical |
Total loss of core service or security breach |
Host down, VPN broken |
| P2 — High |
Degraded service, data risk |
Container crash loop, disk near full |
| P3 — Medium |
Non-critical service down |
Non-essential container down |
| P4 — Low |
Minor issue, no user impact |
Config drift, outdated image |
Response Procedure
Step 1 — Detect & Assess
docker ps -a # Container states
htop # CPU / RAM
df -h # Disk
docker logs <container> --tail 100
journalctl -xe --since "30 minutes ago"
Step 2 — Isolate
ping <host-ip> # Is the host reachable?
sudo systemctl status docker
sudo wg show # WireGuard tunnel status
docker restart <container>
- Fix config →
docker compose up -d --force-recreate
- Pull fresh image →
docker compose pull && docker compose up -d
- Restore from backup → see Backup & Restore
- Rebuild from scratch using runbooks
Step 4 — Verify & Log
Incident Log
| Date |
Severity |
Service |
Root Cause |
Resolution |
Duration |
| YYYY-MM-DD |
P2 |
Grafana |
Disk 100% full |
Cleared old Prometheus data |
45 min |
Common Issues & Fixes
| Symptom |
Likely Cause |
Fix |
| Container restart loop |
Bad env var or config |
Check logs, fix .env |
| Port already in use |
Conflicting service |
ss -tulnp \| grep <PORT> |
| DNS not resolving |
Pi-hole down |
Restart pihole container |
| WireGuard not connecting |
Key mismatch or firewall |
sudo wg show, check NSG/UFW |
| Disk full |
Log or volume growth |
docker system prune |