Scenario: Your monitoring alerts you that disk usage on /var/log has exceeded 90%. Application logs are failing to write, and critical services start crashing. This is a common production incident that demands immediate but careful action.
Symptoms: - df -h shows /var/log at 100% usage. - Application errors: 'No space left on device' in logs. - Logrotate fails with disk full errors. - Services like rsyslog or journald stop functioning.
Diagnosis:
1. Check disk usage: df -h /var/log
2. Identify large files: du -sh /var/log/* | sort -rh | head -10
3. Check for deleted but open files: lsof | grep '(deleted)' – these files still consume space until the file descriptor is closed.
4. Examine logrotate status: cat /var/lib/logrotate/status
5. For systemd journals: journalctl --disk-usage
Commands and Recovery Steps:
1. Safely truncate large log files that are not actively written to. For example: truncate -s 0 /var/log/large-old.log
2. For open files, use > /proc/<PID>/fd/<N> (be extremely cautious, ensure the file is a log file).
3. Force logrotate: logrotate -f /etc/logrotate.conf
4. Clean old rotated logs: find /var/log -name '*.gz' -mtime +30 -delete
5. If using journald, reduce retention: journalctl --vacuum-size=500M
6. Restart logging services: systemctl restart rsyslog or systemctl restart systemd-journald
Risk Controls: - Never delete a log file that is still open; always truncate or use copytruncate in logrotate. - Before any cleanup, snapshot the filesystem if possible. - Test logrotate configuration on a non-production system first. - Monitor disk space after cleanup to ensure it doesn't refill quickly.
Rollback: - If a service fails after truncation, restart the service to reopen the logs. - If logrotate was misconfigured, restore the original config from backup and rerun logrotate.
Verification:
- Check disk usage: df -h /var/log
- Verify service health: systemctl status <service> for affected services.
- Confirm log writing: tail -f /var/log/syslog
When to Submit an OpsGlobal Ticket: - If the root cause is unclear (e.g., logs are filling up faster than expected). - If disk space is exhausted on a non-log partition. - If you need assistance tuning logrotate policies or setting up monitoring thresholds. - If the incident recurs despite cleanup, indicating a systemic issue.
Use cases
Useful for teams handling DevOps issues and needing a clear troubleshooting and delivery workflow.
Problem background
A step-by-step guide for SREs to troubleshoot disk space exhaustion on /var/log, including safe log cleanup, risk mitigation, and when to escalate.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.