Linux SRE Runbook: /var/log Disk Space Troubleshooting

Linux SRE Runbook: Diagnosing and Resolving Out-of-Disk Space on /var/log

DevOps 6min 24 views 2026-06-18

LinuxSRETroubleshootingLog Management

Scenario: Your monitoring alerts you that disk usage on /var/log has exceeded 90%. Application logs are failing to write, and critical services start crashing. This is a common production incident that demands immediate but careful action.

Symptoms: - df -h shows /var/log at 100% usage. - Application errors: 'No space left on device' in logs. - Logrotate fails with disk full errors. - Services like rsyslog or journald stop functioning.

Diagnosis: 1. Check disk usage: df -h /var/log 2. Identify large files: du -sh /var/log/* | sort -rh | head -10 3. Check for deleted but open files: lsof | grep '(deleted)' – these files still consume space until the file descriptor is closed. 4. Examine logrotate status: cat /var/lib/logrotate/status 5. For systemd journals: journalctl --disk-usage

Commands and Recovery Steps: 1. Safely truncate large log files that are not actively written to. For example: truncate -s 0 /var/log/large-old.log 2. For open files, use > /proc/<PID>/fd/<N> (be extremely cautious, ensure the file is a log file). 3. Force logrotate: logrotate -f /etc/logrotate.conf 4. Clean old rotated logs: find /var/log -name '*.gz' -mtime +30 -delete 5. If using journald, reduce retention: journalctl --vacuum-size=500M 6. Restart logging services: systemctl restart rsyslog or systemctl restart systemd-journald

Risk Controls: - Never delete a log file that is still open; always truncate or use copytruncate in logrotate. - Before any cleanup, snapshot the filesystem if possible. - Test logrotate configuration on a non-production system first. - Monitor disk space after cleanup to ensure it doesn't refill quickly.

Rollback: - If a service fails after truncation, restart the service to reopen the logs. - If logrotate was misconfigured, restore the original config from backup and rerun logrotate.

Verification: - Check disk usage: df -h /var/log - Verify service health: systemctl status <service> for affected services. - Confirm log writing: tail -f /var/log/syslog

When to Submit an OpsGlobal Ticket: - If the root cause is unclear (e.g., logs are filling up faster than expected). - If disk space is exhausted on a non-log partition. - If you need assistance tuning logrotate policies or setting up monitoring thresholds. - If the incident recurs despite cleanup, indicating a systemic issue.

Use cases

Useful for teams handling DevOps issues and needing a clear troubleshooting and delivery workflow.

Problem background

A step-by-step guide for SREs to troubleshoot disk space exhaustion on /var/log, including safe log cleanup, risk mitigation, and when to escalate.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

Related service CTA

If you are facing a similar Linux SRE Runbook: Diagnosing and Resolving Out-of-Disk Space on /var/log issue, submit a ticket for remote OpsGlobal support.

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Submit Incident Ticket Book Technical Consultation

Book Technical Consultation Back to Blog