Scenario
Production databases often face performance degradation during backup and recovery operations, especially when running on Kubernetes where storage, network, and resource limits add complexity. Backup windows may stretch, and restore times can exceed SLAs.
Symptoms
- Backup duration consistently exceeds the allowed window (e.g., >4 hours for 500GB).
- Restore operations are extremely slow; a 500GB backup may take over 6 hours.
- High I/O wait, low CPU utilization, and high disk latency observed in monitoring tools.
- Backup tools (pg_dump, XtraBackup) timeout or fail with cryptic errors.
Diagnosis
- Check resource constraints:
bash kubectl top pods -n databaseVerify CPU and memory limits are adequate. - Analyze I/O performance:
bash iostat -x 1Look for high %util and await values. - Database internals:
- MySQL:
sql SHOW ENGINE INNODB STATUS\G;Check adaptive hash index contention and log waits. - PostgreSQL:sql SELECT * FROM pg_stat_activity WHERE state = 'active';Identify long-running queries potentially locking resources. - Review backup tool logs:
- XtraBackup: Check
xtrabackup_logfor errors. - pg_dump: Use--verboseto see progress.
Commands
MySQL Backup Optimization
- Parallel backup:
bash xtrabackup --backup --parallel=4 --target-dir=/backup - Compressed backup:
bash xtrabackup --backup --compress --compress-threads=4 --target-dir=/backup - Throttling to limit production impact:
bash xtrabackup --backup --throttle=100 --target-dir=/backup
PostgreSQL Backup Optimization
- Directory format for parallelism:
bash pg_dump -Fd -j 4 -f /backup mydb - Compressed custom format:
bash pg_dump -Fc -Z 9 -f /backup/dump.gz mydb - Selective backup: Exclude large tables or unnecessary indexes.
Restore Optimization
- MySQL: Use
--apply-logwith increased memory:bash xtrabackup --prepare --use-memory=4G --target-dir=/backup - PostgreSQL: Increase
maintenance_work_membefore restore:sql SET maintenance_work_mem = '1GB';Use parallel restore withpg_restore -j.
Risk Controls
- Schedule backups during low traffic.
- Use I/O throttling to prevent impacting online services.
- Store backups on dedicated persistent volumes to avoid contention.
- Always test restore performance in a staging environment first.
Rollback
- If backup fails due to parameter changes, revert to defaults (e.g., disable parallelism).
- If a restore yields corrupt data, immediately stop and restore from a known-good snapshot or full backup.
Verification
- After restore, run consistency checks:
- MySQL:
checksum tableorpt-table-checksum. - PostgreSQL:
pg_checksumsorANALYZE. - Validate that business queries return correct results.
When to Submit an OpsGlobal Ticket
- Backup/restore times persistently exceed SLA (e.g., >4 hours).
- Data corruption or failed restoration occurs.
- You need architectural recommendations (e.g., storage class, database configuration tuning).
- You are uncertain about risk control configurations.
Following these steps, most performance issues can be resolved. If problems persist, OpsGlobal's SRE team can provide deep analysis and optimization.
Use cases
Useful for teams handling Database issues and needing a clear troubleshooting and delivery workflow.
Problem background
A deep dive into diagnosing and improving backup/restore speed for MySQL and PostgreSQL databases running on Kubernetes, with actionable commands and risk controls.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.