Scenario

Production databases often face performance degradation during backup and recovery operations, especially when running on Kubernetes where storage, network, and resource limits add complexity. Backup windows may stretch, and restore times can exceed SLAs.

Symptoms

Backup duration consistently exceeds the allowed window (e.g., >4 hours for 500GB).
Restore operations are extremely slow; a 500GB backup may take over 6 hours.
High I/O wait, low CPU utilization, and high disk latency observed in monitoring tools.
Backup tools (pg_dump, XtraBackup) timeout or fail with cryptic errors.

Diagnosis

Check resource constraints: bash kubectl top pods -n database Verify CPU and memory limits are adequate.
Analyze I/O performance: bash iostat -x 1 Look for high %util and await values.
Database internals: - MySQL: sql SHOW ENGINE INNODB STATUS\G; Check adaptive hash index contention and log waits. - PostgreSQL: sql SELECT * FROM pg_stat_activity WHERE state = 'active'; Identify long-running queries potentially locking resources.
Review backup tool logs: - XtraBackup: Check xtrabackup_log for errors. - pg_dump: Use --verbose to see progress.

Commands

MySQL Backup Optimization

Parallel backup: bash xtrabackup --backup --parallel=4 --target-dir=/backup
Compressed backup: bash xtrabackup --backup --compress --compress-threads=4 --target-dir=/backup
Throttling to limit production impact: bash xtrabackup --backup --throttle=100 --target-dir=/backup

PostgreSQL Backup Optimization

Directory format for parallelism: bash pg_dump -Fd -j 4 -f /backup mydb
Compressed custom format: bash pg_dump -Fc -Z 9 -f /backup/dump.gz mydb
Selective backup: Exclude large tables or unnecessary indexes.

Restore Optimization

MySQL: Use --apply-log with increased memory: bash xtrabackup --prepare --use-memory=4G --target-dir=/backup
PostgreSQL: Increase maintenance_work_mem before restore: sql SET maintenance_work_mem = '1GB'; Use parallel restore with pg_restore -j.

Risk Controls

Schedule backups during low traffic.
Use I/O throttling to prevent impacting online services.
Store backups on dedicated persistent volumes to avoid contention.
Always test restore performance in a staging environment first.

Rollback

If backup fails due to parameter changes, revert to defaults (e.g., disable parallelism).
If a restore yields corrupt data, immediately stop and restore from a known-good snapshot or full backup.

Verification

After restore, run consistency checks:
MySQL: checksum table or pt-table-checksum.
PostgreSQL: pg_checksums or ANALYZE.
Validate that business queries return correct results.

When to Submit an OpsGlobal Ticket

Backup/restore times persistently exceed SLA (e.g., >4 hours).
Data corruption or failed restoration occurs.
You need architectural recommendations (e.g., storage class, database configuration tuning).
You are uncertain about risk control configurations.

Following these steps, most performance issues can be resolved. If problems persist, OpsGlobal's SRE team can provide deep analysis and optimization.

Use cases

Useful for teams handling Database issues and needing a clear troubleshooting and delivery workflow.

Problem background

A deep dive into diagnosing and improving backup/restore speed for MySQL and PostgreSQL databases running on Kubernetes, with actionable commands and risk controls.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

Related service CTA

If you are facing a similar Optimizing Backup and Recovery Performance for MySQL and PostgreSQL in Kubernetes issue, submit a ticket for remote OpsGlobal support.

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Submit Incident Ticket Book Technical Consultation

Book Technical Consultation Back to Blog