Scenario
A SaaS company needs to migrate 500GB core databases from on-premises to the cloud. Backup window is 2 hours. Currently MySQL 8.0 (InnoDB) backup via mysqldump takes >3h, and recovery is unpredictable. PostgreSQL 15 backup with pg_dump takes 2.8h, recovery exceeding 4h. Both overrun the SLA.
Symptoms
- MySQL:
mysqldumpbackup 3.5h, recovery 4h; XtraBackup backup 1.5h, but recovery spikes disk I/O causing application latency. - PostgreSQL:
pg_dumpbackup 2.8h, recovery slow due to WAL replay;pg_basebackupbackup 1.2h, but recovery requires manual WAL application.
Diagnosis
MySQL
- Backup bottleneck: Check if
mysqldumpuses--single-transaction(avoids table locks) and--quick(row-by-row). Useperf topto identify CPU spent on compression (default gzip). - Recovery bottleneck: High I/O wait due to single-threaded
mysql < dump.sqland non-striped disks.
PostgreSQL
- Backup:
pg_dumpruns serially by default; use-jfor parallelism. Checkshared_buffersandwork_mem. - Recovery:
pg_restorealso serial; specify-j. Improper WAL archiving causes excessive WAL replay.
Commands
MySQL Fast Backup with XtraBackup
xtrabackup --backup --parallel=4 --compress --compress-threads=4 --target-dir=/backup/mysql/full
MySQL Restore
xtrabackup --prepare --target-dir=/backup/mysql/full
xtrabackup --copy-back --target-dir=/backup/mysql/full
PostgreSQL Parallel Backup
pg_dump -j 4 -Fd -f /backup/pg/dump mydb
PostgreSQL Parallel Restore
pg_restore -j 4 -d mydb /backup/pg/dump
Risk Controls
- Verify free disk space before backup to avoid filling the volume.
- Use
--single-transactionor--lock-wait-timeouton production to prevent long waits. - Never restore directly to production; always test on a staging instance.
- Checksum backup files and perform periodic restore drills.
Rollback
- On backup failure, immediately revert to the last successful backup.
- MySQL: Use XtraBackup incremental backups (
--incremental-basedir) for rollback. - PostgreSQL: Use PITR to restore to a specific point in time.
Verification
- Run table-level checksums:
MySQL:
CHECKSUM TABLE table_name;PostgreSQL: Usepg_surgery(carefully) or custom hash-based verification. - Execute key queries comparing row counts and business summaries.
- Monitor application logs for duplicate key errors or missing data.
When to Submit an OpsGlobal Ticket
- Backup duration consistently exceeds the window by >50%.
- Restore fails or data inconsistency cannot be resolved internally.
- Need to tune backup strategy (compression algorithm, parallelism, etc.).
- Disk or network I/O bottlenecks affecting other workloads and requiring infrastructure support.
This guide provides technical reference only. Always follow your change management process. For expert assistance, contact OpsGlobal.
Use cases
Useful for teams handling Database issues and needing a clear troubleshooting and delivery workflow.
Problem background
Deep dive into performance bottlenecks, diagnostic methods, and optimization strategies for MySQL and PostgreSQL backup and recovery, covering scenarios, symptoms, commands, risk controls, rollback, verification, and when to escalate to OpsGlobal.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.