Scenario
During routine SRE operations, PostgreSQL database suddenly exhibits a high volume of slow queries, causing increased application latency and potential cascading failures. Role: SRE Engineer. Environment: PostgreSQL 15 deployed in Kubernetes via StatefulSet.
Symptoms
- Elevated application error rates (e.g., HTTP 5xx)
- Query response times spiking from 50ms to over 5s
- Database CPU utilization near 100%
- Connection pool exhaustion
Diagnosis
1. Enable Slow Query Logging
-- Temporarily log queries exceeding 1 second
ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();
2. Identify Running Queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE state != 'idle' AND query NOT LIKE '%pg_stat%'
ORDER BY duration DESC LIMIT 5;
3. Detect Blocking and Locks
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
WHERE NOT blocked_locks.granted;
4. Analyze with EXPLAIN
EXPLAIN (ANALYZE, BUFFERS, TIMING) SELECT * FROM orders WHERE customer_id = 12345;
Look for full table scans, missing indexes, or suboptimal join orders.
Risk Controls During Diagnosis
- Avoid killing sessions directly (may cause long rollbacks).
- Use
pg_cancel_backend(pid)to cancel queries instead ofpg_terminate_backendwhere possible. - If termination is necessary, record the transaction ID and verify no critical row locks are held.
Safe Rollback
If the issue is caused by a recent schema change (e.g., index creation, query modification), roll back as follows:
1. Rollback Index Creation
DROP INDEX IF EXISTS idx_orders_customer_id_new;
2. Rollback DDL Changes
Use DDL transactions where supported or manually reverse operations.
ALTER TABLE orders DROP COLUMN IF EXISTS temp_column;
3. Rollback Application Code
Revert to previous query version or use pg_hint_plan to force old plan.
Verification
- Monitor slow query logs: confirm no queries exceed threshold.
- Check performance metrics: CPU, IO, connections return to baseline.
- Run test queries:
SELECT * FROM orders WHERE customer_id = 12345; -- expected <100ms
- Application health checks: verify error rate drop.
When to Submit an OpsGlobal Ticket
- Root cause unclear; require expert analysis.
- Issue persists after rollback; need deep debugging (e.g., VACUUM issues, configuration).
- Cluster-level problems: replication lag, disk faults.
- High-risk operations needed (e.g., manual system table repair).
OpsGlobal provides 24/7 remote DBA support with 30-minute response time.
Use cases
Useful for teams handling Database issues and needing a clear troubleshooting and delivery workflow.
Problem background
A deep dive into diagnosing PostgreSQL slow queries and performing safe rollbacks to maintain production stability.
Troubleshooting steps
Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.
Command examples
Replace sample resource names with real values and store passwords, tokens and keys in environment variables.
Risks
Before production changes, confirm backups, access boundaries, change windows and rollback paths.
Rollback plan
Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.
Deliverables
Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.
Need help with a similar technical issue?
If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.