PostgreSQL Slow Query Diagnosis and Safe Rollback SRE Guide

PostgreSQL Slow Query Diagnosis and Safe Rollback Practices for SRE Operations

Database 6min 70 views 2026-06-12

PostgreSQLSREDatabase PerformanceRollback

Scenario

During routine SRE operations, PostgreSQL database suddenly exhibits a high volume of slow queries, causing increased application latency and potential cascading failures. Role: SRE Engineer. Environment: PostgreSQL 15 deployed in Kubernetes via StatefulSet.

Symptoms

Elevated application error rates (e.g., HTTP 5xx)
Query response times spiking from 50ms to over 5s
Database CPU utilization near 100%
Connection pool exhaustion

Diagnosis

1. Enable Slow Query Logging

-- Temporarily log queries exceeding 1 second
ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();

2. Identify Running Queries

SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE state != 'idle' AND query NOT LIKE '%pg_stat%'
ORDER BY duration DESC LIMIT 5;

3. Detect Blocking and Locks

SELECT blocked_locks.pid AS blocked_pid,
       blocked_activity.usename AS blocked_user,
       blocking_locks.pid AS blocking_pid,
       blocked_activity.query AS blocked_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
    AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
WHERE NOT blocked_locks.granted;

4. Analyze with EXPLAIN

EXPLAIN (ANALYZE, BUFFERS, TIMING) SELECT * FROM orders WHERE customer_id = 12345;

Look for full table scans, missing indexes, or suboptimal join orders.

Risk Controls During Diagnosis

Avoid killing sessions directly (may cause long rollbacks).
Use pg_cancel_backend(pid) to cancel queries instead of pg_terminate_backend where possible.
If termination is necessary, record the transaction ID and verify no critical row locks are held.

Safe Rollback

If the issue is caused by a recent schema change (e.g., index creation, query modification), roll back as follows:

1. Rollback Index Creation

DROP INDEX IF EXISTS idx_orders_customer_id_new;

2. Rollback DDL Changes

Use DDL transactions where supported or manually reverse operations.

ALTER TABLE orders DROP COLUMN IF EXISTS temp_column;

3. Rollback Application Code

Revert to previous query version or use pg_hint_plan to force old plan.

Verification

Monitor slow query logs: confirm no queries exceed threshold.
Check performance metrics: CPU, IO, connections return to baseline.
Run test queries:

SELECT * FROM orders WHERE customer_id = 12345; -- expected <100ms

Application health checks: verify error rate drop.

When to Submit an OpsGlobal Ticket

Root cause unclear; require expert analysis.
Issue persists after rollback; need deep debugging (e.g., VACUUM issues, configuration).
Cluster-level problems: replication lag, disk faults.
High-risk operations needed (e.g., manual system table repair).

OpsGlobal provides 24/7 remote DBA support with 30-minute response time.

Use cases

Useful for teams handling Database issues and needing a clear troubleshooting and delivery workflow.

Problem background

A deep dive into diagnosing PostgreSQL slow queries and performing safe rollbacks to maintain production stability.

Troubleshooting steps

Confirm impact and recent changes, collect logs, configuration and metrics, then apply fixes from low to high risk.

Command examples

Replace sample resource names with real values and store passwords, tokens and keys in environment variables.

Risks

Before production changes, confirm backups, access boundaries, change windows and rollback paths.

Rollback plan

Keep original configuration and release versions; roll back config, images or database changes if metrics degrade.

Deliverables

Root-cause notes, key commands, remediation steps, verification results and follow-up recommendations.

Related service CTA

If you are facing a similar PostgreSQL Slow Query Diagnosis and Safe Rollback Practices for SRE Operations issue, submit a ticket for remote OpsGlobal support.

Need help with a similar technical issue?

If your servers, Kubernetes, Docker, CI/CD, databases or monitoring systems have similar issues, submit logs and config files for remote diagnosis.

Submit Incident Ticket Book Technical Consultation

Book Technical Consultation Back to Blog