GCP Cloud SQL Disaster Recovery: A Practical Guide for Developers

When a production database goes down – whether from a bad migration, an accidental DROP TABLE, or a rogue script – the clock starts ticking. Every minute of downtime is lost revenue, broken trust, and a very stressful Slack channel.

This post walks through how Google Cloud SQL’s backup and recovery features work, common disaster scenarios, and the recovery playbook a developer should follow for each. The examples use a typical SaaS application backed by PostgreSQL on Cloud SQL, but the principles apply broadly.

Cloud SQL Backup Fundamentals

Before anything goes wrong, you need to understand what Cloud SQL gives you out of the box and what you need to configure yourself.

Automated Backups

Cloud SQL can take daily automated backups of your instance. These are full snapshots of the entire database and are retained for a configurable window (default 7 days, max 365).

# gcloud: verify automated backups are enabled
gcloud sql instances describe my-instance \
  --format="value(settings.backupConfiguration)"

Key settings to configure:

SettingRecommendationWhy
backupConfiguration.enabledtrueNon-negotiable for production
backupConfiguration.startTimeOff-peak hours (e.g. 04:00 UTC)Minimizes performance impact
backupConfiguration.backupRetentionSettings.retainedBackups14-30Gives you a wider recovery window
backupConfiguration.pointInTimeRecoveryEnabledtrueEnables PITR (see below)
backupConfiguration.transactionLogRetentionDays7How far back PITR can reach

Point-in-Time Recovery (PITR)

Automated backups give you daily snapshots. PITR fills the gaps by continuously archiving write-ahead logs (WAL for PostgreSQL, binary logs for MySQL). This lets you restore to any second within the retention window — not just to the time of the last backup.

# Enable PITR on an existing instance
gcloud sql instances patch my-instance \
  --enable-point-in-time-recovery \
  --retained-transaction-log-days=7

PITR is the single most important setting for disaster recovery. Without it, you lose every write between your last automated backup and the incident.

On-Demand Backups

You can trigger a backup manually before risky operations:

gcloud sql backups create --instance=my-instance \
  --description="pre-migration-backup-2026-04-08"

Rule of thumb: always take an on-demand backup before running migrations, bulk data operations, or any ad-hoc SQL against production.


Disaster Scenarios and Recovery Playbooks

Scenario 1: Accidental Table Drop or Data Deletion

What happened: A developer ran a DROP TABLE or DELETE FROM without a WHERE clause against production. Maybe it was a script meant for staging. Maybe an AI-generated SQL statement was executed without review.

Impact: One or more tables are gone or empty. The application is throwing 500s.

Recovery options:

Option A: PITR (best if available)

Restore to the moment just before the destructive command. You’ll need the approximate timestamp.

# Restore to a clone instance first — never restore directly over production
gcloud sql instances clone my-instance my-instance-recovery \
  --point-in-time="2026-04-08T10:59:00Z"

This creates a new instance with the database state at that exact second. You can then:

  1. Verify the data on the clone
  2. Export the affected tables from the clone
  3. Import them back into the production instance
# Export a specific table from the recovery clone
gcloud sql export sql my-instance-recovery gs://my-bucket/recovery/users-table.sql \
  --database=myapp_production \
  --table=users

# Import into production
gcloud sql import sql my-instance gs://my-bucket/recovery/users-table.sql \
  --database=myapp_production

Option B: Restore from automated backup

If PITR is not enabled, restore the most recent automated backup that predates the incident.

# List available backups
gcloud sql backups list --instance=my-instance

# Restore a specific backup (this overwrites the instance)
gcloud sql backups restore BACKUP_ID --restore-instance=my-instance

Warning: Restoring a backup directly onto your production instance overwrites everything. All writes since that backup are lost. Prefer cloning to a recovery instance first.

The data gap problem:

When you restore from a backup taken at, say, 4:00 AM, but the incident happened at 11:00 AM, you lose 7 hours of data. This is the gap you’ll need to address manually. Common strategies:

  • Application-level event logs: If your app publishes events to a message queue (Kafka, Pub/Sub), you can replay them.
  • Analytics replicas: If you replicate data to BigQuery, Snowflake, or another analytics store, you can query the missing records from there and re-import them.
  • Audit tables: If your application logs changes to an audit table in a separate database, those records survive.
-- Example: querying BigQuery for records created during the gap window
SELECT *
FROM `project.dataset.user_actions`
WHERE created_at BETWEEN TIMESTAMP('2026-04-08 04:00:00', 'America/Vancouver')
  AND TIMESTAMP('2026-04-08 11:00:00', 'America/Vancouver')
  AND action_type = 'account_status_change'

You then re-ingest these records into production, typically via a script run in your application’s console or through a migration task.


Scenario 2: Interrupted Background Job

What happened: A critical scheduled job — say, one that generates weekly records for all active users — was running when the incident occurred. The database was restored from backup, but the job was killed mid-execution. Some users got their records; others didn’t.

Impact: No application errors (the data that exists is valid), but there’s a silent gap. Some users are missing records they should have.

Recovery playbook:

Step 1 — Quantify the gap

Before doing anything, measure what’s missing:

# Find users who should have a record but don't
target_date = Date.parse('2026-05-30')
users_missing = User.where(status: ['active', 'subscribed'])
  .where.not(id: WeeklyRecord.where(week_date: target_date).select(:user_id))
users_missing.count

Record the count. You’ll need it for verification later.

Step 2 – Understand the generation logic

Before re-running anything, understand what the job does:

  • Does it check for existing records before creating? (idempotent?)
  • Does it behave differently based on user status? (e.g., suspended users get a different treatment)
  • Does it trigger side effects? (emails, webhooks, billing)

If the job is idempotent — meaning running it twice for the same user produces the same result without duplicates — you can safely re-run it for all users, not just the ones missing records. This is simpler and safer than trying to target only the gap.

Step 3 – Re-run with guardrails

Write a targeted script rather than re-triggering the entire job:

target_date = Date.parse('2026-05-30')
# Pre-check
baseline_count = WeeklyRecord.where(week_date: target_date).count
puts "Records before: #{baseline_count}"
# Find and process missing users
users_missing = User.where(status: ['active', 'subscribed'])
.where.not(id: WeeklyRecord.where(week_date: target_date).select(:user_id))
puts "Users missing records: #{users_missing.count}"
users_missing.find_each do |user|
WeeklyRecordGenerator.new(user).generate(target_date)
rescue => e
puts "Failed for User ##{user.id}: #{e.message}"
end
# Post-check
new_count = WeeklyRecord.where(week_date: target_date).count
puts "Records after: #{new_count}"
puts "Delta: #{new_count - baseline_count}"

Step 4 – Verify

Check that:

  • The record count increased by the expected amount
  • No duplicates were created
  • No users are still missing records
  • Any status-dependent logic was applied correctly (e.g., suspended users got the right treatment)

Scenario 3: Corrupted Data from a Bad Migration

What happened: A migration altered a column type, dropped a constraint, or backfilled data incorrectly. The application is running but producing wrong results.

Impact: Data is present but incorrect. This is often harder to detect than missing data.

Recovery playbook:

  1. Don’t panic-restore. If the app is functional (just producing wrong data), you have time to assess.
  2. Clone to a recovery instance from a backup predating the migration: gcloud sql instances clone my-instance pre-migration-clone \ --point-in-time="2026-04-07T23:00:00Z"
  3. Diff the data between production and the clone to understand exactly what changed: -- Compare row counts SELECT 'production' as source, count(*) FROM production.orders UNION ALL SELECT 'backup' as source, count(*) FROM backup_clone.orders; -- Find rows that differ SELECT p.id, p.amount as prod_amount, b.amount as backup_amount FROM production.orders p JOIN backup_clone.orders b ON p.id = b.id WHERE p.amount != b.amount;
  4. Write a targeted fix rather than a full restore (which would lose post-migration legitimate writes).
  5. Write a rollback migration if the schema change itself was the problem.

Scenario 4: Full Instance Failure

What happened: The Cloud SQL instance is unreachable – maybe a zone outage, maybe accidental instance deletion.

Recovery options:

If the instance still exists (zone outage):

Cloud SQL instances configured for high availability will automatically failover to a standby in another zone. If you don’t have HA enabled:

# Enable HA (requires instance restart)
gcloud sql instances patch my-instance --availability-type=REGIONAL

If the instance was deleted:

Deleted instances can be recovered within a limited window if deletion protection wasn’t bypassed:

# Enable deletion protection
gcloud sql instances patch my-instance --deletion-protection

If truly gone, restore from the most recent backup to a new instance:

gcloud sql instances create my-instance-restored \
--source-backup=BACKUP_ID \
--tier=db-custom-4-16384 \
--region=us-west1

Then update your application’s database connection string to point to the new instance.


Prevention Checklist

The best disaster recovery is the one you never need. Here’s what to set up before things go wrong:

Cloud SQL Configuration

# The production-ready configuration checklist
gcloud sql instances patch my-instance \
--backup-start-time=04:00 \
--enable-point-in-time-recovery \
--retained-transaction-log-days=7 \
--retained-backups-count=30 \
--deletion-protection \
--availability-type=REGIONAL

Operational Practices

1. Never run ad-hoc SQL directly against production

Use a read replica for investigative queries. If you must write, use a transaction with a manual ROLLBACK checkpoint:

BEGIN;

-- Your change here
UPDATE users SET status = 'inactive' WHERE last_login < '2025-01-01';

-- Verify before committing
SELECT count(*) FROM users WHERE status = 'inactive';

-- Only if the count looks right:
COMMIT;
-- Otherwise:
ROLLBACK;

2. Take on-demand backups before risky operations

gcloud sql backups create --instance=my-instance \
--description="pre-bulk-update-$(date +%Y%m%d-%H%M%S)"

3. Review AI-generated SQL before executing

AI tools are excellent at generating SQL, but they don’t understand your data invariants. A syntactically correct DROP TABLE or DELETE without a WHERE clause is still catastrophic. Always:

  • Read the generated SQL line by line
  • Run it on staging first
  • Wrap destructive operations in a transaction
  • Have a second pair of eyes for DDL changes

4. Maintain an analytics replica

Replicate critical tables to BigQuery or another analytics store. This serves as both an analytics platform and a recovery source. If your primary database loses data, you can query the replica for the gap window and re-ingest.

# Set up a BigQuery data transfer from Cloud SQL
bq mk --transfer_config \
--target_dataset=sql_replica \
--display_name="Production SQL Replica" \
--data_source=scheduled_query \
--schedule="every 1 hours"

5. Use IAM to restrict destructive operations

Not every developer needs cloudsql.instances.delete or direct SQL access to production:

# Create a read-only role for most developers
gcloud projects add-iam-policy-binding my-project \
--member="group:developers@company.com" \
--role="roles/cloudsql.viewer"
# Grant write access only to the ops team
gcloud projects add-iam-policy-binding my-project \
--member="group:database-ops@company.com" \
--role="roles/cloudsql.admin"

The Recovery Timeline: What Happens in Practice

Here’s what a real recovery typically looks like, end to end:

T+0min Incident detected (alerts fire, app errors spike)
T+5min Confirm the issue — is it a code bug or data loss?
T+10min Identify the last good backup / PITR target
T+15min Clone instance from backup (takes 5-30 min depending on size)
T+45min Verify restored data on the clone
T+60min Restore production from clone or selectively import tables
T+90min Identify the data gap (writes between backup and incident)
T+120min Query analytics replica / event logs for gap data
T+150min Re-ingest gap data, verify counts
T+180min Re-run interrupted jobs with verification
T+210min Final validation — all counts match, no duplicates, app healthy
T+240min Post-incident review

The total time depends on database size, gap complexity, and whether you had PITR enabled. With PITR, the gap is seconds. Without it, you could be looking at hours of manual data reconciliation.


Key Takeaways

  1. Enable PITR. It’s the difference between losing seconds of data and losing hours.
  2. Always clone to a recovery instance first. Never restore directly over production unless you have no other option.
  3. Maintain an analytics replica. It’s your insurance policy for the data gap.
  4. Quantify before you fix. Record counts before and after every recovery step. You can’t verify what you didn’t measure.
  5. Understand your jobs’ idempotency. If a background job was interrupted, knowing whether it’s safe to re-run is the difference between a smooth recovery and creating a bigger mess.
  6. Take on-demand backups before risky operations. The 30 seconds it takes could save you 4 hours of recovery.
  7. Review all SQL before execution. Especially AI-generated SQL. Trust, but verify.

Production incidents are stressful, but with the right configuration and a clear playbook, they don’t have to be catastrophic. Set up your backups today — future you will be grateful.

Happy fixing!


Unknown's avatar

Author: Abhilash

Hi, I’m Abhilash! A seasoned web developer with 15 years of experience specializing in Ruby and Ruby on Rails. Since 2010, I’ve built scalable, robust web applications and worked with frameworks like Angular, Sinatra, Laravel, Node.js, Vue and React. Passionate about clean, maintainable code and continuous learning, I share insights, tutorials, and experiences here. Let’s explore the ever-evolving world of web development together!

Leave a comment