Start with RTO and RPO
Before choosing any technology, define your targets. Every DR decision flows from these two numbers:
- RTO (Recovery Time Objective) — how long can the business tolerate downtime? 1 hour? 4 hours? 24 hours?
- RPO (Recovery Point Objective) — how much data loss is acceptable? Zero? 15 minutes? 1 hour?
| Strategy | Typical RPO | Typical RTO | Cost |
|---|---|---|---|
| Cold backup (restore from file) | Hours | Hours | Low |
| Log shipping / WAL archiving | Minutes | Minutes–Hours | Low–Medium |
| Streaming replication | <1 min | Minutes | Medium |
| Synchronous HA (AG / Patroni) | Zero | Seconds | High |
A backup you've never tested is not a backup — it's a hope. Schedule quarterly restore drills to a test environment and measure actual RTO. You'll almost always discover something unexpected.
SQL Server
Backup, Log Shipping, Always On AGBackup Strategy
A solid SQL Server backup strategy follows the 3-2-1 rule: 3 copies of data, on 2 different media types, with 1 copy offsite. For most production databases, the schedule looks like this:
- Full backup — weekly (Sunday night)
- Differential backup — nightly (Mon–Sat)
- Transaction log backup — every 15–30 minutes
-- Full backup
BACKUP DATABASE [YourDB]
TO DISK = 'D:\Backups\YourDB_Full_' +
FORMAT(GETDATE(), 'yyyyMMdd_HHmm') + '.bak'
WITH
COMPRESSION,
CHECKSUM,
STATS = 10,
DESCRIPTION = 'Weekly full backup';
-- Verify backup integrity
RESTORE VERIFYONLY
FROM DISK = 'D:\Backups\YourDB_Full_20260328_0200.bak'
WITH CHECKSUM;
-- Transaction log backup
BACKUP LOG [YourDB]
TO DISK = 'D:\Backups\YourDB_Log_' +
FORMAT(GETDATE(), 'yyyyMMdd_HHmm') + '.trn'
WITH COMPRESSION, CHECKSUM;
Log Shipping for DR
Log shipping ships transaction log backups to a secondary server and restores them continuously, keeping the secondary 15–30 minutes behind the primary. It's simple, reliable, and works across SQL Server versions — ideal for DR sites where some data loss is tolerable.
-- Run on the secondary
SELECT
primary_server,
primary_database,
last_copied_file,
last_copied_date,
last_restored_file,
last_restored_date,
DATEDIFF(MINUTE, last_restored_date, GETDATE()) AS lag_minutes
FROM msdb.dbo.log_shipping_monitor_secondary
ORDER BY lag_minutes DESC;
Always On AG for Zero-RPO DR
For RPO = 0, Always On Availability Groups with synchronous commit is the answer. See our dedicated article on Always On AG architecture for full setup details.
Point-in-Time Restore
-- Step 1: Restore full backup (NORECOVERY keeps it in restoring state)
RESTORE DATABASE [YourDB]
FROM DISK = 'D:\Backups\YourDB_Full_20260323_0200.bak'
WITH NORECOVERY, REPLACE;
-- Step 2: Restore differential (if available)
RESTORE DATABASE [YourDB]
FROM DISK = 'D:\Backups\YourDB_Diff_20260327_0200.bak'
WITH NORECOVERY;
-- Step 3: Restore log backups up to the target time
RESTORE LOG [YourDB]
FROM DISK = 'D:\Backups\YourDB_Log_20260328_0945.trn'
WITH
NORECOVERY,
STOPAT = '2026-03-28 09:43:00'; -- restore to just before the incident
-- Step 4: Bring database online
RESTORE DATABASE [YourDB] WITH RECOVERY;
PostgreSQL
pg_dump, WAL Archiving, pgBackRest, Streaming ReplicationLogical Backups with pg_dump
For smaller databases or selective restores, pg_dump creates portable logical backups. It's not suitable as your sole DR strategy for large databases — use it alongside physical backups.
# Full database dump in custom format (recommended)
pg_dump \
--host=localhost \
--port=5432 \
--username=postgres \
--format=custom \
--compress=9 \
--file=/backups/yourdb_$(date +%Y%m%d_%H%M).dump \
yourdb
# Restore from custom format dump
pg_restore \
--host=localhost \
--port=5432 \
--username=postgres \
--dbname=yourdb_restored \
--jobs=4 \
/backups/yourdb_20260328_0200.dump
Physical Backups with pgBackRest
pgBackRest is the industry standard for production PostgreSQL backups. It supports full, differential, and incremental backups, parallel processing, compression, encryption, and S3/Azure Blob storage.
# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=2
repo1-retention-diff=14
compress-type=lz4
log-level-console=info
[yourdb]
pg1-path=/var/lib/postgresql/15/main
# Take a full backup
pgbackrest --stanza=yourdb backup --type=full
# Take a differential backup
pgbackrest --stanza=yourdb backup --type=diff
# Take an incremental backup
pgbackrest --stanza=yourdb backup --type=incr
# Point-in-time restore
pgbackrest --stanza=yourdb restore \
--target="2026-03-28 09:43:00" \
--target-action=promote
WAL Archiving for PITR
Enable WAL archiving in postgresql.conf to continuously ship WAL segments to a backup location. Combined with a base backup, this enables point-in-time recovery to any moment since the last base backup.
wal_level = replica archive_mode = on archive_command = 'pgbackrest --stanza=yourdb archive-push %p' archive_timeout = 300 # force WAL switch every 5 minutes (max RPO)
Streaming Replication for HA/DR
PostgreSQL streaming replication sends WAL records from primary to standby in real time. Combined with Patroni for automated failover, this is the standard production HA setup.
# On the standby server
pg_basebackup \
--host=primary-server \
--port=5432 \
--username=replicator \
--pgdata=/var/lib/postgresql/15/main \
--wal-method=stream \
--checkpoint=fast \
--progress \
--verbose
# postgresql.conf on standby
primary_conninfo = 'host=primary-server port=5432 user=replicator'
recovery_target_timeline = 'latest'
# Create standby.signal to activate standby mode
touch /var/lib/postgresql/15/main/standby.signal
MongoDB
mongodump, Replica Sets, Ops Manager, AtlasLogical Backups with mongodump
mongodump creates BSON exports of your data. It's suitable for smaller datasets and selective restores but is not recommended as the sole DR strategy for large or high-write databases — it creates a point-in-time snapshot but doesn't capture oplog during the dump.
# Dump with oplog (ensures consistency across collections)
mongodump \
--host="replicaset/mongo1:27017,mongo2:27017,mongo3:27017" \
--oplog \
--gzip \
--out=/backups/mongo_$(date +%Y%m%d_%H%M)
# Restore with oplog replay
mongorestore \
--host="localhost:27017" \
--oplogReplay \
--gzip \
/backups/mongo_20260328_0200
Replica Sets — The Foundation of MongoDB DR
Every production MongoDB deployment should use a replica set. A replica set is a group of mongod instances that maintain the same dataset — one primary accepts writes, secondaries replicate asynchronously and can become primary on failover.
// Initiate the replica set (run on primary)
rs.initiate({
_id: "rs0",
members: [
{ _id: 0, host: "mongo1:27017", priority: 2 },
{ _id: 1, host: "mongo2:27017", priority: 1 },
{ _id: 2, host: "mongo3:27017", priority: 0, hidden: true, votes: 1 }
// hidden member: good for backups, doesn't serve reads
]
});
// Check replica set status
rs.status();
// Check replication lag
rs.printSecondaryReplicationInfo();
Point-in-Time Restore with Oplog
MongoDB's oplog (operations log) is a capped collection that records all write operations. By combining a base backup with oplog replay, you can restore to any point in time within the oplog window.
# Restore base backup first
mongorestore \
--host="localhost:27017" \
--gzip \
/backups/mongo_20260327_0200
# Replay oplog up to a specific timestamp
# Timestamp format: Timestamp(epochSeconds, ordinal)
mongorestore \
--host="localhost:27017" \
--oplogReplay \
--oplogLimit="1743155580:1" \
/backups/mongo_oplog
MongoDB Atlas Built-in DR
If you use MongoDB Atlas, cloud backup and point-in-time restore are built in. Atlas provides continuous cloud backups with configurable retention (up to 35 days) and restores to any second within the retention window — no infrastructure to manage.
For on-premises MongoDB, Ops Manager (included with MongoDB Enterprise) provides the same automated backup, PITR, and monitoring capabilities as Atlas, with full control over your infrastructure.
The DR Runbook Checklist
Whatever platform you use, every DR plan needs a documented runbook. Here's the minimum checklist:
- Define RTO and RPO for each database and application tier
- Document the restore procedure step by step — assume the person running it has never done it before
- Test restores quarterly to a non-production environment and measure actual RTO
- Verify backup integrity automatically after every backup job (CHECKSUM for SQL Server, pgBackRest verify for PostgreSQL)
- Store backups offsite — Azure Blob, AWS S3, or a geographically separate location
- Encrypt backups at rest and in transit — a stolen backup tape is a data breach
- Document the failover decision tree — who authorises a failover? Who executes it? Who communicates to stakeholders?
- Monitor replication lag continuously — alert when lag exceeds your RPO threshold
The most common DR failure is not a technical one — it's an organisational one. Backups exist, but nobody knows how to restore them. Document and drill the procedure, not just the infrastructure.