What Are Always On Availability Groups?
Introduced in SQL Server 2012, Always On Availability Groups (AG) is a high availability and disaster recovery feature that allows a set of databases to fail over together as a unit. It builds on Windows Server Failover Clustering (WSFC) but extends far beyond traditional failover clustering — you don't need shared storage, and secondaries can serve as readable replicas, offloading read workloads from the primary.
┌─────────────────────────────────────────────────────────┐
│ Windows Server Failover Cluster │
│ │
│ ┌──────────────────┐ ┌──────────────────────────┐ │
│ │ PRIMARY NODE │ │ SECONDARY NODE(S) │ │
│ │ │ │ │ │
│ │ SQL Server │ ───► │ SQL Server │ │
│ │ (Read/Write) │ │ (Read-Only / Standby) │ │
│ │ │ ◄─── │ │ │
│ │ ┌────────────┐ │ │ ┌────────────────────┐ │ │
│ │ │ DB_1 │ │ │ │ DB_1 (replica) │ │ │
│ │ │ DB_2 │ │ │ │ DB_2 (replica) │ │ │
│ │ │ DB_3 │ │ │ │ DB_3 (replica) │ │ │
│ │ └────────────┘ │ │ └────────────────────┘ │ │
│ └──────────────────┘ └──────────────────────────┘ │
│ │ │ │
│ └──────────┬─────────────────┘ │
│ │ │
│ AG LISTENER (Virtual IP) │
│ Applications connect here → │
└─────────────────────────────────────────────────────────┘
Prerequisites
Before configuring Always On AG, your environment must meet these requirements:
- SQL Server Enterprise Edition (or Standard for Basic AG — limited to 2 replicas, 1 database per AG)
- Windows Server Failover Cluster (WSFC) configured and healthy across all nodes
- All nodes in the same WSFC — can span different subnets for multi-site AG
- SQL Server service account with the same credentials (or Managed Service Accounts) on all nodes
- Databases in FULL recovery model — Simple and Bulk-Logged are not supported
- A full backup taken on the primary before adding to the AG
- All databases must have the same name on primary and secondary
- Endpoint port open (default 5022) between all SQL Server instances
SQL Server 2022 introduced Contained Availability Groups, which include their own system databases (logins, SQL Agent jobs, etc.), eliminating the need to manually synchronise these objects across replicas.
Step 1 — Configure the Windows Failover Cluster
Always On AG requires an underlying WSFC even if you don't use shared storage. All SQL Server nodes must be cluster members.
Install-WindowsFeature -Name Failover-Clustering -IncludeManagementTools
# Validate cluster configuration (run before creating)
Test-Cluster -Node "SQL-NODE1", "SQL-NODE2" -Include "Inventory", "Network", "System Configuration"
# Create the cluster (no shared storage needed for AG)
New-Cluster -Name "SQL-CLUSTER" `
-Node "SQL-NODE1", "SQL-NODE2" `
-StaticAddress "10.0.0.50" `
-NoStorage
Step 2 — Enable Always On on Each SQL Instance
Enable-SqlAlwaysOn -ServerInstance "SQL-NODE1" -Force Enable-SqlAlwaysOn -ServerInstance "SQL-NODE2" -Force # Restart SQL Server service after enabling Restart-Service -Name "MSSQLSERVER" -Force
Step 3 — Create Database Mirroring Endpoints
Each SQL Server instance needs a dedicated endpoint for AG traffic. This is the communication channel between replicas.
-- Create the endpoint (run on SQL-NODE1)
CREATE ENDPOINT [Hadr_endpoint]
STATE = STARTED
AS TCP (LISTENER_PORT = 5022)
FOR DATABASE_MIRRORING (
ROLE = ALL,
AUTHENTICATION = WINDOWS NEGOTIATE,
ENCRYPTION = REQUIRED ALGORITHM AES
);
-- Grant CONNECT permission to the SQL Server service account
GRANT CONNECT ON ENDPOINT::[Hadr_endpoint]
TO [DOMAIN\SQLServiceAccount];
Synchronisation Modes
Understanding synchronisation modes is critical — the wrong choice can mean data loss or performance degradation.
Synchronous Commit
- Transaction commits only after log is hardened on secondary
- Zero data loss (RPO = 0)
- Automatic failover supported
- Higher latency on primary writes
- Best for: same datacenter or low-latency links (<5ms RTT)
Asynchronous Commit
- Transaction commits without waiting for secondary
- Possible data loss (RPO > 0)
- Only manual failover (forced)
- No latency impact on primary
- Best for: DR sites, high-latency WAN links
A common mistake is configuring synchronous commit to a geographically distant secondary over a high-latency WAN link. Every write transaction pays the RTT penalty. If latency to the secondary is >5ms, use asynchronous commit for that replica.
Step 4 — Create the Availability Group
-- Take full backup first (required)
BACKUP DATABASE [YourDB]
TO DISK = '\\backup-share\YourDB_full.bak'
WITH FORMAT, COMPRESSION;
BACKUP LOG [YourDB]
TO DISK = '\\backup-share\YourDB_log.bak';
-- Create the Availability Group
CREATE AVAILABILITY GROUP [AG_Production]
WITH (
AUTOMATED_BACKUP_PREFERENCE = SECONDARY,
DB_FAILOVER = ON,
FAILURE_CONDITION_LEVEL = 3
)
FOR DATABASE [YourDB], [YourDB2]
REPLICA ON
N'SQL-NODE1' WITH (
ENDPOINT_URL = N'TCP://SQL-NODE1.domain.local:5022',
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
FAILOVER_MODE = AUTOMATIC,
SEEDING_MODE = AUTOMATIC,
SECONDARY_ROLE (ALLOW_CONNECTIONS = NO)
),
N'SQL-NODE2' WITH (
ENDPOINT_URL = N'TCP://SQL-NODE2.domain.local:5022',
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
FAILOVER_MODE = AUTOMATIC,
SEEDING_MODE = AUTOMATIC,
SECONDARY_ROLE (ALLOW_CONNECTIONS = READ_ONLY)
);
-- Join secondary to the AG (run on SQL-NODE2)
ALTER AVAILABILITY GROUP [AG_Production] JOIN;
ALTER AVAILABILITY GROUP [AG_Production] GRANT CREATE ANY DATABASE;
Step 5 — Configure the AG Listener
The listener is a virtual network name (VNN) and IP address that applications connect to. It abstracts the primary replica — after a failover, applications reconnect to the same listener and are automatically routed to the new primary.
ALTER AVAILABILITY GROUP [AG_Production]
ADD LISTENER N'AG-LISTENER' (
WITH IP (
(N'10.0.0.60', N'255.255.255.0') -- listener IP and subnet mask
),
PORT = 1433
);
-- Verify listener
SELECT dns_name, port, ip_configuration_string_from_cluster
FROM sys.availability_group_listeners;
Applications should connect to the listener name, not the individual server names. Include ApplicationIntent=ReadOnly in the connection string to route read workloads to the secondary replica automatically.
Step 6 — Monitor AG Health
-- Overall AG health
SELECT
ag.name AS ag_name,
ar.replica_server_name,
ars.role_desc,
ars.operational_state_desc,
ars.synchronization_health_desc,
ars.connected_state_desc
FROM sys.availability_groups ag
JOIN sys.availability_replicas ar ON ag.group_id = ar.group_id
JOIN sys.dm_hadr_availability_replica_states ars ON ar.replica_id = ars.replica_id;
-- Redo queue and send queue (lag indicators)
SELECT
db_name(drs.database_id) AS database_name,
ar.replica_server_name,
drs.synchronization_state_desc,
drs.log_send_queue_size AS send_queue_kb,
drs.redo_queue_size AS redo_queue_kb,
drs.log_send_rate AS send_rate_kb_s,
drs.redo_rate AS redo_rate_kb_s,
drs.last_commit_time
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar ON drs.replica_id = ar.replica_id
ORDER BY drs.redo_queue_size DESC;
Failover Types
Automatic Failover
Occurs when the primary becomes unavailable and a synchronous secondary detects the failure via the WSFC health check. No data loss. Requires synchronous commit mode and automatic failover mode configured on both replicas.
Manual Planned Failover
Used for maintenance windows — patch the primary, failover to secondary, patch the old primary. Zero data loss.
-- Verify secondary is synchronised first SELECT synchronization_state_desc FROM sys.dm_hadr_database_replica_states WHERE is_local = 1; -- Perform planned failover (run on the secondary you want to promote) ALTER AVAILABILITY GROUP [AG_Production] FAILOVER;
Forced Failover (DR)
Used when the primary is unrecoverable and the secondary is asynchronous. May result in data loss. Use only when no other option is available.
-- Only use when primary is permanently lost
ALTER AVAILABILITY GROUP [AG_Production]
FORCE_FAILOVER_ALLOW_DATA_LOSS;
After a forced failover, the old primary (if it comes back online) will be in a SUSPENDED state. Do not bring it back into the AG until you've verified data consistency and resolved any divergence between the two nodes.
Common Pitfalls to Avoid
- Forgetting to synchronise logins — logins exist at the instance level, not database level. Script them regularly and apply to secondaries, or use Contained AGs in SQL Server 2022
- Orphaned jobs on failover — SQL Agent jobs don't fail over with the AG. Maintain identical jobs on all replicas, enabled/disabled based on role
- TEMPDB not on local disk — TEMPDB must be on local storage on each node, never on shared or replicated storage
- Missing DNS registration — the AG listener requires a computer object in Active Directory. Ensure the cluster service account has permission to create it
- Not testing failover — schedule quarterly planned failover drills to ensure the process works before you need it in a real incident
Summary
Always On Availability Groups is the most robust HA/DR solution for SQL Server, but it requires careful planning — from WSFC configuration to choosing the right synchronisation mode for each replica. With the architecture correctly in place, you gain automatic failover, readable secondaries, and a solid foundation for zero-downtime maintenance.
Need help designing or implementing an Always On cluster for your environment? Contact our SQL Server team — we've designed and deployed AG clusters from 2-node setups to multi-site enterprise configurations.