Lease Timeouts and Health Checks in SQL

Daniel AG by Daniel A G

Lease Timeouts and Health Checks in SQL Server Always On Availability Groups


In the article for SQL Server Always On Availability Groups, we are going to explore lease timeouts and health checks.

Introduction

SQL Server Always On Availability Groups provides resilience for high availability and disaster recovery solution in a multi-node architecture. Suppose you have an availability group configured for automatic failover. In this synchronized availability group, we can have the following types of failover.

  • Automatic
  • Manual
  • You can do forced failover with possible data loss

In the case of automatic failover, the availability group fails over to a synchronized secondary replica without any data loss. However, an automatic failover requires the satisfied conditions defined in the flexible failover policy.

The flexible failover policy depends on the Health-check timeout thresholdFailure-Condition Level and Cluster timeouts.

It is an important aspect to determine the auto-failover causes in a production database environment. The logs are useful factors for monitoring and investigating the failures.

  • SQL Server error logs
  • Windows cluster logs
  • Cluster event logs
  • SQL Server failover diagnostics (sp_server_diagnostics) logs
  • AlwaysOn_Health extended events output
  • System_health extended events
  • Application and system logs

The availability group feature requires an underlying Windows Failover Clusters. The cluster requires cluster services on all participating nodes.

The resource host acts as an interface between the cluster resource and service. SQL Server uses resource DLLs for the resource monitor. These DLLs are responsible for health monitoring and offline-online resource management.

In the below image, we see two levels of monitoring in an AG configuration.

  • Looks-Alive
  • Is-Alive

Looks-Alive mechanism

SQL Server Always On Availability Groups levels of monitoring

In a Windows failover cluster, the cluster service continuously transmits heartbeat to other cluster nodes and waits for the acknowledgment. If it does not receive a heartbeat acknowledgment after a series of heartbeats, it assumes that the corresponding node is down. Subsequently, it broadcasts a message to other cluster nodes, and a node takes over the cluster ownership based on the quorum votes and majority configuration. If due to any reason, the cluster could not determine the quorum majority, all nodes in a cluster go into the resolving state and all cluster resources are taken offline. This process is known as Looks-Alive and runs every 5 seconds. The Windows failover cluster uses the Looks-Alive mechanism for an integration between the cluster resource host and SQL process. On the primary replica, SQL Server uses a dedicated lease worker thread.

dedicated lease worker thread

Go to Windows Failover Cluster and right-click on the cluster resource. In the properties, you get configured values for LeaseTimeout. By default, it is set to 20000 milliseconds (20 seconds).

Lease Timeout

The lease worker thread and resource host work circularly, as shown in the below image.

lease worker thread and resource host

In the SQL Server logs on the primary replica, you get a message from the lease worker. The lease worker thread and resource host maintains a time-to-leave (TTL), and it gets updated each time threads wait up after a  signal from another resource.

SQL Server Always On Availability Groups SQL Server logs

If the lease timeout period elapsed while waiting for the signal, the AG resource goes into the resolving state. In the case of AG failover, SQL Server logs another entry to stop the lease renewal.

lease timeout period

Is-Alive mechanism

The SQL Server Always on resource DLL uses the sp_server_diagnostics stored procedure for the health of SQL Service. The stored procedure reports the status for the system, resource, IO subsystem, query processing and events. It uses a FAILURE_CONDITION_LEVEL for implementing defining a condition in which automatic failover can happen.

Failover Condition Level

Description

1

Automatic failover occurred due to server down or lease expiry.

2

No data returned from sp_server_diagnostics. It can be related to the hung server issue. The automatic failover can happen due to SQL Server is not connected to the cluster or due to user-defined health check timeout.

3 (Default level)

System error ( stack dumps, orphaned spinlocks, serious write-access)

4

Resource error (For example- persistent out-of-memory condition)

5

Query processing (for example – Scheduler deadlock or unresolvable deadlock)

A failover condition encompasses conditions lower than configured conditions as well. For example, the default condition 3 also uses condition levels 1 and 2.

You can use the ALTER AVAILABILITY GROUP to set a specific failover condition level.

But how long resource DLL with wait for the output from the stored procedure sp_server_diagnostics? The resource DLL uses the timeout properties for this purpose. By default, the timeout is 30,000 milliseconds, i.e. 30 seconds. You can set the value for a timeout from 15 seconds to 4294967 seconds.

You can modify the health check timeout from the following ALTER AVAILABILITY GROUP command.

We can query sys.dm_xe_session to check the extended events for the sp_server_diagnostics session.

Is-Alive mechanism

The output of these events is stored in the default log directory. For example, in my environment, these files are available in the C:\Program Files\Microsoft SQL Server\MSSQL15.INST1\MSSQL\Log directory.

default log directory

Lease Time out in SQL Server Always On Availability Groups

As shown below, the lease timeout is 20 seconds. By default, the Windows failover cluster renewals the lease every 5 seconds, i.e. ¼ time of the lease timeout.

SQL Server Always On Availability Groups Lease Timeout

If lease timeout occurs ( >20 seconds), the rhs.exe reports an error in the Windows cluster, and it starts taking preventive action.

You can get the following error messages for a least timeout issues:

error messages

SQL Server Always On Availability Groups Health Check timeout

SQL Server Always On Availability Group performs a health check of the primary replica using the sp_server_diagnostics stored procedure. The sp_server_diagnostics executes every 10 seconds. The health check timeout is 30 seconds.

Therefore, the stored procedure returns the result on 1/3 * health checks current threshold. If the SP does not return any results, AG refers to the previous state for determining the instance health until the health-check timeout threshold. The next Is-Alive determines that the primary replica is unresponsive, and it initiates the automatic failover.

  • Note:
    • The sp_server_diagnostics does not perform health checks at the database level
    • It does not consider DB level health detection until we explicitly enabled it. 

WSFC configuration and cluster timeouts for SQL Server Always On Availability Groups

In a Windows failover cluster, the following parameters control the cluster timeout values:

  • SameSubnetDelay
  • SameSubnetThreshold
  • CrossSubnetDelay
  • CrossSubnetThreshold

You can check these parameter values using the Windows PowerShell cmdlet get-cluster().

WSFC configuration and cluster timeouts

In my lab environment, the servers are located in the same subnets. Therefore, the following table shows the default values for Windows Server 2012 and 2016.

Parameter

Windows Server 2012 R2

Windows Server 2016

SameSubnetDelay

1 second

1 Second

SameSubnetThreshold

5 heartbeats

10 heartbeats

The SameSubnetDelay defines the wait time between the heartbeats. It is 1 second for both Windows Server 2012 and 2016.

The SameSubnetThreshold defines the number of missing heartbeats a cluster can tolerate, and if it does not receive the acknowledgement, the target node is declared dead.

Recap of LeaseTimeout, Session-Timeout and Health check timeouts

  • Lease timeout:
    • Default value: 20000 milliseconds, i.e. 20 seconds
    • It is required to prevent a split-brain scenario in the Windows failover cluster
    • It can trigger an AG failover or offline-online
    • It is used in both Is-Alive and the Looks-alive mechanism
  • Session Timeout:
    • It is used to safeguard against soft errors between AG replicas
    • default value: 10000 milliseconds or 10 seconds
    • It is not part of the Is-Alive or the Looks-alive mechanism
    • Secondary replica becomes DISCONNECTED status due to session timeout
    • You can configure the session timeouts in the availability group properties from the primary replica instance
  • Health Check timeout
    • Default value: 30000 milliseconds, i.e. 30 seconds
    • It is the timeout if the sp_server_diagnostics does not return any data or reports errors in the health check
    • It is used in both Is-Alive and the Looks-alive mechanism
    • It also depends on the failover condition levels defined from 1-5, as explained earlier. The default configuration is level 3

Conclusion

In this article, we figured SQL Server Always On Availability Groups flexible failover policy using the lease, health check timeout and Failover Condition Levels. You should review the logs to determine the scenarios where the automatic failover occurred or failed.



Statistics Says It All

16

Years of Experience

3000

Gratified Students

100

Training Batches

9600

Training Hours


Subscribe

Please subscribe our technical blog to get recent updates.

Empire Data Systems

Social Links