Bug 2224249 - Manual and Automatic failover issues in RHEL 9
Summary: Manual and Automatic failover issues in RHEL 9
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: pacemaker
Version: 9.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-20 09:21 UTC by Aravind Mahadevan
Modified: 2023-08-16 19:41 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
screenshot (246.21 KB, image/png)
2023-07-20 09:21 UTC, Aravind Mahadevan
no flags Details
pacemaker logs (1.70 MB, application/zip)
2023-07-20 09:24 UTC, Aravind Mahadevan
no flags Details
rhel9manualfailover (108.68 KB, image/png)
2023-07-27 16:21 UTC, Aravind Mahadevan
no flags Details
rhel9manualfailover (250.80 KB, image/png)
2023-07-27 16:21 UTC, Aravind Mahadevan
no flags Details
LatestLogs_post_call_with_MS&Rhel (208.85 KB, text/plain)
2023-08-02 18:22 UTC, Aravind Mahadevan
no flags Details
LatestLogs_post_call_with_MS&Rhel (324.99 KB, text/plain)
2023-08-02 18:23 UTC, Aravind Mahadevan
no flags Details
LatestLogs_post_call_with_MS&Rhel (301.18 KB, text/plain)
2023-08-02 18:23 UTC, Aravind Mahadevan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-162834 0 None None None 2023-07-20 09:26:14 UTC

Description Aravind Mahadevan 2023-07-20 09:21:59 UTC
Created attachment 1976688 [details]
screenshot

Description of problem:
We work with MS engineering team to run HA tests - SQL Server on RHEL 9.
We have observed the following issue of automatic failover failing for RHEL 9.

We are having a upcoming CU release and want to fix this before that, in order to prepare for the release pipelines internally.

Version-Release number of selected component (if applicable):
RHEL 9, I think pacemaker 0.11 version ? 

How reproducible:
Setup SQL Server HA configuration on RHEL 9 and shut down primary, to reproduce the automatic failover failing issue.

Actual results:

The Manual failover has passed but Automatic failover has failed. We have waited for more than half an hour for master switch to happen but it has not taken place

Attaching logs and screenshot for reference

Expected results:
Auto failover should happen without issues.

Additional info:

Comment 1 Aravind Mahadevan 2023-07-20 09:24:37 UTC
Created attachment 1976689 [details]
pacemaker logs

Comment 2 Ken Gaillot 2023-07-24 20:15:37 UTC
Hi,

Looking at the "second_run" logs, the problem is that start and promote actions repeatedly fail.

As an aside, it is not necessary to set a low cluster-recheck-interval. Many years ago, failure timeouts were only guaranteed to be checked as often as the recheck interval, so it was customary to set them similarly. However with modern Pacemaker, failure timeouts occur at their exact time, regardless of cluster-recheck-interval.

As another aside, it is necessary to enable fencing. Pacemaker is unable to recover from certain failures, such as a stop failure, without fencing, so even in a test cluster it is worth configuring fencing to simulate production behavior.

Interesting logs:

    Jun 30 02:51:11.015 rhel-9-3 pacemaker-schedulerd[1198] (remap_operation)       info: Probe found agCluster:0 active and promoted on rhel-9-1 at Jun 30 02:51:04 2023

^ When the resource is first added to the cluster, it is found to be already running. That is not a problem, and Pacemaker will "adopt" the existing instance.

    Jun 30 02:51:22.814 rhel-9-3 pacemaker-schedulerd[1198] (rsc_action_default)    info: Leave   agCluster:0       (Unpromoted rhel-9-2)
    Jun 30 02:51:22.814 rhel-9-3 pacemaker-schedulerd[1198] (rsc_action_default)    info: Leave   agCluster:1       (Promoted rhel-9-1)
    Jun 30 02:51:22.814 rhel-9-3 pacemaker-schedulerd[1198] (rsc_action_default)    info: Leave   agCluster:2       (Unpromoted rhel-9-3)
    Jun 30 02:51:22.815 rhel-9-3 pacemaker-schedulerd[1198] (pcmk__log_transition_summary)  notice: Calculated transition 6, saving inputs in /var/lib/pacemaker/pengine/pe-input-6.bz2

^ At this point, right after the resource was added, agCluster is happy, and promoted on rhel-9-1.

    Jun 30 04:03:02  ag(agCluster)[52851]:    INFO: monitor: ERROR: 2023/06/30 04:03:02 Instance is unhealthy: status 3 is at or below monitor policy 3

^ The agent on rhel-9-2 reports a failed monitor (and sets the failed node's promotion score to -INFINITY). At this point, the cluster responds appropriately, wanting to recover the instance on rhel-9-2 (the failed node) to the unpromoted role, and promote the instance on rhel-9-1.

    Jun 30 04:03:24  ag(agCluster)[53220]:    ERROR: SQL Server isn't running.

^ The agent on rhel-9-2 reports that the resource failed to restart. The cluster responds by stopping the instance there and leaving it stopped.

    Jun 30 04:03:52  ag(agCluster)[56097]:    INFO: promote: 2023/06/30 04:03:51 Failed action promote: One or more DBs are unsynchronized or not joined to the AG: Error 41142, State 17: mssql: The availability replica for availability group 'ag1' on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group. If the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss). For more information, see SQL Server Books Online.

^ The agent on rhel-9-1 reports that the promotion attempt failed there. The cluster responds by wanting to restart the instance on rhel-9-1 to the unpromoted role, leaving the instance on rhel-9-2 stopped, and promoting the instance on rhel-9-3.

    Jun 30 04:04:36  ag(agCluster)[59101]:    INFO: promote: 2023/06/30 04:04:36 Failed action promote: One or more DBs are unsynchronized or not joined to the AG: Error 41142, State 34: mssql: The availability replica for availability group 'ag1' on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group. If the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss). For more information, see SQL Server Books Online.

^ The promotion fails on rhel-9-3 as well. Due to a low failure-timeout of 60s, the various failures quickly expire, and the same sequence repeats endlessly. (Usually failure-timeout is much higher, but that is not a problem.)

To sum up, Pacemaker is responding appropriately to what the agent is telling it. The agent is unable to successfully start or promote the resource after the first failure.

Comment 3 Aravind Mahadevan 2023-07-27 16:20:21 UTC
Thanks Ken for the update.
Sorry to change the topic of the issue. I've got to know that manual failover also has issues in RHEL 9.

I'd like to add the details here, for your reference.

Comment 4 Aravind Mahadevan 2023-07-27 16:21:10 UTC
Created attachment 1980299 [details]
rhel9manualfailover

Comment 5 Aravind Mahadevan 2023-07-27 16:21:32 UTC
Created attachment 1980300 [details]
rhel9manualfailover

Comment 6 Aravind Mahadevan 2023-07-28 08:35:52 UTC
@kgaillot , I'd request a call to be scheduled between yourself, me and our engineering dev Yunxi Jia as there's some confusion and disagreements about this ticket issue as well as https://bugzilla.redhat.com/show_bug.cgi?id=2221772 . Since both are handled by you, Kindly request you to let us know if we could have a call coming Monday/tuesday  and share your preferable timeslots/working timezone so that we could schedule it accordingly. If not a call, then it would delay things and we wouldn't be able to finish our HA testing for our upcoming SQL on Linux CU release.

Comment 7 Aravind Mahadevan 2023-08-02 18:21:51 UTC
Sharing logs of latest run :

Comment 8 Aravind Mahadevan 2023-08-02 18:22:59 UTC
Created attachment 1981341 [details]
LatestLogs_post_call_with_MS&Rhel

Comment 9 Aravind Mahadevan 2023-08-02 18:23:27 UTC
Created attachment 1981342 [details]
LatestLogs_post_call_with_MS&Rhel

Comment 10 Aravind Mahadevan 2023-08-02 18:23:56 UTC
Created attachment 1981343 [details]
LatestLogs_post_call_with_MS&Rhel

Comment 11 Aravind Mahadevan 2023-08-03 18:24:19 UTC
@kgaillot : Would you be having any insights from the latest logs attached correlating it with the agent code that was shared ?

Comment 12 Aravind Mahadevan 2023-08-03 18:39:33 UTC
and would you have the recommended steps to configure RHEL 9 AG setup with SQL Server .. ? 

https://learn.microsoft.com/en-us/sql/linux/sql-server-linux-availability-group-cluster-pacemaker?view=sql-server-ver16&tabs=rhel 
The steps we follow are above and agent code : https://github.com/microsoft/mssql-server-ha/blob/master/ag/ag

Comment 13 Ken Gaillot 2023-08-03 22:53:48 UTC
(In reply to Aravind Mahadevan from comment #11)
> @kgaillot : Would you be having any insights from the latest logs
> attached correlating it with the agent code that was shared ?

I wasn't able to get that far this week but will pick it up again on Monday.

The agent definitely won't work on RHEL 9 without the "Promoted" change. I saw in the new logs:

    Aug 01 11:03:47.289 rhel91 pacemaker-execd     [3734570] (log_op_output)        info: ag_cluster_monitor_0[3734822] error output [ resource ag_cluster is NOT running ]
    Aug 01 11:03:47.289 rhel91 pacemaker-controld  [3734573] (log_executor_event)   notice: Result of probe operation for ag_cluster on rhel91: promoted | CIB update 15, graph action confirmed; call=6 key=ag_cluster_monitor_0 rc=8

which I thought was odd -- the agent's probe action output that it's not running, but returned promoted? I suspect I'm misunderstanding what the output means.

> and would you have the recommended steps to configure RHEL 9 AG setup with SQL Server .. ? 
> https://learn.microsoft.com/en-us/sql/linux/sql-server-linux-availability-group-cluster-pacemaker?view=sql-server-ver16&tabs=rhel 
> The steps we follow are above and agent code : https://github.com/microsoft/mssql-server-ha/blob/master/ag/ag

I believe the steps are identical for RHEL 8 and 9. There's no need to change cluster-recheck-interval on RHEL 8 or 9 (though it's useful for RHEL 7). Note that the "Clusters from Scratch" link goes to the 1.1 version, whereas https://clusterlabs.org/pacemaker/doc/2.1/Clusters_from_Scratch/html/ would be for RHEL 8 and 9.

Comment 14 Ken Gaillot 2023-08-07 21:47:16 UTC
(In reply to Aravind Mahadevan from comment #11)
> @kgaillot : Would you be having any insights from the latest logs
> attached correlating it with the agent code that was shared ?

I didn't see any failover or errors in the latest logs.

I suspect the (Master|Promoted) change in the agent is the main thing needed for failover to work, and may even be sufficient by itself.

Comment 15 Ken Gaillot 2023-08-07 21:57:19 UTC
(In reply to Aravind Mahadevan from comment #12)
> and would you have the recommended steps to configure RHEL 9 AG setup with
> SQL Server .. ? 
> 
> https://learn.microsoft.com/en-us/sql/linux/sql-server-linux-availability-
> group-cluster-pacemaker?view=sql-server-ver16&tabs=rhel 
> The steps we follow are above and agent code :
> https://github.com/microsoft/mssql-server-ha/blob/master/ag/ag

BTW, I don't think that page needs the bias warning anymore. I don't see any usage that would require it, unless you need it for links to RHEL or upstream documentation that might.


Note You need to log in before you can comment on or make changes to this bug.