Bug 2224249
| Summary: | Manual and Automatic failover issues in RHEL 9 | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Aravind Mahadevan <armaha> | ||||||||||||||||
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||||||||||||||
| Status: | NEW --- | QA Contact: | cluster-qe <cluster-qe> | ||||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||||
| Priority: | unspecified | ||||||||||||||||||
| Version: | 9.0 | CC: | amitkh, cluster-maint, dyeisley, kgaillot | ||||||||||||||||
| Target Milestone: | rc | Keywords: | Triaged | ||||||||||||||||
| Target Release: | --- | ||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||
| Whiteboard: | |||||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||
| Last Closed: | Type: | Bug | |||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
| Embargoed: | |||||||||||||||||||
| Attachments: |
|
||||||||||||||||||
Created attachment 1976689 [details]
pacemaker logs
Hi,
Looking at the "second_run" logs, the problem is that start and promote actions repeatedly fail.
As an aside, it is not necessary to set a low cluster-recheck-interval. Many years ago, failure timeouts were only guaranteed to be checked as often as the recheck interval, so it was customary to set them similarly. However with modern Pacemaker, failure timeouts occur at their exact time, regardless of cluster-recheck-interval.
As another aside, it is necessary to enable fencing. Pacemaker is unable to recover from certain failures, such as a stop failure, without fencing, so even in a test cluster it is worth configuring fencing to simulate production behavior.
Interesting logs:
Jun 30 02:51:11.015 rhel-9-3 pacemaker-schedulerd[1198] (remap_operation) info: Probe found agCluster:0 active and promoted on rhel-9-1 at Jun 30 02:51:04 2023
^ When the resource is first added to the cluster, it is found to be already running. That is not a problem, and Pacemaker will "adopt" the existing instance.
Jun 30 02:51:22.814 rhel-9-3 pacemaker-schedulerd[1198] (rsc_action_default) info: Leave agCluster:0 (Unpromoted rhel-9-2)
Jun 30 02:51:22.814 rhel-9-3 pacemaker-schedulerd[1198] (rsc_action_default) info: Leave agCluster:1 (Promoted rhel-9-1)
Jun 30 02:51:22.814 rhel-9-3 pacemaker-schedulerd[1198] (rsc_action_default) info: Leave agCluster:2 (Unpromoted rhel-9-3)
Jun 30 02:51:22.815 rhel-9-3 pacemaker-schedulerd[1198] (pcmk__log_transition_summary) notice: Calculated transition 6, saving inputs in /var/lib/pacemaker/pengine/pe-input-6.bz2
^ At this point, right after the resource was added, agCluster is happy, and promoted on rhel-9-1.
Jun 30 04:03:02 ag(agCluster)[52851]: INFO: monitor: ERROR: 2023/06/30 04:03:02 Instance is unhealthy: status 3 is at or below monitor policy 3
^ The agent on rhel-9-2 reports a failed monitor (and sets the failed node's promotion score to -INFINITY). At this point, the cluster responds appropriately, wanting to recover the instance on rhel-9-2 (the failed node) to the unpromoted role, and promote the instance on rhel-9-1.
Jun 30 04:03:24 ag(agCluster)[53220]: ERROR: SQL Server isn't running.
^ The agent on rhel-9-2 reports that the resource failed to restart. The cluster responds by stopping the instance there and leaving it stopped.
Jun 30 04:03:52 ag(agCluster)[56097]: INFO: promote: 2023/06/30 04:03:51 Failed action promote: One or more DBs are unsynchronized or not joined to the AG: Error 41142, State 17: mssql: The availability replica for availability group 'ag1' on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group. If the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss). For more information, see SQL Server Books Online.
^ The agent on rhel-9-1 reports that the promotion attempt failed there. The cluster responds by wanting to restart the instance on rhel-9-1 to the unpromoted role, leaving the instance on rhel-9-2 stopped, and promoting the instance on rhel-9-3.
Jun 30 04:04:36 ag(agCluster)[59101]: INFO: promote: 2023/06/30 04:04:36 Failed action promote: One or more DBs are unsynchronized or not joined to the AG: Error 41142, State 34: mssql: The availability replica for availability group 'ag1' on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group. If the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss). For more information, see SQL Server Books Online.
^ The promotion fails on rhel-9-3 as well. Due to a low failure-timeout of 60s, the various failures quickly expire, and the same sequence repeats endlessly. (Usually failure-timeout is much higher, but that is not a problem.)
To sum up, Pacemaker is responding appropriately to what the agent is telling it. The agent is unable to successfully start or promote the resource after the first failure.
Thanks Ken for the update. Sorry to change the topic of the issue. I've got to know that manual failover also has issues in RHEL 9. I'd like to add the details here, for your reference. Created attachment 1980299 [details]
rhel9manualfailover
Created attachment 1980300 [details]
rhel9manualfailover
@kgaillot , I'd request a call to be scheduled between yourself, me and our engineering dev Yunxi Jia as there's some confusion and disagreements about this ticket issue as well as https://bugzilla.redhat.com/show_bug.cgi?id=2221772 . Since both are handled by you, Kindly request you to let us know if we could have a call coming Monday/tuesday and share your preferable timeslots/working timezone so that we could schedule it accordingly. If not a call, then it would delay things and we wouldn't be able to finish our HA testing for our upcoming SQL on Linux CU release. Sharing logs of latest run : Created attachment 1981341 [details]
LatestLogs_post_call_with_MS&Rhel
Created attachment 1981342 [details]
LatestLogs_post_call_with_MS&Rhel
Created attachment 1981343 [details]
LatestLogs_post_call_with_MS&Rhel
@kgaillot : Would you be having any insights from the latest logs attached correlating it with the agent code that was shared ? and would you have the recommended steps to configure RHEL 9 AG setup with SQL Server .. ? https://learn.microsoft.com/en-us/sql/linux/sql-server-linux-availability-group-cluster-pacemaker?view=sql-server-ver16&tabs=rhel The steps we follow are above and agent code : https://github.com/microsoft/mssql-server-ha/blob/master/ag/ag (In reply to Aravind Mahadevan from comment #11) > @kgaillot : Would you be having any insights from the latest logs > attached correlating it with the agent code that was shared ? I wasn't able to get that far this week but will pick it up again on Monday. The agent definitely won't work on RHEL 9 without the "Promoted" change. I saw in the new logs: Aug 01 11:03:47.289 rhel91 pacemaker-execd [3734570] (log_op_output) info: ag_cluster_monitor_0[3734822] error output [ resource ag_cluster is NOT running ] Aug 01 11:03:47.289 rhel91 pacemaker-controld [3734573] (log_executor_event) notice: Result of probe operation for ag_cluster on rhel91: promoted | CIB update 15, graph action confirmed; call=6 key=ag_cluster_monitor_0 rc=8 which I thought was odd -- the agent's probe action output that it's not running, but returned promoted? I suspect I'm misunderstanding what the output means. > and would you have the recommended steps to configure RHEL 9 AG setup with SQL Server .. ? > https://learn.microsoft.com/en-us/sql/linux/sql-server-linux-availability-group-cluster-pacemaker?view=sql-server-ver16&tabs=rhel > The steps we follow are above and agent code : https://github.com/microsoft/mssql-server-ha/blob/master/ag/ag I believe the steps are identical for RHEL 8 and 9. There's no need to change cluster-recheck-interval on RHEL 8 or 9 (though it's useful for RHEL 7). Note that the "Clusters from Scratch" link goes to the 1.1 version, whereas https://clusterlabs.org/pacemaker/doc/2.1/Clusters_from_Scratch/html/ would be for RHEL 8 and 9. (In reply to Aravind Mahadevan from comment #11) > @kgaillot : Would you be having any insights from the latest logs > attached correlating it with the agent code that was shared ? I didn't see any failover or errors in the latest logs. I suspect the (Master|Promoted) change in the agent is the main thing needed for failover to work, and may even be sufficient by itself. (In reply to Aravind Mahadevan from comment #12) > and would you have the recommended steps to configure RHEL 9 AG setup with > SQL Server .. ? > > https://learn.microsoft.com/en-us/sql/linux/sql-server-linux-availability- > group-cluster-pacemaker?view=sql-server-ver16&tabs=rhel > The steps we follow are above and agent code : > https://github.com/microsoft/mssql-server-ha/blob/master/ag/ag BTW, I don't think that page needs the bias warning anymore. I don't see any usage that would require it, unless you need it for links to RHEL or upstream documentation that might. |
Created attachment 1976688 [details] screenshot Description of problem: We work with MS engineering team to run HA tests - SQL Server on RHEL 9. We have observed the following issue of automatic failover failing for RHEL 9. We are having a upcoming CU release and want to fix this before that, in order to prepare for the release pipelines internally. Version-Release number of selected component (if applicable): RHEL 9, I think pacemaker 0.11 version ? How reproducible: Setup SQL Server HA configuration on RHEL 9 and shut down primary, to reproduce the automatic failover failing issue. Actual results: The Manual failover has passed but Automatic failover has failed. We have waited for more than half an hour for master switch to happen but it has not taken place Attaching logs and screenshot for reference Expected results: Auto failover should happen without issues. Additional info: