Description of problem: In RHEL 8.7, with the use of pacemaker 2.0+ in SQL Server Availability groups, Automatic failover within the nodes keep happening. With 3 replicas setup, when SQL is shutdown on primary , failover keeps happening between the replicas. This is caused/observed by pacemaker breaking change after introducing "on-fail=demote" property. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1.Setup sql server 3 node replica in rhel 8.7 with pacemaker 2.0+ 2. Stop SQL server in primary 3. Observe that pacemaker failovers HA resource back and forth , attempts to start sql server infinitely. Actual results: Expected results: Additional info:
Hi, Could you attach the output of "pcs cluster report" covering the time of one such incident? It sounds like a monitor is failing repeatedly.
Created attachment 1976118 [details] Pacemaker logs
Pacemaker logs when we observed the issue : May 25 13:51:20 ag(ag_cluster)[458795]: INFO: monitor: 2023/05/25 13:51:20 ag-helper invoked with required-synchronized-secondaries-to-commit [-1]; current-master [server3 May 25 13:51:20 ag(ag_cluster)[458795]: INFO: monitor: server2]; disable-primary-on-quorum-timeout-after [60]; primary-write-lease-duration [73]; monitor-interval-timeout [60] May 25 13:51:20 ag(ag_cluster)[458795]: INFO: monitor: 2023/05/25 13:51:20 From RetryExecuteWithTimeout - Attempt 1 to connect to the instance at localhost:1433 May 25 13:51:20 ag(ag_cluster)[458795]: INFO: monitor: 2023/05/25 13:51:20 Connected to the instance at localhost:1433 May 25 13:51:25 ag(ag_cluster)[458795]: INFO: monitor: 2023/05/25 13:51:25 Monitor Caller is: monitor. May 25 13:51:25 ag(ag_cluster)[458795]: INFO: monitor: 2023/05/25 13:51:25 [DEBUG] AG Helper Monitor Role info: AVAILABILITY GROUP ag1 on instance server2 May 25 13:51:25 ag(ag_cluster)[458795]: INFO: monitor: 2023/05/25 13:51:25 Replica is PRIMARY (1) May 25 13:51:25 ag(ag_cluster)[458795]: INFO: monitor: 2023/05/25 13:51:25 Offlining replica... May 25 13:51:25 ag(ag_cluster)[458795]: INFO: monitor: 2023/05/25 13:51:25 Replica is RESOLVING (0) May 25 13:51:25 ag(ag_cluster)[458795]: INFO: monitor: 2023/05/25 13:51:25 Instance name is server2. May 25 13:51:25 ag(ag_cluster)[458795]: INFO: monitor: 2023/05/25 13:51:25 Current master is server3 May 25 13:51:25 ag(ag_cluster)[458795]: INFO: monitor: server2. May 25 13:51:25 ag(ag_cluster)[458795]: INFO: monitor: 2023/05/25 13:51:25 Setting the role to Secondary. May 25 13:51:25 ag(ag_cluster)[458795]: INFO: monitor: 2023/05/25 13:51:25 Setting replica to SECONDARY role... May 25 13:51:25 ag(ag_cluster)[458795]: INFO: monitor: 2023/05/25 13:51:25 Could not set replica to SECONDARY role. Failover Failed. May 25 13:51:25 ag(ag_cluster)[458795]: INFO: monitor: PROMOTION_SCORE: -INFINITY May 25 13:51:25 ag(ag_cluster)[458795]: INFO: lease_expiry after monitor update:
Created attachment 1976119 [details] Pacemaker logs
Created attachment 1976120 [details] Pacemaker logs
More details/explanation for the logs attached : Timestamp 1: Before shutting down sql server on primary node (node 1). Pacemaker1.log is the primary. Pacemaker2 and pacemaker3 are the secondary. Timestamp 2: after shutting down sql server on primary node (node 1). Pacemaker1 is in a unavailable status. Pacemaker2 is the new primary. Pacemaker3 is the secondary.
Hi, Those look like logs from the agent only. Do you have /var/log/pacemaker/pacemaker.log from all nodes covering the same time? If logging to that file is disabled, any lines from the system log (/var/log/syslog or systemd journal) containing "pacemaker" should help.
These logs are from /var/log/pacemaker. attaching them for your reference. We cannot find a file or directory named /var/log/syslog in RHEL and systemd is not a built-in command in RHEL. In pacemaker2_7_19.log, Please take a look at Jul 19 14:41:37 ag(ag_cluster)[44254]: INFO: monitor: 2023/07/19 14:41:37 Current master is server1 Jul 19 14:41:37 ag(ag_cluster)[44254]: INFO: monitor: server2.
Created attachment 1976684 [details] pacemaker logs_updated
Created attachment 1976685 [details] pacemaker logs_updated
Created attachment 1976686 [details] pacemaker logs_updated
What we are observing is : “start-failure-is-fatal=true” does not work as expected /not being adhered to Expected behavior is.. cluster will try to start resource on another node that has higer score.. but instead old primary resource itself is being promoted even after start failure.. So start-failure-is-fatal = true is not being adhered to for automatic failover when old primary sql server is shutdown
(In reply to Aravind Mahadevan from comment #8) > These logs are from /var/log/pacemaker. attaching them for your reference. > > We cannot find a file or directory named /var/log/syslog in RHEL and systemd > is not a built-in command in RHEL. > > In pacemaker2_7_19.log, Please take a look at > Jul 19 14:41:37 ag(ag_cluster)[44254]: INFO: monitor: 2023/07/19 > 14:41:37 Current master is server1 > Jul 19 14:41:37 ag(ag_cluster)[44254]: INFO: monitor: server2. Apologies, I meant /var/log/messages. For the systemd journal, the command is like: journalctl --since "YYYY-MM-DD HH:MM:SS" --until "YYYY-MM-DD HH:MM:SS" replacing the date/times appropriately. Either /var/log/messages or that output would be sufficient.
(In reply to Aravind Mahadevan from comment #12) > What we are observing is : “start-failure-is-fatal=true” does not work as > expected /not being adhered to > > Expected behavior is.. cluster will try to start resource on another node > that has higer score.. but instead old primary resource itself is being > promoted even after start failure.. > > So start-failure-is-fatal = true is not being adhered to for automatic > failover when old primary sql server is shutdown Hi Aravind, I don't need logs anymore, I believe I understand what's happening. on-fail="demote" can be set for promote actions and for monitor actions for the promoted role, so one of those is what is actually failing. Start is technically a separate action (a "promote" is done after the start completes), so start-failure-is-fatal is not used. When the action fails, the resource is demoted, then the cluster recalculates whether and where a new instance should be promoted. The node with the failure *is* eligible, so if promotion scores have not changed, it will be promoted again. (This is similar to a regular monitor failing, and recovery being attempted with a restart on the same node.) There is no equivalent of start-failure-is-fatal or migration-threshold for promotion failures, but the same effect can be achieved with rule-based location constraints. For example: pcs constraint location <resource> rule role="Promoted" score="-INFINITY" \ "fail-count-<resource>#promote_0" gt 0 \ or "fail-count-<resource>#monitor_<interval>" gt 0 replacing <resource> with the resource name and <interval> with the configured monitor interval in milliseconds (so, if you have interval="10", use 10000). That tells the cluster to disallow the resource for promotion on any node where promotion has previously failed. If you want the node to still be eligible for promotion but just at a lower preference, you could use a finite number instead of -INFINITY. Cleaning up the fail count will make the node eligible again. Let me know if that works for you.
https://bugzilla.redhat.com/show_bug.cgi?id=2224249 and https://bugzilla.redhat.com/show_bug.cgi?id=2221772 are handled by @
@kgaillot , I'd request a call to be scheduled between yourself, me and our engineering dev Yunxi Jia as there's some confusion and disagreements about this ticket issue https://bugzilla.redhat.com/show_bug.cgi?id=2221772 as well as https://bugzilla.redhat.com/show_bug.cgi?id=2224249 . Since both are handled by you, Kindly request you to let us know if we could have a call today or tomorrow at the earliest and share your preferable timeslots/working timezone so that we could schedule it accordingly. If not a call, then it would delay things and we wouldn't be able to finish our HA testing for our upcoming SQL on Linux CU release.
(In reply to Aravind Mahadevan from comment #16) > @kgaillot , I'd request a call to be scheduled between yourself, > me and our engineering dev Yunxi Jia as there's some confusion and > disagreements about this ticket issue > https://bugzilla.redhat.com/show_bug.cgi?id=2221772 as well as > https://bugzilla.redhat.com/show_bug.cgi?id=2224249 . Since both are handled > by you, Kindly request you to let us know if we could have a call today or > tomorrow at the earliest and share your preferable timeslots/working > timezone so that we could schedule it accordingly. If not a call, then it > would delay things and we wouldn't be able to finish our HA testing for our > upcoming SQL on Linux CU release. Certainly. I am available the rest of today (until around 6PM US Central / 11PM UTC) and tomorrow morning (around 9AM to noon US Central / 2PM to 5PM UTC).
(In reply to Ken Gaillot from comment #17) > (In reply to Aravind Mahadevan from comment #16) > > @kgaillot , I'd request a call to be scheduled between yourself, > > me and our engineering dev Yunxi Jia as there's some confusion and > > disagreements about this ticket issue > > https://bugzilla.redhat.com/show_bug.cgi?id=2221772 as well as > > https://bugzilla.redhat.com/show_bug.cgi?id=2224249 . Since both are handled > > by you, Kindly request you to let us know if we could have a call today or > > tomorrow at the earliest and share your preferable timeslots/working > > timezone so that we could schedule it accordingly. If not a call, then it > > would delay things and we wouldn't be able to finish our HA testing for our > > upcoming SQL on Linux CU release. > > Certainly. I am available the rest of today (until around 6PM US Central / > 11PM UTC) and tomorrow morning (around 9AM to noon US Central / 2PM to 5PM > UTC). Hi Aravind, Not sure if my attempt to comment via Google calendar went anywhere :) I have to leave early tomorrow, so the meeting would have to start no later than noon US Central time (5PM UTC)
8:00am PST works for you ? @
10am CST tomorrow ( 8am PST ) works for you hopefully ? Sent an updated invite @kgaillot
(In reply to Aravind Mahadevan from comment #20) > 10am CST tomorrow ( 8am PST ) works for you hopefully ? Sent an updated > invite @kgaillot Yes, see you then
Agent code : https://github.com/microsoft/mssql-server-ha/blob/master/ag/ag
(In reply to Aravind Mahadevan from comment #22) > Agent code : https://github.com/microsoft/mssql-server-ha/blob/master/ag/ag As discussed, the most important change needed is that the regular expression when setting current_master at lines 94 and 301 should use (Master|Promoted) instead of Master. That will handle the RHEL 9 output as well as earlier versions. I'm out of time today, so I'll investigate further tomorrow. FYI, agent tracing can be a big help during testing. You can turn it on with "pcs resource update <resource> trace_ra=1". It will log every single line that the agent executes, so the logs grow really quickly, so I would only use it briefly. The output will be in the /var/lib/heartbeat/trace_ra/ag directory.
I'll ask our team to try agent tracing while testing once. @kgaillot : Any further insights from your investigation ? Please do let us know, your inputs are very much appreciated !