Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2232244

Summary: Setting priority-fencing-delay results in delay in failover of resource in SAP ENSA2 configuration
Product: Red Hat Enterprise Linux 8 Reporter: dennispadia <depadia>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED NOTABUG QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.8CC: cfeist, cluster-maint, kris.shawcross, ksatarin, radeltch
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-09-06 20:31:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description dennispadia 2023-08-15 21:59:08 UTC
Description of problem:

While testing priority-fencing-delay parameter in RHEL 8.8 for SAP HA & Update service OS image in Azure, we encountered some unusual behavior in failover behavior of ASCS resource when the node crashes. Below are the details of the issue:

I have configured ENSA2 (ASCS/ERS) on VMs (rh8scs00l795, rh8scs01l795) cluster. The SAP Kernel level is 789 Patch 52, so sapstartsrv is integrated to systemd by default. I have setup priority-fencing-delay as described in https://access.redhat.com/solutions/5110521. As sapstartsrv is integrated with systemd, I have also followed the step mentioned in https://access.redhat.com/articles/6884531. 

When I perform a crash with "echo b > /proc/sysrq-trigger" on the VM where ASCS resource is running. I encountered that the ASCS resource takes time to perform any action. Below are the details 

Cluster resource state before performing crash. ASCS resource is running on rh8scs00l795 node

Node List:
  * Online: [ rh8scs00l795 rh8scs01l795 ]

Full List of Resources:
  * rsc_app_azr_agt     (stonith:fence_azure_arm):       Started rh8scs00l795
  * Resource Group: g-RH8_ASCS:
    * fs_RH8_ASCS       (ocf::heartbeat:Filesystem):     Started rh8scs00l795
    * vip_RH8_ASCS      (ocf::heartbeat:IPaddr2):        Started rh8scs00l795
    * nc_RH8_ASCS       (ocf::heartbeat:azure-lb):       Started rh8scs00l795
    * rsc_sap_RH8_ASCS00        (ocf::heartbeat:SAPInstance):    Started rh8scs00l795
  * Resource Group: g-RH8_AERS:
    * fs_RH8_AERS       (ocf::heartbeat:Filesystem):     Started rh8scs01l795
    * vip_RH8_AERS      (ocf::heartbeat:IPaddr2):        Started rh8scs01l795
    * nc_RH8_AERS       (ocf::heartbeat:azure-lb):       Started rh8scs01l795
    * rsc_sap_RH8_ERS01 (ocf::heartbeat:SAPInstance):    Started rh8scs01l795

Crash rh8scs00l795 node, using below command

[root@rh8scs00l795 ~]# date; echo b > /proc/sysrq-trigger
Tue Aug 15 20:12:38 UTC 2023

On other (rh8scs01l795) node, /var/log/messages captures below behavior

   - Token is set 30000 ms and consensus is set to 36000 ms. After 30+36 seconds, cluster form a new membership

	Aug 15 20:12:58 rh8scs01l795 corosync[1479]:  [TOTEM ] Token has not been received in 22500 ms
	Aug 15 20:13:05 rh8scs01l795 corosync[1479]:  [TOTEM ] A processor failed, forming new configuration: token timed out (30000ms), waiting 36000ms for consensus.

	Aug 15 20:13:41 rh8scs01l795 corosync[1479]:  [QUORUM] Sync members[1]: 2
	Aug 15 20:13:41 rh8scs01l795 corosync[1479]:  [QUORUM] Sync left[1]: 1
	Aug 15 20:13:41 rh8scs01l795 corosync[1479]:  [TOTEM ] A new membership (2.560) was formed. Members left: 1
	Aug 15 20:13:41 rh8scs01l795 corosync[1479]:  [TOTEM ] Failed to receive the leave message. failed: 1
	Aug 15 20:13:41 rh8scs01l795 pacemaker-attrd[1623]: notice: Node rh8scs00l795 state is now lost
	Aug 15 20:13:41 rh8scs01l795 pacemaker-attrd[1623]: notice: Removing all rh8scs00l795 attributes for peer loss
	Aug 15 20:13:41 rh8scs01l795 pacemaker-attrd[1623]: notice: Purged 1 peer with id=1 and/or uname=rh8scs00l795 from the membership cache
	Aug 15 20:13:41 rh8scs01l795 corosync[1479]:  [QUORUM] Members[1]: 2
	Aug 15 20:13:41 rh8scs01l795 corosync[1479]:  [MAIN  ] Completed service synchronization, ready to provide service.

    - As priority=10 is set on rsc_sap_RH8_ASCS00, fence agent introduce a delay of 15s as priority-fencing-delay is configured as 15s.

	Aug 15 20:13:42 rh8scs01l795 pacemaker-fenced[1621]: notice: Delaying 'reboot' action targeting rh8scs00l795 using rsc_app_azr_agt for 15s

    - After a delay of 15s, it is expected that cluster to perform some action on the resource. Instead, the cluster again checks the totem and do not perform any action on ASCS resource (which are in stopped state at this time)

	Aug 15 20:14:25 rh8scs01l795 corosync[1479]:  [TOTEM ] Token has not been received in 22500 ms
	Aug 15 20:14:32 rh8scs01l795 corosync[1479]:  [TOTEM ] A processor failed, forming new configuration: token timed out (30000ms), waiting 36000ms for consensus.

    - It waits again for 30+36 seconds to form a new cluster membership.  

	Aug 15 20:15:08 rh8scs01l795 corosync[1479]:  [QUORUM] Sync members[1]: 2
	Aug 15 20:15:08 rh8scs01l795 corosync[1479]:  [QUORUM] Sync left[1]: 1
	Aug 15 20:15:08 rh8scs01l795 corosync[1479]:  [TOTEM ] A new membership (2.569) was formed. Members left: 1
	Aug 15 20:15:08 rh8scs01l795 corosync[1479]:  [TOTEM ] Failed to receive the leave message. failed: 1
	Aug 15 20:15:08 rh8scs01l795 pacemaker-attrd[1623]: notice: Node rh8scs00l795 state is now lost
	Aug 15 20:15:08 rh8scs01l795 pacemaker-attrd[1623]: notice: Removing all rh8scs00l795 attributes for peer loss
	Aug 15 20:15:08 rh8scs01l795 pacemaker-attrd[1623]: notice: Purged 1 peer with id=1 and/or uname=rh8scs00l795 from the membership cache
	Aug 15 20:15:08 rh8scs01l795 corosync[1479]:  [QUORUM] Members[1]: 2
	Aug 15 20:15:08 rh8scs01l795 corosync[1479]:  [MAIN  ] Completed service synchronization, ready to provide service.

    - So, the cluster does not perform any action on the resource for like (30+36) + 15 + (30+36) = 147 seconds ~ 2 minutes 27 seconds. 

    - The node rh8scs00l795 that crashed, comes back online and it joins the cluster. After the other node joins the cluster, ASCS resources are either failed over or started on the same node. 

This is not a normal behavior, where cluster go on checking totem twice during a crash. In typical ENSA2, cluster either performs a failover or starts on the same node (if VM comes up fast). The expected behavior is that the cluster should perform action on resources after (30+36) + 15 seconds. But instead, there is an additional delay in taking action as cluster checks totem twice. This behavior we encountered when we set priority-fencing-delay. 

We have not encountered this behavior in ENSA1 configuration with priority-fencing-delay. 

Also, we have encountered the same behavior in RHEL 8.6 as well. We set up the same ENSA2 environment in RHEL 8.6 on Azure and encountered similar behavior. 


Version-Release number of selected component (if applicable):

RHEL 8.8 for SAP HA & Update Services

pacemaker-schemas-2.1.5-8.1.el8_8.noarch
pacemaker-cluster-libs-2.1.5-8.1.el8_8.x86_64
pacemaker-cli-2.1.5-8.1.el8_8.x86_64
pacemaker-2.1.5-8.1.el8_8.x86_64
pacemaker-libs-2.1.5-8.1.el8_8.x86_64

How reproducible:

Setup ENSA2 system (S/4HANA 2022) with sapstartsrv integrated with systemd. 

Steps to Reproduce:
1. Setup ENSA2 system (S/4HANA 2022) with sapstartsrv integrated with systemd on RHEL 8.8
2. Configure priority-fencing-delay described in https://access.redhat.com/solutions/5110521
3. Additional step performed for cluster managing sapstartsrv https://access.redhat.com/articles/6884531

Actual results:
 

Expected results:
Cluster should perform some action on ASCS resource after (30+36) + 15 seconds. But instead, it checks totem again that adds additional delay of (30+36) seconds.

Additional info:

Comment 1 dennispadia 2023-08-15 22:34:12 UTC
You can find SOS report for both VMs in https://1drv.ms/f/s!Aqy5CvX16plyvzgMtno-OcIr0uGa?e=scD4gS

Comment 2 Ken Gaillot 2023-08-22 16:02:41 UTC
(In reply to dennispadia from comment #1)
> You can find SOS report for both VMs in
> https://1drv.ms/f/s!Aqy5CvX16plyvzgMtno-OcIr0uGa?e=scD4gS

Hi,

I don't see any files there. Do I need a special login or permissions?

Comment 3 dennispadia 2023-08-22 16:32:38 UTC
(In reply to Ken Gaillot from comment #2)
> (In reply to dennispadia from comment #1)
> > You can find SOS report for both VMs in
> > https://1drv.ms/f/s!Aqy5CvX16plyvzgMtno-OcIr0uGa?e=scD4gS
> 
> Hi,
> 
> I don't see any files there. Do I need a special login or permissions?

Hi,

It doesn't require login; can you check with this link https://1drv.ms/f/s!Aqy5CvX16plyvziczbnSgnxXM7K2?e=WGi6c2

Regards,
Dennis Padia

Comment 4 Ken Gaillot 2023-08-28 21:53:53 UTC
(In reply to dennispadia from comment #0)
>     - As priority=10 is set on rsc_sap_RH8_ASCS00, fence agent introduce a
> delay of 15s as priority-fencing-delay is configured as 15s.
> 
> 	Aug 15 20:13:42 rh8scs01l795 pacemaker-fenced[1621]: notice: Delaying
> 'reboot' action targeting rh8scs00l795 using rsc_app_azr_agt for 15s
> 
>     - After a delay of 15s, it is expected that cluster to perform some
> action on the resource. Instead, the cluster again checks the totem and do
> not perform any action on ASCS resource (which are in stopped state at this
> time)
> 
> 	Aug 15 20:14:25 rh8scs01l795 corosync[1479]:  [TOTEM ] Token has not been
> received in 22500 ms
> 	Aug 15 20:14:32 rh8scs01l795 corosync[1479]:  [TOTEM ] A processor failed,
> forming new configuration: token timed out (30000ms), waiting 36000ms for
> consensus.

Hi,

What happened here is that the node comes back up and rejoins the cluster at 20:13:50, while fencing is waiting for the delay. The Corosync cluster is re-formed until the fencing succeeds at 20:14:18, which Corosync sees as a new node loss event above.

Coincidentally, the node was fenced the second time at the same time the surviving node was updating the CIB status for its rejoin. That caused the status updates to fail since Corosync couldn't deliver the request to the other node:

    Aug 15 20:14:36 rh8scs01l795 pacemaker-controld[993208]: error: Node update 89 failed: Timer expired (-62)
    Aug 15 20:14:36 rh8scs01l795 pacemaker-controld[993208]: error: Node update 90 failed: Timer expired (-62)

That caused the controller on the surviving node to exit, and pacemakerd to respawn it. That's a correct recovery, but since the other node is still down when the controller respawns, it causes an election failure after 20 seconds, which results in another controller exit and respawn. By this time, the other node is back, so everything proceeds normally.

This is just unfortunate timing but ultimately results in correct (if slow) recovery. Some users get around this by putting a delay (longer than the maximum time fencing is expected to take) in each node's boot sequence before starting Pacemaker. That ensures if a node leaves and rejoins, any fencing scheduled for it can complete before it tries to rejoin the cluster.