Bug 2232244 - Setting priority-fencing-delay results in delay in failover of resource in SAP ENSA2 configuration
Summary: Setting priority-fencing-delay results in delay in failover of resource in SA...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: pacemaker
Version: 8.8
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-15 21:59 UTC by dennispadia
Modified: 2023-08-15 22:38 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-165880 0 None None None 2023-08-15 22:01:10 UTC

Description dennispadia 2023-08-15 21:59:08 UTC
Description of problem:

While testing priority-fencing-delay parameter in RHEL 8.8 for SAP HA & Update service OS image in Azure, we encountered some unusual behavior in failover behavior of ASCS resource when the node crashes. Below are the details of the issue:

I have configured ENSA2 (ASCS/ERS) on VMs (rh8scs00l795, rh8scs01l795) cluster. The SAP Kernel level is 789 Patch 52, so sapstartsrv is integrated to systemd by default. I have setup priority-fencing-delay as described in https://access.redhat.com/solutions/5110521. As sapstartsrv is integrated with systemd, I have also followed the step mentioned in https://access.redhat.com/articles/6884531. 

When I perform a crash with "echo b > /proc/sysrq-trigger" on the VM where ASCS resource is running. I encountered that the ASCS resource takes time to perform any action. Below are the details 

Cluster resource state before performing crash. ASCS resource is running on rh8scs00l795 node

Node List:
  * Online: [ rh8scs00l795 rh8scs01l795 ]

Full List of Resources:
  * rsc_app_azr_agt     (stonith:fence_azure_arm):       Started rh8scs00l795
  * Resource Group: g-RH8_ASCS:
    * fs_RH8_ASCS       (ocf::heartbeat:Filesystem):     Started rh8scs00l795
    * vip_RH8_ASCS      (ocf::heartbeat:IPaddr2):        Started rh8scs00l795
    * nc_RH8_ASCS       (ocf::heartbeat:azure-lb):       Started rh8scs00l795
    * rsc_sap_RH8_ASCS00        (ocf::heartbeat:SAPInstance):    Started rh8scs00l795
  * Resource Group: g-RH8_AERS:
    * fs_RH8_AERS       (ocf::heartbeat:Filesystem):     Started rh8scs01l795
    * vip_RH8_AERS      (ocf::heartbeat:IPaddr2):        Started rh8scs01l795
    * nc_RH8_AERS       (ocf::heartbeat:azure-lb):       Started rh8scs01l795
    * rsc_sap_RH8_ERS01 (ocf::heartbeat:SAPInstance):    Started rh8scs01l795

Crash rh8scs00l795 node, using below command

[root@rh8scs00l795 ~]# date; echo b > /proc/sysrq-trigger
Tue Aug 15 20:12:38 UTC 2023

On other (rh8scs01l795) node, /var/log/messages captures below behavior

   - Token is set 30000 ms and consensus is set to 36000 ms. After 30+36 seconds, cluster form a new membership

	Aug 15 20:12:58 rh8scs01l795 corosync[1479]:  [TOTEM ] Token has not been received in 22500 ms
	Aug 15 20:13:05 rh8scs01l795 corosync[1479]:  [TOTEM ] A processor failed, forming new configuration: token timed out (30000ms), waiting 36000ms for consensus.

	Aug 15 20:13:41 rh8scs01l795 corosync[1479]:  [QUORUM] Sync members[1]: 2
	Aug 15 20:13:41 rh8scs01l795 corosync[1479]:  [QUORUM] Sync left[1]: 1
	Aug 15 20:13:41 rh8scs01l795 corosync[1479]:  [TOTEM ] A new membership (2.560) was formed. Members left: 1
	Aug 15 20:13:41 rh8scs01l795 corosync[1479]:  [TOTEM ] Failed to receive the leave message. failed: 1
	Aug 15 20:13:41 rh8scs01l795 pacemaker-attrd[1623]: notice: Node rh8scs00l795 state is now lost
	Aug 15 20:13:41 rh8scs01l795 pacemaker-attrd[1623]: notice: Removing all rh8scs00l795 attributes for peer loss
	Aug 15 20:13:41 rh8scs01l795 pacemaker-attrd[1623]: notice: Purged 1 peer with id=1 and/or uname=rh8scs00l795 from the membership cache
	Aug 15 20:13:41 rh8scs01l795 corosync[1479]:  [QUORUM] Members[1]: 2
	Aug 15 20:13:41 rh8scs01l795 corosync[1479]:  [MAIN  ] Completed service synchronization, ready to provide service.

    - As priority=10 is set on rsc_sap_RH8_ASCS00, fence agent introduce a delay of 15s as priority-fencing-delay is configured as 15s.

	Aug 15 20:13:42 rh8scs01l795 pacemaker-fenced[1621]: notice: Delaying 'reboot' action targeting rh8scs00l795 using rsc_app_azr_agt for 15s

    - After a delay of 15s, it is expected that cluster to perform some action on the resource. Instead, the cluster again checks the totem and do not perform any action on ASCS resource (which are in stopped state at this time)

	Aug 15 20:14:25 rh8scs01l795 corosync[1479]:  [TOTEM ] Token has not been received in 22500 ms
	Aug 15 20:14:32 rh8scs01l795 corosync[1479]:  [TOTEM ] A processor failed, forming new configuration: token timed out (30000ms), waiting 36000ms for consensus.

    - It waits again for 30+36 seconds to form a new cluster membership.  

	Aug 15 20:15:08 rh8scs01l795 corosync[1479]:  [QUORUM] Sync members[1]: 2
	Aug 15 20:15:08 rh8scs01l795 corosync[1479]:  [QUORUM] Sync left[1]: 1
	Aug 15 20:15:08 rh8scs01l795 corosync[1479]:  [TOTEM ] A new membership (2.569) was formed. Members left: 1
	Aug 15 20:15:08 rh8scs01l795 corosync[1479]:  [TOTEM ] Failed to receive the leave message. failed: 1
	Aug 15 20:15:08 rh8scs01l795 pacemaker-attrd[1623]: notice: Node rh8scs00l795 state is now lost
	Aug 15 20:15:08 rh8scs01l795 pacemaker-attrd[1623]: notice: Removing all rh8scs00l795 attributes for peer loss
	Aug 15 20:15:08 rh8scs01l795 pacemaker-attrd[1623]: notice: Purged 1 peer with id=1 and/or uname=rh8scs00l795 from the membership cache
	Aug 15 20:15:08 rh8scs01l795 corosync[1479]:  [QUORUM] Members[1]: 2
	Aug 15 20:15:08 rh8scs01l795 corosync[1479]:  [MAIN  ] Completed service synchronization, ready to provide service.

    - So, the cluster does not perform any action on the resource for like (30+36) + 15 + (30+36) = 147 seconds ~ 2 minutes 27 seconds. 

    - The node rh8scs00l795 that crashed, comes back online and it joins the cluster. After the other node joins the cluster, ASCS resources are either failed over or started on the same node. 

This is not a normal behavior, where cluster go on checking totem twice during a crash. In typical ENSA2, cluster either performs a failover or starts on the same node (if VM comes up fast). The expected behavior is that the cluster should perform action on resources after (30+36) + 15 seconds. But instead, there is an additional delay in taking action as cluster checks totem twice. This behavior we encountered when we set priority-fencing-delay. 

We have not encountered this behavior in ENSA1 configuration with priority-fencing-delay. 

Also, we have encountered the same behavior in RHEL 8.6 as well. We set up the same ENSA2 environment in RHEL 8.6 on Azure and encountered similar behavior. 


Version-Release number of selected component (if applicable):

RHEL 8.8 for SAP HA & Update Services

pacemaker-schemas-2.1.5-8.1.el8_8.noarch
pacemaker-cluster-libs-2.1.5-8.1.el8_8.x86_64
pacemaker-cli-2.1.5-8.1.el8_8.x86_64
pacemaker-2.1.5-8.1.el8_8.x86_64
pacemaker-libs-2.1.5-8.1.el8_8.x86_64

How reproducible:

Setup ENSA2 system (S/4HANA 2022) with sapstartsrv integrated with systemd. 

Steps to Reproduce:
1. Setup ENSA2 system (S/4HANA 2022) with sapstartsrv integrated with systemd on RHEL 8.8
2. Configure priority-fencing-delay described in https://access.redhat.com/solutions/5110521
3. Additional step performed for cluster managing sapstartsrv https://access.redhat.com/articles/6884531

Actual results:
 

Expected results:
Cluster should perform some action on ASCS resource after (30+36) + 15 seconds. But instead, it checks totem again that adds additional delay of (30+36) seconds.

Additional info:

Comment 1 dennispadia 2023-08-15 22:34:12 UTC
You can find SOS report for both VMs in https://1drv.ms/f/s!Aqy5CvX16plyvzgMtno-OcIr0uGa?e=scD4gS


Note You need to log in before you can comment on or make changes to this bug.