Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2118337

Summary: SAPInstance cluster resources not getting enabled after implementing sap-cluster-connector
Product: Red Hat Enterprise Linux 8 Reporter: Ken Gaillot <kgaillot>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 8.6CC: amemon, cfeist, cluster-maint, cluster-qe, fdanapfe, hbiswas, jfuchs, jserrano, mjuricek, mmoster, msmazova, oalbrigt, sbradley
Target Milestone: rcKeywords: Regression, Triaged, ZStream
Target Release: 8.7Flags: pm-rhel: mirror+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: pacemaker-2.1.4-5.el8 Doc Type: Bug Fix
Doc Text:
Cause: The Pacemaker versions in RHEL 8.5 and 9.0 beta and later, for the crm_resource -O command to list operations, use a minimum of 0 for execution status. Consequence: A pending command, which has an execution status of -1, is shown as complete (0). Fix: crm_resource -O now uses a minimum of -1 for execution status. Result: Pending commands are appropriately shown as pending.
Story Points: ---
Clone Of: 2089353
: 2118745 (view as bug list) Environment:
Last Closed: 2022-11-08 09:42:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2089353    
Bug Blocks: 2118745    

Description Ken Gaillot 2022-08-15 14:22:02 UTC
+++ This bug was initially created as a clone of Bug #2089353 +++

Description of problem:
Cluster Resources managed by SAPInstance resource agents are getting failed and disabled after sap-cluster-connector is implemented.

Version-Release number of selected component (if applicable):
sap-cluster-connector-3.0.1-7.el9.1.noarch

How reproducible: Fresh Install S/4HANA system on RHEL 9, configure the HA cluster normally and implement the sap-cluster-connector as mentioned  in https://access.redhat.com/solutions/3606101


Steps to Reproduce:
1. Install sap-cluster-connector on both nodes:
[root@s4hana06 ~]# rpm -qa | grep connector
sap-cluster-connector-3.0.1-7.el9.1.noarch

[root@s4hana05 heartbeat]# rpm -qa | grep connector
sap-cluster-connector-3.0.1-7.el9.1.noarch


2. Disable SAP Instance Resources
[root@s4hana06 ~]# pcs resource disable s4h_ers29; pcs resource disable s4h_ascs20


3. # Add the following lines to the end of file of their respective Instances
[root@s4hana06 ~]# vim /sapmnt/S4H/profile/S4H_ASCS20_s4ascs
service/halib = $(DIR_CT_RUN)/saphascriptco.so
service/halib_cluster_connector = /usr/bin/sap_cluster_connector

[root@s4hana06 ~]# vim /sapmnt/S4H/profile/S4H_ERS29_s4ers
service/halib = $(DIR_CT_RUN)/saphascriptco.so
service/halib_cluster_connector = /usr/bin/sap_cluster_connector

4. kill the remaining processes on both nodes:
[root@s4hana06 ~]# ps aux|grep sapstartsrv|grep s4hadm
s4hadm   3128462  0.0  0.0 848668 92040 ?        Ssl  13:50   0:01 /usr/sap/S4H/ERS29/exe/sapstartsrv pf=/sapmnt/S4H/profile/S4H_ERS29_s4ers -D -u s4hadm
[root@s4hana06 ~]# kill 3128462

[root@s4hana05 heartbeat]# ps aux|grep sapstartsrv|grep s4hadm
s4hadm    985652  0.0  0.0 991344 96984 ?        Ssl  13:49   0:02 /usr/sap/S4H/ASCS20/exe/sapstartsrv pf=/sapmnt/S4H/profile/S4H_ASCS20_s4ascs -D -u s4hadm


5. [root@s4hana06 ~]# pcs resource enable s4h_ascs20; pcs resource enable s4h_ers29


Actual results:
After executing step 5 the resources show starting and then the following errors occur:

Failed Resource Actions:
  * s4h_ascs20_start_0 on s4hana05 'not running' (7): call=90, status='complete', exitreason='', last-rc-change='2022-05-23 15:08:26 +02:00', queued=0ms, exec=12798ms
  * s4h_ers29_start_0 on s4hana05 'not running' (7): call=102, status='complete', exitreason='', last-rc-change='2022-05-23 15:08:42 +02:00', queued=0ms, exec=12884ms

And the resources: 
    * s4h_ascs20        (ocf:heartbeat:SAPInstance):     Stopped (disabled)
    * s4h_ers29 (ocf:heartbeat:SAPInstance):     Stopped (disabled)

Any attempts to cleanup and/or enable the resource doesn't help and same issue repeats.


Expected results:
After enabling the resource it should start without such errors:


Additional info:
Error messages in /var/log/messages
May 23 15:21:07 s4hana06 SAPInstance(s4h_ascs20)[3519701]: ERROR: SAP Instance S4H-ASCS20 start failed: #01223.05.2022 15:21:07#012WaitforStarted#012FAIL: process msg_server MessageServer not running
May 23 15:21:07 s4hana06 pacemaker-controld[3117621]: notice: Result of start operation for s4h_ascs20 on s4hana06: not running
May 23 15:21:07 s4hana06 pacemaker-attrd[3117619]: notice: Setting fail-count-s4h_ascs20#start_0[s4hana06]: (unset) -> INFINITY
May 23 15:21:07 s4hana06 pacemaker-attrd[3117619]: notice: Setting last-failure-s4h_ascs20#start_0[s4hana06]: (unset) -> 1653312067
May 23 15:21:07 s4hana06 pacemaker-controld[3117621]: notice: Requesting local execution of stop operation for s4h_ascs20 on s4hana06

### error messages in /usr/sap/S4H/ERS29/work/sapstartsrv.log
Initiating start via cluster API at 2022/05/23 15:21:17
trusted unix domain socket user is stopping SAP System at 2022/05/23 15:21:28
SAP HA Trace: Mon May 23 15:21:28 2022
SAP HA Trace: === SAP_HA_FindSAPInstance ===
SAP HA Trace: Fire system command /usr/bin/sap_cluster_connector lsr ...
SAP HA Trace: searchClusterFile: S4H:29 found
SAP HA Trace: Mon May 23 15:21:28 2022
SAP HA Trace: --- SAP_HA_FindSAPInstance Exit-Code: SAP_HA_OK ---
SAP HA Trace: Mon May 23 15:21:28 2022
SAP HA Trace: === SAP_HA_StopCluster ===
SAP HA Trace: Fire system command /usr/bin/sap_cluster_connector cpa ...
SAP HA Trace: SAP_HA_StopCluster: DID NOT FOUND A PENDING ACTION -> SAP_HA_OK
SAP HA Trace: SAP_HA_StopCluster: calling fire_resource_action
SAP HA Trace: Fire system command /usr/bin/sap_cluster_connector fra ...
SAP HA Trace: Mon May 23 15:21:28 2022
SAP HA Trace: --- SAP_HA_StopCluster Exit-Code: SAP_HA_OK ---


### Errors in file /usr/sap/S4H/ASCS20/work/sapstartsrv.log
Initiating start via cluster API at 2022/05/23 15:20:57
trusted unix domain socket user is stopping SAP System at 2022/05/23 15:21:08
SAP HA Trace: Mon May 23 15:21:08 2022
SAP HA Trace: === SAP_HA_FindSAPInstance ===
SAP HA Trace: Fire system command /usr/bin/sap_cluster_connector lsr ...
SAP HA Trace: searchClusterFile: S4H:20 found
SAP HA Trace: Mon May 23 15:21:08 2022
SAP HA Trace: --- SAP_HA_FindSAPInstance Exit-Code: SAP_HA_OK ---
SAP HA Trace: Mon May 23 15:21:08 2022
SAP HA Trace: === SAP_HA_StopCluster ===
SAP HA Trace: Fire system command /usr/bin/sap_cluster_connector cpa ...
SAP HA Trace: SAP_HA_StopCluster: DID NOT FOUND A PENDING ACTION -> SAP_HA_OK
SAP HA Trace: SAP_HA_StopCluster: calling fire_resource_action
SAP HA Trace: Fire system command /usr/bin/sap_cluster_connector fra ...
SAP HA Trace: Mon May 23 15:21:08 2022
SAP HA Trace: --- SAP_HA_StopCluster Exit-Code: SAP_HA_OK ---



Recovery Steps: Remove the implementation of sap-cluster-connector,kill the remaining sapstartsrv processes, cleanup and enable the resources:
1. Comment the last 2 lines:
[root@s4hana06 ]# vim /sapmnt/S4H/profile/S4H_ERS29_s4ers 
# service/halib = $(DIR_CT_RUN)/saphascriptco.so
# service/halib_cluster_connector = /usr/bin/sap_cluster_connector

[root@s4hana06 ]# vim /sapmnt/S4H/profile/S4H_ASCS20_s4ascs
# service/halib = $(DIR_CT_RUN)/saphascriptco.so
# service/halib_cluster_connector = /usr/bin/sap_cluster_connector

2. Kill the processes on both nodes
[root@s4hana06 ]# ps aux | grep sapstart | grep s4hadm

3. Enable the resources
[root@s4hana06 ]# pcs resource enable s4h_ascs20; pcs resource enable s4h_ers29

    * s4h_ascs20        (ocf:heartbeat:SAPInstance):     Started s4hana05
    * s4h_ers29 (ocf:heartbeat:SAPInstance):     Started s4hana06

<snip>

--- Additional comment from Ken Gaillot on 2022-08-09 20:35:44 UTC ---

Good catch ... this is a regression introduced in upstream Pacemaker version 2.1.0 (RHEL 8.5 / 9.0beta).

Would you like z-streams for the fix, or is 8.8 / 9.2 acceptable? I might be able to squeeze it in 8.7 / 9.1 ...

(In reply to Janine Fuchs from comment #13)
> Hi Ken,
> 
> I have actually spotted what is causing the SAP startup process to mess up
> prematurely.
> (Not attaching a report because the following details do not appear to be
> part of the collected data.)
> 
> After the SAPInstance resource initiates the start it triggers
> sap_cluster_connector (when enabled) to check for pending <res>_start_0
> resources using "crm_resource -O".
> It is looking for the related "_start_0" line ending with ": pending". 
> However, in the RHEL 9 cluster we used for testing yesterday and today these
> lines actually end with ": complete".
> The missing "pending" activity apparently leads to premature monitors that
> find a process not yet started and this is returning the failure and
> stopping all related processes again as a consequence.
> 
> * "crm_resource -O | grep s4h_ascs20" example on RHEL 9
> (pacemaker-2.1.2-4.el9.x86_64):
> 
> s4h_ascs20      (ocf:heartbeat:SAPInstance):     Starting:
> s4h_ascs20_start_0 (node=s4hana08, call=-1, rc=193, last-rc-change=Tue Aug 
> 9 20:21:34 2022, exec=0ms): complete
> s4h_ascs20      (ocf:heartbeat:SAPInstance):     Starting:
> s4h_ascs20_monitor_0 (node=s4hana07, call=524, rc=7, last-rc-change=Tue Aug 
> 9 20:21:15 2022, exec=2239ms): complete
> 
> 
> * "crm_resource -O | grep s4h_ascs20" example on RHEL 8.2
> (pacemaker-2.0.3-5.el8_2.5.x86_64):
> 
> s4h_ascs20      (ocf::heartbeat:SAPInstance):   Starting: s4h_ascs20_start_0
> (node=s4hana02, call=-1, rc=193, last-rc-change=Tue Aug  9 20:31:21 2022,
> exec=0ms): pending
> s4h_ascs20      (ocf::heartbeat:SAPInstance):   Starting:
> s4h_ascs20_monitor_0 (node=s4hana01, call=32, rc=7, last-rc-change=Mon Jul
> 25 17:25:12 2022, exec=105ms): complete
> 
> 
> The changed pattern in this crm_resource output is causing the issue due to
> the connector providing that information to the SAP service.
> Temporarily removing the "pending" string from the search pattern in the
> connector script helped passing this check correctly.
> There is no "pending" status at the end of the *_monitor_0 lines and that
> was already the case in older versions, as shown in above examples.
> 
> Is this an intentional change in the output?
> If so, how does this "Starting: ... : complete" have to be interpreted
> correctly? 
> Since the line turns into "Started: ..." when actually done it is confusing
> to show "complete" already for the 10-20 seconds while "Starting".
> 
> ... I have no idea why setting the target-role explicitly worked yesterday
> (multiple times). It failed when I wanted to include the output in a 2nd
> report for comparison and that's when I started digging again...
> 
> Regards,
> Janine

--- Additional comment from Ken Gaillot on 2022-08-09 20:52:53 UTC ---

QA: The simplest reproducer is:

    CIB_file=/usr/share/pacemaker/tests/scheduler/xml/probe-1.xml crm_resource -O

Before:

    DcIPaddr        (ocf::heartbeat:IPaddr):         Stopped: DcIPaddr_monitor_0 (node=c001n05, call=2, rc=-1): complete

After:

    DcIPaddr        (ocf::heartbeat:IPaddr):         Stopped: DcIPaddr_monitor_0 (node=c001n05, call=2, rc=-1): pending

"crm_resource -O" in a live cluster with a pending operation would also show the issue.

Comment 4 Ken Gaillot 2022-08-15 14:26:27 UTC
Fixed in upstream main branch by commit fe9150bc

Comment 13 errata-xmlrpc 2022-11-08 09:42:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7573