1573723 – SAPInstance stops both Master and Slave when only Slave detects missing binaries

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1573723 - SAPInstance stops both Master and Slave when only Slave detects missing binaries

Summary: SAPInstance stops both Master and Slave when only Slave detects missing binaries

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Oyvind Albrigtsen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1612799 (view as bug list)
Depends On:
Blocks:	1018952
TreeView+	depends on / blocked

Reported:	2018-05-02 06:57 UTC by Ondrej Faměra
Modified:	2022-03-13 14:56 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-19 09:02:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ClusterLabs resource-agents pull 1155	'None'	closed	SAPInstance improvements 2018/05 (2/4) - make missing binaries on Slave non-fatal issue	2021-01-28 11:12:12 UTC
Red Hat Knowledge Base (Solution)	3430711	None	None	None	2018-05-03 07:03:59 UTC
Red Hat Knowledge Base (Solution)	3552761	None	None	None	2018-10-01 16:09:40 UTC

Internal Links: 1612799

Description Ondrej Faměra 2018-05-02 06:57:01 UTC

=== Description of problem:
SAPInstance is not very resilient when the binaries that it uses gets missing.
Missing binaries on Slave causes the stop of both Slave and Master.

=== Version-Release number of selected component (if applicable):
resource-agents-sap-3.9.5-124.el7.x86_64

=== How reproducible:
always

=== Steps to Reproduce:
1. Setup ASCS/ERS SAPInstance resource in cluster as described here
https://access.redhat.com/articles/3150081#configure-ascs-ers-sapinstance-cluster-resource
# pcs resource create rh1_ascs_ers SAPInstance InstanceName="RH1_ASCS00_rh1-ascs" DIR_PROFILE=/sapmnt/RH1/profile START_PROFILE=/sapmnt/RH1/profile/RH1_ASCS00_rh1-ascs ERS_InstanceName="RH1_ERS10_rh1-ers" ERS_START_PROFILE=/sapmnt/RH1/profile/RH1_ERS10_rh1-ers master master-max="1" clone-max="2" notify="true" interleave="true"
2. On node running Slave make the binaries needed by this resource agent non-executable
(this will make the function `have_binary` to fail)
# chmod -x /usr/sap/RH1/ASCS00/exe/sapstartsrv
# chmod -x /usr/sap/RH1/ERS10/exe/sapstartsrv
# chmod -x /usr/sap/RH1/SYS/exe/run/sapstartsrv
3. Wait for 'monitor' action on node with non-executable binaries to run

=== Actual results:
Cluster detects that resource running as Slave is missing binaries and will return $OCF_ERR_INSTALLED
which will cause the stop of both Slave and Master and will prevent the resource to be started anywhere in cluster.
Master has all needed binaries that it needs are present but Master doesn't run.

=== Expected results:
Cluster will detect problem on given node and prevent the start of resource only on
that node, keeping Master running and possibly starting Slave elsewhere where binaries are available and executable.

=== Additional info:
Upstream PR to address this: https://github.com/ClusterLabs/resource-agents/pull/1144

I have internal 2-node cluster with latest RHEL 7.5 and SAP Netweaver setup that can be used
for testing the changes.

Additional notes for testing:
- /usr/sap/$SID/SYS/exe/run/ is on shared NFS share so it will affect all nodes

PR also addresses the detection of binaries for Master/Slave kind of SAPInstance to make it more consistent.

Comment 30 Frank Danapfel 2018-10-01 16:09:40 UTC

*** Bug 1612799 has been marked as a duplicate of this bug. ***

Comment 31 Frank Danapfel 2018-10-19 09:02:20 UTC

As agreed between Arne Arnold, Red Hat PM for SAP, Joachim Boyer, Red Hat Escallation manager for the support cases related to this bugzilla, we won't be implementing the fixes proposed in this bugzilla in the SAPInstance resource agent shipped via the resource-agents-sap packages due to the following reasons:

- implementing the changes are considered to risky, since it is not clear which effects the change will have on already existing customer implementations

- since the root cause for the issue that this bugzilla is trying to fix is that the instance directory/file system of either the (A)SCS or ERS instance becomes inaccessible the issue can also be avoided by adapting the cluster configuration to also allow the cluster manage the instance directories to ensure that in case the instance directory/file system becomes inaccessible the cluster will either attempt to restore access or fence the node in case access can't be restored; if customers do not want to have the file systems managed by the cluster they need to make sure they have some alternative method of monitoring that the file systems needed by the SP instances managed by the cluster are accessible and have procedures in place to ensure that the file systems will be made accessible again as soon as possible if they become unavailable on a node

- a new architecture for managing the SAP (A)SCS/ERS instances is currently developed, which will eliminate the issue that this bugzilla is trying to fix, since it will implement the following changes:
   - the (A)SCS and ERS instances will be managed as separate instance and not use the master/slave approach anymore
   - the instance directories/file systems will be managed by the cluster by default
  See the following KBase for further details on the new setup:
  'Configuring ASCS/ERS for SAP NetWeaver with standalone resources in RHEL 7.5'
  (https://access.redhat.com/articles/3569681)

Ondrej, since the issues you discovered when trying to use the test provided by SAP as part of the SAP NetWeaver HA Interface certification are not related to the issue discussed in this thread, we should discuss them separately.

Note You need to log in before you can comment on or make changes to this bug.