Bug 1573723

Summary:	SAPInstance stops both Master and Slave when only Slave detects missing binaries
Product:	Red Hat Enterprise Linux 7	Reporter:	Ondrej Faměra <ofamera>
Component:	resource-agents	Assignee:	Oyvind Albrigtsen <oalbrigt>
Status:	CLOSED WONTFIX	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.5	CC:	aarnold, agk, bjarolim, cfeist, cluster-maint, fbroussy, fdanapfe, fdinitto, jruemker, nwahl, ofamera, takirby
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-19 09:02:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1018952

Description Ondrej Faměra 2018-05-02 06:57:01 UTC

=== Description of problem:
SAPInstance is not very resilient when the binaries that it uses gets missing.
Missing binaries on Slave causes the stop of both Slave and Master.

=== Version-Release number of selected component (if applicable):
resource-agents-sap-3.9.5-124.el7.x86_64

=== How reproducible:
always

=== Steps to Reproduce:
1. Setup ASCS/ERS SAPInstance resource in cluster as described here
https://access.redhat.com/articles/3150081#configure-ascs-ers-sapinstance-cluster-resource
# pcs resource create rh1_ascs_ers SAPInstance InstanceName="RH1_ASCS00_rh1-ascs" DIR_PROFILE=/sapmnt/RH1/profile START_PROFILE=/sapmnt/RH1/profile/RH1_ASCS00_rh1-ascs ERS_InstanceName="RH1_ERS10_rh1-ers" ERS_START_PROFILE=/sapmnt/RH1/profile/RH1_ERS10_rh1-ers master master-max="1" clone-max="2" notify="true" interleave="true"
2. On node running Slave make the binaries needed by this resource agent non-executable
(this will make the function `have_binary` to fail)
# chmod -x /usr/sap/RH1/ASCS00/exe/sapstartsrv
# chmod -x /usr/sap/RH1/ERS10/exe/sapstartsrv
# chmod -x /usr/sap/RH1/SYS/exe/run/sapstartsrv
3. Wait for 'monitor' action on node with non-executable binaries to run

=== Actual results:
Cluster detects that resource running as Slave is missing binaries and will return $OCF_ERR_INSTALLED
which will cause the stop of both Slave and Master and will prevent the resource to be started anywhere in cluster.
Master has all needed binaries that it needs are present but Master doesn't run.

=== Expected results:
Cluster will detect problem on given node and prevent the start of resource only on
that node, keeping Master running and possibly starting Slave elsewhere where binaries are available and executable.

=== Additional info:
Upstream PR to address this: https://github.com/ClusterLabs/resource-agents/pull/1144

I have internal 2-node cluster with latest RHEL 7.5 and SAP Netweaver setup that can be used
for testing the changes.

Additional notes for testing:
- /usr/sap/$SID/SYS/exe/run/ is on shared NFS share so it will affect all nodes

PR also addresses the detection of binaries for Master/Slave kind of SAPInstance to make it more consistent.

Comment 30 Frank Danapfel 2018-10-01 16:09:40 UTC

*** Bug 1612799 has been marked as a duplicate of this bug. ***

Comment 31 Frank Danapfel 2018-10-19 09:02:20 UTC

As agreed between Arne Arnold, Red Hat PM for SAP, Joachim Boyer, Red Hat Escallation manager for the support cases related to this bugzilla, we won't be implementing the fixes proposed in this bugzilla in the SAPInstance resource agent shipped via the resource-agents-sap packages due to the following reasons:

- implementing the changes are considered to risky, since it is not clear which effects the change will have on already existing customer implementations

- since the root cause for the issue that this bugzilla is trying to fix is that the instance directory/file system of either the (A)SCS or ERS instance becomes inaccessible the issue can also be avoided by adapting the cluster configuration to also allow the cluster manage the instance directories to ensure that in case the instance directory/file system becomes inaccessible the cluster will either attempt to restore access or fence the node in case access can't be restored; if customers do not want to have the file systems managed by the cluster they need to make sure they have some alternative method of monitoring that the file systems needed by the SP instances managed by the cluster are accessible and have procedures in place to ensure that the file systems will be made accessible again as soon as possible if they become unavailable on a node

- a new architecture for managing the SAP (A)SCS/ERS instances is currently developed, which will eliminate the issue that this bugzilla is trying to fix, since it will implement the following changes:
   - the (A)SCS and ERS instances will be managed as separate instance and not use the master/slave approach anymore
   - the instance directories/file systems will be managed by the cluster by default
  See the following KBase for further details on the new setup:
  'Configuring ASCS/ERS for SAP NetWeaver with standalone resources in RHEL 7.5'
  (https://access.redhat.com/articles/3569681)

Ondrej, since the issues you discovered when trying to use the test provided by SAP as part of the SAP NetWeaver HA Interface certification are not related to the issue discussed in this thread, we should discuss them separately.