Bug 1573723 - SAPInstance stops both Master and Slave when only Slave detects missing binaries
Summary: SAPInstance stops both Master and Slave when only Slave detects missing binaries
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: resource-agents
Version: 7.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Oyvind Albrigtsen
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
: 1612799 (view as bug list)
Depends On:
Blocks: 1018952
TreeView+ depends on / blocked
 
Reported: 2018-05-02 06:57 UTC by Ondrej Faměra
Modified: 2020-01-06 00:17 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-19 09:02:20 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ClusterLabs resource-agents pull 1155 0 'None' closed SAPInstance improvements 2018/05 (2/4) - make missing binaries on Slave non-fatal issue 2021-01-28 11:12:12 UTC
Red Hat Knowledge Base (Solution) 3430711 0 None None None 2018-05-03 07:03:59 UTC
Red Hat Knowledge Base (Solution) 3552761 0 None None None 2018-10-01 16:09:40 UTC

Internal Links: 1612799

Description Ondrej Faměra 2018-05-02 06:57:01 UTC
=== Description of problem:
SAPInstance is not very resilient when the binaries that it uses gets missing.
Missing binaries on Slave causes the stop of both Slave and Master.

=== Version-Release number of selected component (if applicable):
resource-agents-sap-3.9.5-124.el7.x86_64

=== How reproducible:
always

=== Steps to Reproduce:
1. Setup ASCS/ERS SAPInstance resource in cluster as described here
https://access.redhat.com/articles/3150081#configure-ascs-ers-sapinstance-cluster-resource
# pcs resource create rh1_ascs_ers SAPInstance InstanceName="RH1_ASCS00_rh1-ascs" DIR_PROFILE=/sapmnt/RH1/profile START_PROFILE=/sapmnt/RH1/profile/RH1_ASCS00_rh1-ascs ERS_InstanceName="RH1_ERS10_rh1-ers" ERS_START_PROFILE=/sapmnt/RH1/profile/RH1_ERS10_rh1-ers master master-max="1" clone-max="2" notify="true" interleave="true"
2. On node running Slave make the binaries needed by this resource agent non-executable
(this will make the function `have_binary` to fail)
# chmod -x /usr/sap/RH1/ASCS00/exe/sapstartsrv
# chmod -x /usr/sap/RH1/ERS10/exe/sapstartsrv
# chmod -x /usr/sap/RH1/SYS/exe/run/sapstartsrv
3. Wait for 'monitor' action on node with non-executable binaries to run

=== Actual results:
Cluster detects that resource running as Slave is missing binaries and will return $OCF_ERR_INSTALLED
which will cause the stop of both Slave and Master and will prevent the resource to be started anywhere in cluster.
Master has all needed binaries that it needs are present but Master doesn't run.

=== Expected results:
Cluster will detect problem on given node and prevent the start of resource only on
that node, keeping Master running and possibly starting Slave elsewhere where binaries are available and executable.

=== Additional info:
Upstream PR to address this: https://github.com/ClusterLabs/resource-agents/pull/1144

I have internal 2-node cluster with latest RHEL 7.5 and SAP Netweaver setup that can be used
for testing the changes.

Additional notes for testing:
- /usr/sap/$SID/SYS/exe/run/ is on shared NFS share so it will affect all nodes

PR also addresses the detection of binaries for Master/Slave kind of SAPInstance to make it more consistent.

Comment 30 Frank Danapfel 2018-10-01 16:09:40 UTC
*** Bug 1612799 has been marked as a duplicate of this bug. ***

Comment 31 Frank Danapfel 2018-10-19 09:02:20 UTC
As agreed between Arne Arnold, Red Hat PM for SAP, Joachim Boyer, Red Hat Escallation manager for the support cases related to this bugzilla, we won't be implementing the fixes proposed in this bugzilla in the SAPInstance resource agent shipped via the resource-agents-sap packages due to the following reasons:

- implementing the changes are considered to risky, since it is not clear which effects the change will have on already existing customer implementations

- since the root cause for the issue that this bugzilla is trying to fix is that the instance directory/file system of either the (A)SCS or ERS instance becomes inaccessible the issue can also be avoided by adapting the cluster configuration to also allow the cluster manage the instance directories to ensure that in case the instance directory/file system becomes inaccessible the cluster will either attempt to restore access or fence the node in case access can't be restored; if customers do not want to have the file systems managed by the cluster they need to make sure they have some alternative method of monitoring that the file systems needed by the SP instances managed by the cluster are accessible and have procedures in place to ensure that the file systems will be made accessible again as soon as possible if they become unavailable on a node

- a new architecture for managing the SAP (A)SCS/ERS instances is currently developed, which will eliminate the issue that this bugzilla is trying to fix, since it will implement the following changes:
   - the (A)SCS and ERS instances will be managed as separate instance and not use the master/slave approach anymore
   - the instance directories/file systems will be managed by the cluster by default
  See the following KBase for further details on the new setup:
  'Configuring ASCS/ERS for SAP NetWeaver with standalone resources in RHEL 7.5'
  (https://access.redhat.com/articles/3569681)

Ondrej, since the issues you discovered when trying to use the test provided by SAP as part of the SAP NetWeaver HA Interface certification are not related to the issue discussed in this thread, we should discuss them separately.


Note You need to log in before you can comment on or make changes to this bug.