Bug 1855888

Summary: SAPHana: check_for_primary() uses mode instead of actual mode in global.ini as fallback [RHEL 7]
Product: Red Hat Enterprise Linux 7 Reporter: Reid Wahl <nwahl>
Component: resource-agentsAssignee: Frank Danapfel <fdanapfe>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: high    
Version: 7.8CC: aarnold, agk, amemon, cfeist, cluster-maint, cnewsom, cpelland, cww, dkinkead, fdanapfe, fdinitto, jodonnel, jreznik, krohlfs, kwalker, oalbrigt, phagara, revijaya, sbradley
Target Milestone: rcKeywords: OtherQA, Reopened, Triaged, ZStream
Target Release: 7.9Flags: aarnold: mirror+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: resource-agents-4.1.1-61.el7_9.13 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1855885
: 1918784 1918786 1943756 (view as bug list) Environment:
Last Closed: 2021-08-31 09:11:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1855885    
Bug Blocks: 1918784, 1918786, 1943756    

Description Reid Wahl 2020-07-10 20:05:13 UTC
+++ This bug was initially created as a clone of Bug #1855885 +++

Description of problem:

The SAPHana resource agent uses the `system_replication/mode` attribute from global.ini as a fallback if the `$hdbState` command fails. The expectation is that a takeover event updates the `mode` parameter so that it's usually a valid representation of which node is currently primary.

However, `mode` is a static parameter that does not change with a takeover event. Instead, the takeover updates the `actual mode` parameter.

Our resource agent needs to be updated to query the correct parameter in the event that `$hdbState` fails. This way, we can respond more appropriately to edge-case situations like missing hdb* binaries.


Adapted from an SAP engineer in a support collaboration email:
~~~
# # Before takeover
# # node1 is primary, node2 is secondary
	
global.ini on node1:
  mode  = primary
  actual mode = primary
  operation_mode = logreplay

global.ini on node2:
  mode  = sync
  actual mode = sync
  operation_mode = logreplay

hdbnsutil -sr_state on node1:
  mode: primary
  operation mode: primary

hdbnsutil -sr_state on node2:
  mode: sync
  operation mode: logreplay


# # After takeover/failover
# # node1 is secondary, node2 is primary

global.ini on node1:
  mode  = primary
  actual mode = sync
  operation_mode = logreplay

global.ini on node2:
  mode  = sync
  actual mode = primary
  operation_mode = logreplay

hdbnsutil -sr_state on node1:
  mode: sync
  operation mode: logreplay

hdbnsutil -sr_state on node2:
  mode: primary
  operation mode: primary


Just have a look how we change the parameter values depending on the source – global.ini or hdbnsutil -sr_state. I highlighted major differences.

This behavior doesn’t depend on removing binaries, it’s normal HANA parameter change after the takeover/re-registering secondaries to new primary. Confirmation is provided in SAP Note 1999880 - FAQ: SAP HANA System Replication:

47. Why do I see deviating values in the system replication mode parameter?

The value in parameter global.ini -> [system_replication] -> mode depends on the original role of the site:

·         If site was originally configured as primary site: mode = 'primary'

·         If site was originally configured as secondary / tertiary site: mode = 'sync', 'async', ... (dependent on the system replication mode)

As a consequence the mode value can be different in two identically configured systems if a takeover happened in one system, but not in the other.
~~~

-----

Version-Release number of selected component (if applicable):

resource-agents-sap-hana-4.1.1-53.el7

-----

How reproducible:

Always

-----

Steps to Reproduce:

Assuming SAP's description of the parameters is correct, it's trivial to look at the check_for_primary() function and see that we're apparently querying the wrong one.

I believe the following steps will reproduce an issue occurring as a result of this:

1. Make the node with `mode = sync` the primary node via takeover.
2. Move the hdb* binaries to another location on that node so that the binaries are "missing."

-----

Actual results:

Both nodes end up in demoted state because the RA reads "mode = sync" from the global.ini file as a fallback on the primary.

-----

Expected results:

Pacemaker does not take any corrective action because the RA reads "actual mode = primary" from the global.ini file as a fallback.

-----

Additional info:

Related to closed BZ1783581.

--- Additional comment from Reid Wahl on 2020-07-10 20:02:01 UTC ---

I don't think this is an issue in the Scale Out RAs, but it wouldn't hurt to confirm.

Comment 9 Chris Williams 2020-11-11 21:50:06 UTC
Red Hat Enterprise Linux 7 shipped it's final minor release on September 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7.
From intial triage it does not appear the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 and will now be closed. 

From the RHEL life cycle page:
https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase
"During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available."

If this BZ was closed in error and meets the above criteria please re-open it flag for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelerated Fixes:
https://source.redhat.com/groups/public/pnt-cxno/pnt_customer_experience_and_operations_wiki/support_delivery_accelerated_fix_release_handbook  

Feature Requests can re-opened and moved to RHEL 8 if the desired functionality is not already present in the product. 

Please reach out to the applicable Product Experience Engineer[0] if you have any questions or concerns.  

[0] https://bugzilla.redhat.com/page.cgi?id=agile_component_mapping.html&product=Red+Hat+Enterprise+Linux+7

Comment 34 errata-xmlrpc 2021-08-31 09:11:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3332