Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1943756

Summary: SAPHana: check_for_primary() uses mode instead of actual mode in global.ini as fallback [RHEL 7] [rhel-7-4.z]
Product: Red Hat Enterprise Linux 7 Reporter: Reid Wahl <nwahl>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED NOTABUG QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: high    
Version: 7.8CC: aarnold, agk, cfeist, cluster-maint, cluster-qe, cpelland, cww, dkinkead, fdanapfe, fdinitto, jodonnel, jreznik, krohlfs, kwalker, oalbrigt, phagara, revijaya, sbradley
Target Milestone: rcKeywords: OtherQA, Reopened, Triaged, ZStream
Target Release: ---Flags: pm-rhel: mirror+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1855888 Environment:
Last Closed: 2021-07-28 14:57:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1855888    
Bug Blocks:    

Description Reid Wahl 2021-03-27 05:13:54 UTC
+++ This bug was initially created as a clone of Bug #1855888 +++

+++ This bug was initially created as a clone of Bug #1855885 +++

Description of problem:

The SAPHana resource agent uses the `system_replication/mode` attribute from global.ini as a fallback if the `$hdbState` command fails. The expectation is that a takeover event updates the `mode` parameter so that it's usually a valid representation of which node is currently primary.

However, `mode` is a static parameter that does not change with a takeover event. Instead, the takeover updates the `actual mode` parameter.

Our resource agent needs to be updated to query the correct parameter in the event that `$hdbState` fails. This way, we can respond more appropriately to edge-case situations like missing hdb* binaries.


Adapted from an SAP engineer in a support collaboration email:
~~~
# # Before takeover
# # node1 is primary, node2 is secondary
	
global.ini on node1:
  mode  = primary
  actual mode = primary
  operation_mode = logreplay

global.ini on node2:
  mode  = sync
  actual mode = sync
  operation_mode = logreplay

hdbnsutil -sr_state on node1:
  mode: primary
  operation mode: primary

hdbnsutil -sr_state on node2:
  mode: sync
  operation mode: logreplay


# # After takeover/failover
# # node1 is secondary, node2 is primary

global.ini on node1:
  mode  = primary
  actual mode = sync
  operation_mode = logreplay

global.ini on node2:
  mode  = sync
  actual mode = primary
  operation_mode = logreplay

hdbnsutil -sr_state on node1:
  mode: sync
  operation mode: logreplay

hdbnsutil -sr_state on node2:
  mode: primary
  operation mode: primary


Just have a look how we change the parameter values depending on the source – global.ini or hdbnsutil -sr_state. I highlighted major differences.

This behavior doesn’t depend on removing binaries, it’s normal HANA parameter change after the takeover/re-registering secondaries to new primary. Confirmation is provided in SAP Note 1999880 - FAQ: SAP HANA System Replication:

47. Why do I see deviating values in the system replication mode parameter?

The value in parameter global.ini -> [system_replication] -> mode depends on the original role of the site:

·         If site was originally configured as primary site: mode = 'primary'

·         If site was originally configured as secondary / tertiary site: mode = 'sync', 'async', ... (dependent on the system replication mode)

As a consequence the mode value can be different in two identically configured systems if a takeover happened in one system, but not in the other.
~~~

-----

Version-Release number of selected component (if applicable):

resource-agents-sap-hana-4.1.1-53.el7

-----

How reproducible:

Always

-----

Steps to Reproduce:

Assuming SAP's description of the parameters is correct, it's trivial to look at the check_for_primary() function and see that we're apparently querying the wrong one.

I believe the following steps will reproduce an issue occurring as a result of this:

1. Make the node with `mode = sync` the primary node via takeover.
2. Move the hdb* binaries to another location on that node so that the binaries are "missing."

-----

Actual results:

Both nodes end up in demoted state because the RA reads "mode = sync" from the global.ini file as a fallback on the primary.

-----

Expected results:

Pacemaker does not take any corrective action because the RA reads "actual mode = primary" from the global.ini file as a fallback.

-----

Additional info:

Related to closed BZ1783581.

--- Additional comment from Reid Wahl on 2020-07-10 20:02:01 UTC ---

I don't think this is an issue in the Scale Out RAs, but it wouldn't hurt to confirm.

--- Additional comment from RHEL Program Management on 2020-07-10 20:05:21 UTC ---

Since this bug report was entered in Red Hat Bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from Frank Danapfel on 2020-08-18 12:51:04 UTC ---



--- Additional comment from Frank Danapfel on 2020-09-24 14:44:53 UTC ---

Received feedback from CSM for customer that testing of preliminary fix was successful:
---
From: "Keri Rohlfs" <krohlfs>
To: "Frank Danapfel" <fdanapfe>
Cc: "Kyle Drehwing" <kdrehwin>, "Mark Tonneson" <mark.tonneson>, "James O'Donnell" <jodonnel>, "Reid Wahl" <nwahl>, "Christoph Brune" <cbrune>, "Arne Arnold" <aarnold>
Sent: Wednesday, 16 September, 2020 4:20:02 PM
Subject: Re: Mars - Bugs 1855885 and 1855888

Hello Frank,

I heard back from Mars today and good news, they completed their testing and it was positive.
---

Will now continue the discussions with upstream on what the final patch should look like (see https://github.com/SUSE/SAPHanaSR/issues/40).When this has been decided we can start working on creating the final patch.

--- Additional comment from Chris Williams on 2020-10-23 19:42:34 UTC ---

Red Hat Enterprise Linux 7 shipped it's final minor release on Spetember 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7.
Unless the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 this BZ will be closed during the week of November 9th, 2020. 

From the RHEL life cyle page:
https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase
"During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available."

If this BZ meets the above criteria please flag if for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelrated Fixes:
https://mojo.redhat.com/docs/DOC-80399  

Feature Requests can be moved to RHEL 8 if the desired functionality is not already present in the product. 

Please reach out to the applicable Product Experience Engineer[0] if you have any questions or concerns.  

[0] https://bugzilla.redhat.com/page.cgi?id=agile_component_mapping.html&product=Red+Hat+Enterprise+Linux+7

--- Additional comment from Shane Bradley on 2020-10-29 12:57:02 UTC ---

We have multiple tickets attached to this one. I am flagging for 7.9.z inclusion.

--- Additional comment from Reid Wahl on 2020-11-02 23:09:33 UTC ---

I second the request to include this in RHEL 7. The Mars account in particular has expressed that this is of high importance for them, so that they won't automatically face an outage if hdb* binaries are lost.

However, AFAIK RHEL 7.7 is the latest RHEL 7 minor release to be supported for SAP HANA. If my understanding is correct, then 7.9.z will not be sufficient. We will need z-streams to at least 7.7. RHEL 7.6 is preferred, as that's the version Mars is on and it's not clear (@krohlfs, can you find out for us?) whether they're able to upgrade to 7.7.

--- Additional comment from Arne Arnold on 2020-11-04 13:50:00 UTC ---

(In reply to Chris Williams from comment #4)
> If this BZ meets the above criteria please flag if for 7.9.z, provide
> suitable business and technical justifications

Echoing what Shane and Reid already stated - this fix is required to achieve intentional behaviour of our HA solutions
Corresponding hot patch had already been shared with & tested by the customer.
We should make sure that once accepted upstream, it can make its way into RHEL 7 still.

While RHEL 7.9z is in fact planned to be certified for SAP HANA, it may take time until we get there. 
Currently, RHEL 7.7 ist the last SAP HANA certified release of RHEL 7, while most of our CCSPs are still offering RHEL 7.6
I hence echo Reid's comments, that we should consider backport into RHEL 7.6 and 7.7 - depending by when we can get the fix actually pulled downstream.

--- Additional comment from Kyle Walker on 2020-11-04 15:36:10 UTC ---

After further discussion with the Mars account team, I'm nominating this for 7.6.z and 7.7.z inclusion as well. I will follow-up with the SST internally to get the requisite ACKs.

--- Additional comment from Chris Williams on 2020-11-11 21:50:06 UTC ---

Red Hat Enterprise Linux 7 shipped it's final minor release on September 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7.
From intial triage it does not appear the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 and will now be closed. 

From the RHEL life cycle page:
https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase
"During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available."

If this BZ was closed in error and meets the above criteria please re-open it flag for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelerated Fixes:
https://source.redhat.com/groups/public/pnt-cxno/pnt_customer_experience_and_operations_wiki/support_delivery_accelerated_fix_release_handbook  

Feature Requests can re-opened and moved to RHEL 8 if the desired functionality is not already present in the product. 

Please reach out to the applicable Product Experience Engineer[0] if you have any questions or concerns.  

[0] https://bugzilla.redhat.com/page.cgi?id=agile_component_mapping.html&product=Red+Hat+Enterprise+Linux+7

--- Additional comment from Reid Wahl on 2020-11-11 22:10:03 UTC ---

It's not clear to me why this Bugzilla was closed when multiple parties (GSS, the Mars account team, the SAP product team, and the HA product team) are all seeking to get this approved for RHEL 7.6.z, 7.7.z, and 7.9.z. The recent BZ comments attest to this. The 7.9.z flag was set on 29 Oct, and the 7.6.z and 7.7.z flags were set on 4 Nov.

I'm reopening it.

--- Additional comment from RHEL Program Management on 2020-11-11 22:10:13 UTC ---

This bug was reopened or transitioned from a non-RHEL to RHEL product.  The stale date has been reset to +6 months.

--- Additional comment from Chris Williams on 2020-11-11 23:11:16 UTC ---

(In reply to Reid Wahl from comment #10)
> It's not clear to me why this Bugzilla was closed when multiple parties
> (GSS, the Mars account team, the SAP product team, and the HA product team)
> are all seeking to get this approved for RHEL 7.6.z, 7.7.z, and 7.9.z. The
> recent BZ comments attest to this. The 7.9.z flag was set on 29 Oct, and the
> 7.6.z and 7.7.z flags were set on 4 Nov.
> 
> I'm reopening it.

Yes, my bad. Thanks for re-opening it. I was just about to correct this error.

--- Additional comment from Reid Wahl on 2020-11-11 23:27:53 UTC ---

(In reply to Chris Williams from comment #12)
> Yes, my bad. Thanks for re-opening it. I was just about to correct this
> error.

NP. Easy to get into a groove when working through a large batch of these.

--- Additional comment from  on 2020-11-20 14:14:28 UTC ---

Apologies for my delay in response to (@krohlfs, can you find out for us?) whether they're able to upgrade to 7.7.

Mars is currently testing 7.7 and ran into concerns, they are working with support. I will provide an update once we know if they can upgrade to 7.7.

--- Additional comment from Jaroslav Reznik on 2020-11-23 14:06:37 UTC ---

Please make sure the bug has all pm/devel/qa acks before I can ack for 7.9.z and clone to other streams.

--- Additional comment from Frank Danapfel on 2020-12-07 12:28:19 UTC ---

Since I'm seeing more and more support cases being attached to this bugzilla I'd like to clarify something:

as I've just mentioned in a private comment on https://access.redhat.com/solutions/4657331 the purpose of this Bugzilla is to provide a fix for the fallback mechanism to determine the current status of HANA System Replication where the wrong status is returned in a very specific customer scenario:
---
ACT: Using getParameter.py as fallback - node_status=<wrong_status>
---

It will NOT help customers where the SAPHana/SAPHanTopology resource agents return the following error:
---
ERROR: ACT: check_for_primary:  we didn't expect srmode to be: DUMP:
---

Therefore please DO NOT attach any customer cases to this bugzilla where the error mentioned above is reported, but work with the customer/the hardware partner/cloud provider/SAP instead to try to determine why HANA wasn't able to return the proper value for the current HANA System Replication status (for example this might be due to performance issues, a bug on the HANA side, a problem with the infrastructure on which HANA is running, ...)

--- Additional comment from Chris Feist on 2021-01-09 03:19:29 UTC ---

Dropping priority from urgent to high, as a high priority will still get a z-stream, but urgent is causing this bz to get flagged on several program lists.  Please let me know if we have a pressing need to keep this flagged as urgent and we can change it back.

--- Additional comment from Jaroslav Reznik on 2021-01-13 13:13:16 UTC ---

This bug was approved on RHEL Blocker and Exception meeting for inclusion in RHEL 7.9.z. Please proceed with qa and devel acks when possible.

--- Additional comment from Patrik Hagara on 2021-01-13 13:29:47 UTC ---

qa_ack+, to be verified by SAP QE

--- Additional comment from Oyvind Albrigtsen on 2021-01-13 14:59:25 UTC ---

Bumping back to ASSIGNED, as we havent got the patch yet.

--- Additional comment from RAD team bot copy to z-stream on 2021-01-21 14:44:33 UTC ---

This bug has been copied as 7.6 z-stream (EUS) bug #1918784 and now must be
resolved in the current update release, set blocker flag.

--- Additional comment from RAD team bot copy to z-stream on 2021-01-21 14:44:55 UTC ---

This bug has been copied as 7.7 z-stream (EUS) bug #1918786 and now must be
resolved in the current update release, set blocker flag.

--- Additional comment from Frank Danapfel on 2021-03-17 09:48:31 UTC ---

Patch for this is now available upstream (https://github.com/SUSE/SAPHanaSR/commit/ec9fd4e526e572fe9bc0070186fa584b032eac22).