2233803 – [IBM Z]-[Regional DR]-[HUB Recovery] - DR Policy remains in 'Not Validated' state after switching to secondary hub cluster.

Bug 2233803 - [IBM Z]-[Regional DR]-[HUB Recovery] - DR Policy remains in 'Not Validated' state after switching to secondary hub cluster.

Summary: [IBM Z]-[Regional DR]-[HUB Recovery] - DR Policy remains in 'Not Validated' s...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	documentation
Sub Component:
Version:	4.13
Hardware:	s390x
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	ODF 4.13.3
Assignee:	Olive Lakra
QA Contact:	Parikshith
Docs Contact:
URL:
Whiteboard:
Depends On:	2178304
Blocks:
TreeView+	depends on / blocked

Reported:	2023-08-23 13:53 UTC by Raghavendra Talur
Modified:	2023-12-13 11:35 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2178304
Environment:
Last Closed:	2023-12-13 11:35:36 UTC
Embargoed:

Attachments	(Terms of Use)

Description Raghavendra Talur 2023-08-23 13:53:07 UTC

+++ This bug was initially created as a clone of Bug #2178304 +++

Description of problem (please be detailed as possible and provide log
snippests):
DR Policy remains in 'Not Validated' state after switching to secondary hub cluster.
status says 'DRClustersUnavailable'. 
There is no application deployed.

```
[root@m4216001 ~]# oc get drpolicy ocsm1301015-ocsm4204001-5m -o jsonpath='{.status.conditions[].reason}{"\n"}'
DRClustersUnavailable
[root@m4216001 ~]#
```

Version of all relevant components (if applicable):
OCP : 4.12.0
ODF: 4.12.1-19
MCO: 4.12.1-19

[root@m1301015 ~]# oc -n openshift-dr-system get csv --show-labels
NAME                           DISPLAY                         VERSION   REPLACES                       PHASE       LABELS
odr-cluster-operator.v4.12.1   Openshift DR Cluster Operator   4.12.1    odr-cluster-operator.v4.12.0   Succeeded   operators.coreos.com/odr-cluster-operator.openshift-dr-system=
volsync-product.v0.6.1         VolSync                         0.6.1     volsync-product.v0.6.0         Succeeded   olm.copiedFrom=openshift-operators,operatorframework.io/arch.amd64=supported,operatorframework.io/arch.arm64=supported,operatorframework.io/arch.ppc64le=supported,operatorframework.io/arch.s390x=supported,operatorframework.io/os.linux=supported
[root@m1301015 ~]#

[root@m4216001 ~]# oc -n openshift-operators get csv --show-labels
NAME                                    DISPLAY                         VERSION   REPLACES                   PHASE       LABELS
odf-multicluster-orchestrator.v4.12.1   ODF Multicluster Orchestrator   4.12.1                               Succeeded   operators.coreos.com/odf-multicluster-orchestrator.openshift-operators=
odr-hub-operator.v4.12.1                Openshift DR Hub Operator       4.12.1    odr-hub-operator.v4.12.0   Succeeded   operators.coreos.com/odr-hub-operator.openshift-operators=
[root@m4216001 ~]#




Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Cant manage application after HUB Recovery.


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Setup Regional DR environment with 2 Hub cluster and 2 managed clusters
2. Perform hub recovery, and create auto-import-secret to see managed clusters in imported state.
3.


Actual results:
DR Policy in "Not Validated"


Expected results:
DR Policy in Validated state.

Additional info:
must gather from all the clusters : https://drive.google.com/file/d/1P0ywE194nlJ-jJ5AphLPTrw8saOIybZw/view?usp=sharing

--- Additional comment from Abdul Kandathil (IBM) on 2023-03-15 05:25:50 EDT ---

Followed Instructions from MDR Team. : https://docs.google.com/document/d/1DOlkuKpbZJyzWnhll1-pj0jL3dJSzPFY8rvWieMqZU8/edit#

Also tried deleting pods as mentioned in: https://docs.google.com/document/d/1DbTvTgzwWvS3Gupyj7vl6Toa8BuZadkcIoPnnWkWqo4/edit#

--- Additional comment from RHEL Program Management on 2023-03-25 03:26:55 EDT ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Shyamsundar on 2023-04-20 12:37:24 EDT ---

@akandath I was debugging this based on the logs that you had uploaded with Elena in our team and our questions and observations are as follows:

Questions:
----------

1) In the initial configuration, for the primary hub did you set it up with SSL certificates as detailed in [1]?
2) Post hub recovery for the standby hub, did you repeat the SSL steps on the standby hub?

We do understand that [2] does not cover redoing the SSL portion as instructions during hub recovery, so that is something we should fix in documentation, but based on the logs we think this is the root cause for the issue that you have filed.

Observations:

- Based on the Hub ramen logs, we see that initially DRCluster reconciliation reported that ramen config still does not contain the S3Profile that the DRCluster refers to, which is understood, as the multicluster-operator fills those details up subsequently

Example log:
2023-03-14T12:16:18.122141840Z 2023-03-14T12:16:18.122Z	INFO	controllers.drcluster	util/conditions.go:46	condition append	{"name": "ocsm4204001", "rid": "2913282a-c4ea-43a8-b1b8-d51c750abfba", "type": "Validated", "status": "False", "reason": "s3ConnectionFailed", "message": "s3profile-ocsm4204001-ocs-storagecluster: failed to get profile s3profile-ocsm4204001-ocs-storagecluster for caller drpolicy validation, s3 profile s3profile-ocsm4204001-ocs-storagecluster not found in RamenConfig", "generation": 1}

- Subsequently the hub operator finds the updated config and then fails to connect to the S3 store reporting certificate trust issues as follows:

2023-03-14T12:23:08.835173201Z 2023-03-14T12:23:08.834Z	INFO	controllers.drcluster	util/conditions.go:38	condition update	{"name": "ocsm4204001", "rid": "79a92075-9d5b-4dfc-b268-b6ee7a479c57", "type": "Validated", "old status": "False", "new status": "False", "old reason": "s3ConnectionFailed", "new reason": "s3ListFailed", "old message": "s3profile-ocsm4204001-ocs-storagecluster: failed to get profile s3profile-ocsm4204001-ocs-storagecluster for caller drpolicy validation, s3 profile s3profile-ocsm4204001-ocs-storagecluster not found in RamenConfig", "new message": "s3profile-ocsm4204001-ocs-storagecluster: failed to list objects in bucket odrbucket-307198fdb79a:/ocsm4204001, RequestError: send request failed\ncaused by: Get \"https://s3-openshift-storage.apps.ocsm4204001.lnxero1.boe/odrbucket-307198fdb79a?list-type=2&prefix=%2Focsm4204001\": x509: certificate signed by unknown authority", "old generation": 1, "new generation": 1}

- The above continues for 20+ minutes and never recovers

2023-03-14T12:43:56.285650153Z 2023-03-14T12:43:56.268Z	INFO	controllers.drcluster	util/conditions.go:31	condition unchanged	{"name": "ocsm4204001", "rid": "9d6b9511-3e50-4352-9648-869ea06277e2", "type": "Validated", "status": "False", "reason": "s3ListFailed", "message": "s3profile-ocsm4204001-ocs-storagecluster: failed to list objects in bucket odrbucket-307198fdb79a:/ocsm4204001, RequestError: send request failed\ncaused by: Get \"https://s3-openshift-storage.apps.ocsm4204001.lnxero1.boe/odrbucket-307198fdb79a?list-type=2&prefix=%2Focsm4204001\": x509: certificate signed by unknown authority", "generation": 1}

- On inspecting the config maps in the ramen hub operator namespace, we do not see any new user created config map for certificates as specified in [1], hence potentially this step was missed as the documentation does not cover it as well

[1] Configuring SSL across clusters: https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/metro-dr-solution#configuring-ssl-access-across-clusters_mdr

[2] Failover prerequisites dealing with hub recovery: https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/metro-dr-solution#application-failover-between-managed-clusters_manage-dr

--- Additional comment from Karolin Seeger on 2023-05-02 11:36:58 EDT ---

@akandath please re-test.
Changing to ON_QA.

--- Additional comment from Abdul Kandathil (IBM) on 2023-05-04 13:35:07 EDT ---

After creating SSL config, the DR policy turned to validated state.

--- Additional comment from Shyamsundar on 2023-05-05 08:43:46 EDT ---

@olakra we should document a note stating SSL configuration if done manually earlier during setup, should be repeated in during hub recovery. Reopening this and marking it for documentation.

(In reply to Shyamsundar from comment #3)
> 1) In the initial configuration, for the primary hub did you set it up with
> SSL certificates as detailed in [1]?
> 2) Post hub recovery for the standby hub, did you repeat the SSL steps on
> the standby hub?
> 
> We do understand that [2] does not cover redoing the SSL portion as
> instructions during hub recovery, so that is something we should fix in
> documentation, but based on the logs we think this is the root cause for the
> issue that you have filed.

> [1] Configuring SSL across clusters:
> https://access.redhat.com/documentation/en-us/
> red_hat_openshift_data_foundation/4.12/html/
> configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloa
> ds/metro-dr-solution#configuring-ssl-access-across-clusters_mdr
> 
> [2] Failover prerequisites dealing with hub recovery:
> https://access.redhat.com/documentation/en-us/
> red_hat_openshift_data_foundation/4.12/html/
> configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloa
> ds/metro-dr-solution#application-failover-between-managed-clusters_manage-dr

--- Additional comment from Olive Lakra on 2023-05-11 08:54:45 EDT ---

Hi Shyam,

* Does this note apply to both MDR and RDR solution? or only to MDR solution?

* Also will this note be applicable only for 4.12 onwards documentation or to any of the previous versions as well?

--- Additional comment from Shyamsundar on 2023-05-11 10:59:27 EDT ---

(In reply to Olive Lakra from comment #7)
> Hi Shyam,
> 
> * Does this note apply to both MDR and RDR solution? or only to MDR solution?

Both (as the certificate trust step is common to both setups)

> 
> * Also will this note be applicable only for 4.12 onwards documentation or
> to any of the previous versions as well?

We can do this since 4.12, as hub recovery made it's appearance since then.

--- Additional comment from Red Hat Bugzilla on 2023-08-03 04:30:16 EDT ---

Account disabled by LDAP Audit

--- Additional comment from RHEL Program Management on 2023-08-17 03:29:06 EDT ---

This BZ is being approved for an ODF 4.12.z z-stream update, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.12.z', and having been marked for an approved z-stream update

--- Additional comment from RHEL Program Management on 2023-08-17 03:29:06 EDT ---

Since this bug has been approved for ODF 4.12.6 release, through release flag 'odf-4.12.z+', and appropriate update number entry at the 'Internal Whiteboard', the Target Release is being set to 'ODF 4.12.6'

Note You need to log in before you can comment on or make changes to this bug.