Bug 2216440 - [IBM Z /MDR]: Failover of application fails when OpenShift is reinstalled on one of the managed clusters after a disaster [NEEDINFO]
Summary: [IBM Z /MDR]: Failover of application fails when OpenShift is reinstalled on ...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.13
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: ---
: ---
Assignee: Nir Soffer
QA Contact: Harish NV Rao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-21 11:38 UTC by Sravika
Modified: 2024-09-10 10:43 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
rtalur: needinfo? (hnallurv)


Attachments (Terms of Use)

Description Sravika 2023-06-21 11:38:05 UTC
Created attachment 1971845 [details]
vrg_managed_clsuter1.log

Description of problem (please be detailed as possible and provide log
snippests):

One of the managed clusters(mc2) went down (due to disaster/network issue) and OCP was reinstalled on mc2 to have the cluster up and running. Configured SSL access across clusters with the new ingress cert.  In the Hub cluster, detached old mc2 and imported the reinstalled mc2 again.  OpenShift DR Cluster operator was installed on mc2 automatically once the cluster is imported in Hub cluster. Failover of the application from mc1 to mc2 stuck in "Failing over" state with the following error message:



# oc describe drpc busybox-placement-1-drpc -nbusybox-sample
.....
......
Events:
  Type     Reason                   Age                  From                           Message
  ----     ------                   ----                 ----                           -------
  Warning  unknown state            16m                  controller_DRPlacementControl  next state not known
  Warning  DRPCFailingOver          16m                  controller_DRPlacementControl  Failing over the application and VRG
  Warning  DRPCClusterSwitchFailed  16m                  controller_DRPlacementControl  failed to get VRG busybox-placement-1-drpc from cluster ocpm4202001 (err: getManagedClusterResource results:  "requested resource not found in ManagedCluster" not found)
  Warning  DRPCClusterSwitchFailed  6m48s (x5 over 16m)  controller_DRPlacementControl  Waiting for App resources to be restored...)


vrg log on mc1 reports the following error

2023-06-15T16:53:37.670Z        ERROR   controllers.VolumeReplicationGroup.vrginstance  controllers/vrg_vrgobject.go:50 VRG Kube object protect error   {"VolumeReplicationGroup": "busybox-appset-sample/appset1-busybox-placement-drpc", "rid": "69471d6d-6a0e-450b-b5ef-887595f196b1", "State": "primary", "profile": "s3profile-ocpm4202001-ocs-external-storagecluster", "error": "failed to upload data of odrbucket-373521917843:busybox-appset-sample/appset1-busybox-placement-drpc/v1alpha1.VolumeReplicationGroup/a, InvalidAccessKeyId: The AWS access key Id you provided does not exist in our records.\n\tstatus code: 403, request id: lixdr8jk-dg4r8q-1ddq, host id: lixdr8jk-dg4r8q-1ddq"}
github.com/ramendr/ramen/controllers.(*VRGInstance).vrgObjectProtect
        /remote-source/app/controllers/vrg_vrgobject.go:50
github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileAsPrimary
        /remote-source/app/controllers/volumereplicationgroup_controller.go:918
github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary
        /remote-source/app/controllers/volumereplicationgroup_controller.go:889
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRGActions
        /remote-source/app/controllers/volumereplicationgroup_controller.go:551
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG
        /remote-source/app/controllers/volumereplicationgroup_controller.go:524
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile
        /remote-source/app/controllers/volumereplicationgroup_controller.go:413
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:235



Version of all relevant components (if applicable):
OCP: 4.13.0
ODF on hub, mc1, mc2: 4.13.0-218 



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Configure MDR environment with 1 hub cluster and 2 managed cluster (Hub, mc1 and mc2)
2. Deploy application and perform failover and relocate operations
3. Bring down one of the managed clusters (mc2) and reinstall mc2 with OCP
4. Configure SSL access across clusters with the new ingress cert from mc2
5. In the Hub cluster, detach old mc2 and import the reinstalled mc2.  
6. OpenShift DR Cluster operator would be installed on mc2 automatically once the cluster is imported in Hub cluster
7. Perform failover of application from mc1 to mc2


Actual results:
Failover of application stuck in failing over state

Expected results:
Failover of application should be successful

Additional info:

Attaching the vrg logs of the managed clusters to bugzilla and uploading the must-gather logs of all the cluster in the google drive

https://drive.google.com/file/d/1NmvKrORqcX-17Bd8YfOoLTHYwGRqbLX8/view?usp=sharing

Comment 8 Nir Soffer 2024-01-10 16:44:06 UTC
Based on discussion with Shyam, a possible solution can be to add a reference to the managed cluster
when handling a new drcluster. The reference can include the managedcluster UID which changes after
replacing a cluster, even if the replaced cluster uses the same name. Replacing a cluster will 
cause the drcluster to be invalidated with a clear error message.

With this change the admin will not be able to add new DRCPs or perform operations with existing
DRPCs, helping the admin to detect the issue early and fix it by following the documented procedure.


Note You need to log in before you can comment on or make changes to this bug.