Created attachment 1971845 [details] vrg_managed_clsuter1.log Description of problem (please be detailed as possible and provide log snippests): One of the managed clusters(mc2) went down (due to disaster/network issue) and OCP was reinstalled on mc2 to have the cluster up and running. Configured SSL access across clusters with the new ingress cert. In the Hub cluster, detached old mc2 and imported the reinstalled mc2 again. OpenShift DR Cluster operator was installed on mc2 automatically once the cluster is imported in Hub cluster. Failover of the application from mc1 to mc2 stuck in "Failing over" state with the following error message: # oc describe drpc busybox-placement-1-drpc -nbusybox-sample ..... ...... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning unknown state 16m controller_DRPlacementControl next state not known Warning DRPCFailingOver 16m controller_DRPlacementControl Failing over the application and VRG Warning DRPCClusterSwitchFailed 16m controller_DRPlacementControl failed to get VRG busybox-placement-1-drpc from cluster ocpm4202001 (err: getManagedClusterResource results: "requested resource not found in ManagedCluster" not found) Warning DRPCClusterSwitchFailed 6m48s (x5 over 16m) controller_DRPlacementControl Waiting for App resources to be restored...) vrg log on mc1 reports the following error 2023-06-15T16:53:37.670Z ERROR controllers.VolumeReplicationGroup.vrginstance controllers/vrg_vrgobject.go:50 VRG Kube object protect error {"VolumeReplicationGroup": "busybox-appset-sample/appset1-busybox-placement-drpc", "rid": "69471d6d-6a0e-450b-b5ef-887595f196b1", "State": "primary", "profile": "s3profile-ocpm4202001-ocs-external-storagecluster", "error": "failed to upload data of odrbucket-373521917843:busybox-appset-sample/appset1-busybox-placement-drpc/v1alpha1.VolumeReplicationGroup/a, InvalidAccessKeyId: The AWS access key Id you provided does not exist in our records.\n\tstatus code: 403, request id: lixdr8jk-dg4r8q-1ddq, host id: lixdr8jk-dg4r8q-1ddq"} github.com/ramendr/ramen/controllers.(*VRGInstance).vrgObjectProtect /remote-source/app/controllers/vrg_vrgobject.go:50 github.com/ramendr/ramen/controllers.(*VRGInstance).reconcileAsPrimary /remote-source/app/controllers/volumereplicationgroup_controller.go:918 github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary /remote-source/app/controllers/volumereplicationgroup_controller.go:889 github.com/ramendr/ramen/controllers.(*VRGInstance).processVRGActions /remote-source/app/controllers/volumereplicationgroup_controller.go:551 github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG /remote-source/app/controllers/volumereplicationgroup_controller.go:524 github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile /remote-source/app/controllers/volumereplicationgroup_controller.go:413 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:122 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:323 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:274 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.5/pkg/internal/controller/controller.go:235 Version of all relevant components (if applicable): OCP: 4.13.0 ODF on hub, mc1, mc2: 4.13.0-218 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Configure MDR environment with 1 hub cluster and 2 managed cluster (Hub, mc1 and mc2) 2. Deploy application and perform failover and relocate operations 3. Bring down one of the managed clusters (mc2) and reinstall mc2 with OCP 4. Configure SSL access across clusters with the new ingress cert from mc2 5. In the Hub cluster, detach old mc2 and import the reinstalled mc2. 6. OpenShift DR Cluster operator would be installed on mc2 automatically once the cluster is imported in Hub cluster 7. Perform failover of application from mc1 to mc2 Actual results: Failover of application stuck in failing over state Expected results: Failover of application should be successful Additional info: Attaching the vrg logs of the managed clusters to bugzilla and uploading the must-gather logs of all the cluster in the google drive https://drive.google.com/file/d/1NmvKrORqcX-17Bd8YfOoLTHYwGRqbLX8/view?usp=sharing
Based on discussion with Shyam, a possible solution can be to add a reference to the managed cluster when handling a new drcluster. The reference can include the managedcluster UID which changes after replacing a cluster, even if the replaced cluster uses the same name. Replacing a cluster will cause the drcluster to be invalidated with a clear error message. With this change the admin will not be able to add new DRCPs or perform operations with existing DRPCs, helping the admin to detect the issue early and fix it by following the documented procedure.