Description of problem (please be detailed as possible and provide log snippests): After zone failure(c1+h1) and hub recovery, wait until restore is completed in new hub and validate c2 cluster imported, DRPolicy validated, DRPCs status. Initiate failover of applications from c1 to c2. Applications failover succeeds, but note that for few applications after failOVer shows wrong status that is Progression should be in "CleaningUP" it says "Completed" as shown in sample output (as c1 cluster is still down) $ date; date --utc; oc get drpc -A -owide Friday 13 October 2023 12:43:33 PM IST Friday 13 October 2023 07:13:33 AM UTC NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY openshift-gitops appset-a2-placement-drpc 2d20h pbyregow-clu1 pbyregow-clu2 Failover FailedOver Completed 2023-10-10T10:45:06Z 19.287096857s True sub-a2 busybox-2-placement-1-drpc 2d20h pbyregow-clu1 pbyregow-clu2 Failover FailedOver Completed 2023-10-10T10:45:42Z 9.907681063s True Later bring c1 and ceph nodes up which was down. Waited for a day, few failedOver applications are still present in c1 and not deleted as shown in sample output below P.S - For "appset-a5" application shows correct DRPC status, but after c1 is bought up PV,PVC,VRG is not deleted openshift-gitops appset-a5-placement-drpc 2d20h pbyregow-clu1 pbyregow-clu2 Failover FailedOver Cleaning Up 2023-10-10T10:45:19Z False c1$ date; date --utc; oc get pod,pvc,vrg -n appset-a5 Friday 13 October 2023 12:51:44 PM IST Friday 13 October 2023 07:21:44 AM UTC NAME READY STATUS RESTARTS AGE pod/busybox-cephfs-pod-5-7898cb6b59-qs5kg 0/1 ContainerCreating 1 2d21h pod/busybox-rbd-pod-5-5bd797b7-2xf97 0/1 ContainerCreating 1 2d21h NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/busybox-cephfs-pvc-5 Bound pvc-338e4d99-8a49-4641-927a-fe40e104cadc 100Gi RWO ocs-external-storagecluster-cephfs 6d23h persistentvolumeclaim/busybox-rbd-pvc-5 Bound pvc-f462cb0f-3361-4e33-9cd9-58cdfc671ed6 100Gi RWO ocs-external-storagecluster-ceph-rbd 6d23h NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/appset-a5-placement-drpc primary Primary - For "appset-a2" and "sub-a2" DRPC shows wrong and also applications are not deleted NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY openshift-gitops appset-a2-placement-drpc 2d20h pbyregow-clu1 pbyregow-clu2 Failover FailedOver Completed 2023-10-10T10:45:06Z 19.287096857s True sub-a2 busybox-2-placement-1-drpc 2d20h pbyregow-clu1 pbyregow-clu2 Failover FailedOver Completed 2023-10-10T10:45:42Z 9.907681063s True c1$ date; date --utc; oc get pod,pvc,vrg -n sub-a2 Friday 13 October 2023 12:43:43 PM IST Friday 13 October 2023 07:13:43 AM UTC NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/busybox-2-placement-1-drpc primary Primary c1$ date; date --utc; oc get pod,pvc,vrg -n appset-a2 Friday 13 October 2023 12:53:02 PM IST Friday 13 October 2023 07:23:02 AM UTC NAME READY STATUS RESTARTS AGE pod/busybox-cephfs-pod-2-586db567cb-6nmdz 0/1 ContainerCreating 1 2d21h pod/busybox-rbd-pod-2-5774dd5b6d-46zq6 0/1 ContainerCreating 1 2d21h NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/busybox-cephfs-pvc-2 Bound pvc-6c48bec5-90c5-4bb6-a296-dd2f420f7e5d 100Gi RWO ocs-external-storagecluster-cephfs 6d23h persistentvolumeclaim/busybox-rbd-pvc-2 Bound pvc-0ab6c08d-5986-4c80-8878-2d42a223920d 100Gi RWO ocs-external-storagecluster-ceph-rbd 6d23h NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/appset-a2-placement-drpc primary Primary Version of all relevant components (if applicable): OCP: 4.14.0-0.nightly-2023-10-06-234925 ODF (upgraded): 4.14.0-145.stable ACM (upgraded): 2.9.0-DOWNSTREAM-2023-10-08-08-16-57 CEPH: 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? a. If we unfence c1 cluster, IOs would be running from both managed clusters which is not correct? b. At this stage, it might impact on relocation and relocation may not succeed? (Because of these 2 reasons kept severity as high, if not case can reduce severity) Is there any workaround available to the best of your knowledge? not sure Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Can this issue reproducible? 1/1 Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Configure MDR cluster with ODF build 4.14.0-137.stable zone a: arbiter zone b: hub1 (active), c1, ceph nodes: osd-0, osd-1,osd-2 zone c: hub2 (passive), c2, ceph nodes; osd-3, osd-4, osd-5 2. Install subscription apps and appset apps have them in Deployed, Failedover and Relocated state 3. Upgrade the ODF and ACM to latest build that is upgraded to ODF: 4.14.0-145.stable and ACM: 2.9.0-180 4. Deploy few more subscription apps and appset apps have them in Deployed 5. Bring zone b down 6. Perform hub recovery that is restore the data in passive hub 7. Wait for 3-7 mins and verify c2 cluster imported, DRPolicy validated, DRPCs status 8. Perform failover of applications from c1 to c2. Applications failover succeeds, but note that for "apset-a2" and "sub-a2" application after FailedOVer shows wrong status that is Progression should be in "CleaningUP" it says "Completed" as shown in sample output in description 9. Later bring c1 and 3 ceph nodes up, wait until c1 is imported and healthy 10. Wait for a day, failedOver applications are still present and not deleted as shown in sample output c1$ date; date --utc; oc get pod,pvc,vrg -n appset-a5 Friday 13 October 2023 12:51:44 PM IST Friday 13 October 2023 07:21:44 AM UTC NAME READY STATUS RESTARTS AGE pod/busybox-cephfs-pod-5-7898cb6b59-qs5kg 0/1 ContainerCreating 1 2d21h pod/busybox-rbd-pod-5-5bd797b7-2xf97 0/1 ContainerCreating 1 2d21h NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/busybox-cephfs-pvc-5 Bound pvc-338e4d99-8a49-4641-927a-fe40e104cadc 100Gi RWO ocs-external-storagecluster-cephfs 6d23h persistentvolumeclaim/busybox-rbd-pvc-5 Bound pvc-f462cb0f-3361-4e33-9cd9-58cdfc671ed6 100Gi RWO ocs-external-storagecluster-ceph-rbd 6d23h c1$ date; date --utc; oc get pod,pvc,vrg -n sub-a2 Friday 13 October 2023 12:43:43 PM IST Friday 13 October 2023 07:13:43 AM UTC NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/busybox-2-placement-1-drpc primary Primary Actual results: 1. DRPC status shows wrong status 2. Applications leftovers present even after c1 is up Expected results: 1. In DRPC status: progression should be "CleaningUP" 2. Once c1 is bought up applications leftovers should be deleted. Additional info: - Small observations is that all these applications were in Relocate state before hub recovery - Another is that all these 3 applications were present before upgrading ODF, and in between step (3) made hub1 down and restored in new hub
Tested versions: ---------------- OCP - 4.14.0-0.nightly-2023-10-08-220853 ODF - 4.14.0-146.stable ACM - 2.9.0-180 Post hubrecovery I tried a failover of 1 subscription app and 1 appset app * Failover of subscription app took almost 24 mins and it was stuck in cleaning up phase for a long time eventually cleaning up and failover succeeded $ oc get drpc -n cephfs1 -o wide NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY cephfs1-placement-3-drpc 16h sraghave-c1-oct sraghave-c2-oct Failover FailedOver Completed 2023-10-19T19:00:01Z 24m4.988864425s True * Failover of appset app stuck in cleaning phase and its been almost 40 mins now sraghave:~$ oc get drpc rbd-sample-placement-drpc -n openshift-gitops -o wide NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY rbd-sample-placement-drpc 16h sraghave-c1-oct sraghave-c2-oct Failover FailedOver Cleaning Up 2023-10-20T08:31:45Z False sraghave:~$ sraghave:~$ date --utc Fri Oct 20 09:11:02 AM UTC 2023 Leftovers from C1: ------------------- $ oc get pods,pvc,vrg -n multiple-appsets NAME READY STATUS RESTARTS AGE pod/busybox-rbd-5d6cc5f8b9-lrltp 0/1 Pending 0 42m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/busybox-rbd-pvc Terminating pvc-7dadb197-1361-4bd6-97f4-fdc42a6f0500 5Gi RWO ocs-external-storagecluster-ceph-rbd 3d18h NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/rbd-sample-placement-drpc secondary Primary O/P from C2 ------------- $ oc get pods,pvc,vrg -n multiple-appsets NAME READY STATUS RESTARTS AGE pod/busybox-rbd-5d6cc5f8b9-p7n54 1/1 Running 0 52s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/busybox-rbd-pvc Bound pvc-7dadb197-1361-4bd6-97f4-fdc42a6f0500 5Gi RWO ocs-external-storagecluster-ceph-rbd 3m NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/rbd-sample-placement-drpc primary Primary VRG status from C1: -------------------- $ oc describe volumereplicationgroup.ramendr.openshift.io/rbd-sample-placement-drpc -n multiple-appsets Name: rbd-sample-placement-drpc Namespace: multiple-appsets Labels: <none> Annotations: <none> API Version: ramendr.openshift.io/v1alpha1 Kind: VolumeReplicationGroup Metadata: Creation Timestamp: 2023-10-17T07:50:36Z Finalizers: volumereplicationgroups.ramendr.openshift.io/vrg-protection Generation: 2 Owner References: API Version: work.open-cluster-management.io/v1 Kind: AppliedManifestWork Name: 8bc873eb592f07d550f95e8aa51b95cda8e5f8355dc9283229799add471e0d8c-rbd-sample-placement-drpc-multiple-appsets-vrg-mw UID: 7d12332f-6719-4977-9222-fb509efa9c87 Resource Version: 10852123 UID: d736ba25-eae5-4082-ad06-d5691096c271 Spec: Action: Failover Pvc Selector: Match Labels: Appname: busybox-rbd Replication State: secondary s3Profiles: s3profile-sraghave-c1-oct-ocs-external-storagecluster s3profile-sraghave-c2-oct-ocs-external-storagecluster Sync: Vol Sync: Disabled: true Status: Conditions: Last Transition Time: 2023-10-20T08:32:04Z Message: VolumeReplicationGroup is progressing Observed Generation: 2 Reason: Progressing Status: False Type: DataReady Last Transition Time: 2023-10-20T08:32:04Z Message: VolumeReplicationGroup is replicating Observed Generation: 2 Reason: Replicating Status: False Type: DataProtected Last Transition Time: 2023-10-17T07:50:36Z Message: Restored cluster data Observed Generation: 1 Reason: Restored Status: True Type: ClusterDataReady Last Transition Time: 2023-10-20T08:32:04Z Message: Cluster data of all PVs are protected Observed Generation: 2 Reason: Uploaded Status: True Type: ClusterDataProtected Kube Object Protection: Last Update Time: 2023-10-20T08:35:08Z Observed Generation: 2 Protected PV Cs: Conditions: Last Transition Time: 2023-10-20T08:32:04Z Message: Secondary transition failed as PVC is potentially in use by a pod Observed Generation: 2 Reason: Progressing Status: False Type: DataReady Last Transition Time: 2023-10-17T07:50:36Z Message: PVC in the VolumeReplicationGroup is ready for use Observed Generation: 1 Reason: Replicating Status: False Type: DataProtected Last Transition Time: 2023-10-17T07:50:40Z Message: Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-sraghave-c1-oct-ocs-external-storagecluster s3profile-sraghave-c2-oct-ocs-external-storagecluster] Observed Generation: 1 Reason: Uploaded Status: True Type: ClusterDataProtected Name: busybox-rbd-pvc Replication ID: Id: Resources: Storage ID: Id: State: Primary Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning VrgUploadFailed 50m (x773 over 13h) controller_VolumeReplicationGroup (combined from similar events): failed to upload data of odrbucket-11bea101f6d8:multiple-appsets/rbd-sample-placement-drpc/v1alpha1.VolumeReplicationGroup/a, InternalError: We encountered an internal error. Please try again. status code: 500, request id: lnycmay1-3cw5la-6yn, host id: lnycmay1-3cw5la-6yn
Applied the workaround on the cluster, Noted down all the observations below * To apply the WA and to get resources cleaned i think it took around 54Hrs * Failover and relocate succeeded $ oc get drpc rbd-sample-placement-drpc -n openshift-gitops -o wide NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY rbd-sample-placement-drpc 3d21h sraghave-c1-oct sraghave-c2-oct Failover FailedOver Completed 2023-10-20T08:31:45Z 54h40m26.854129648s True * Resources cleaned up from the cluster c1 as expected sraghave:~$ oc get pod,pvc,vrg -n multiple-appsets No resources found in multiple-appsets namespace. * Resources found on C2 as expected $ oc get pods,pvc,vrg -n multiple-appsets NAME READY STATUS RESTARTS AGE pod/busybox-rbd-5d6cc5f8b9-p7n54 1/1 Running 0 3d6h NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/busybox-rbd-pvc Bound pvc-7dadb197-1361-4bd6-97f4-fdc42a6f0500 5Gi RWO ocs-external-storagecluster-ceph-rbd 3d6h NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/rbd-sample-placement-drpc primary Primary Note: ----- I am quite unsure why am i seeing the status as cleaning in the DRCluster sraghave:~$ oc get drcluster sraghave-c1-oct -o jsonpath='{.status.conditions[2].reason}{"\n"}' Cleaning sraghave:~$ oc get drcluster sraghave-c2-oct -o jsonpath='{.status.conditions[2].reason}{"\n"}' Clean * Unable to delete appset apps after failover/relocate (Multiple appset apps installed on namespace, leftovers on c1, DRPCs got deleted) $ oc get pods,pvc,vrg -n multiple-appsets1 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/busybox-cephfs-pvc Terminating pvc-bc67ba04-3277-4127-b422-5929b5d97872 5Gi RWO ocs-external-storagecluster-cephfs 76m persistentvolumeclaim/busybox-rbd-pvc Terminating pvc-10c7f9cc-da55-42ab-882a-558eccf864cb 5Gi RWO ocs-external-storagecluster-ceph-rbd 74m persistentvolumeclaim/helloworld-pv-claim Terminating pvc-7610e25b-8ed4-4a56-9139-5ee644e2353e 10Gi RWO ocs-external-storagecluster-cephfs 75m NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-appset1-placement-drpc primary Primary volumereplicationgroup.ramendr.openshift.io/hello-appsets1-placement-drpc primary Primary volumereplicationgroup.ramendr.openshift.io/rbd-appset1-placement-drpc primary Primary Live cluster available to debug
We are concerned about the WA mentioned in comment 10 which might not be applicable when the customer loses the access to old cluster. We need an alternative WA to move forward with hub recovery cases on active site when entire zone is down @bmekhiss
This requires at present that we move to the pull model for gitops from ACM, than the current push model. In the pull model, the managed cluster has the ArgoCD Application resource created using a ManifestWork, based on a PlacementDecision. So post hub recovery the manifest work operator would garbage collect work that was deployed by the older hub (as it does for Subscription based applications at present), ensuring successful cleanup of the failed cluster eventually. The gitops model is described here: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.9/html/gitops/gitops-overview#gitops-push-pull
My two cents on this: 1. We should retest with the latest changes which were made for hub recovery. 2. Even if the WA is not good for customer, can it at least help QE to progress with happy path validation of the feature? That way we can save some time and keep the feature in the release while working on improving the WA or providing a proper fix in parallel.
I am not sure I understand the ask here https://bugzilla.redhat.com/show_bug.cgi?id=2243804#c13 This is an ACM hub recovery issue. ODF has nothing to do with it. The solution for it is mentioned by Shyam in comment 14. Now, thinking little bit more about the problem, the workaround provided in comment 10 is simply a workaround that will work if customers still have access to the old active hub. So before recovering a hub, it's important for users to ensure that the current active hub doesn't have network access to the managed clusters. Keep in mind that the workaround in comment 10 is a basic solution and only effective when access to the failed active hub is still possible. Again, I don't understand the issue. When the hub cluster fails, customers need to ensure that that cluster is no longer have network access to the managed clusters.
(In reply to Mudit Agarwal from comment #15) > My two cents on this: > > 1. We should retest with the latest changes which were made for hub recovery. > 2. Even if the WA is not good for customer, can it at least help QE to > progress with happy path validation of the feature? Yes, We are bringing up 4.15 clusters for happy path testing now.
(In reply to Benamar Mekhissi from comment #16) > I am not sure I understand the ask here > https://bugzilla.redhat.com/show_bug.cgi?id=2243804#c13 > This is an ACM hub recovery issue. ODF has nothing to do with it. The > solution for it is mentioned by Shyam in comment 14. > > Now, thinking little bit more about the problem, the workaround provided in > comment 10 is simply a workaround that will work if customers still have > access to the old active hub. So before recovering a hub, it's important for > users to ensure that the current active hub doesn't have network access to > the managed clusters. Keep in mind that the workaround in comment 10 is a > basic solution and only effective when access to the failed active hub is > still possible. > > Again, I don't understand the issue. When the hub cluster fails, customers > need to ensure that that cluster is no longer have network access to the > managed clusters. Hi Benamar, I have scheduled a meeting at 7.30PM IST on 8th Jan to discuss this BZ with you. I hope it will help us understand the issue better.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days