Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): OCP 4.14.0-0.nightly-2023-10-30-170011 advanced-cluster-management.v2.9.0-188 ODF 4.14.0-157 ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) ACM 2.9.0-DOWNSTREAM-2023-10-18-17-59-25 Submariner brew.registry.redhat.io/rh-osbs/iib:607438 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: Steps: 1. On a hub recovery RDR setup, ensure backups are being created on active and passive hub clusters. Failover and relocate different workloads so that it is running on the primary managed cluster after the failover and relocate operation completes. Ensure latest backups are taken and no action of any of the workloads (cephfs, rbd- appset or subscription type) is in progress. 2. Collect drpc status. Bring primary managed cluster down, and then bring active hub down. 3. Ensure secondary managed cluster is properly imported on the passive hub and then DRPolicy gets validated. 4. Check the drpc status from passive hub and compare it with the output taken from active hub when it was up. We notice that post hub recovery, a sanity check is run for all the workloads which were failedover or relocated where we again perform the same action on those workloads which was performed from the active hub, which marks peer ready as false for those workloads. From active hub- NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-2 subscription-cephfs-placement-1-drpc 9h amagrawa-31o-prim amagrawa-passivee Relocate Relocated Completed 2023-11-01T17:54:21Z 30.282249722s True busybox-workloads-5 subscription-rbd1-placement-1-drpc 9h amagrawa-31o-prim amagrawa-31o-prim Failover FailedOver Completed 2023-11-01T13:57:37Z 47m3.364814169s True busybox-workloads-6 subscription-rbd2-placement-1-drpc 9h amagrawa-31o-prim amagrawa-passivee Relocate Relocated Completed 2023-11-01T14:16:28Z 3h17m50.318760845s True openshift-gitops appset-cephfs-placement-drpc 9h amagrawa-31o-prim amagrawa-passivee Failover FailedOver Completed 2023-11-01T13:20:45Z 5m59.4021061s True openshift-gitops appset-rbd1-placement-drpc 9h amagrawa-31o-prim amagrawa-31o-prim Failover FailedOver Completed 2023-11-01T14:15:30Z 41m2.588884417s True openshift-gitops appset-rbd2-placement-drpc 9h amagrawa-passivee Deployed Completed True From passive hub- amagrawa:~$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-2 subscription-cephfs-placement-1-drpc 57m amagrawa-31o-prim amagrawa-passivee Relocate Relocating 2023-11-01T18:59:35Z False busybox-workloads-5 subscription-rbd1-placement-1-drpc 57m amagrawa-31o-prim amagrawa-31o-prim Failover FailingOver WaitForStorageMaintenanceActivation 2023-11-01T18:59:36Z False busybox-workloads-6 subscription-rbd2-placement-1-drpc 57m amagrawa-31o-prim amagrawa-passivee Relocate True openshift-gitops appset-cephfs-placement-drpc 57m amagrawa-31o-prim amagrawa-passivee Failover FailedOver EnsuringVolSyncSetup True openshift-gitops appset-rbd1-placement-drpc 57m amagrawa-31o-prim amagrawa-31o-prim Failover FailingOver FailingOverToCluster 2023-11-01T18:59:36Z False openshift-gitops appset-rbd2-placement-drpc 57m amagrawa-passivee Deployed Completed True Since peer ready is now marked as false due to sanity check, subscription-cephfs-placement-1-drpc and subscription-rbd1-placement-1-drpc and appset-rbd1-placement-drpc can not be failedover in this example. This sanity check is needed as per k8s recommended guidelines and we should not backup the currentstate of the workloads as confirmed by @bmekhiss so the issue will always persist. As of now, the only option is to trigger a failover by editing drpc yaml (which would be addressed by BZ2247537). So all these apps were failedover via CLI to the secondary managed cluster which was available but the failover didn't succeed for rbd backed workloads as volumereplicationclass was not backed up/got deleted. Benamar tried a WA which created the volumereplicationclass on the secondary managed cluster which was available. This helped failover to proceed and created the workloads pods but not the VR's for rbd backed workloads, so VRG CURRENTSTATE couldn't be marked as Primary. We need VR's to be created for rbd backed workloads so the workaround didn't work as expected which is updated https://bugzilla.redhat.com/show_bug.cgi?id=2246084#c8. Since volumereplicationclass is not needed for cephfs based workloads, failover didn't succeed for some other reason and is under RCA and hence a separate BZ is being opened for it. From passive hub after triggering failover from CLI- amagrawa:~$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-2 subscription-cephfs-placement-1-drpc 3h21m amagrawa-31o-prim amagrawa-passivee Failover FailingOver WaitingForResourceRestore 2023-11-01T18:59:35Z False busybox-workloads-5 subscription-rbd1-placement-1-drpc 3h21m amagrawa-31o-prim amagrawa-passivee Failover FailedOver WaitForReadiness 2023-11-01T18:59:36Z True busybox-workloads-6 subscription-rbd2-placement-1-drpc 3h21m amagrawa-31o-prim amagrawa-passivee Failover FailedOver WaitForReadiness 2023-11-01T20:12:09Z True openshift-gitops appset-cephfs-placement-drpc 3h21m amagrawa-31o-prim amagrawa-passivee Failover FailedOver EnsuringVolSyncSetup True openshift-gitops appset-rbd1-placement-drpc 3h21m amagrawa-31o-prim amagrawa-passivee Failover FailedOver WaitForReadiness 2023-11-01T18:59:36Z True openshift-gitops appset-rbd2-placement-drpc 3h21m amagrawa-passivee Deployed Completed True From secondary available managed cluster to which failover was triggered- amagrawa:~$ busybox-2 Already on project "busybox-workloads-2" on server "https://api.amagrawa-passivee.qe.rh-ocs.com:6443". NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/subscription-cephfs-placement-1-drpc primary amagrawa:~$ oc describe vrg Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning VrgUploadFailed 11s (x33 over 91m) controller_VolumeReplicationGroup failed to upload data of odrbucket-93489a7b9ef9:busybox-workloads-2/subscription-cephfs-placement-1-drpc/v1alpha1.VolumeReplicationGroup/a, RequestError: send request failed caused by: Put "https://s3-openshift-storage.apps.amagrawa-31o-prim.qe.rh-ocs.com/odrbucket-93489a7b9ef9/busybox-workloads-2/subscription-cephfs-placement-1-drpc/v1alpha1.VolumeReplicationGroup/a": dial tcp 10.19.98.14:443: i/o timeout From passive hub during this time- amagrawa:~$ oc get drpc -o yaml -n busybox-workloads-2 apiVersion: v1 items: - apiVersion: ramendr.openshift.io/v1alpha1 kind: DRPlacementControl metadata: annotations: drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-2 drplacementcontrol.ramendr.openshift.io/last-app-deployment-cluster: amagrawa-31o-prim creationTimestamp: "2023-11-01T18:05:45Z" finalizers: - drpc.ramendr.openshift.io/finalizer generation: 2 labels: cluster.open-cluster-management.io/backup: resource velero.io/backup-name: acm-resources-generic-schedule-20231101180053 velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231101180053 name: subscription-cephfs-placement-1-drpc namespace: busybox-workloads-2 ownerReferences: - apiVersion: cluster.open-cluster-management.io/v1beta1 blockOwnerDeletion: true controller: true kind: Placement name: subscription-cephfs-placement-1 uid: 0ca2e6ac-5942-43f9-8c55-272b1b70a919 resourceVersion: "1079179" uid: b7024c21-2bdf-4f43-8577-db3505a89104 spec: action: Failover drPolicyRef: apiVersion: ramendr.openshift.io/v1alpha1 kind: DRPolicy name: my-drpolicy-10 failoverCluster: amagrawa-passivee placementRef: apiVersion: cluster.open-cluster-management.io/v1beta1 kind: Placement name: subscription-cephfs-placement-1 namespace: busybox-workloads-2 preferredCluster: amagrawa-31o-prim pvcSelector: matchLabels: appname: busybox_app1_cephfs status: actionStartTime: "2023-11-01T18:59:35Z" conditions: - lastTransitionTime: "2023-11-01T20:10:24Z" message: Waiting for App resources to be restored...) observedGeneration: 2 reason: FailingOver status: "False" type: Available - lastTransitionTime: "2023-11-01T20:10:24Z" message: Started failover to cluster "amagrawa-passivee" observedGeneration: 2 reason: NotStarted status: "False" type: PeerReady lastUpdateTime: "2023-11-01T20:22:55Z" phase: FailingOver preferredDecision: clusterName: amagrawa-31o-prim clusterNamespace: amagrawa-31o-prim progression: WaitingForResourceRestore resourceConditions: resourceMeta: generation: 0 kind: "" name: "" namespace: "" kind: List metadata: resourceVersion: "" Actual results: Failover didn't succeed for cephfs backed workloads Logs are being uploaded here (collected a few hours after triggering failover from CLI)- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/02nov23-2/ Expected results: Failover should complete without cleanup as older primary cluster is still down and should eventually cleanup and mark VRG as secondary when the older primary cluster becomes reachable and data sync should resume as expected. Additional info:
Moving hub recovery issues out to 4.15 based on offline discussion.
BZ2258351 is ON_QA now, please retry
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383