Bug 2241329
Summary: | [RDR] [Hub recovery][4.15 clone][ With passive hub, sync stops for all rbd and cephfs workloads, rgw on one of the managed clusters goes down | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Aman Agrawal <amagrawa> | |
Component: | odf-dr | Assignee: | Shyamsundar <srangana> | |
odf-dr sub component: | ramen | QA Contact: | Aman Agrawal <amagrawa> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | unspecified | CC: | bmekhiss, ebenahar, edonnell, etamir, kramdoss, kseeger, muagarwa, sraghave | |
Version: | 4.14 | |||
Target Milestone: | --- | |||
Target Release: | ODF 4.17.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | 4.15.0-123 | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2266006 (view as bug list) | Environment: | ||
Last Closed: | 2024-10-30 14:25:45 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2244409, 2266006 |
Description
Aman Agrawal
2023-09-29 10:54:46 UTC
In the bug triage meeting held on 3rd oct, it was decided that Aman will retest and share the backup info with Benammar. This will help engineering decide whether a fix is needed. The bug will continue to be proposed as a blocker. Several issues need attention, but I'll focus on the situation where both workloads (specifically, the ApplicationSet workload) are running on both clusters. While I'm not entirely certain about my findings, it appears that the PlacementDecision was backed up and subsequently restored on the passive hub before transitioning to the active one. ```up oc get placementdecision -n openshift-gitops busybox-3-placement-decision-1 -o yaml apiVersion: cluster.open-cluster-management.io/v1beta1 kind: PlacementDecision metadata: creationTimestamp: "2023-09-28T10:59:51Z" generation: 1 labels: cluster.open-cluster-management.io/decision-group-index: "0" cluster.open-cluster-management.io/decision-group-name: "" cluster.open-cluster-management.io/placement: busybox-3-placement velero.io/backup-name: acm-resources-schedule-20230928100034 velero.io/restore-name: restore-acm-acm-resources-schedule-20230928100034 name: busybox-3-placement-decision-1 namespace: openshift-gitops resourceVersion: "1960594" uid: c8878f28-2594-4c12-9495-30e86fd94cf4 status: decisions: - clusterName: amagrawa-c1 reason: "" ``` The label `velero.io/restore-name: restore-acm-acm-resources-schedule-20230928100034` indicates that the decision was restored from that backup. Although we lack access to the backup for verification, the application log indicates that following the hub restore, "amagrawa-c2" was initially chosen as the target cluster for the application. However, at a later stage, when the DRPC had the opportunity to select the correct cluster, the decision was modified to "amagrawa-c1" as observed above. However, the openshift-gitops operator failed to remove the application from "amagrawa-c2". We need to reproduce this issue in order to gather the logs before and after the hub recovery is started. PR posted to fix a ramen issue with PlacementDecision being backed up by the HubRecovery backup routine: https://github.com/RamenDR/ramen/pull/1092 Tested with OCP 4.14.0-0.nightly-2023-10-18-004928 advanced-cluster-management.v2.9.0-188 ODF 4.14.0-156 ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) Submariner image: brew.registry.redhat.io/rh-osbs/iib:599799 ACM 2.9.0-DOWNSTREAM-2023-10-18-17-59-25 Latency 50ms RTT Outputs from active hub- amagrawa:acm$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-1 busybox-workloads-1-placement-1-drpc 2d1h amagrawa-1st Deployed Completed True busybox-workloads-3 busybox-sub-cephfs-placement-1-drpc 47h amagrawa-2nd Relocate Relocated Completed 2023-10-23T17:35:35Z 4m40.644361827s True openshift-gitops busybox-appset-cephfs-placement-drpc 47h amagrawa-2nd Deployed Completed True openshift-gitops busybox-workloads-2-placement-drpc 47h amagrawa-1st amagrawa-2nd Failover FailedOver Cleaning Up 2023-10-23T17:35:45Z False Here, busybox-sub-cephfs-placement-1-drpc was relocated and busybox-workloads-2-placement-drpc was failedover which was still cleaning up but resources were successfully created on the failover cluster. It was ensured that the new backup is not created and active hub is completely brought down before next new backup creation schedule. amagrawa:acm$ group drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-1 namespace: busybox-workloads-1 namespace: busybox-workloads-1 lastGroupSyncTime: "2023-10-23T17:40:00Z" namespace: busybox-workloads-1 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-3 namespace: busybox-workloads-3 namespace: busybox-workloads-3 lastGroupSyncTime: "2023-10-23T17:41:44Z" namespace: busybox-workloads-3 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-4 namespace: openshift-gitops namespace: openshift-gitops lastGroupSyncTime: "2023-10-23T17:40:40Z" namespace: busybox-workloads-4 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-2 namespace: openshift-gitops namespace: openshift-gitops namespace: busybox-workloads-2 amagrawa:acm$ date -u Monday 23 October 2023 05:49:25 PM UTC amagrawa:acm$ oc get backups -A NAMESPACE NAME AGE open-cluster-management-backup acm-credentials-schedule-20231023110057 6h48m open-cluster-management-backup acm-credentials-schedule-20231023120057 5h48m open-cluster-management-backup acm-credentials-schedule-20231023130057 4h48m open-cluster-management-backup acm-credentials-schedule-20231023140057 3h48m open-cluster-management-backup acm-credentials-schedule-20231023150057 168m open-cluster-management-backup acm-credentials-schedule-20231023160057 108m open-cluster-management-backup acm-credentials-schedule-20231023170057 48m open-cluster-management-backup acm-managed-clusters-schedule-20231023110057 6h48m open-cluster-management-backup acm-managed-clusters-schedule-20231023120057 5h48m open-cluster-management-backup acm-managed-clusters-schedule-20231023130057 4h48m open-cluster-management-backup acm-managed-clusters-schedule-20231023140057 3h48m open-cluster-management-backup acm-managed-clusters-schedule-20231023150057 168m open-cluster-management-backup acm-managed-clusters-schedule-20231023160057 108m open-cluster-management-backup acm-managed-clusters-schedule-20231023170057 48m open-cluster-management-backup acm-resources-generic-schedule-20231023110057 6h48m open-cluster-management-backup acm-resources-generic-schedule-20231023120057 5h48m open-cluster-management-backup acm-resources-generic-schedule-20231023130057 4h48m open-cluster-management-backup acm-resources-generic-schedule-20231023140057 3h48m open-cluster-management-backup acm-resources-generic-schedule-20231023150057 168m open-cluster-management-backup acm-resources-generic-schedule-20231023160057 108m open-cluster-management-backup acm-resources-generic-schedule-20231023170057 48m open-cluster-management-backup acm-resources-schedule-20231023110057 6h48m open-cluster-management-backup acm-resources-schedule-20231023120057 5h48m open-cluster-management-backup acm-resources-schedule-20231023130057 4h48m open-cluster-management-backup acm-resources-schedule-20231023140057 3h48m open-cluster-management-backup acm-resources-schedule-20231023150057 168m open-cluster-management-backup acm-resources-schedule-20231023160057 108m open-cluster-management-backup acm-resources-schedule-20231023170057 48m open-cluster-management-backup acm-validation-policy-schedule-20231023170057 48m amagrawa:acm$ date -u Monday 23 October 2023 05:50:07 PM UTC From passive hub- amagrawa:~$ oc get backups -A NAMESPACE NAME AGE open-cluster-management-backup acm-credentials-schedule-20231023110057 6h48m open-cluster-management-backup acm-credentials-schedule-20231023120057 5h47m open-cluster-management-backup acm-credentials-schedule-20231023130057 4h47m open-cluster-management-backup acm-credentials-schedule-20231023140057 3h48m open-cluster-management-backup acm-credentials-schedule-20231023150057 167m open-cluster-management-backup acm-credentials-schedule-20231023160057 108m open-cluster-management-backup acm-credentials-schedule-20231023170057 48m open-cluster-management-backup acm-managed-clusters-schedule-20231023110057 6h48m open-cluster-management-backup acm-managed-clusters-schedule-20231023120057 5h47m open-cluster-management-backup acm-managed-clusters-schedule-20231023130057 4h47m open-cluster-management-backup acm-managed-clusters-schedule-20231023140057 3h48m open-cluster-management-backup acm-managed-clusters-schedule-20231023150057 168m open-cluster-management-backup acm-managed-clusters-schedule-20231023160057 107m open-cluster-management-backup acm-managed-clusters-schedule-20231023170057 48m open-cluster-management-backup acm-resources-generic-schedule-20231023110057 6h46m open-cluster-management-backup acm-resources-generic-schedule-20231023120057 5h47m open-cluster-management-backup acm-resources-generic-schedule-20231023130057 4h47m open-cluster-management-backup acm-resources-generic-schedule-20231023140057 3h47m open-cluster-management-backup acm-resources-generic-schedule-20231023150057 167m open-cluster-management-backup acm-resources-generic-schedule-20231023160057 107m open-cluster-management-backup acm-resources-generic-schedule-20231023170057 47m open-cluster-management-backup acm-resources-schedule-20231023110057 6h46m open-cluster-management-backup acm-resources-schedule-20231023120057 5h47m open-cluster-management-backup acm-resources-schedule-20231023130057 4h47m open-cluster-management-backup acm-resources-schedule-20231023140057 3h47m open-cluster-management-backup acm-resources-schedule-20231023150057 167m open-cluster-management-backup acm-resources-schedule-20231023160057 108m open-cluster-management-backup acm-resources-schedule-20231023170057 47m open-cluster-management-backup acm-validation-policy-schedule-20231023170057 47m amagrawa:~$ date -u Monday 23 October 2023 05:50:02 PM UTC Then completely brought down active hub and waited for 5-10mins. Restored backup on passive hub, & ensured that both the managed clusters are up and running. DRPolicy gets validated. Checked drpc status- amagrawa:~$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-1 busybox-workloads-1-placement-1-drpc 10h amagrawa-1st Deployed Completed True busybox-workloads-3 busybox-sub-cephfs-placement-1-drpc 10h amagrawa-1st Deployed Completed True openshift-gitops busybox-appset-cephfs-placement-drpc 10h amagrawa-2nd Deployed Completed True openshift-gitops busybox-workloads-2-placement-drpc 10h amagrawa-1st Deployed Completed True All workloads moved to their last backedup state which is Deployed. But individual workload resources status are completely messed up. busybox-workloads-1-placement-1-drpc is fine ============================================================================================================================================================================== for busybox-workloads-2-placement-drpc which was failedover to C2 (amagrawa-2nd) From C1- amagrawa:~$ busybox-2 Now using project "busybox-workloads-2" on server "https://api.amagrawa-1st.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/dd-io-pvc-1 Terminating pvc-e8ad08e3-a6ca-4232-81ba-31d31811b255 117Gi RWO ocs-storagecluster-ceph-rbd 2d21h Filesystem persistentvolumeclaim/dd-io-pvc-2 Terminating pvc-d5cf5619-671f-4fd8-8415-82c6398d0ed4 143Gi RWO ocs-storagecluster-ceph-rbd 2d21h Filesystem persistentvolumeclaim/dd-io-pvc-3 Terminating pvc-e9561849-5093-49b2-9a1b-4fe75fb91535 134Gi RWO ocs-storagecluster-ceph-rbd 2d21h Filesystem persistentvolumeclaim/dd-io-pvc-4 Terminating pvc-970dff84-721b-4c6f-9829-42869812c523 106Gi RWO ocs-storagecluster-ceph-rbd 2d21h Filesystem persistentvolumeclaim/dd-io-pvc-5 Terminating pvc-d50dc5b9-b1db-45f2-a60d-145eb34fcacb 115Gi RWO ocs-storagecluster-ceph-rbd 2d21h Filesystem persistentvolumeclaim/dd-io-pvc-6 Terminating pvc-d80390eb-717a-4b25-8e42-dc6517be610d 129Gi RWO ocs-storagecluster-ceph-rbd 2d21h Filesystem persistentvolumeclaim/dd-io-pvc-7 Terminating pvc-b3d544ea-1c03-4206-9383-816c611467df 149Gi RWO ocs-storagecluster-ceph-rbd 2d21h Filesystem NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE volumereplication.replication.storage.openshift.io/dd-io-pvc-1 2d21h rbd-volumereplicationclass-2263283542 dd-io-pvc-1 secondary Secondary volumereplication.replication.storage.openshift.io/dd-io-pvc-2 2d21h rbd-volumereplicationclass-2263283542 dd-io-pvc-2 secondary Secondary volumereplication.replication.storage.openshift.io/dd-io-pvc-3 2d21h rbd-volumereplicationclass-2263283542 dd-io-pvc-3 secondary Secondary volumereplication.replication.storage.openshift.io/dd-io-pvc-4 2d21h rbd-volumereplicationclass-2263283542 dd-io-pvc-4 secondary Secondary volumereplication.replication.storage.openshift.io/dd-io-pvc-5 2d21h rbd-volumereplicationclass-2263283542 dd-io-pvc-5 secondary Secondary volumereplication.replication.storage.openshift.io/dd-io-pvc-6 2d21h rbd-volumereplicationclass-2263283542 dd-io-pvc-6 secondary Secondary volumereplication.replication.storage.openshift.io/dd-io-pvc-7 2d21h rbd-volumereplicationclass-2263283542 dd-io-pvc-7 secondary Secondary NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/busybox-workloads-2-placement-drpc primary Unknown NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/dd-io-1-854f867867-zp9hr 0/1 Pending 0 20h <none> <none> <none> <none> pod/dd-io-2-56679fb667-gz8jf 0/1 Pending 0 21h <none> <none> <none> <none> pod/dd-io-3-5757659b99-p6rjh 0/1 Pending 0 21h <none> <none> <none> <none> pod/dd-io-4-75bd89888c-tslbd 0/1 Pending 0 21h <none> <none> <none> <none> pod/dd-io-5-86c65fd579-kq68b 0/1 Pending 0 21h <none> <none> <none> <none> pod/dd-io-6-fd8994467-9ln28 0/1 Pending 0 21h <none> <none> <none> <none> pod/dd-io-7-685b4f6699-q97rs 0/1 Pending 0 21h <none> <none> <none> <none> From C2- amagrawa:~$ busybox-2 Now using project "busybox-workloads-2" on server "https://api.amagrawa-2nd.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/dd-io-pvc-1 Terminating pvc-e8ad08e3-a6ca-4232-81ba-31d31811b255 117Gi RWO ocs-storagecluster-ceph-rbd 22h Filesystem persistentvolumeclaim/dd-io-pvc-2 Terminating pvc-d5cf5619-671f-4fd8-8415-82c6398d0ed4 143Gi RWO ocs-storagecluster-ceph-rbd 22h Filesystem persistentvolumeclaim/dd-io-pvc-3 Terminating pvc-e9561849-5093-49b2-9a1b-4fe75fb91535 134Gi RWO ocs-storagecluster-ceph-rbd 22h Filesystem persistentvolumeclaim/dd-io-pvc-4 Terminating pvc-970dff84-721b-4c6f-9829-42869812c523 106Gi RWO ocs-storagecluster-ceph-rbd 22h Filesystem persistentvolumeclaim/dd-io-pvc-5 Terminating pvc-d50dc5b9-b1db-45f2-a60d-145eb34fcacb 115Gi RWO ocs-storagecluster-ceph-rbd 22h Filesystem persistentvolumeclaim/dd-io-pvc-6 Terminating pvc-d80390eb-717a-4b25-8e42-dc6517be610d 129Gi RWO ocs-storagecluster-ceph-rbd 22h Filesystem persistentvolumeclaim/dd-io-pvc-7 Terminating pvc-b3d544ea-1c03-4206-9383-816c611467df 149Gi RWO ocs-storagecluster-ceph-rbd 22h Filesystem NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE volumereplication.replication.storage.openshift.io/dd-io-pvc-1 22h rbd-volumereplicationclass-2263283542 dd-io-pvc-1 primary Primary volumereplication.replication.storage.openshift.io/dd-io-pvc-2 22h rbd-volumereplicationclass-2263283542 dd-io-pvc-2 primary Primary volumereplication.replication.storage.openshift.io/dd-io-pvc-3 22h rbd-volumereplicationclass-2263283542 dd-io-pvc-3 primary Primary volumereplication.replication.storage.openshift.io/dd-io-pvc-4 22h rbd-volumereplicationclass-2263283542 dd-io-pvc-4 primary Primary volumereplication.replication.storage.openshift.io/dd-io-pvc-5 22h rbd-volumereplicationclass-2263283542 dd-io-pvc-5 primary Primary volumereplication.replication.storage.openshift.io/dd-io-pvc-6 22h rbd-volumereplicationclass-2263283542 dd-io-pvc-6 primary Primary volumereplication.replication.storage.openshift.io/dd-io-pvc-7 22h rbd-volumereplicationclass-2263283542 dd-io-pvc-7 primary Primary NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/busybox-workloads-2-placement-drpc primary Primary ============================================================================================================================================================================== busybox-sub-cephfs-placement-1-drpc initially had issues but it recovered after a few hours which was relocated to C2 but it's backup wasn't taken From C1- amagrawa:~$ busybox-3 Now using project "busybox-workloads-3" on server "https://api.amagrawa-1st.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-caeb302f-c863-4646-b8ea-22e4dd57ac63 94Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-10 Bound pvc-b832628a-632c-4268-9195-5ae88e13eae0 87Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-11 Bound pvc-6ae0f2f7-ac8b-4a70-b090-fe52e40274a8 33Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-12 Bound pvc-1462dddf-3b95-4cc2-bd22-c85c20eafa51 147Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-13 Bound pvc-6e37fa31-ffd7-4b89-9d5b-9d349f1d4496 77Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-14 Bound pvc-039641c0-425c-4fa8-89af-6eac7da8cda6 70Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-15 Bound pvc-0a9dfa24-4476-4fb1-a99e-0a64cb0d7c56 131Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-16 Bound pvc-79875a2c-c924-4faf-8fb9-9b7b8ddd7dcf 127Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-17 Bound pvc-2a8d4574-1c08-4fdf-88d2-8c16b62ffd99 58Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-18 Bound pvc-d4b20c4d-b7a1-4943-b258-f8590e424aee 123Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-19 Bound pvc-e4597e84-d1ca-4853-a2b7-3174fbf18428 61Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-2 Bound pvc-cb019838-799b-46cc-852e-c5a6c3becbb5 44Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-20 Bound pvc-e1d05fff-eb05-4598-93b2-1985cfdbcaaa 33Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-3 Bound pvc-89a1de2a-bebc-4a04-859d-52b9ca8fd4c9 76Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-4 Bound pvc-406070d0-351b-4386-8dbc-649da2806533 144Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-5 Bound pvc-7f5ef6c5-8ea3-4e46-b943-7481fb5200c1 107Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-6 Bound pvc-1fb622b3-ba12-4210-8b23-c7fe570926d7 123Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-7 Bound pvc-825fefd8-3a21-495f-b91e-38972b3487cf 90Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-8 Bound pvc-1b691bde-e343-48f9-83a9-13f905a901d2 91Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem persistentvolumeclaim/busybox-pvc-9 Bound pvc-535d2f02-905b-4307-a966-08e766768def 111Gi RWX ocs-storagecluster-cephfs 2d21h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/busybox-sub-cephfs-placement-1-drpc primary Primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-1-7f7bf8c5d9-g5vh5 1/1 Running 0 21h 10.131.2.18 compute-4 <none> <none> pod/busybox-10-7b7bddddf8-sjq6m 1/1 Running 0 21h 10.128.3.254 compute-3 <none> <none> pod/busybox-11-6c4cf4bfb-8mwt4 1/1 Running 0 21h 10.128.2.12 compute-3 <none> <none> pod/busybox-12-7968f7d4bb-9jbvw 1/1 Running 0 21h 10.131.2.27 compute-4 <none> <none> pod/busybox-13-674b97b564-b66hr 1/1 Running 0 21h 10.128.5.218 compute-5 <none> <none> pod/busybox-14-f59899658-bcdtg 1/1 Running 0 21h 10.128.5.219 compute-5 <none> <none> pod/busybox-15-867dd79cbd-bgljp 1/1 Running 0 21h 10.128.5.220 compute-5 <none> <none> pod/busybox-16-866d576d54-n22d6 1/1 Running 0 21h 10.128.2.14 compute-3 <none> <none> pod/busybox-17-8d7df8b76-4cmrr 1/1 Running 0 21h 10.128.2.15 compute-3 <none> <none> pod/busybox-18-75cdf6f4c4-scwq7 1/1 Running 0 21h 10.128.2.19 compute-3 <none> <none> pod/busybox-19-6bcbc84d68-jsx22 1/1 Running 0 21h 10.128.5.221 compute-5 <none> <none> pod/busybox-2-5cffb67686-z79cb 1/1 Running 0 21h 10.131.2.28 compute-4 <none> <none> pod/busybox-20-fdbd78dbd-d8s57 1/1 Running 0 21h 10.131.2.29 compute-4 <none> <none> pod/busybox-3-7ffc7c8fbb-dfjmv 1/1 Running 0 21h 10.131.2.30 compute-4 <none> <none> pod/busybox-4-66688c494b-5pmh9 1/1 Running 0 21h 10.128.2.20 compute-3 <none> <none> pod/busybox-5-56978ff94-rjm5d 1/1 Running 0 21h 10.131.2.31 compute-4 <none> <none> pod/busybox-6-57544b458b-ccxm8 1/1 Running 0 21h 10.128.5.222 compute-5 <none> <none> pod/busybox-7-77ff998b8b-qzh56 1/1 Running 0 21h 10.128.5.223 compute-5 <none> <none> pod/busybox-8-6d5cdc5678-gmgbx 1/1 Running 0 21h 10.128.5.224 compute-5 <none> <none> pod/busybox-9-79c789995d-nnw24 1/1 Running 0 21h 10.131.2.32 compute-4 <none> <none> From C2- amagrawa:~$ busybox-3 Now using project "busybox-workloads-3" on server "https://api.amagrawa-2nd.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-47521762-e4a6-4af5-b662-72e274db8b86 94Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-10 Bound pvc-b3f62011-3d44-4920-9dff-da20ae34af5f 87Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-11 Bound pvc-cd44860a-a275-480d-98d1-0919a1095b8a 33Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-12 Bound pvc-72e8f30c-ddd5-48f8-8202-76f7163c38a7 147Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-13 Bound pvc-2111d09c-6439-4503-9643-22cfaa792e02 77Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-14 Bound pvc-2acb5e8d-6f09-4acb-9e1a-dc2843ed8c2e 70Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-15 Bound pvc-1d5d4c70-7330-4cd5-a75e-d2202d05ef7c 131Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-16 Bound pvc-05101e71-18c6-42bd-858f-5c5bc4e5b5ad 127Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-17 Bound pvc-21aef07c-7856-454b-b15e-c16ad2f3737e 58Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-18 Bound pvc-6389abd3-cfdb-42e4-a2de-73ad8094b5c5 123Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-19 Bound pvc-d366cbdc-c245-4b4d-932b-dd87d17d260e 61Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-2 Bound pvc-703f3ec0-a2d7-450a-8149-3ed6c04d1ec6 44Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-20 Bound pvc-d06e06c9-83ea-488a-86c2-355ef516c30d 33Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-3 Bound pvc-0c502b44-622c-45b1-9d7b-38b1dabe9d1d 76Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-4 Bound pvc-d3e2be6b-f2a7-4207-81ce-0346de6e5cee 144Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-5 Bound pvc-939e1c17-6c05-4589-af01-c36096252c20 107Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-6 Bound pvc-b58508cf-0d81-4d2b-96df-8030816c8330 123Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-7 Bound pvc-76c12792-f94e-4f7a-9bf8-c8febc83ee80 90Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-8 Bound pvc-5a88c698-f7fa-4536-9103-4f085cbcaa8e 91Gi RWX ocs-storagecluster-cephfs 20h Filesystem persistentvolumeclaim/busybox-pvc-9 Bound pvc-2687e03c-fdea-4594-b7f5-48080f06bac9 111Gi RWX ocs-storagecluster-cephfs 20h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/busybox-sub-cephfs-placement-1-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/volsync-rsync-tls-dst-busybox-pvc-1-9bh6q 1/1 Running 0 102s 10.131.0.37 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-10-2fm7r 1/1 Running 0 103s 10.131.0.36 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-11-c27rx 1/1 Running 0 2m1s 10.131.0.31 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-12-wp9zz 1/1 Running 0 116s 10.131.0.35 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-13-jwcpp 1/1 Running 0 2m38s 10.131.0.10 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-14-zrwqn 1/1 Running 0 94s 10.131.0.39 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-15-q884c 1/1 Running 0 2m33s 10.131.0.12 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-16-ng9xb 1/1 Running 0 94s 10.131.0.38 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-17-t947z 1/1 Running 0 2m18s 10.131.0.23 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-18-bnklh 1/1 Running 0 2m28s 10.131.0.19 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-19-ml7zl 1/1 Running 0 2m16s 10.131.0.27 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-2-6lkcb 1/1 Running 0 93s 10.131.0.40 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-20-wghxd 1/1 Running 0 2m6s 10.131.0.30 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-3-mrblc 1/1 Running 0 2m28s 10.131.0.16 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-4-vqz66 1/1 Running 0 119s 10.131.0.32 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-5-9zc6c 1/1 Running 0 2m14s 10.131.0.28 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-6-8hcsd 1/1 Running 0 117s 10.131.0.33 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-7-z6qsc 1/1 Running 0 93s 10.131.0.41 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-8-bsf7k 1/1 Running 0 2m22s 10.131.0.21 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-9-5h8bn 1/1 Running 0 117s 10.131.0.34 compute-1 <none> <none> ============================================================================================================================================================================== busybox-appset-cephfs-placement-drpc is also fine which was in deployed state and running on C2 ============================================================================================================================================================================== Rest everything seems fine. Ceph is as it was earlier. All ODF pods are up and running. DRPC should last backedup state. DRPolicy is validated. Mirroring status is healthy which shouldn't be, but it will it fixed when the issue with busybox-workloads-2-placement-drpc will be fixed. DR monitoring dashboard was backed up as is. Sync is working fine for remaining 3 workloads- amagrawa:acm$ group drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-1 namespace: busybox-workloads-1 namespace: busybox-workloads-1 lastGroupSyncTime: "2023-10-24T15:41:13Z" namespace: busybox-workloads-1 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-3 namespace: busybox-workloads-3 namespace: busybox-workloads-3 lastGroupSyncTime: "2023-10-24T15:41:29Z" namespace: busybox-workloads-3 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-4 namespace: openshift-gitops namespace: openshift-gitops lastGroupSyncTime: "2023-10-24T15:40:40Z" namespace: busybox-workloads-4 drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-2 namespace: openshift-gitops namespace: openshift-gitops namespace: busybox-workloads-2 amagrawa:acm$ date -u Tuesday 24 October 2023 03:51:49 PM UTC Failing_qa because of the issue with workload busybox-workloads-2-placement-drpc which was failedover to C2 (amagrawa-2nd) but it's backup wasn't taken. So, definitely not a blocker for 4.14.0? Best case, we can add this as a know issue. (In reply to Mudit Agarwal from comment #12) > So, definitely not a blocker for 4.14.0? > > Best case, we can add this as a know issue. Pls note that this could possibly happen on a production setup, even though chances are on the lesser side. If failover was performed assuming primary cluster goes down, there are high chances that the hub also goes down after that (or may be at the same time if it's in the same Data centre/zone), and new backups are not taken which would eventually end up in the same situation. I agree with it not being 4.14 blocker, but it's a real use case and we should still consider it. Benamar, let us add this as a known issue for now. I can live with not a blocker at this point of time :) As Aman highlighted, We should anticipate that some operations may occur on the cluster after a backup has been taken. Could we please document the steps on how to recover the cluster in such cases? Yes @sheggodu do we need to move this to ON_QA manually? verification of this bug will be done as part of Hub recovery testing post 4.15. No regression seen so far. Moving the bug to 4.16. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.17.0 Security, Enhancement, & Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:8676 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |