Bug 2246084

Summary:	[RDR] [Hub recovery] Failover doesn't complete
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Aman Agrawal <amagrawa>
Component:	odf-dr	Assignee:	Shyamsundar <srangana>
odf-dr sub component:	ramen	QA Contact:	krishnaram Karthick <kramdoss>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	bmekhiss, kramdoss, kseeger, muagarwa
Version:	4.14	Flags:	kramdoss: needinfo+
Target Milestone:	---
Target Release:	ODF 4.15.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-03-19 15:28:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Aman Agrawal 2023-10-25 10:35:58 UTC

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-10-18-004928
advanced-cluster-management.v2.9.0-188 
ODF 4.14.0-156
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Submariner   image: brew.registry.redhat.io/rh-osbs/iib:599799
ACM 2.9.0-DOWNSTREAM-2023-10-18-17-59-25
Latency 50ms RTT


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. On a RDR setup, deploy multiple rbd and cephfs based workloads of both subscription and appset based and run IOs for a few days. In this case, a cephfs workload was deployed on C2 but all other workloads were on C1.
2. Perform hub recovery by bringing active hub down.
3. Restore backup on passive hub, ensure managed clusters are successfully imported, DRPolicy gets validated, drpc gets created, managed clusters are healthy and sync of all the workloads are working fine.
4. Failover the cephfs workload running on C2 to C1 with all nodes of C2 up and running. Then let IOs continue for some more time (a few hrs) for all workloads.
5. Now bring master nodes of primary cluster down and wait for the cluster status to change to unknown on the RHACM console. 
6. Now perform failover of all workloads running on C1 to the secondary managed cluster C2.


Actual results: Failover doesn't complete

2 workloads busybox-appset-cephfs-placement-drpc and busybox-workloads-1-placement-1-drpc were failedover to cluster amagrawa-2nd, failover completed for cephfs and is waiting for cleanup as older primary remains down, however, failover is messed up for rbd based workload.

Before failover was triggered, busybox-workloads-1-placement-1-drpc was in Deployed state however, busybox-appset-cephfs-placement-drpc was failedover to cluster amagrawa-1st and was in FailedOver state and had started data sync.

Data sync was working fine for both these workloads and resources were healthy.

From C2 (cluster amagrawa-2nd)-

amagrawa:c2$ busybox-1
Already on project "busybox-workloads-1" on server "https://api.amagrawa-2nd.qe.rh-ocs.com:6443".
NAME                                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE    VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1                   Bound    pvc-de5bd6db-99ed-4765-b28a-7d3b3dee5a04   94Gi       RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-1-20231025075057    Bound    pvc-fbb56b63-48c5-47eb-b822-ebe1498fb23d   94Gi       ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-10                  Bound    pvc-04b51cfb-c49a-4aff-8e0e-df2a5aef6d3f   87Gi       RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-10-20231025075050   Bound    pvc-43fba93e-aca6-484c-b4b7-3eeda8949b8f   87Gi       ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-11                  Bound    pvc-58fafdad-ca6f-482e-9e88-6d81bab777f5   33Gi       RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-11-20231025075114   Bound    pvc-1e189f90-08cd-46dd-b643-0875ae947809   33Gi       ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-12                  Bound    pvc-e04ed345-1a93-4657-aec1-8091063a1314   147Gi      RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-12-20231025075116   Bound    pvc-58386339-696e-4216-bd19-4212c6a74228   147Gi      ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-13                  Bound    pvc-457644b7-899a-4d63-8b86-7545922db822   77Gi       RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-13-20231025075101   Bound    pvc-4eeeb8c0-b599-4cc7-9b0b-3c31cdc914b9   77Gi       ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-14                  Bound    pvc-13956812-ff44-483e-b534-83326292eaa7   70Gi       RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-14-20231025075101   Bound    pvc-a304ab44-8544-47c2-9d3c-7ed4cf2ec886   70Gi       ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-15                  Bound    pvc-2bfe46cc-c2eb-431b-abff-4a716f3e7e61   131Gi      RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-15-20231025075115   Bound    pvc-91856426-5501-4ee5-ab3b-5a4e908bce3b   131Gi      ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-16                  Bound    pvc-3db7e992-630e-4b70-93e4-3fbc19b9fcb5   127Gi      RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-16-20231025075100   Bound    pvc-5e58af89-93f5-4130-926e-0e7bdf7d8fd2   127Gi      ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-17                  Bound    pvc-9317c6a0-134f-47d8-b008-444a8cf383d9   58Gi       RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-17-20231025075052   Bound    pvc-d366e7c8-0ee5-4886-9231-30d43a6fc044   58Gi       ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-18                  Bound    pvc-18b039a7-b7c3-4b15-ac3e-854384d2645e   123Gi      RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-18-20231025075103   Bound    pvc-91c17f4b-8494-4e0a-a2b6-9a6952037b64   123Gi      ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-19                  Bound    pvc-5d7f6252-4192-438e-80e6-fb0652192fe3   61Gi       RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-19-20231025075103   Bound    pvc-6e75287b-d1c6-4363-9397-c16e4606e197   61Gi       ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-2                   Bound    pvc-3d26f5ff-575f-4aa4-90bb-1e2a6bc173ca   44Gi       RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-2-20231025075048    Bound    pvc-ea494ad9-0ea9-4e74-9b8f-5e2306f7e24b   44Gi       ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-20                  Bound    pvc-a7d92d8d-03f3-43c7-9b7d-5bf9267a8924   33Gi       RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-20-20231025075100   Bound    pvc-17a66af5-29d9-425f-b7d2-e0721c2b6574   33Gi       ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-3                   Bound    pvc-706ed444-9317-47f0-9fc1-552f0441307e   76Gi       RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-3-20231025075100    Bound    pvc-e3f7b630-3f31-49ef-aa80-ab192d3e9138   76Gi       ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-4                   Bound    pvc-014b7389-79ce-4a09-944d-30dc8eed1abf   144Gi      RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-4-20231025075115    Bound    pvc-c9b7c15d-dab8-4051-a128-c27c4aaee646   144Gi      ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-5                   Bound    pvc-31e6e2cc-ee8f-4d99-bf8c-7f4182d84581   107Gi      RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-5-20231025075052    Bound    pvc-a798b1ec-15f3-4307-b584-71b5168080b8   107Gi      ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-6                   Bound    pvc-7564bb79-1482-4c39-951c-2b0be326edfb   123Gi      RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-6-20231025075050    Bound    pvc-603b1980-a113-4d30-840c-502a64ca6dcb   123Gi      ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-7                   Bound    pvc-ae183bc7-661a-4d68-8ae2-d73704c8acd2   90Gi       RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-7-20231025075053    Bound    pvc-577519db-edee-4631-921c-f55aecc01974   90Gi       ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-8                   Bound    pvc-6de6feaa-ed4d-4358-a4db-8fa39e8f0335   91Gi       RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-8-20231025075103    Bound    pvc-e33b7b37-7b9a-431c-8bd6-04ce388281ba   91Gi       ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem
persistentvolumeclaim/busybox-pvc-9                   Bound    pvc-8200c8c6-530a-4b46-aafb-3d78941779ec   111Gi      RWO            ocs-storagecluster-ceph-rbd   38h    Filesystem
persistentvolumeclaim/busybox-pvc-9-20231025075050    Bound    pvc-1d1dbea3-ee73-4b8e-9d49-15a63a9563ef   111Gi      ROX            ocs-storagecluster-ceph-rbd   113m   Filesystem

NAME                                                                               DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/busybox-workloads-1-placement-1-drpc   primary        Secondary


Pods were not created after failover even after a few hours, and it created ROX PVCs for this workload thought it's backed by rbd which shouldn't have happened (older primary cluster remains down).


Must gather logs are kept here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/25oct23/



From passive hub-


amagrawa:~$ drpc
NAMESPACE             NAME                                   AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION                 START TIME             DURATION          PEER READY
busybox-workloads-1   busybox-workloads-1-placement-1-drpc   40h   amagrawa-1st       amagrawa-2nd      Failover       FailingOver    WaitingForResourceRestore   2023-10-25T08:06:56Z                     False
busybox-workloads-3   busybox-sub-cephfs-placement-1-drpc    40h   amagrawa-2nd                         Relocate       Relocated      EnsuringVolSyncSetup        2023-10-25T07:11:30Z   4m41.188234434s   True
openshift-gitops      busybox-appset-cephfs-placement-drpc   40h   amagrawa-2nd       amagrawa-2nd      Failover       FailedOver     Cleaning Up                 2023-10-25T08:07:34Z                     False
openshift-gitops      busybox-workloads-2-placement-drpc     40h   amagrawa-1st       amagrawa-2nd      Failover       FailedOver     WaitForReadiness            2023-10-24T18:42:16Z                     False



Expected results: Failover should complete, workloads should be created on the failover cluster, VRG both status should be marked as primary, ROX volumes shouldn't be created for rbd based workloads. 


Additional info:

Comment 5 Mudit Agarwal 2023-10-26 13:58:20 UTC

This is not always reproducible and we have a workaround as mentioned by Benamar in https://bugzilla.redhat.com/show_bug.cgi?id=2246186#c3
IMO, we should move both these BZs to 4.14.z as this is a corner case and it might require some code restructuring in MCO

Comment 6 Aman Agrawal 2023-10-26 17:43:11 UTC

(In reply to Mudit Agarwal from comment #5)
> This is not always reproducible and we have a workaround as mentioned by
> Benamar in https://bugzilla.redhat.com/show_bug.cgi?id=2246186#c3
> IMO, we should move both these BZs to 4.14.z as this is a corner case and it
> might require some code restructuring in MCO

Actually, No.
The WA didn't work as expected and Benamar knows this. 

On the reproducibility, I am sure it should be reproducible as the workloads were in deployed state before active hub went down, and it is just a normal failover scenario which is blocked due to this BZ, so certainly a hub-recovery blocker BZ.

Comment 8 Aman Agrawal 2023-11-01 21:31:35 UTC

This issue was hit again with-

OCP 4.14.0-0.nightly-2023-10-30-170011
advanced-cluster-management.v2.9.0-188 
ODF 4.14.0-157
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
ACM 2.9.0-DOWNSTREAM-2023-10-18-17-59-25
Submariner brew.registry.redhat.io/rh-osbs/iib:607438


Steps:
1. On a hub recovery RDR setup, ensure backups are being created on active and passive hub clusters. Failover and relocate different workloads so that it is running on the primary managed cluster after the failover and relocate operation completes. Ensure latest backups are taken and no action of any of the workloads (cephfs, rbd- appset or subscription type) is in progress.
2. Collect drpc status. Bring primary managed cluster down, and then bring active hub down.
3. Ensure secondary managed cluster is properly imported on the passive hub and then DRPolicy gets validated. 
4. Check the drpc status from passive hub and compare it with the output taken from active hub when it was up. We notice that post hub recovery, a sanity check is run for all the workloads which were failedover or relocated where we again perform the same action on those workloads which was performed from the active hub, which marks peer ready as false for those workloads.

From active hub-

NAMESPACE             NAME                                   AGE   PREFERREDCLUSTER    FAILOVERCLUSTER     DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION             PEER READY
busybox-workloads-2   subscription-cephfs-placement-1-drpc   9h    amagrawa-31o-prim   amagrawa-passivee   Relocate       Relocated      Completed     2023-11-01T17:54:21Z   30.282249722s        True
busybox-workloads-5   subscription-rbd1-placement-1-drpc     9h    amagrawa-31o-prim   amagrawa-31o-prim   Failover       FailedOver     Completed     2023-11-01T13:57:37Z   47m3.364814169s      True
busybox-workloads-6   subscription-rbd2-placement-1-drpc     9h    amagrawa-31o-prim   amagrawa-passivee   Relocate       Relocated      Completed     2023-11-01T14:16:28Z   3h17m50.318760845s   True
openshift-gitops      appset-cephfs-placement-drpc           9h    amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     Completed     2023-11-01T13:20:45Z   5m59.4021061s        True
openshift-gitops      appset-rbd1-placement-drpc             9h    amagrawa-31o-prim   amagrawa-31o-prim   Failover       FailedOver     Completed     2023-11-01T14:15:30Z   41m2.588884417s      True
openshift-gitops      appset-rbd2-placement-drpc             9h    amagrawa-passivee                                      Deployed       Completed                                                 True


From passive hub-

amagrawa:~$ drpc
NAMESPACE             NAME                                   AGE   PREFERREDCLUSTER    FAILOVERCLUSTER     DESIREDSTATE   CURRENTSTATE   PROGRESSION                           START TIME             DURATION   PEER READY
busybox-workloads-2   subscription-cephfs-placement-1-drpc   57m   amagrawa-31o-prim   amagrawa-passivee   Relocate       Relocating                                           2023-11-01T18:59:35Z              False
busybox-workloads-5   subscription-rbd1-placement-1-drpc     57m   amagrawa-31o-prim   amagrawa-31o-prim   Failover       FailingOver    WaitForStorageMaintenanceActivation   2023-11-01T18:59:36Z              False
busybox-workloads-6   subscription-rbd2-placement-1-drpc     57m   amagrawa-31o-prim   amagrawa-passivee   Relocate                                                                                              True
openshift-gitops      appset-cephfs-placement-drpc           57m   amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     EnsuringVolSyncSetup                                                    True
openshift-gitops      appset-rbd1-placement-drpc             57m   amagrawa-31o-prim   amagrawa-31o-prim   Failover       FailingOver    FailingOverToCluster                  2023-11-01T18:59:36Z              False
openshift-gitops      appset-rbd2-placement-drpc             57m   amagrawa-passivee                                      Deployed       Completed                                                               True


Since peer ready is now marked as false due to sanity check, subscription-cephfs-placement-1-drpc and subscription-rbd1-placement-1-drpc and appset-rbd1-placement-drpc can not be failedover in this example.

This sanity check is needed as per k8s recommended guidelines and we should not backup the currentstate of the workloads as confirmed by @bmekhiss so the issue will always persist.

As of now, the only option is to trigger a failover by editing drpc yaml (which would be addressed by BZ2247537).

So all these apps were failedover via CLI to the secondary managed cluster which was available but the failover didn't succeed for rbd backed workloads as volumereplicationclass was not backed up/got deleted.

@bmekhiss tried a WA which created the volumereplicationclass on the secondary managed cluster which was available.

This helped failover to proceed and created the workloads pods but not the VR's for rbd backed workloads, so VRG CURRENTSTATE couldn't be marked as Primary. We need VR's to be created for rbd backed workloads so the workaround didn't work as expected.



From passive hub after triggering failover from CLI-

amagrawa:~$ drpc
NAMESPACE             NAME                                   AGE     PREFERREDCLUSTER    FAILOVERCLUSTER     DESIREDSTATE   CURRENTSTATE   PROGRESSION                 START TIME             DURATION   PEER READY
busybox-workloads-2   subscription-cephfs-placement-1-drpc   3h21m   amagrawa-31o-prim   amagrawa-passivee   Failover       FailingOver    WaitingForResourceRestore   2023-11-01T18:59:35Z              False
busybox-workloads-5   subscription-rbd1-placement-1-drpc     3h21m   amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     WaitForReadiness            2023-11-01T18:59:36Z              True
busybox-workloads-6   subscription-rbd2-placement-1-drpc     3h21m   amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     WaitForReadiness            2023-11-01T20:12:09Z              True
openshift-gitops      appset-cephfs-placement-drpc           3h21m   amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     EnsuringVolSyncSetup                                          True
openshift-gitops      appset-rbd1-placement-drpc             3h21m   amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     WaitForReadiness            2023-11-01T18:59:36Z              True
openshift-gitops      appset-rbd2-placement-drpc             3h21m   amagrawa-passivee                                      Deployed       Completed                                                     True


From secondary available managed cluster to which failover was triggered-


amagrawa:~$ busybox-5
Now using project "busybox-workloads-5" on server "https://api.amagrawa-passivee.qe.rh-ocs.com:6443".
NAME                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE   VOLUMEMODE
persistentvolumeclaim/busybox-pvc-21   Bound    pvc-81ff5583-61e1-45fd-a739-0ad850f9d803   43Gi       RWO            ocs-storagecluster-ceph-rbd   70m   Filesystem
persistentvolumeclaim/busybox-pvc-22   Bound    pvc-b14f6c3b-f1ed-42dd-b658-abaaf3e77a3d   43Gi       RWO            ocs-storagecluster-ceph-rbd   70m   Filesystem
persistentvolumeclaim/busybox-pvc-23   Bound    pvc-345815af-9b83-4e27-b8fa-6946f638e3c6   52Gi       RWO            ocs-storagecluster-ceph-rbd   70m   Filesystem
persistentvolumeclaim/busybox-pvc-24   Bound    pvc-3345a8f9-4552-4f2e-80ad-670088e3334a   20Gi       RWO            ocs-storagecluster-ceph-rbd   70m   Filesystem
persistentvolumeclaim/busybox-pvc-25   Bound    pvc-7088a4bf-5607-4b71-b578-7682ecd6fe24   45Gi       RWO            ocs-storagecluster-ceph-rbd   70m   Filesystem

NAME                                                                             DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/subscription-rbd1-placement-1-drpc   primary        

NAME                              READY   STATUS    RESTARTS   AGE   IP            NODE        NOMINATED NODE   READINESS GATES
pod/busybox-21-7d6dfb858-qdkqn    1/1     Running   0          70m   10.129.3.25   compute-2   <none>           <none>
pod/busybox-22-6cf5dcc584-b9lwx   1/1     Running   0          70m   10.129.3.26   compute-2   <none>           <none>
pod/busybox-23-5bf89b9cc8-g62tl   1/1     Running   0          70m   10.131.0.97   compute-0   <none>           <none>
pod/busybox-24-6d5bc476dd-sx9xt   1/1     Running   0          70m   10.129.3.28   compute-2   <none>           <none>
pod/busybox-25-84d6dd6dc4-jqth2   1/1     Running   0          70m   10.131.0.98   compute-0   <none>           <none>
amagrawa:~$ busybox-6
Now using project "busybox-workloads-6" on server "https://api.amagrawa-passivee.qe.rh-ocs.com:6443".
NAME                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE   VOLUMEMODE
persistentvolumeclaim/mysql-pv-claim   Bound    pvc-6ea645c2-b6f8-44d2-9526-9911282aa487   24Gi       RWO            ocs-storagecluster-ceph-rbd   70m   Filesystem

NAME                                                                             DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/subscription-rbd2-placement-1-drpc   primary        

NAME                                   READY   STATUS      RESTARTS      AGE   IP            NODE        NOMINATED NODE   READINESS GATES
pod/data-viewer-1-build                0/1     Completed   0             70m   10.129.3.24   compute-2   <none>           <none>
pod/data-viewer-775bb7cb4d-zvgt5       1/1     Running     0             69m   10.129.3.29   compute-2   <none>           <none>
pod/io-writer-mysql-68475c9785-bxvpp   1/1     Running     0             70m   10.129.3.22   compute-2   <none>           <none>
pod/io-writer-mysql-68475c9785-q74zw   1/1     Running     0             70m   10.131.0.96   compute-0   <none>           <none>
pod/io-writer-mysql-68475c9785-qgdh7   1/1     Running     0             70m   10.129.3.23   compute-2   <none>           <none>
pod/io-writer-mysql-68475c9785-qkhck   1/1     Running     0             70m   10.131.0.95   compute-0   <none>           <none>
pod/io-writer-mysql-68475c9785-ttmzv   1/1     Running     0             70m   10.128.3.88   compute-1   <none>           <none>
pod/mysql-7c88dd4dff-gsvcr             1/1     Running     1 (69m ago)   70m   10.129.3.27   compute-2   <none>           <none>
amagrawa:~$ busybox-3
Now using project "busybox-workloads-3" on server "https://api.amagrawa-passivee.qe.rh-ocs.com:6443".
NAME                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE   VOLUMEMODE
persistentvolumeclaim/dd-io-pvc-1   Bound    pvc-4cb8fad8-cd23-4e25-a6df-e8f00e2583a1   117Gi      RWO            ocs-storagecluster-ceph-rbd   69m   Filesystem
persistentvolumeclaim/dd-io-pvc-2   Bound    pvc-eef9d77b-d0bf-4b0b-9b67-cf1df477fdfc   143Gi      RWO            ocs-storagecluster-ceph-rbd   69m   Filesystem
persistentvolumeclaim/dd-io-pvc-3   Bound    pvc-ed60b47a-1724-4685-bf72-2925535114df   134Gi      RWO            ocs-storagecluster-ceph-rbd   69m   Filesystem
persistentvolumeclaim/dd-io-pvc-4   Bound    pvc-e56afbd0-65d3-4c67-b64d-24a5c301a65d   106Gi      RWO            ocs-storagecluster-ceph-rbd   69m   Filesystem
persistentvolumeclaim/dd-io-pvc-5   Bound    pvc-4e9e86a1-75d3-463a-ba9e-79abe33512aa   115Gi      RWO            ocs-storagecluster-ceph-rbd   69m   Filesystem
persistentvolumeclaim/dd-io-pvc-6   Bound    pvc-e541b7b9-36e4-4572-87aa-4276e7267b3e   129Gi      RWO            ocs-storagecluster-ceph-rbd   69m   Filesystem
persistentvolumeclaim/dd-io-pvc-7   Bound    pvc-075a6bca-0c69-47c9-8e37-9a79a8f10f29   149Gi      RWO            ocs-storagecluster-ceph-rbd   69m   Filesystem

NAME                                                                     DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/appset-rbd1-placement-drpc   primary        

NAME                           READY   STATUS    RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
pod/dd-io-1-854f867867-rcfd5   1/1     Running   0          69m   10.129.3.45    compute-2   <none>           <none>
pod/dd-io-2-56679fb667-7bjb7   1/1     Running   0          69m   10.129.3.44    compute-2   <none>           <none>
pod/dd-io-3-5757659b99-2th5r   1/1     Running   0          69m   10.131.0.100   compute-0   <none>           <none>
pod/dd-io-4-75bd89888c-x9rrv   1/1     Running   0          69m   10.129.3.47    compute-2   <none>           <none>
pod/dd-io-5-86c65fd579-8c6m7   1/1     Running   0          69m   10.129.3.46    compute-2   <none>           <none>
pod/dd-io-6-fd8994467-rcrkt    1/1     Running   0          69m   10.131.0.102   compute-0   <none>           <none>
pod/dd-io-7-685b4f6699-l7lb8   1/1     Running   0          69m   10.131.0.101   compute-0   <none>           <none>

Benamar, could you pls check why VR's were not created for any of these workloads?


Logs collected before applying the workaround to create volumereplicationclass- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/02nov23-1/

Comment 9 Aman Agrawal 2023-11-01 21:54:14 UTC

Logs are kept here (collected a few hours after triggering failover from CLI)-

http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/02nov23-2/

Comment 10 Mudit Agarwal 2023-11-07 11:39:27 UTC

Moving Hub Recovery issues to 4.14.z based on offline discussion

Comment 15 errata-xmlrpc 2024-03-19 15:28:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Comment 16 Red Hat Bugzilla 2024-10-23 04:25:05 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days