Description of problem (please be detailed as possible and provide log snippests): Observing an issue related to subscription apps post MDR co-situated hub recovery(c1+activehub+ceph(zone b) was down). Was able to failover appset pull and discovered apps successfully using the new hub. But Sub app pods are not showing up after failover from c1 to c2, but PVCs, vrg of are failedover for these apps. DRPC of sub apps shows it has failedover successfully, but respective app pods are missing in c2: busybox-sub-1 busybox-sub-1-placement-1-drpc 17h pbyregow-cl1 pbyregow-cl2 Failover FailedOver Completed 2024-07-03T16:04:38Z 2h0m45.152881171s True vm-pvc-acm-sub1 vm-pvc-acm-sub1-placement-1-drpc 17h pbyregow-cl1 pbyregow-cl2 Failover FailedOver Completed 2024-07-03T16:17:57Z 2h14m58.850396117s True vm-pvc-acm-sub2 vm-pvc-acm-sub2-placement-1-drpc 17h pbyregow-cl1 pbyregow-cl2 Failover FailedOver Completed 2024-07-03T16:18:03Z 2h14m52.041023629s True for i in {busybox-sub-1,vm-pvc-acm-sub1,vm-pvc-acm-sub2};do oc get pod,pvc,vrg -n $i;done NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/busybox-cephfs-pvc-1 Bound pvc-cba9f468-46ee-41de-a6a5-0650e9235b8b 100Gi RWO ocs-external-storagecluster-cephfs <unset> 19h persistentvolumeclaim/busybox-rbd-pvc-1 Bound pvc-4be77410-ef6b-454f-9835-2b8c111f88c6 100Gi RWO ocs-external-storagecluster-ceph-rbd <unset> 19h NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/busybox-sub-1-placement-1-drpc primary Primary NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/vm-1-pvc Bound pvc-96184450-4ed0-4879-84a7-76fd3407af7a 512Mi RWX ocs-external-storagecluster-ceph-rbd <unset> 19h NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/vm-pvc-acm-sub1-placement-1-drpc primary Primary NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/vm-1-pvc Bound pvc-584707a8-81af-4994-9f08-90556b4f26a7 512Mi RWX ocs-external-storagecluster-ceph-rbd <unset> 19h NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/vm-pvc-acm-sub2-placement-1-drpc primary Primary Seeing this error in subscription in Acm console for busybox-sub-1 app: {ggithubcom-red-hat-storage-ocs-workloads-ns/ggithubcom-red-hat-storage-ocs-workloads <nil> [] 0xc0025bd470 [] <nil> nil [] [] false} { 0001-01-01 00:00:00 +0000 UTC { [] []} map[]}}: channels.apps.open-cluster-management.io "ggithubcom-red-hat-storage-ocs-workloads" is forbidden: User "system:open-cluster-management:cluster:pbyregow-cl2:addon:application-manager:agent:application-manager" cannot get resource "channels" in API group "apps.open-cluster-management.io" in the namespace "ggithubcom-red-hat-storage-ocs-workloads-ns" Version of all relevant components (if applicable): OCP: 4.16.0-0.nightly-2024-06-27-091410 ODF: 4.16.0-134 ACM: 2.11.0-137 OADP: 1.4 (latest) hub/managed cluster Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Configured MDR cluster as per the versions listed. 2. Deployed sub, appset pull and discovered apps, applied policies and had them in different states(Deployed/FailedOver/Relocate) on both clusters. 3. Configured backup, waited ~2 hrs to take latest backup. Had the latest backup without any changes in between for any apps. 4. Brought down c1+activehub+3cephnodes 5. Restored on newhub, Restore completed successfully, followed the hub recovery doc to apply appliedManifestWorkEvictionGracePeriod: "24h" 6. DRpolicy reached validated state. 7. Removed appliedManifestWorkEvictionGracePeriod after DRpolicy and drpc recovered. 7. Failedover apps from c1 to c2. Actual results: Subscription app pods did not come up after failover post hubrecovery. Expected results: Sub apps pods should up along with rest of the resources. Additional info: Rest of the apps(appset-pull & disc) got failedover to c2 successfully.
Moving the non-blocker bz out of ODF-4.16.0. If this is a blocker, feel free to propose it back with justification note.