Bug 2295782

Summary: [RDR][MDR][Tracker ACM-12448] Post hub recovery, subscription app pods are not coming up after Failover from c1 to c2.
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Parikshith <pbyregow>
Component: odf-drAssignee: Elena Gershkovich <egershko>
odf-dr sub component: unclassified QA Contact: Aman Agrawal <amagrawa>
Status: ON_QA --- Docs Contact:
Severity: high    
Priority: unspecified CC: amagrawa, edonnell, egershko, hnallurv, kbg, kramdoss, kseeger, muagarwa, sheggodu
Version: 4.16Keywords: Tracking
Target Milestone: ---Flags: kseeger: needinfo? (amagrawa)
Target Release: ODF 4.18.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.17.0-105 Doc Type: Bug Fix
Doc Text:
.Post hub recovery, subscription app pods now come up after Failover Previously, post hub recovery, the subscription application pods did not come up after failover from primary to the secondary managed clusters. This caused RBAC error occurs in AppSub subscription resource on managed cluster due to a timing issue in the backup and restore scenario. This issue has been fixed, and subscription app pods now come up after failover from primary to secondary managed clusters.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2260844, 2281703    

Description Parikshith 2024-07-04 12:09:02 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Observing an issue related to subscription apps post MDR co-situated hub recovery(c1+activehub+ceph(zone b) was down). Was able to failover appset pull and discovered apps successfully using the new hub.
But Sub app pods are not showing up after failover from c1 to c2, but PVCs, vrg of are failedover for these apps.

DRPC of sub apps shows it has failedover successfully, but respective app pods are missing in c2:
busybox-sub-1      busybox-sub-1-placement-1-drpc     17h   pbyregow-cl1       pbyregow-cl2      Failover       FailedOver     Completed     2024-07-03T16:04:38Z   2h0m45.152881171s    True
vm-pvc-acm-sub1    vm-pvc-acm-sub1-placement-1-drpc   17h   pbyregow-cl1       pbyregow-cl2      Failover       FailedOver     Completed     2024-07-03T16:17:57Z   2h14m58.850396117s   True
vm-pvc-acm-sub2    vm-pvc-acm-sub2-placement-1-drpc   17h   pbyregow-cl1       pbyregow-cl2      Failover       FailedOver     Completed     2024-07-03T16:18:03Z   2h14m52.041023629s   True

for i in {busybox-sub-1,vm-pvc-acm-sub1,vm-pvc-acm-sub2};do oc get pod,pvc,vrg -n $i;done
NAME                                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/busybox-cephfs-pvc-1   Bound    pvc-cba9f468-46ee-41de-a6a5-0650e9235b8b   100Gi      RWO            ocs-external-storagecluster-cephfs     <unset>                 19h
persistentvolumeclaim/busybox-rbd-pvc-1      Bound    pvc-4be77410-ef6b-454f-9835-2b8c111f88c6   100Gi      RWO            ocs-external-storagecluster-ceph-rbd   <unset>                 19h

NAME                                                                         DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/busybox-sub-1-placement-1-drpc   primary        Primary
NAME                             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/vm-1-pvc   Bound    pvc-96184450-4ed0-4879-84a7-76fd3407af7a   512Mi      RWX            ocs-external-storagecluster-ceph-rbd   <unset>                 19h

NAME                                                                           DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/vm-pvc-acm-sub1-placement-1-drpc   primary        Primary
NAME                             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/vm-1-pvc   Bound    pvc-584707a8-81af-4994-9f08-90556b4f26a7   512Mi      RWX            ocs-external-storagecluster-ceph-rbd   <unset>                 19h

NAME                                                                           DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/vm-pvc-acm-sub2-placement-1-drpc   primary        Primary


Seeing this error in subscription in Acm console for busybox-sub-1 app:

{ggithubcom-red-hat-storage-ocs-workloads-ns/ggithubcom-red-hat-storage-ocs-workloads  
    <nil> [] 0xc0025bd470 [] <nil> nil [] [] false} {    0001-01-01 00:00:00
    +0000 UTC { []  []} map[]}}: channels.apps.open-cluster-management.io
    "ggithubcom-red-hat-storage-ocs-workloads" is forbidden: User
    "system:open-cluster-management:cluster:pbyregow-cl2:addon:application-manager:agent:application-manager"
    cannot get resource "channels" in API group
    "apps.open-cluster-management.io" in the namespace
    "ggithubcom-red-hat-storage-ocs-workloads-ns"


Version of all relevant components (if applicable):
OCP: 4.16.0-0.nightly-2024-06-27-091410
ODF: 4.16.0-134
ACM: 2.11.0-137
OADP: 1.4 (latest) hub/managed cluster

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Configured MDR cluster as per the versions listed.
2. Deployed sub, appset pull and discovered apps, applied policies and had them in different states(Deployed/FailedOver/Relocate) on both clusters.
3. Configured backup, waited ~2 hrs to take latest backup. Had the latest backup without any changes in between for any apps.
4. Brought down c1+activehub+3cephnodes
5. Restored on newhub, Restore completed successfully, followed the hub recovery doc to apply appliedManifestWorkEvictionGracePeriod: "24h"
6. DRpolicy reached validated state.
7. Removed appliedManifestWorkEvictionGracePeriod after DRpolicy and drpc recovered.
7. Failedover apps from c1 to c2. 

Actual results:
Subscription app pods did not come up after failover post hubrecovery. 

Expected results:
Sub apps pods should up along with rest of the resources.

Additional info:
Rest of the apps(appset-pull & disc) got failedover to c2 successfully.

Comment 5 Sunil Kumar Acharya 2024-07-05 09:36:36 UTC
Moving the non-blocker bz out of ODF-4.16.0. If this is a blocker, feel free to propose it back with justification note.