Bug 2211643

Summary: [MDR][ACM Tracker] After zone failure(c1+h1 cluster) and hub recovery, apps on c2 cluster are cleaned up as application namespace Manifestwork isn't backed up
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Parikshith <pbyregow>
Component: odf-drAssignee: Benamar Mekhissi <bmekhiss>
odf-dr sub component: ramen QA Contact: Shrivaibavi Raghaventhiran <sraghave>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: akandath, bmekhiss, edonnell, hnallurv, jpacker, kseeger, leyan, muagarwa, odf-bz-bot, owasserm, rtalur, xiangli
Version: 4.13Keywords: Regression
Target Milestone: ---Flags: xiangli: needinfo-
Target Release: ODF 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.13.0-218 Doc Type: Bug Fix
Doc Text:
Previously, during hub recovery, OpenShift Data Foundation encountered a known issue with Red Hat Advanced Cluster Management version 2.7.4 (or higher) where certain managed resources associated with the subscription-based workload might have been unintentionally deleted. This issue has been fixed, and no managed resources are deleted during hub recovery.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-08 18:50:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2154341, 2173907, 2173997, 2176028, 2183153, 2213472, 2244409    

Description Parikshith 2023-06-01 10:33:43 UTC
Description of problem (please be detailed as possible and provide log
snippests):
After shutting down zone hosting c1 and h1 cluster and performing hub recovery to h2. Apps(subscription and appset) located/deployed on c2 managed cluster is deleted automatically.

Drpc of apps before hub recovery:
---------------------------------
oc get drpc -A -o wide
NAMESPACE          NAME                             AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
b-sub-1            b-sub-1-placement-1-drpc         23h     pbyregow-c1        pbyregow-c2       Relocate       Relocated      Completed     2023-05-31T11:53:34Z   2m5.010223347s    True
b-sub-2            b-sub-2-placement-1-drpc         23h     pbyregow-c1        pbyregow-c2       Relocate       Relocated      Completed     2023-05-31T11:53:46Z   1m31.99599474s    True
b-sub-3            b-sub-3-placement-1-drpc         23h     pbyregow-c2        pbyregow-c1       Relocate       Relocated      Completed     2023-05-31T13:48:24Z   2m16.995339752s   True
b-sub-4            b-sub-4-placement-1-drpc         23h     pbyregow-c2        pbyregow-c1       Relocate       Relocated      Completed     2023-05-31T13:48:39Z   18m7.137529589s   True
cronjob-sub-1      cronjob-sub-1-placement-1-drpc   23h     pbyregow-c1        pbyregow-c2       Relocate       Relocated      Completed     2023-05-31T11:53:59Z   1m59.004503184s   True
cronjob-sub-2      cronjob-sub-2-placement-1-drpc   23h     pbyregow-c2        pbyregow-c1       Relocate       Relocated      Completed     2023-05-31T13:48:59Z   2m13.949180869s   True
job-sub-1          job-sub-1-placement-1-drpc       23h     pbyregow-c1        pbyregow-c2       Relocate       Relocated      Completed     2023-05-31T11:54:34Z   1m40.973458756s   True
job-sub-2          job-sub-2-placement-1-drpc       23h     pbyregow-c2        pbyregow-c1       Relocate       Relocated      Completed     2023-05-31T13:49:16Z   2m24.759375791s   True
new-sub-1          new-sub-1-placement-1-drpc       3h11m   pbyregow-c1                                         Deployed       Completed     2023-06-01T05:58:49Z   2.032041553s      True
new-sub-2          new-sub-2-placement-1-drpc       3h11m   pbyregow-c2                                         Deployed       Completed     2023-06-01T05:58:38Z   19.043831186s     True
openshift-gitops   b-app-1-placement-drpc           23h     pbyregow-c1        pbyregow-c2       Relocate       Relocated      Completed     2023-05-31T11:53:20Z   1m41.015799768s   True
openshift-gitops   b-app-2-placement-drpc           23h     pbyregow-c1        pbyregow-c2       Relocate       Relocated      Completed     2023-05-31T11:53:24Z   5m0.009409233s    True
openshift-gitops   b-app-3-placement-drpc           23h     pbyregow-c2        pbyregow-c1       Relocate       Relocated      Completed     2023-05-31T13:48:10Z   5m30.927637297s   True
openshift-gitops   b-app-4-placement-drpc           23h     pbyregow-c2        pbyregow-c1       Relocate       Relocated      Completed     2023-05-31T13:48:15Z   4m55.943941384s   True
openshift-gitops   cronjob-app-1-placement-drpc     23h     pbyregow-c1        pbyregow-c2       Relocate       Relocated      Completed     2023-05-31T11:53:52Z   1m49.956000232s   True
openshift-gitops   cronjob-app-2-placement-drpc     23h     pbyregow-c2        pbyregow-c1       Relocate       Relocated      Completed     2023-05-31T13:48:50Z   1m53.3372577s     True
openshift-gitops   job-app-1-placement-drpc         23h     pbyregow-c1        pbyregow-c2       Relocate       Relocated      Completed     2023-05-31T11:54:05Z   1m57.943351562s   True
openshift-gitops   job-app-2-placement-drpc         23h     pbyregow-c2        pbyregow-c1       Relocate       Relocated      Completed     2023-05-31T13:49:07Z   2m3.812597752s    True
openshift-gitops   new-app-1-placement-drpc         3h11m   pbyregow-c1                                         Deployed       Completed     2023-06-01T05:58:37Z   29.01439644s      True
openshift-gitops   new-app-2-placement-drpc         3h10m   pbyregow-c2                                         Deployed       Completed     2023-06-01T05:59:14Z   1.013864968s      True

drpc of apps after recovery:
----------------------------
oc get drpc -A -o wide
NAMESPACE          NAME                             AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION      START TIME             DURATION       PEER READY
b-sub-1            b-sub-1-placement-1-drpc         59m   pbyregow-c1        pbyregow-c2       Relocate                                                                             Unknown
b-sub-2            b-sub-2-placement-1-drpc         59m   pbyregow-c1        pbyregow-c2       Relocate                                                                             Unknown
b-sub-3            b-sub-3-placement-1-drpc         59m   pbyregow-c2        pbyregow-c1       Relocate       Relocated      Cleaning Up                                            True
b-sub-4            b-sub-4-placement-1-drpc         59m   pbyregow-c2        pbyregow-c1       Relocate       Relocated      Cleaning Up                                            True
cronjob-sub-1      cronjob-sub-1-placement-1-drpc   59m   pbyregow-c1        pbyregow-c2       Relocate                                                                             Unknown
cronjob-sub-2      cronjob-sub-2-placement-1-drpc   59m   pbyregow-c2        pbyregow-c1       Relocate       Relocated      Cleaning Up                                            True
job-sub-1          job-sub-1-placement-1-drpc       59m   pbyregow-c1        pbyregow-c2       Relocate                                                                             Unknown
job-sub-2          job-sub-2-placement-1-drpc       59m   pbyregow-c2        pbyregow-c1       Relocate       Relocated      Cleaning Up                                            True
new-sub-1          new-sub-1-placement-1-drpc       59m   pbyregow-c1                                         Deployed       UpdatingPlRule   2023-06-01T10:21:16Z                  True
new-sub-2          new-sub-2-placement-1-drpc       59m   pbyregow-c2                                         Deployed       Completed        2023-06-01T09:40:40Z   116.057682ms   True
openshift-gitops   b-app-1-placement-drpc           59m   pbyregow-c1        pbyregow-c2       Relocate                                                                             Unknown
openshift-gitops   b-app-2-placement-drpc           59m   pbyregow-c1        pbyregow-c2       Relocate                                                                             Unknown
openshift-gitops   b-app-3-placement-drpc           59m   pbyregow-c2        pbyregow-c1       Relocate       Relocated      Cleaning Up                                            True
openshift-gitops   b-app-4-placement-drpc           59m   pbyregow-c2        pbyregow-c1       Relocate       Relocated      Cleaning Up                                            True
openshift-gitops   cronjob-app-1-placement-drpc     59m   pbyregow-c1        pbyregow-c2       Relocate                                                                             Unknown
openshift-gitops   cronjob-app-2-placement-drpc     59m   pbyregow-c2        pbyregow-c1       Relocate       Relocated      Cleaning Up      2023-06-01T09:40:39Z                  True
openshift-gitops   job-app-1-placement-drpc         59m   pbyregow-c1        pbyregow-c2       Relocate                                                                             Unknown
openshift-gitops   job-app-2-placement-drpc         59m   pbyregow-c2        pbyregow-c1       Relocate       Relocated      Cleaning Up                                            True
openshift-gitops   new-app-1-placement-drpc         59m   pbyregow-c1                                         Deployed       UpdatingPlRule   2023-06-01T10:21:20Z                  True
openshift-gitops   new-app-2-placement-drpc         59m   pbyregow-c2                                         Deployed       Completed        2023-06-01T09:40:41Z   376.606272ms   True

Version of all relevant components (if applicable):
ocp: 4.13.0-0.nightly-2023-05-30-074322
odf/mco: 4.13.0-207
ACM: 2.7.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes, apps on running managed cluster should not be deleted

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?
not sure

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
yes

Steps to Reproduce:
1. Create 4 OCP clusters such that 2 hubs and 2 managed clusters. And one stretched RHCS cluster.
   Deploy cluster in such a way that
	zone a: arbiter ceph node
	zone b: c1, h1, 3 ceph nodes
	zone c: c2, h2, 3 ceph nodes
2. Configure MDR and deploy 20 applications on each managed clusters
3. Initiate a backup process, such that the active and passive hubs are in sync
4. Made zone b down, ie c1, h1 and 3 ceph nodes
5. Initiate the restore process on h2
6. Restore succeeded in new-hub, dr policy on h2 in validated state
7. Check applications on c2 cluster


Actual results:
Applications present on c2 managed cluster are deleted after hub recovery

Expected results:
Applications on c2 managed cluster should present and in running state after hub recovery.


Additional info:
Had validated the status of apps on c2 some time before hub recovery:
------------------------------------------------------------------------

$for i in {b-sub-3,b-sub-4,b-app-3,b-app-4,cronjob-sub-2,job-sub-2,cronjob-app-2,job-app-2,new-app-2,new-sub-2}; do oc get pod,pvc -n $i; done
NAME                               READY   STATUS    RESTARTS   AGE
pod/busybox-rbd-5f46b79479-h8pdn   1/1     Running   0          19h

NAME                                    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/busybox-rbd-pvc   Bound    pvc-3800a556-9776-4216-b8b1-c48b4989308e   5Gi        RWO            ocs-external-storagecluster-ceph-rbd   19h
NAME                                  READY   STATUS    RESTARTS   AGE
pod/busybox-cephfs-7bd55bcb67-9qn6c   1/1     Running   0          19h

NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                         AGE
persistentvolumeclaim/busybox-cephfs-pvc   Bound    pvc-8fe28974-f5ae-4848-95d9-763a9e7457e5   5Gi        RWO            ocs-external-storagecluster-cephfs   19h
NAME                               READY   STATUS    RESTARTS   AGE
pod/busybox-rbd-5f46b79479-n5jj8   1/1     Running   0          19h

NAME                                    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/busybox-rbd-pvc   Bound    pvc-27f7ab03-88c8-47e2-b362-e6886ae4ad22   5Gi        RWO            ocs-external-storagecluster-ceph-rbd   19h
NAME                                  READY   STATUS    RESTARTS   AGE
pod/busybox-cephfs-7bd55bcb67-282cz   1/1     Running   0          19h

NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                         AGE
persistentvolumeclaim/busybox-cephfs-pvc   Bound    pvc-28174687-31a2-4b65-947a-368655475774   5Gi        RWO            ocs-external-storagecluster-cephfs   19h
NAME                                        READY   STATUS      RESTARTS   AGE
pod/hello-world-job-cephfs-28093508-p7hkv   0/1     Completed   0          2m14s
pod/hello-world-job-cephfs-28093509-skd98   0/1     Completed   0          74s
pod/hello-world-job-cephfs-28093510-kgv9w   0/1     Completed   0          14s
pod/hello-world-job-rbd-28093508-wdxrn      0/1     Completed   0          2m14s
pod/hello-world-job-rbd-28093509-vzsg6      0/1     Completed   0          74s
pod/hello-world-job-rbd-28093510-xjkxx      0/1     Completed   0          14s

NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/hello-world-cephfs   Bound    pvc-c9fe7e99-cd5a-4777-8711-42dcdeacb3c2   10Gi       RWO            ocs-external-storagecluster-cephfs     19h
persistentvolumeclaim/hello-world-rbd      Bound    pvc-7c0ee582-284c-4397-a2aa-57c7cf769568   10Gi       RWO            ocs-external-storagecluster-ceph-rbd   19h
NAME                         READY   STATUS    RESTARTS   AGE
pod/countdown-cephfs-sdvg5   1/1     Running   0          19h
pod/countdown-cephfs-sztzq   1/1     Running   0          19h
pod/countdown-cephfs-wctxf   1/1     Running   0          19h
pod/countdown-rbd-2cqd6      1/1     Running   0          19h
pod/countdown-rbd-g5mnl      1/1     Running   0          19h
pod/countdown-rbd-jbflw      1/1     Running   0          19h

NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/job-cephfspvc   Bound    pvc-2e71aa9f-5eb2-4098-80e0-6fe602c5a0b4   5Gi        RWO            ocs-external-storagecluster-cephfs     19h
persistentvolumeclaim/job-rbdpvc      Bound    pvc-8b3af01c-b138-4882-94bd-742b7656c699   5Gi        RWO            ocs-external-storagecluster-ceph-rbd   19h
NAME                                        READY   STATUS      RESTARTS   AGE
pod/hello-world-job-cephfs-28093508-9zs4k   0/1     Completed   0          2m17s
pod/hello-world-job-cephfs-28093509-lvcp2   0/1     Completed   0          77s
pod/hello-world-job-cephfs-28093510-5fr6p   0/1     Completed   0          17s
pod/hello-world-job-rbd-28093508-v6bql      0/1     Completed   0          2m17s
pod/hello-world-job-rbd-28093509-kvgtr      0/1     Completed   0          77s
pod/hello-world-job-rbd-28093510-xsskr      0/1     Completed   0          17s

NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/hello-world-cephfs   Bound    pvc-c840a012-b91c-4be0-aeef-bbe6c976e52e   10Gi       RWO            ocs-external-storagecluster-cephfs     19h
persistentvolumeclaim/hello-world-rbd      Bound    pvc-e9f70091-2bad-4f66-90ac-ac09927f9aa7   10Gi       RWO            ocs-external-storagecluster-ceph-rbd   19h
NAME                         READY   STATUS    RESTARTS   AGE
pod/countdown-cephfs-l2wqv   1/1     Running   0          19h
pod/countdown-cephfs-mxxct   1/1     Running   0          19h
pod/countdown-cephfs-wzt87   1/1     Running   0          19h
pod/countdown-rbd-89hb6      1/1     Running   0          19h
pod/countdown-rbd-gghfs      1/1     Running   0          19h
pod/countdown-rbd-l6h5v      1/1     Running   0          19h

NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/job-cephfspvc   Bound    pvc-f533982a-9613-4aad-93f1-7ae1be45d4b3   5Gi        RWO            ocs-external-storagecluster-cephfs     19h
persistentvolumeclaim/job-rbdpvc      Bound    pvc-e463888c-04e4-4278-969c-9bbea4148294   5Gi        RWO            ocs-external-storagecluster-ceph-rbd   19h
NAME                               READY   STATUS    RESTARTS   AGE
pod/busybox-rbd-5f46b79479-52mht   1/1     Running   0          3h12m

NAME                                    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/busybox-rbd-pvc   Bound    pvc-37dc7b45-96b8-430d-b1c6-44b72b489805   5Gi        RWO            ocs-external-storagecluster-ceph-rbd   3h12m
NAME                               READY   STATUS    RESTARTS   AGE
pod/busybox-rbd-5f46b79479-zlwj5   1/1     Running   0          3h12m

NAME                                    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/busybox-rbd-pvc   Bound    pvc-0e1d7acd-c676-4b10-9b08-cae9c4847426   5Gi        RWO 

After hub recovery:
-----------------  
$for i in {b-sub-3,b-sub-4,b-app-3,b-app-4,cronjob-sub-2,job-sub-2,cronjob-app-2,job-app-2,new-app-2,new-sub-2}; do oc get pod,pvc -n $i; done
No resources found in b-sub-3 namespace.
No resources found in b-sub-4 namespace.
No resources found in b-app-3 namespace.
No resources found in b-app-4 namespace.
No resources found in cronjob-sub-2 namespace.
No resources found in job-sub-2 namespace.
No resources found in cronjob-app-2 namespace.
No resources found in job-app-2 namespace.
No resources found in new-app-2 namespace.
No resources found in new-sub-2 namespace.

Comment 12 Harish NV Rao 2023-06-08 08:10:08 UTC
Hi Benamar

Does this fix cover upgrade scenario that we discussed in yesterday's meeting?

If not, then what are the steps to cover it manually? Please let us know

Harish

Comment 13 Benamar Mekhissi 2023-06-08 19:19:25 UTC
@hnallurv To update the namespace ManifestWork after upgrade, follow these steps:
0. Find where the application is running:
```
oc get drpc -A

NAMESPACE        NAME           AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE
busybox-sample   busybox-drpc   68d   c1                 c2                Failover       FailedOver
```

1. Find the ManifestWork for the namespace
```
oc get manifestwork -n c2 | grep ns
NAME                                         AGE
busybox-drpc-busybox-sample-ns-mw            69d
```
2. Find the namespace ManifestWork for the application. It is named based on this format "%1-%2-%3-mw". The breakdown of that is as follows:
    - %1: Name of the application
    - %2: Namespace of the application
    - %3: the word 'ns'
    example:  busybox-drpc-busybox-sample-ns-mw --> [busybox-drpc]-[busybox-sample]-ns-mw

2. Edit the ManifestWork
```
oc edit manifestwork -n c2 busybox-drpc-busybox-sample-ns-mw
```

3. Add the following lable to the .spec.workload.manifests section
```
	labels:
          cluster.open-cluster-management.io/backup: resource
```
4. Here is an example:
```
apiVersion: work.open-cluster-management.io/v1
kind: ManifestWork
metadata:
  annotations:
    drplacementcontrol.ramendr.openshift.io/drpc-name: busybox-drpc
    drplacementcontrol.ramendr.openshift.io/drpc-namespace: busybox-sample
  creationTimestamp: "2023-03-30T19:50:25Z"
  finalizers:
  - cluster.open-cluster-management.io/manifest-work-cleanup
  generation: 2
  name: busybox-drpc-busybox-sample-ns-mw
  namespace: c2
  resourceVersion: "910332"
  uid: 788ff2c3-4d2e-49dc-b222-61f581131866
spec:
  workload:
    manifests:
    - apiVersion: v1
      kind: Namespace
      labels:
        cluster.open-cluster-management.io/backup: resource
      metadata:
        name: busybox-sample
      spec: {}
      status: {}
```

Comment 24 Mudit Agarwal 2023-07-25 07:18:31 UTC
ACM issue https://issues.redhat.com/browse/ACM-5795 is fixed with 2.7.7

Comment 33 Raghavendra Talur 2023-09-06 12:41:13 UTC
*** Bug 2222706 has been marked as a duplicate of this bug. ***

Comment 35 Shrivaibavi Raghaventhiran 2023-10-20 16:15:03 UTC
Tested versions:
----------------
OCP - 4.14.0-0.nightly-2023-10-08-220853
ODF - 4.14.0-146.stable
ACM - 2.9.0-180

Steps performed:
-----------------
1. Configured 4.14 MetroDR setup with ACM 2.9.0 zone-a: c1, hub-active. zone-b: clu2, hub-passive
2. Deployed subscription and appset apps on both managed cluster(c1, c2)
3. Applied DR policy to apps and had apps on deployed, failedover and relocate states
4. Created backup
5. Brought down zone-a(c1, hub-active, ceph nodes)
6. Restored on hub-passive

Observations:
--------------
1. Post restoring, had to manually import the c1 managed cluster.(using auto-import-secret)
2. After few minutes dr policy reached validated state. All applications were running and not cleaned up on managed clusters.
3. Openshift-storage ns and other app resources were intact

With above observations moving the BZ to Verified state

Comment 37 errata-xmlrpc 2023-11-08 18:50:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Comment 38 Red Hat Bugzilla 2024-03-08 04:25:46 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days