Bug 2243804

Summary:	[MDR] : After zone failure and hub recovery, on failover applications DRPC reporting 'Progression:Completed' when cluster has leftovers of PVC, PV, VRG
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	akarsha <akrai>
Component:	odf-dr	Assignee:	Shyamsundar <srangana>
odf-dr sub component:	ramen	QA Contact:	akarsha <akrai>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	amagrawa, bmekhiss, hnallurv, kramdoss, kseeger, muagarwa, sraghave
Version:	4.14
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-09-09 10:11:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description akarsha 2023-10-13 07:47:36 UTC

Description of problem (please be detailed as possible and provide log
snippests):

After zone failure(c1+h1) and hub recovery, wait until restore is completed in new hub and validate c2 cluster imported, DRPolicy validated, DRPCs status.

Initiate failover of applications from c1 to c2. 
Applications failover succeeds, but note that for few applications after failOVer shows wrong status that is Progression should be in "CleaningUP" it says "Completed" as shown in sample output (as c1 cluster is still down)

$ date; date --utc; oc get drpc -A -owide
Friday 13 October 2023 12:43:33 PM IST
Friday 13 October 2023 07:13:33 AM UTC
NAMESPACE          NAME                         AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION              PEER READY

openshift-gitops   appset-a2-placement-drpc     2d20h   pbyregow-clu1      pbyregow-clu2     Failover       FailedOver     Completed     2023-10-10T10:45:06Z   19.287096857s         True


sub-a2             busybox-2-placement-1-drpc   2d20h   pbyregow-clu1      pbyregow-clu2     Failover       FailedOver     Completed     2023-10-10T10:45:42Z   9.907681063s          True

Later bring c1 and ceph nodes up which was down. 
Waited for a day, few failedOver applications are still present in c1 and not deleted as shown in sample output below
P.S 

- For "appset-a5" application shows correct DRPC status, but after c1 is bought up PV,PVC,VRG is not deleted 

openshift-gitops   appset-a5-placement-drpc     2d20h   pbyregow-clu1      pbyregow-clu2     Failover       FailedOver     Cleaning Up   2023-10-10T10:45:19Z                         False

c1$  date; date --utc; oc get pod,pvc,vrg -n appset-a5
Friday 13 October 2023 12:51:44 PM IST
Friday 13 October 2023 07:21:44 AM UTC
NAME                                        READY   STATUS              RESTARTS   AGE
pod/busybox-cephfs-pod-5-7898cb6b59-qs5kg   0/1     ContainerCreating   1          2d21h
pod/busybox-rbd-pod-5-5bd797b7-2xf97        0/1     ContainerCreating   1          2d21h

NAME                                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/busybox-cephfs-pvc-5   Bound    pvc-338e4d99-8a49-4641-927a-fe40e104cadc   100Gi      RWO            ocs-external-storagecluster-cephfs     6d23h
persistentvolumeclaim/busybox-rbd-pvc-5      Bound    pvc-f462cb0f-3361-4e33-9cd9-58cdfc671ed6   100Gi      RWO            ocs-external-storagecluster-ceph-rbd   6d23h

NAME                                                                   DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/appset-a5-placement-drpc   primary        Primary

- For "appset-a2" and "sub-a2" DRPC shows wrong and also applications are not deleted

NAMESPACE          NAME                         AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION              PEER READY

openshift-gitops   appset-a2-placement-drpc     2d20h   pbyregow-clu1      pbyregow-clu2     Failover       FailedOver     Completed     2023-10-10T10:45:06Z   19.287096857s         True


sub-a2             busybox-2-placement-1-drpc   2d20h   pbyregow-clu1      pbyregow-clu2     Failover       FailedOver     Completed     2023-10-10T10:45:42Z   9.907681063s          True

c1$ date; date --utc; oc get pod,pvc,vrg -n sub-a2
Friday 13 October 2023 12:43:43 PM IST
Friday 13 October 2023 07:13:43 AM UTC
NAME                                                                     DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/busybox-2-placement-1-drpc   primary        Primary


c1$ date; date --utc; oc get pod,pvc,vrg -n appset-a2
Friday 13 October 2023 12:53:02 PM IST
Friday 13 October 2023 07:23:02 AM UTC
NAME                                        READY   STATUS              RESTARTS   AGE
pod/busybox-cephfs-pod-2-586db567cb-6nmdz   0/1     ContainerCreating   1          2d21h
pod/busybox-rbd-pod-2-5774dd5b6d-46zq6      0/1     ContainerCreating   1          2d21h

NAME                                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/busybox-cephfs-pvc-2   Bound    pvc-6c48bec5-90c5-4bb6-a296-dd2f420f7e5d   100Gi      RWO            ocs-external-storagecluster-cephfs     6d23h
persistentvolumeclaim/busybox-rbd-pvc-2      Bound    pvc-0ab6c08d-5986-4c80-8878-2d42a223920d   100Gi      RWO            ocs-external-storagecluster-ceph-rbd   6d23h

NAME                                                                   DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/appset-a2-placement-drpc   primary        Primary


Version of all relevant components (if applicable):
OCP: 4.14.0-0.nightly-2023-10-06-234925
ODF (upgraded): 4.14.0-145.stable
ACM (upgraded): 2.9.0-DOWNSTREAM-2023-10-08-08-16-57
CEPH: 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
a. If we unfence c1 cluster, IOs would be running from both managed clusters which is not correct?
b. At this stage, it might impact on relocation and relocation may not succeed?
(Because of these 2 reasons kept severity as high, if not case can reduce severity)

Is there any workaround available to the best of your knowledge?
not sure

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?
1/1

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Configure MDR cluster with ODF build 4.14.0-137.stable
   zone a: arbiter 
   zone b: hub1 (active), c1, ceph nodes: osd-0, osd-1,osd-2
   zone c: hub2 (passive), c2, ceph nodes; osd-3, osd-4, osd-5
2. Install subscription apps and appset apps have them in Deployed, Failedover and Relocated state
3. Upgrade the ODF and ACM to latest build that is upgraded to ODF: 4.14.0-145.stable and ACM: 2.9.0-180
4. Deploy few more subscription apps and appset apps have them in Deployed
5. Bring zone b down
6. Perform hub recovery that is restore the data in passive hub
7. Wait for 3-7 mins and verify c2 cluster imported, DRPolicy validated, DRPCs status
8. Perform failover of applications from c1 to c2. 
   Applications failover succeeds, but note that for "apset-a2" and "sub-a2" application after FailedOVer shows wrong status that is  Progression should be in "CleaningUP" it says "Completed" as shown in sample output in description
9. Later bring c1 and 3 ceph nodes up, wait until c1 is imported and healthy
10. Wait for a day, failedOver applications are still present and not deleted as shown in sample output

c1$  date; date --utc; oc get pod,pvc,vrg -n appset-a5
Friday 13 October 2023 12:51:44 PM IST
Friday 13 October 2023 07:21:44 AM UTC
NAME                                        READY   STATUS              RESTARTS   AGE
pod/busybox-cephfs-pod-5-7898cb6b59-qs5kg   0/1     ContainerCreating   1          2d21h
pod/busybox-rbd-pod-5-5bd797b7-2xf97        0/1     ContainerCreating   1          2d21h

NAME                                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/busybox-cephfs-pvc-5   Bound    pvc-338e4d99-8a49-4641-927a-fe40e104cadc   100Gi      RWO            ocs-external-storagecluster-cephfs     6d23h
persistentvolumeclaim/busybox-rbd-pvc-5      Bound    pvc-f462cb0f-3361-4e33-9cd9-58cdfc671ed6   100Gi      RWO            ocs-external-storagecluster-ceph-rbd   6d23h

c1$ date; date --utc; oc get pod,pvc,vrg -n sub-a2
Friday 13 October 2023 12:43:43 PM IST
Friday 13 October 2023 07:13:43 AM UTC
NAME                                                                     DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/busybox-2-placement-1-drpc   primary        Primary

Actual results:
1. DRPC status shows wrong status
2. Applications leftovers present even after c1 is up

Expected results:
1. In DRPC status: progression should be "CleaningUP"
2. Once c1 is bought up applications leftovers should be deleted.


Additional info:
- Small observations is that all these applications were in Relocate state before hub recovery
- Another is that all these 3 applications were present before upgrading ODF, and in between step (3) made hub1 down and restored in new hub

Comment 9 Shrivaibavi Raghaventhiran 2023-10-20 09:19:52 UTC

Tested versions:
----------------
OCP - 4.14.0-0.nightly-2023-10-08-220853
ODF - 4.14.0-146.stable
ACM - 2.9.0-180

Post hubrecovery I tried a failover of 1 subscription app and 1 appset app

* Failover of subscription app took almost 24 mins and it was stuck in cleaning up phase for a long time eventually cleaning up and failover succeeded

$ oc get drpc -n cephfs1 -o wide
NAME                       AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
cephfs1-placement-3-drpc   16h   sraghave-c1-oct    sraghave-c2-oct   Failover       FailedOver     Completed     2023-10-19T19:00:01Z   24m4.988864425s   True


* Failover of appset app stuck in cleaning phase and its been almost 40 mins now

sraghave:~$ oc get drpc rbd-sample-placement-drpc -n openshift-gitops -o wide
NAME                        AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION   PEER READY
rbd-sample-placement-drpc   16h   sraghave-c1-oct    sraghave-c2-oct   Failover       FailedOver     Cleaning Up   2023-10-20T08:31:45Z              False
sraghave:~$ 
sraghave:~$ date --utc
Fri Oct 20 09:11:02 AM UTC 2023

Leftovers from C1:
-------------------

$ oc get pods,pvc,vrg -n multiple-appsets
NAME                               READY   STATUS    RESTARTS   AGE
pod/busybox-rbd-5d6cc5f8b9-lrltp   0/1     Pending   0          42m

NAME                                    STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/busybox-rbd-pvc   Terminating   pvc-7dadb197-1361-4bd6-97f4-fdc42a6f0500   5Gi        RWO            ocs-external-storagecluster-ceph-rbd   3d18h

NAME                                                                    DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-sample-placement-drpc   secondary      Primary

O/P from C2
-------------
$ oc get pods,pvc,vrg -n multiple-appsets
NAME                               READY   STATUS    RESTARTS   AGE
pod/busybox-rbd-5d6cc5f8b9-p7n54   1/1     Running   0          52s

NAME                                    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/busybox-rbd-pvc   Bound    pvc-7dadb197-1361-4bd6-97f4-fdc42a6f0500   5Gi        RWO            ocs-external-storagecluster-ceph-rbd   3m

NAME                                                                    DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-sample-placement-drpc   primary        Primary


VRG status from C1:
--------------------
$ oc describe volumereplicationgroup.ramendr.openshift.io/rbd-sample-placement-drpc -n multiple-appsets
Name:         rbd-sample-placement-drpc
Namespace:    multiple-appsets
Labels:       <none>
Annotations:  <none>
API Version:  ramendr.openshift.io/v1alpha1
Kind:         VolumeReplicationGroup
Metadata:
  Creation Timestamp:  2023-10-17T07:50:36Z
  Finalizers:
    volumereplicationgroups.ramendr.openshift.io/vrg-protection
  Generation:  2
  Owner References:
    API Version:     work.open-cluster-management.io/v1
    Kind:            AppliedManifestWork
    Name:            8bc873eb592f07d550f95e8aa51b95cda8e5f8355dc9283229799add471e0d8c-rbd-sample-placement-drpc-multiple-appsets-vrg-mw
    UID:             7d12332f-6719-4977-9222-fb509efa9c87
  Resource Version:  10852123
  UID:               d736ba25-eae5-4082-ad06-d5691096c271
Spec:
  Action:  Failover
  Pvc Selector:
    Match Labels:
      Appname:        busybox-rbd
  Replication State:  secondary
  s3Profiles:
    s3profile-sraghave-c1-oct-ocs-external-storagecluster
    s3profile-sraghave-c2-oct-ocs-external-storagecluster
  Sync:
  Vol Sync:
    Disabled:  true
Status:
  Conditions:
    Last Transition Time:  2023-10-20T08:32:04Z
    Message:               VolumeReplicationGroup is progressing
    Observed Generation:   2
    Reason:                Progressing
    Status:                False
    Type:                  DataReady
    Last Transition Time:  2023-10-20T08:32:04Z
    Message:               VolumeReplicationGroup is replicating
    Observed Generation:   2
    Reason:                Replicating
    Status:                False
    Type:                  DataProtected
    Last Transition Time:  2023-10-17T07:50:36Z
    Message:               Restored cluster data
    Observed Generation:   1
    Reason:                Restored
    Status:                True
    Type:                  ClusterDataReady
    Last Transition Time:  2023-10-20T08:32:04Z
    Message:               Cluster data of all PVs are protected
    Observed Generation:   2
    Reason:                Uploaded
    Status:                True
    Type:                  ClusterDataProtected
  Kube Object Protection:
  Last Update Time:     2023-10-20T08:35:08Z
  Observed Generation:  2
  Protected PV Cs:
    Conditions:
      Last Transition Time:  2023-10-20T08:32:04Z
      Message:               Secondary transition failed as PVC is potentially in use by a pod
      Observed Generation:   2
      Reason:                Progressing
      Status:                False
      Type:                  DataReady
      Last Transition Time:  2023-10-17T07:50:36Z
      Message:               PVC in the VolumeReplicationGroup is ready for use
      Observed Generation:   1
      Reason:                Replicating
      Status:                False
      Type:                  DataProtected
      Last Transition Time:  2023-10-17T07:50:40Z
      Message:               Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-sraghave-c1-oct-ocs-external-storagecluster s3profile-sraghave-c2-oct-ocs-external-storagecluster]
      Observed Generation:   1
      Reason:                Uploaded
      Status:                True
      Type:                  ClusterDataProtected
    Name:                    busybox-rbd-pvc
    Replication ID:
      Id:  
    Resources:
    Storage ID:
      Id:  
  State:   Primary
Events:
  Type     Reason           Age                  From                               Message
  ----     ------           ----                 ----                               -------
  Warning  VrgUploadFailed  50m (x773 over 13h)  controller_VolumeReplicationGroup  (combined from similar events): failed to upload data of odrbucket-11bea101f6d8:multiple-appsets/rbd-sample-placement-drpc/v1alpha1.VolumeReplicationGroup/a, InternalError: We encountered an internal error. Please try again.
           status code: 500, request id: lnycmay1-3cw5la-6yn, host id: lnycmay1-3cw5la-6yn

Comment 11 Shrivaibavi Raghaventhiran 2023-10-23 14:49:55 UTC

Applied the workaround on the cluster, Noted down all the observations below

* To apply the WA and to get resources cleaned i think it took around 54Hrs

* Failover and relocate succeeded
$ oc get drpc rbd-sample-placement-drpc -n openshift-gitops -o wide
NAME                        AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION              PEER READY
rbd-sample-placement-drpc   3d21h   sraghave-c1-oct    sraghave-c2-oct   Failover       FailedOver     Completed     2023-10-20T08:31:45Z   54h40m26.854129648s   True

* Resources cleaned up from the cluster c1 as expected
sraghave:~$ oc get pod,pvc,vrg -n multiple-appsets
No resources found in multiple-appsets namespace.

* Resources found on C2 as expected
$ oc get pods,pvc,vrg -n multiple-appsets
NAME                               READY   STATUS    RESTARTS   AGE
pod/busybox-rbd-5d6cc5f8b9-p7n54   1/1     Running   0          3d6h

NAME                                    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/busybox-rbd-pvc   Bound    pvc-7dadb197-1361-4bd6-97f4-fdc42a6f0500   5Gi        RWO            ocs-external-storagecluster-ceph-rbd   3d6h

NAME                                                                    DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-sample-placement-drpc   primary        Primary

Note:
-----
I am quite unsure why am i seeing the status as cleaning in the DRCluster

sraghave:~$ oc get drcluster sraghave-c1-oct -o jsonpath='{.status.conditions[2].reason}{"\n"}'
Cleaning

sraghave:~$ oc get drcluster sraghave-c2-oct -o jsonpath='{.status.conditions[2].reason}{"\n"}'
Clean

* Unable to delete appset apps after failover/relocate (Multiple appset apps installed on namespace, leftovers on c1, DRPCs got deleted)

$ oc get pods,pvc,vrg -n multiple-appsets1
NAME                                        STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
persistentvolumeclaim/busybox-cephfs-pvc    Terminating   pvc-bc67ba04-3277-4127-b422-5929b5d97872   5Gi        RWO            ocs-external-storagecluster-cephfs     76m
persistentvolumeclaim/busybox-rbd-pvc       Terminating   pvc-10c7f9cc-da55-42ab-882a-558eccf864cb   5Gi        RWO            ocs-external-storagecluster-ceph-rbd   74m
persistentvolumeclaim/helloworld-pv-claim   Terminating   pvc-7610e25b-8ed4-4a56-9139-5ee644e2353e   10Gi       RWO            ocs-external-storagecluster-cephfs     75m

NAME                                                                        DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-appset1-placement-drpc   primary        Primary
volumereplicationgroup.ramendr.openshift.io/hello-appsets1-placement-drpc   primary        Primary
volumereplicationgroup.ramendr.openshift.io/rbd-appset1-placement-drpc      primary        Primary

Live cluster available to debug

Comment 13 Shrivaibavi Raghaventhiran 2024-01-03 14:30:50 UTC

We are concerned about the WA mentioned in comment 10 which might not be applicable when the customer loses the access to old cluster. We need an alternative WA to move forward with hub recovery cases on active site when entire zone is down

@bmekhiss

Comment 14 Shyamsundar 2024-01-03 16:11:02 UTC

This requires at present that we move to the pull model for gitops from ACM, than the current push model.

In the pull model, the managed cluster has the ArgoCD Application resource created using a ManifestWork, based on a PlacementDecision. So post hub recovery the manifest work operator would garbage collect work that was deployed by the older hub (as it does for Subscription based applications at present), ensuring successful cleanup of the failed cluster eventually.

The gitops model is described here: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.9/html/gitops/gitops-overview#gitops-push-pull

Comment 15 Mudit Agarwal 2024-01-04 08:17:33 UTC

My two cents on this:

1. We should retest with the latest changes which were made for hub recovery.
2. Even if the WA is not good for customer, can it at least help QE to progress with happy path validation of the feature?

That way we can save some time and keep the feature in the release while working on improving the WA or providing a proper fix in parallel.

Comment 16 Benamar Mekhissi 2024-01-04 14:09:26 UTC

I am not sure I understand the ask here https://bugzilla.redhat.com/show_bug.cgi?id=2243804#c13
This is an ACM hub recovery issue. ODF has nothing to do with it. The solution for it is mentioned by Shyam in comment 14.

Now, thinking little bit more about the problem, the workaround provided in comment 10 is simply a workaround that will work if customers still have access to the old active hub. So before recovering a hub, it's important for users to ensure that the current active hub doesn't have network access to the managed clusters. Keep in mind that the workaround in comment 10 is a basic solution and only effective when access to the failed active hub is still possible.

Again, I don't understand the issue. When the hub cluster fails, customers need to ensure that that cluster is no longer have network access to the managed clusters.

Comment 17 Harish NV Rao 2024-01-05 07:46:45 UTC

(In reply to Mudit Agarwal from comment #15)
> My two cents on this:
> 
> 1. We should retest with the latest changes which were made for hub recovery.
> 2. Even if the WA is not good for customer, can it at least help QE to
> progress with happy path validation of the feature?

Yes, We are bringing up 4.15 clusters for happy path testing now.

Comment 18 Harish NV Rao 2024-01-05 07:50:44 UTC

(In reply to Benamar Mekhissi from comment #16)
> I am not sure I understand the ask here
> https://bugzilla.redhat.com/show_bug.cgi?id=2243804#c13
> This is an ACM hub recovery issue. ODF has nothing to do with it. The
> solution for it is mentioned by Shyam in comment 14.
> 
> Now, thinking little bit more about the problem, the workaround provided in
> comment 10 is simply a workaround that will work if customers still have
> access to the old active hub. So before recovering a hub, it's important for
> users to ensure that the current active hub doesn't have network access to
> the managed clusters. Keep in mind that the workaround in comment 10 is a
> basic solution and only effective when access to the failed active hub is
> still possible.
> 
> Again, I don't understand the issue. When the hub cluster fails, customers
> need to ensure that that cluster is no longer have network access to the
> managed clusters.

Hi Benamar,

I have scheduled a meeting at 7.30PM IST on 8th Jan to discuss this BZ with you. I hope it will help us understand the issue better.

Comment 26 Red Hat Bugzilla 2025-01-08 04:25:04 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days