Bug 2268594

Summary:	[RDR] [Hub recovery] [Co-situated] Cleanup and data sync for appset workloads remain stuck after older primary is recovered post failover
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Aman Agrawal <amagrawa>
Component:	odf-dr	Assignee:	Shyamsundar <srangana>
odf-dr sub component:	ramen	QA Contact:	krishnaram Karthick <kramdoss>
Status:	CLOSED MIGRATED	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	hnallurv, kbg, kramdoss, kseeger, muagarwa, srangana
Version:	4.15
Target Milestone:	---
Target Release:	ODF 4.16.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	.Cleanup and data synchronization for ApplicationSet workloads remain stuck after older primary managed cluster is recovered post the failover ApplicationSet based workload deployments to the managed clusters are not garbage collected in cases when the hub cluster fails. It is recovered to a standby hub cluster, while the workload has been failed over to a surviving managed cluster. The cluster that the workload failed over from, rejoins the new recovered standby hub. ApplicationSets that are disaster recovery (DR) protected and with a regional DRPolicy, hence starts firing the VolumeSynchronizationDelay alert. Further such DR protected workloads cannot be failed over to the peer cluster or relocated to the peer cluster as data is out of sync between the two clusters. For a workaround, see the Troubleshooting section for Regional-DR in Configuring OpenShift Data Foundation Disaster Recovery for OpenShift Workloads.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-05-09 12:55:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2246375

Description Aman Agrawal 2024-03-08 15:02:38 UTC

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
ODF 4.15.0-157
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)
OCP 4.15.0-0.nightly-2024-03-05-113700
ACM 2.10.0-DOWNSTREAM-2024-02-28-06-06-55
Submariner image: brew.registry.redhat.io/rh-osbs/iib:680159


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
*****Active hub co-situated with primary managed cluster*****

1. After site failure (active hub and the primary managed cluster goes down) and moving to passive hub post hub recovery, all the workloads (RBD and CephFS) of both subscription and appset types and in different states Deployed, FailedOver, Relocated which were running on primary managed cluster were failedover to the failovercluster (secondary) and the failover operation was successful.

Workloads are successfully running on the failovercluster (secondary) and VRG both states are marked as Primary for all these workloads.

2. Now recover the older primary managed cluster and ensure it's successfully imported on the RHACM console (if not, create auto-import-secret for this cluster on the passive hub).
3. Monitor drpc cleanup status and lastGroupSyncTime for all the failedover workloads. 

Here, C1 went down during site failure and was recovered later post failover.


Actual results: Cleanup and data sync for appset workloads remain stuck after older primary is recovered post failover

Cleanup for RBD appset remained stuck for hours

amagrawa:~$ drpc|grep Cleaning
openshift-gitops       rbd-appset-busybox1-placement-drpc       3h14m   amagrawa-c1        amagrawa-c2       Failover       FailedOver     Cleaning Up   2024-03-08T11:24:26Z                        False
openshift-gitops       rbd-appset-busybox2-placement-drpc       3h14m   amagrawa-c1        amagrawa-c2       Failover       FailedOver     Cleaning Up   2024-03-08T11:24:25Z                        False
openshift-gitops       rbd-appset-busybox3-placement-drpc       3h14m   amagrawa-c1        amagrawa-c2       Failover       FailedOver     Cleaning Up   2024-03-08T11:24:13Z                        False


replicationdestination was not created for appset CephFS (for hours) however cleanup was successful, leading to halted data replication for them. lastGroupSyncTime is missing in the drpc yaml and it's firing VolSync.DelayAlert on the hub console


amagrawa:~$ drpc|grep cephfs-appset
openshift-gitops       cephfs-appset-busybox10-placement-drpc   3h42m   amagrawa-c1        amagrawa-c2       Failover       FailedOver     Completed     2024-03-08T11:25:41Z   44m49.697822642s     True
openshift-gitops       cephfs-appset-busybox11-placement-drpc   3h42m   amagrawa-c1        amagrawa-c2       Failover       FailedOver     Completed     2024-03-08T11:25:36Z   44m56.299638791s     True
openshift-gitops       cephfs-appset-busybox12-placement-drpc   3h42m   amagrawa-c2                                         Deployed       Completed     2024-03-08T11:19:14Z   51m14.942082281s     True
openshift-gitops       cephfs-appset-busybox9-placement-drpc    3h42m   amagrawa-c1        amagrawa-c2       Failover       FailedOver     Completed     2024-03-08T11:25:31Z   44m58.19020063s      True


Expected results: Cleanup and data sync for appset workloads should be successful after older primary is recovered post failover within the desired time (10-15mins or so)


Additional info:

Comment 3 Shyamsundar 2024-03-08 16:48:45 UTC

ApplicationSet based workload deployments to managed clusters are not garbage collected in cases when the hub cluster fails and is recovered to a standby hub cluster. This is because ArgoCD does not have a means to track older clusters where the workload was placed and garbage collect the same.

NOTE: For Subscriptions this works after the AppliedManifestWork eviction timeout (1h currently with ACM 4.10/24h ideally), to garbage collect the Subscription that was applied to the managed cluster by the original hub. For ApplicationSets the resolution is to move to the ACM pull model of ApplicationSet deployment rather than the current push model of deployment, as that uses ManifestWork (and hence AppliedManifestWork) to create the Application on the managed cluster, and hence would automatically garbage collect the same in hub recovery scenarios.

Ramen DRPC does not progress to Cleaned state as the Application resources are still active on the recovered managed cluster, primarily the PVCs are found to be still in use by the respective pods, or not deleted.

The above can be seen as DR protected applications reporting sync time alerts, or DRPCs not reporting a status.lastGroupSyncTime. Also for RBD backed workloads DRPCs would remain stuck in "Cleaning Up" progression and not report the PeerReady condition as true.

The current workaround for this issue is as follows:
- Instruct ArgoCD/openshift-gitops to place the workload to the recovered cluster
  - This is to ensure ArgoCD assumes ownership of the orphaned workload resources on the failed cluster that was recovered
- Once ArgoCD places the workload to the recovered cluster, remove the placement of the workload form that cluster such that ArgoCD can carbage collect the workload
- Wait and ensure lastGroupSyncTime alerts are no longer firing for these workloads

Steps to do the same:
1) Determine the Placement that is in use by the ArgoCD ApplicationSet resource on the hub cluster in the openshift-gitops namespace
  - Inspect the placement label value for the ApplicationSet in this field: spec.generators.clusterDecisionResource.labelSelector.matchLabels, this would be the name of the Placement resource (<placement-name>)
2) Ensure that there exists a PlacemenDecision for the ApplicationSet referenced Placement
  - $oc get placementdecision -n openshift-gitops --selector cluster.open-cluster-management.io/placement=<placement-name>
  - This should result in a single PlacementDecision that places the workload to the currently desired failover cluster
3) Create a new PlacementDecision for the ApplicationSet pointing to the cluster where it should be cleaned up e.g:

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: PlacementDecision
metadata:
  labels:
    cluster.open-cluster-management.io/decision-group-index: "1" # Typically one higher than the same value in the esisting PlacementDecision determined at step (2)
    cluster.open-cluster-management.io/decision-group-name: ""
    cluster.open-cluster-management.io/placement: cephfs-appset-busybox10-placement 
  name: <placement-name>-decision-<n> # <n> should be one higher than the existing PlacementDecision as determined in step (2) 
  namespace: openshift-gitops

4) Update the created PlacementDecision with a status subresource:

decision-status.yaml:
status:
  decisions:
  - clusterName: <managedcluster-name-to-clean-up> # This would be the cluster from where the workload was failed over, NOT the current workload cluster
    reason: FailoverCleanup

$ oc patch placementdecision -n openshift-gitops <placemen-name>-decision-<n> --patch-file=decision-status.yaml --subresource=status --type=merge

5) Watch and ensure the Application resource for the ApplicationSet has been placed on the desired cluster

$ oc get application -n openshift-gitops  <applicationset-name>-<managedcluster-name-to-clean-up>
In the output SYNC STATUS should be Synced and the HEALTH STATUS should be Healthy

6) Delete the PlacementDecision that was created in step (3), such that ArgoCD can garbage collect the workload resources on the <managedcluster-name-to-clean-up>

$ oc delete placementdecision -n openshift-gitops <placemen-name>-decision-<n>

7) Ensure that after current replication schedule duration (as set in the DRPolicy) the alerts for lastGroupSyncTime are not present
  - Further for RBD backed workloads the DRPC can be inspected to ensure that PeerReady condition is reported as true

Comment 17 Red Hat Bugzilla 2024-09-07 04:25:12 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days