Bug 2268594
| Summary: | [RDR] [Hub recovery] [Co-situated] Cleanup and data sync for appset workloads remain stuck after older primary is recovered post failover | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Aman Agrawal <amagrawa> |
| Component: | odf-dr | Assignee: | Shyamsundar <srangana> |
| odf-dr sub component: | ramen | QA Contact: | krishnaram Karthick <kramdoss> |
| Status: | CLOSED MIGRATED | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | hnallurv, kbg, kramdoss, kseeger, muagarwa, srangana |
| Version: | 4.15 | ||
| Target Milestone: | --- | ||
| Target Release: | ODF 4.16.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Known Issue | |
| Doc Text: |
.Cleanup and data synchronization for ApplicationSet workloads remain stuck after older primary managed cluster is recovered post the failover
ApplicationSet based workload deployments to the managed clusters are not garbage collected in cases when the hub cluster fails. It is recovered to a standby hub cluster, while the workload has been failed over to a surviving managed cluster. The cluster that the workload failed over from, rejoins the new recovered standby hub.
ApplicationSets that are disaster recovery (DR) protected and with a regional DRPolicy, hence starts firing the VolumeSynchronizationDelay alert. Further such DR protected workloads cannot be failed over to the peer cluster or relocated to the peer cluster as data is out of sync between the two clusters.
For a workaround, see the Troubleshooting section for Regional-DR in Configuring OpenShift Data Foundation Disaster Recovery for OpenShift Workloads.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-05-09 12:55:44 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2246375 | ||
|
Description
Aman Agrawal
2024-03-08 15:02:38 UTC
ApplicationSet based workload deployments to managed clusters are not garbage collected in cases when the hub cluster fails and is recovered to a standby hub cluster. This is because ArgoCD does not have a means to track older clusters where the workload was placed and garbage collect the same.
NOTE: For Subscriptions this works after the AppliedManifestWork eviction timeout (1h currently with ACM 4.10/24h ideally), to garbage collect the Subscription that was applied to the managed cluster by the original hub. For ApplicationSets the resolution is to move to the ACM pull model of ApplicationSet deployment rather than the current push model of deployment, as that uses ManifestWork (and hence AppliedManifestWork) to create the Application on the managed cluster, and hence would automatically garbage collect the same in hub recovery scenarios.
Ramen DRPC does not progress to Cleaned state as the Application resources are still active on the recovered managed cluster, primarily the PVCs are found to be still in use by the respective pods, or not deleted.
The above can be seen as DR protected applications reporting sync time alerts, or DRPCs not reporting a status.lastGroupSyncTime. Also for RBD backed workloads DRPCs would remain stuck in "Cleaning Up" progression and not report the PeerReady condition as true.
The current workaround for this issue is as follows:
- Instruct ArgoCD/openshift-gitops to place the workload to the recovered cluster
- This is to ensure ArgoCD assumes ownership of the orphaned workload resources on the failed cluster that was recovered
- Once ArgoCD places the workload to the recovered cluster, remove the placement of the workload form that cluster such that ArgoCD can carbage collect the workload
- Wait and ensure lastGroupSyncTime alerts are no longer firing for these workloads
Steps to do the same:
1) Determine the Placement that is in use by the ArgoCD ApplicationSet resource on the hub cluster in the openshift-gitops namespace
- Inspect the placement label value for the ApplicationSet in this field: spec.generators.clusterDecisionResource.labelSelector.matchLabels, this would be the name of the Placement resource (<placement-name>)
2) Ensure that there exists a PlacemenDecision for the ApplicationSet referenced Placement
- $oc get placementdecision -n openshift-gitops --selector cluster.open-cluster-management.io/placement=<placement-name>
- This should result in a single PlacementDecision that places the workload to the currently desired failover cluster
3) Create a new PlacementDecision for the ApplicationSet pointing to the cluster where it should be cleaned up e.g:
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: PlacementDecision
metadata:
labels:
cluster.open-cluster-management.io/decision-group-index: "1" # Typically one higher than the same value in the esisting PlacementDecision determined at step (2)
cluster.open-cluster-management.io/decision-group-name: ""
cluster.open-cluster-management.io/placement: cephfs-appset-busybox10-placement
name: <placement-name>-decision-<n> # <n> should be one higher than the existing PlacementDecision as determined in step (2)
namespace: openshift-gitops
4) Update the created PlacementDecision with a status subresource:
decision-status.yaml:
status:
decisions:
- clusterName: <managedcluster-name-to-clean-up> # This would be the cluster from where the workload was failed over, NOT the current workload cluster
reason: FailoverCleanup
$ oc patch placementdecision -n openshift-gitops <placemen-name>-decision-<n> --patch-file=decision-status.yaml --subresource=status --type=merge
5) Watch and ensure the Application resource for the ApplicationSet has been placed on the desired cluster
$ oc get application -n openshift-gitops <applicationset-name>-<managedcluster-name-to-clean-up>
In the output SYNC STATUS should be Synced and the HEALTH STATUS should be Healthy
6) Delete the PlacementDecision that was created in step (3), such that ArgoCD can garbage collect the workload resources on the <managedcluster-name-to-clean-up>
$ oc delete placementdecision -n openshift-gitops <placemen-name>-decision-<n>
7) Ensure that after current replication schedule duration (as set in the DRPolicy) the alerts for lastGroupSyncTime are not present
- Further for RBD backed workloads the DRPC can be inspected to ensure that PeerReady condition is reported as true
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |