Bug 2243804
| Summary: | [MDR] : After zone failure and hub recovery, on failover applications DRPC reporting 'Progression:Completed' when cluster has leftovers of PVC, PV, VRG | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | akarsha <akrai> |
| Component: | odf-dr | Assignee: | Shyamsundar <srangana> |
| odf-dr sub component: | ramen | QA Contact: | akarsha <akrai> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | amagrawa, bmekhiss, hnallurv, kramdoss, kseeger, muagarwa, sraghave |
| Version: | 4.14 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-09-09 10:11:59 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
akarsha
2023-10-13 07:47:36 UTC
Tested versions:
----------------
OCP - 4.14.0-0.nightly-2023-10-08-220853
ODF - 4.14.0-146.stable
ACM - 2.9.0-180
Post hubrecovery I tried a failover of 1 subscription app and 1 appset app
* Failover of subscription app took almost 24 mins and it was stuck in cleaning up phase for a long time eventually cleaning up and failover succeeded
$ oc get drpc -n cephfs1 -o wide
NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
cephfs1-placement-3-drpc 16h sraghave-c1-oct sraghave-c2-oct Failover FailedOver Completed 2023-10-19T19:00:01Z 24m4.988864425s True
* Failover of appset app stuck in cleaning phase and its been almost 40 mins now
sraghave:~$ oc get drpc rbd-sample-placement-drpc -n openshift-gitops -o wide
NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
rbd-sample-placement-drpc 16h sraghave-c1-oct sraghave-c2-oct Failover FailedOver Cleaning Up 2023-10-20T08:31:45Z False
sraghave:~$
sraghave:~$ date --utc
Fri Oct 20 09:11:02 AM UTC 2023
Leftovers from C1:
-------------------
$ oc get pods,pvc,vrg -n multiple-appsets
NAME READY STATUS RESTARTS AGE
pod/busybox-rbd-5d6cc5f8b9-lrltp 0/1 Pending 0 42m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/busybox-rbd-pvc Terminating pvc-7dadb197-1361-4bd6-97f4-fdc42a6f0500 5Gi RWO ocs-external-storagecluster-ceph-rbd 3d18h
NAME DESIREDSTATE CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-sample-placement-drpc secondary Primary
O/P from C2
-------------
$ oc get pods,pvc,vrg -n multiple-appsets
NAME READY STATUS RESTARTS AGE
pod/busybox-rbd-5d6cc5f8b9-p7n54 1/1 Running 0 52s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/busybox-rbd-pvc Bound pvc-7dadb197-1361-4bd6-97f4-fdc42a6f0500 5Gi RWO ocs-external-storagecluster-ceph-rbd 3m
NAME DESIREDSTATE CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-sample-placement-drpc primary Primary
VRG status from C1:
--------------------
$ oc describe volumereplicationgroup.ramendr.openshift.io/rbd-sample-placement-drpc -n multiple-appsets
Name: rbd-sample-placement-drpc
Namespace: multiple-appsets
Labels: <none>
Annotations: <none>
API Version: ramendr.openshift.io/v1alpha1
Kind: VolumeReplicationGroup
Metadata:
Creation Timestamp: 2023-10-17T07:50:36Z
Finalizers:
volumereplicationgroups.ramendr.openshift.io/vrg-protection
Generation: 2
Owner References:
API Version: work.open-cluster-management.io/v1
Kind: AppliedManifestWork
Name: 8bc873eb592f07d550f95e8aa51b95cda8e5f8355dc9283229799add471e0d8c-rbd-sample-placement-drpc-multiple-appsets-vrg-mw
UID: 7d12332f-6719-4977-9222-fb509efa9c87
Resource Version: 10852123
UID: d736ba25-eae5-4082-ad06-d5691096c271
Spec:
Action: Failover
Pvc Selector:
Match Labels:
Appname: busybox-rbd
Replication State: secondary
s3Profiles:
s3profile-sraghave-c1-oct-ocs-external-storagecluster
s3profile-sraghave-c2-oct-ocs-external-storagecluster
Sync:
Vol Sync:
Disabled: true
Status:
Conditions:
Last Transition Time: 2023-10-20T08:32:04Z
Message: VolumeReplicationGroup is progressing
Observed Generation: 2
Reason: Progressing
Status: False
Type: DataReady
Last Transition Time: 2023-10-20T08:32:04Z
Message: VolumeReplicationGroup is replicating
Observed Generation: 2
Reason: Replicating
Status: False
Type: DataProtected
Last Transition Time: 2023-10-17T07:50:36Z
Message: Restored cluster data
Observed Generation: 1
Reason: Restored
Status: True
Type: ClusterDataReady
Last Transition Time: 2023-10-20T08:32:04Z
Message: Cluster data of all PVs are protected
Observed Generation: 2
Reason: Uploaded
Status: True
Type: ClusterDataProtected
Kube Object Protection:
Last Update Time: 2023-10-20T08:35:08Z
Observed Generation: 2
Protected PV Cs:
Conditions:
Last Transition Time: 2023-10-20T08:32:04Z
Message: Secondary transition failed as PVC is potentially in use by a pod
Observed Generation: 2
Reason: Progressing
Status: False
Type: DataReady
Last Transition Time: 2023-10-17T07:50:36Z
Message: PVC in the VolumeReplicationGroup is ready for use
Observed Generation: 1
Reason: Replicating
Status: False
Type: DataProtected
Last Transition Time: 2023-10-17T07:50:40Z
Message: Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-sraghave-c1-oct-ocs-external-storagecluster s3profile-sraghave-c2-oct-ocs-external-storagecluster]
Observed Generation: 1
Reason: Uploaded
Status: True
Type: ClusterDataProtected
Name: busybox-rbd-pvc
Replication ID:
Id:
Resources:
Storage ID:
Id:
State: Primary
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning VrgUploadFailed 50m (x773 over 13h) controller_VolumeReplicationGroup (combined from similar events): failed to upload data of odrbucket-11bea101f6d8:multiple-appsets/rbd-sample-placement-drpc/v1alpha1.VolumeReplicationGroup/a, InternalError: We encountered an internal error. Please try again.
status code: 500, request id: lnycmay1-3cw5la-6yn, host id: lnycmay1-3cw5la-6yn
Applied the workaround on the cluster, Noted down all the observations below
* To apply the WA and to get resources cleaned i think it took around 54Hrs
* Failover and relocate succeeded
$ oc get drpc rbd-sample-placement-drpc -n openshift-gitops -o wide
NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY
rbd-sample-placement-drpc 3d21h sraghave-c1-oct sraghave-c2-oct Failover FailedOver Completed 2023-10-20T08:31:45Z 54h40m26.854129648s True
* Resources cleaned up from the cluster c1 as expected
sraghave:~$ oc get pod,pvc,vrg -n multiple-appsets
No resources found in multiple-appsets namespace.
* Resources found on C2 as expected
$ oc get pods,pvc,vrg -n multiple-appsets
NAME READY STATUS RESTARTS AGE
pod/busybox-rbd-5d6cc5f8b9-p7n54 1/1 Running 0 3d6h
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/busybox-rbd-pvc Bound pvc-7dadb197-1361-4bd6-97f4-fdc42a6f0500 5Gi RWO ocs-external-storagecluster-ceph-rbd 3d6h
NAME DESIREDSTATE CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/rbd-sample-placement-drpc primary Primary
Note:
-----
I am quite unsure why am i seeing the status as cleaning in the DRCluster
sraghave:~$ oc get drcluster sraghave-c1-oct -o jsonpath='{.status.conditions[2].reason}{"\n"}'
Cleaning
sraghave:~$ oc get drcluster sraghave-c2-oct -o jsonpath='{.status.conditions[2].reason}{"\n"}'
Clean
* Unable to delete appset apps after failover/relocate (Multiple appset apps installed on namespace, leftovers on c1, DRPCs got deleted)
$ oc get pods,pvc,vrg -n multiple-appsets1
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/busybox-cephfs-pvc Terminating pvc-bc67ba04-3277-4127-b422-5929b5d97872 5Gi RWO ocs-external-storagecluster-cephfs 76m
persistentvolumeclaim/busybox-rbd-pvc Terminating pvc-10c7f9cc-da55-42ab-882a-558eccf864cb 5Gi RWO ocs-external-storagecluster-ceph-rbd 74m
persistentvolumeclaim/helloworld-pv-claim Terminating pvc-7610e25b-8ed4-4a56-9139-5ee644e2353e 10Gi RWO ocs-external-storagecluster-cephfs 75m
NAME DESIREDSTATE CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-appset1-placement-drpc primary Primary
volumereplicationgroup.ramendr.openshift.io/hello-appsets1-placement-drpc primary Primary
volumereplicationgroup.ramendr.openshift.io/rbd-appset1-placement-drpc primary Primary
Live cluster available to debug
We are concerned about the WA mentioned in comment 10 which might not be applicable when the customer loses the access to old cluster. We need an alternative WA to move forward with hub recovery cases on active site when entire zone is down @bmekhiss This requires at present that we move to the pull model for gitops from ACM, than the current push model. In the pull model, the managed cluster has the ArgoCD Application resource created using a ManifestWork, based on a PlacementDecision. So post hub recovery the manifest work operator would garbage collect work that was deployed by the older hub (as it does for Subscription based applications at present), ensuring successful cleanup of the failed cluster eventually. The gitops model is described here: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.9/html/gitops/gitops-overview#gitops-push-pull My two cents on this: 1. We should retest with the latest changes which were made for hub recovery. 2. Even if the WA is not good for customer, can it at least help QE to progress with happy path validation of the feature? That way we can save some time and keep the feature in the release while working on improving the WA or providing a proper fix in parallel. I am not sure I understand the ask here https://bugzilla.redhat.com/show_bug.cgi?id=2243804#c13 This is an ACM hub recovery issue. ODF has nothing to do with it. The solution for it is mentioned by Shyam in comment 14. Now, thinking little bit more about the problem, the workaround provided in comment 10 is simply a workaround that will work if customers still have access to the old active hub. So before recovering a hub, it's important for users to ensure that the current active hub doesn't have network access to the managed clusters. Keep in mind that the workaround in comment 10 is a basic solution and only effective when access to the failed active hub is still possible. Again, I don't understand the issue. When the hub cluster fails, customers need to ensure that that cluster is no longer have network access to the managed clusters. (In reply to Mudit Agarwal from comment #15) > My two cents on this: > > 1. We should retest with the latest changes which were made for hub recovery. > 2. Even if the WA is not good for customer, can it at least help QE to > progress with happy path validation of the feature? Yes, We are bringing up 4.15 clusters for happy path testing now. (In reply to Benamar Mekhissi from comment #16) > I am not sure I understand the ask here > https://bugzilla.redhat.com/show_bug.cgi?id=2243804#c13 > This is an ACM hub recovery issue. ODF has nothing to do with it. The > solution for it is mentioned by Shyam in comment 14. > > Now, thinking little bit more about the problem, the workaround provided in > comment 10 is simply a workaround that will work if customers still have > access to the old active hub. So before recovering a hub, it's important for > users to ensure that the current active hub doesn't have network access to > the managed clusters. Keep in mind that the workaround in comment 10 is a > basic solution and only effective when access to the failed active hub is > still possible. > > Again, I don't understand the issue. When the hub cluster fails, customers > need to ensure that that cluster is no longer have network access to the > managed clusters. Hi Benamar, I have scheduled a meeting at 7.30PM IST on 8th Jan to discuss this BZ with you. I hope it will help us understand the issue better. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |