Description of problem (please be detailed as possible and provide log snippests): This issue found because for MDR ramen adds "kind: PersistentVolumeClaim" when PV is created after failover even though PV is stored as metadata in the object buckets without "kind: PersistentVolumeClaim" from when the VM was first created and a DRPolicy applied. When there is change to generation then this error will be found in the associated VRG status field: - lastTransitionTime: "2024-02-01T20:37:44Z" message: 'Failed to restore PVs: failed to restore ClusterData for VolRep (failed to restore PVs and PVCs using profile list ([s3profile-perf8-ocs-storagecluster]): failed to restore all []v1.PersistentVolume. Total/Restored 1/0)' observedGeneration: 2 reason: Error status: "False" type: ClusterDataReady Version of all relevant components (if applicable): CNV 4.14.3 OCP 4.14.7 ODF 4.15 (build 129 pre-release) ACM 2.9.2 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Yes, delete "kind: PersistentVolumeClaim" in the claimRef definition of the PV after failover. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Can this issue reproducible? Yes Steps to Reproduce: 1. Using ACM create VM using https://github.com/nirs/ocm-kubevirt-samples branch: odr-metro path: odr-vm-pvc-metro 2. After VM created check claimRef for PV Actual results: [...] claimRef: name: sample-vm-pvc namespace: vm-test [...] Expected results: [...] claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: sample-vm-pvc namespace: vm-test [...] Additional info:
Annette, can you explain what is the user visible issue? It is clear that there is a mismatch between the PVC claimRef between the backup in s3 and the actual resource and we think this mismatch is a bug in ramen (it should ignore the mismatch if kind is missing). But it is not clear what is user visible effect and the severity of this issue. Removed the devel ack for now until we have more info on the actual issue.
Info from internal discussion: kind in objectReference which is the type for claimRef is optional (or omitempty): https://github.com/kubernetes/api/blob/f3648a53522eb60ea75d70d36a50c799f7e4e23b/core/v1/types.go#L342 So whatever is causing kind on the PV to be missing, we can ignore kind comparison and move forward. The failing line in ramen: https://github.com/RamenDR/ramen/blob/9320b5e171baf8e9b0aee756c3538f245d431c9d/controllers/vrg_volrep.go#L2164 The kind does not exist when ramen uploads the PVC to s3, it seems to be added after the PVC is restored.
Based on discussion with Benamar, this issue breaks any flow - once ramen validation breaks, ramne will not make any progress with the drpc. The workaround is to the remove the optional "kind" filed from the claimref. Annette claims that this does not happen with busybox application. We don't know why this happens only with the pvc from the kubvirt sample application. To reproduce: - create an vm with one pvc - remove the kind from pv claimref - enable dr - trigger a generation change in the vrg - adding an annotation may trigger it (what Annette did) - changing the vrg spec will trigger it - failover (or reloate?) to the other cluster - the restored pv will have "kind" in the claimref - validation should fail since the pvc in s3 store does not have a kind and the restored pvc has a kind
@nsoffer I tested this fix using your patched ODF 4.15 image quay.io/nirsof/ramen-operator:release-4.15-validate-pvc-v1 in all ramen pods (hub + managedclusters) using a VM workload. The VM workload was created using my repo https://github.com/netzzer/ocm-kubevirt-samples, branch odf-rdr, path odr-vm-pvc-regional. After VM created and DRPolicy applied, I checked the PV object uploaded to the noobaa bucket and found this: "claimRef": { "namespace": "vm-02", "name": "sample-vm-pvc", I then failed over to alternate cluster and failover was successful with vrg generation = 1. $ oc get vrg vm-01-placement-drpc -o yaml apiVersion: ramendr.openshift.io/v1alpha1 kind: VolumeReplicationGroup metadata: annotations: drplacementcontrol.ramendr.openshift.io/destination-cluster: perf3 drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc: "" drplacementcontrol.ramendr.openshift.io/drpc-uid: b30ae87d-c99f-45c1-aea5-eb0424b8e53e creationTimestamp: "2024-02-22T17:30:15Z" finalizers: - volumereplicationgroups.ramendr.openshift.io/vrg-protection generation: 1 name: vm-01-placement-drpc namespace: vm-01 [...] To change generation I updated vrg with value "true" for the "do-not-delete-pvc" annotation. $ oc get vrg vm-01-placement-drpc -o yaml apiVersion: ramendr.openshift.io/v1alpha1 kind: VolumeReplicationGroup metadata: annotations: drplacementcontrol.ramendr.openshift.io/destination-cluster: perf3 drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc: "true" drplacementcontrol.ramendr.openshift.io/drpc-uid: 54b1f9c9-d350-498d-ac6c-9f3ac3115e27 creationTimestamp: "2024-02-22T01:25:53Z" finalizers: - volumereplicationgroups.ramendr.openshift.io/vrg-protection generation: 2 name: vm-01-placement-drpc namespace: vm-01 [...] No errors for VRG even with "kind" missing for PV object in noobaa bucket: conditions: - lastTransitionTime: "2024-02-22T17:21:17Z" message: PVC in the VolumeReplicationGroup is ready for use observedGeneration: 4 reason: Ready status: "True" type: DataReady
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591