+++ This bug was initially created as a clone of Bug #2262455 +++ Description of problem (please be detailed as possible and provide log snippests): This issue found because for MDR ramen adds "kind: PersistentVolumeClaim" when PV is created after failover even though PV is stored as metadata in the object buckets without "kind: PersistentVolumeClaim" from when the VM was first created and a DRPolicy applied. When there is change to generation then this error will be found in the associated VRG status field: - lastTransitionTime: "2024-02-01T20:37:44Z" message: 'Failed to restore PVs: failed to restore ClusterData for VolRep (failed to restore PVs and PVCs using profile list ([s3profile-perf8-ocs-storagecluster]): failed to restore all []v1.PersistentVolume. Total/Restored 1/0)' observedGeneration: 2 reason: Error status: "False" type: ClusterDataReady Version of all relevant components (if applicable): CNV 4.14.3 OCP 4.14.7 ODF 4.15 (build 129 pre-release) ACM 2.9.2 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Yes, delete "kind: PersistentVolumeClaim" in the claimRef definition of the PV after failover. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Can this issue reproducible? Yes Steps to Reproduce: 1. Using ACM create VM using https://github.com/nirs/ocm-kubevirt-samples branch: odr-metro path: odr-vm-pvc-metro 2. After VM created check claimRef for PV Actual results: [...] claimRef: name: sample-vm-pvc namespace: vm-test [...] Expected results: [...] claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: sample-vm-pvc namespace: vm-test [...] Additional info: --- Additional comment from RHEL Program Management on 2024-02-02 22:38:18 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from Nir Soffer on 2024-02-05 15:31:31 UTC --- Annette, can you explain what is the user visible issue? It is clear that there is a mismatch between the PVC claimRef between the backup in s3 and the actual resource and we think this mismatch is a bug in ramen (it should ignore the mismatch if kind is missing). But it is not clear what is user visible effect and the severity of this issue. Removed the devel ack for now until we have more info on the actual issue. --- Additional comment from Nir Soffer on 2024-02-05 15:39:43 UTC --- Info from internal discussion: kind in objectReference which is the type for claimRef is optional (or omitempty): https://github.com/kubernetes/api/blob/f3648a53522eb60ea75d70d36a50c799f7e4e23b/core/v1/types.go#L342 So whatever is causing kind on the PV to be missing, we can ignore kind comparison and move forward. The failing line in ramen: https://github.com/RamenDR/ramen/blob/9320b5e171baf8e9b0aee756c3538f245d431c9d/controllers/vrg_volrep.go#L2164 The kind does not exist when ramen uploads the PVC to s3, it seems to be added after the PVC is restored. --- Additional comment from Karolin Seeger on 2024-02-12 16:42:44 UTC --- Decision has been taken to fix this in 4. --- Additional comment from Karolin Seeger on 2024-02-12 16:44:05 UTC --- Decision has been taken to fix this in Ramen instead of CNV, because "kind" can be empty. --- Additional comment from Nir Soffer on 2024-02-12 16:55:50 UTC --- Based on discussion with Benamar, this issue breaks any flow - once ramen validation breaks, ramne will not make any progress with the drpc. The workaround is to the remove the optional "kind" filed from the claimref. Annette claims that this does not happen with busybox application. We don't know why this happens only with the pvc from the kubvirt sample application. To reproduce: - create an vm with one pvc - remove the kind from pv claimref - enable dr - trigger a generation change in the vrg - adding an annotation may trigger it (what Annette did) - changing the vrg spec will trigger it - failover (or reloate?) to the other cluster - the restored pv will have "kind" in the claimref - validation should fail since the pvc in s3 store does not have a kind and the restored pvc has a kind
I'm not sure this flow did reproduce the issue. In my tests, after deployment we do have a claimRef *without* kind in s3, and after failover we do have a claimRef *with* kind in the system. But - after failover ramen uploads the pv again to s3, and we have also "kind" in s3, so there is no conflict when changing the generation. I reproduced the issue locally by doing: 1. failover 2. edit the pv and remove "kind" 3. edit the drpc and change do-no-delete-pvc: "true" With this the validation error reproduced, and then replacing the ramen image fixed the issue.
Moving the bug to 4.15.4. we need to understand why this bug needs to be backported.