Bug 2265147 - [4.15 clone][MDR] VirtualMachine PV claimRef definition does not include "kind: PersistentVolumeClaim"
Summary: [4.15 clone][MDR] VirtualMachine PV claimRef definition does not include "kin...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.15
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Nir Soffer
QA Contact: Annette Clewett
URL:
Whiteboard:
Depends On: 2262455
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-02-20 17:16 UTC by Karolin Seeger
Modified: 2024-09-04 09:49 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2262455
Environment:
Last Closed: 2024-09-04 09:49:46 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ramen pull 200 0 None open Bug 2265147: Release 4.15 validate pvc 2024-02-21 16:14:42 UTC

Description Karolin Seeger 2024-02-20 17:16:05 UTC
+++ This bug was initially created as a clone of Bug #2262455 +++

Description of problem (please be detailed as possible and provide log
snippests):
This issue found because for MDR ramen adds "kind: PersistentVolumeClaim" when PV is created after failover even though PV is stored as metadata in the object buckets without "kind: PersistentVolumeClaim" from when the VM was first created and a DRPolicy applied. When there is change to generation then this error will be found in the associated VRG status field:

- lastTransitionTime: "2024-02-01T20:37:44Z"
      message: 'Failed to restore PVs: failed to restore ClusterData for VolRep (failed
        to restore PVs and PVCs using profile list ([s3profile-perf8-ocs-storagecluster]):
        failed to restore all []v1.PersistentVolume. Total/Restored 1/0)'
      observedGeneration: 2
      reason: Error
      status: "False"
      type: ClusterDataReady

Version of all relevant components (if applicable):
CNV 4.14.3
OCP 4.14.7
ODF 4.15 (build 129 pre-release)
ACM 2.9.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
Yes, delete "kind: PersistentVolumeClaim" in the claimRef definition of the PV after failover.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?
Yes

Steps to Reproduce:
1. Using ACM create VM using 
https://github.com/nirs/ocm-kubevirt-samples
branch: odr-metro  
path: odr-vm-pvc-metro  
2. After VM created check claimRef for PV


Actual results:
[...]
  claimRef:
    name: sample-vm-pvc
    namespace: vm-test
[...]


Expected results:
[...]
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim     
    name: sample-vm-pvc
    namespace: vm-test 
[...]

Additional info:

--- Additional comment from RHEL Program Management on 2024-02-02 22:38:18 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Nir Soffer on 2024-02-05 15:31:31 UTC ---

Annette, can you explain what is the user visible issue? It is clear that there is a mismatch
between the PVC claimRef between the backup in s3 and the actual resource and we think this
mismatch is a bug in ramen (it should ignore the mismatch if kind is missing). But it is not
clear what is user visible effect and the severity of this issue.

Removed the devel ack for now until we have more info on the actual issue.

--- Additional comment from Nir Soffer on 2024-02-05 15:39:43 UTC ---

Info from internal discussion:

kind in objectReference which is the type for claimRef is optional (or omitempty):
https://github.com/kubernetes/api/blob/f3648a53522eb60ea75d70d36a50c799f7e4e23b/core/v1/types.go#L342

So whatever is causing kind on the PV to be missing, we can ignore kind comparison and move forward.

The failing line in ramen:
https://github.com/RamenDR/ramen/blob/9320b5e171baf8e9b0aee756c3538f245d431c9d/controllers/vrg_volrep.go#L2164

The kind does not exist when ramen uploads the PVC to s3, it seems to be added after the PVC is
restored.

--- Additional comment from Karolin Seeger on 2024-02-12 16:42:44 UTC ---

Decision has been taken to fix this in 4.

--- Additional comment from Karolin Seeger on 2024-02-12 16:44:05 UTC ---

Decision has been taken to fix this in Ramen instead of CNV, because "kind" can be empty.

--- Additional comment from Nir Soffer on 2024-02-12 16:55:50 UTC ---

Based on discussion with Benamar, this issue breaks any flow - once ramen validation breaks,
ramne will not make any progress with the drpc.

The workaround is to the remove the optional "kind" filed from the claimref.

Annette claims that this does not happen with busybox application. We don't know why
this happens only with the pvc from the kubvirt sample application.

To reproduce:
- create an vm with one pvc
- remove the kind from pv claimref
- enable dr
- trigger a generation change in the vrg
  - adding an annotation may trigger it (what Annette did)
  - changing the vrg spec will trigger it
- failover (or reloate?) to the other cluster
- the restored pv will have "kind" in the claimref
- validation should fail since the pvc in s3 store does not have a kind and the restored pvc has a kind

Comment 4 Nir Soffer 2024-02-22 19:11:11 UTC
I'm not sure this flow did reproduce the issue. In my tests, after deployment we do have
a claimRef *without* kind in s3, and after failover we do have a claimRef *with* kind
in the system. But - after failover ramen uploads the pv again to s3, and we have also
"kind" in s3, so there is no conflict when changing the generation.

I reproduced the issue locally by doing:

1. failover
2. edit the pv and remove "kind"
3. edit the drpc and change do-no-delete-pvc: "true"

With this the validation error reproduced, and then replacing the ramen image fixed the issue.

Comment 5 krishnaram Karthick 2024-05-02 11:41:19 UTC
Moving the bug to 4.15.4. we need to understand why this bug needs to be backported.


Note You need to log in before you can comment on or make changes to this bug.