Bug 2262455 - [MDR] VirtualMachine PV claimRef definition does not include "kind: PersistentVolumeClaim"
Summary: [MDR] VirtualMachine PV claimRef definition does not include "kind: Persisten...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.15
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ODF 4.16.0
Assignee: Nir Soffer
QA Contact: Annette Clewett
URL:
Whiteboard:
Depends On:
Blocks: 2265147
TreeView+ depends on / blocked
 
Reported: 2024-02-02 22:38 UTC by Annette Clewett
Modified: 2024-07-17 13:12 UTC (History)
6 users (show)

Fixed In Version: 4.16.0-85
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2265147 (view as bug list)
Environment:
Last Closed: 2024-07-17 13:12:56 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github RamenDR ramen pull 1211 0 None Draft Fix PV validation 2024-02-20 16:21:00 UTC
Red Hat Issue Tracker CNV-37937 0 None None None 2024-02-02 22:45:55 UTC
Red Hat Product Errata RHSA-2024:4591 0 None None None 2024-07-17 13:12:59 UTC

Description Annette Clewett 2024-02-02 22:38:10 UTC
Description of problem (please be detailed as possible and provide log
snippests):
This issue found because for MDR ramen adds "kind: PersistentVolumeClaim" when PV is created after failover even though PV is stored as metadata in the object buckets without "kind: PersistentVolumeClaim" from when the VM was first created and a DRPolicy applied. When there is change to generation then this error will be found in the associated VRG status field:

- lastTransitionTime: "2024-02-01T20:37:44Z"
      message: 'Failed to restore PVs: failed to restore ClusterData for VolRep (failed
        to restore PVs and PVCs using profile list ([s3profile-perf8-ocs-storagecluster]):
        failed to restore all []v1.PersistentVolume. Total/Restored 1/0)'
      observedGeneration: 2
      reason: Error
      status: "False"
      type: ClusterDataReady

Version of all relevant components (if applicable):
CNV 4.14.3
OCP 4.14.7
ODF 4.15 (build 129 pre-release)
ACM 2.9.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
Yes, delete "kind: PersistentVolumeClaim" in the claimRef definition of the PV after failover.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?
Yes

Steps to Reproduce:
1. Using ACM create VM using 
https://github.com/nirs/ocm-kubevirt-samples
branch: odr-metro  
path: odr-vm-pvc-metro  
2. After VM created check claimRef for PV


Actual results:
[...]
  claimRef:
    name: sample-vm-pvc
    namespace: vm-test
[...]


Expected results:
[...]
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim     
    name: sample-vm-pvc
    namespace: vm-test 
[...]

Additional info:

Comment 2 Nir Soffer 2024-02-05 15:31:31 UTC
Annette, can you explain what is the user visible issue? It is clear that there is a mismatch
between the PVC claimRef between the backup in s3 and the actual resource and we think this
mismatch is a bug in ramen (it should ignore the mismatch if kind is missing). But it is not
clear what is user visible effect and the severity of this issue.

Removed the devel ack for now until we have more info on the actual issue.

Comment 3 Nir Soffer 2024-02-05 15:39:43 UTC
Info from internal discussion:

kind in objectReference which is the type for claimRef is optional (or omitempty):
https://github.com/kubernetes/api/blob/f3648a53522eb60ea75d70d36a50c799f7e4e23b/core/v1/types.go#L342

So whatever is causing kind on the PV to be missing, we can ignore kind comparison and move forward.

The failing line in ramen:
https://github.com/RamenDR/ramen/blob/9320b5e171baf8e9b0aee756c3538f245d431c9d/controllers/vrg_volrep.go#L2164

The kind does not exist when ramen uploads the PVC to s3, it seems to be added after the PVC is
restored.

Comment 6 Nir Soffer 2024-02-12 16:55:50 UTC
Based on discussion with Benamar, this issue breaks any flow - once ramen validation breaks,
ramne will not make any progress with the drpc.

The workaround is to the remove the optional "kind" filed from the claimref.

Annette claims that this does not happen with busybox application. We don't know why
this happens only with the pvc from the kubvirt sample application.

To reproduce:
- create an vm with one pvc
- remove the kind from pv claimref
- enable dr
- trigger a generation change in the vrg
  - adding an annotation may trigger it (what Annette did)
  - changing the vrg spec will trigger it
- failover (or reloate?) to the other cluster
- the restored pv will have "kind" in the claimref
- validation should fail since the pvc in s3 store does not have a kind and the restored pvc has a kind

Comment 7 Annette Clewett 2024-02-22 18:21:45 UTC
@nsoffer I tested this fix using your patched ODF 4.15 image quay.io/nirsof/ramen-operator:release-4.15-validate-pvc-v1 in all ramen pods (hub + managedclusters) using a VM workload. The VM workload was created using my repo https://github.com/netzzer/ocm-kubevirt-samples, branch odf-rdr, path odr-vm-pvc-regional.

After VM created and DRPolicy applied, I checked the PV object uploaded to the noobaa bucket and found this:

    "claimRef": {
      "namespace": "vm-02",
      "name": "sample-vm-pvc",

I then failed over to alternate cluster and failover was successful with vrg generation = 1.

$ oc get vrg vm-01-placement-drpc -o yaml

apiVersion: ramendr.openshift.io/v1alpha1
kind: VolumeReplicationGroup
metadata:
  annotations:
    drplacementcontrol.ramendr.openshift.io/destination-cluster: perf3
    drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc: ""
    drplacementcontrol.ramendr.openshift.io/drpc-uid: b30ae87d-c99f-45c1-aea5-eb0424b8e53e
  creationTimestamp: "2024-02-22T17:30:15Z"
  finalizers:
  - volumereplicationgroups.ramendr.openshift.io/vrg-protection
  generation: 1
  name: vm-01-placement-drpc
  namespace: vm-01
[...]

To change generation I updated vrg with value "true" for the "do-not-delete-pvc" annotation.

$ oc get vrg vm-01-placement-drpc -o yaml

apiVersion: ramendr.openshift.io/v1alpha1
kind: VolumeReplicationGroup
metadata:
  annotations:
    drplacementcontrol.ramendr.openshift.io/destination-cluster: perf3
    drplacementcontrol.ramendr.openshift.io/do-not-delete-pvc: "true"
    drplacementcontrol.ramendr.openshift.io/drpc-uid: 54b1f9c9-d350-498d-ac6c-9f3ac3115e27
  creationTimestamp: "2024-02-22T01:25:53Z"
  finalizers:
  - volumereplicationgroups.ramendr.openshift.io/vrg-protection
  generation: 2
  name: vm-01-placement-drpc
  namespace: vm-01
[...]

No errors for VRG even with "kind" missing for PV object in noobaa bucket:

    conditions:
    - lastTransitionTime: "2024-02-22T17:21:17Z"
      message: PVC in the VolumeReplicationGroup is ready for use
      observedGeneration: 4
      reason: Ready
      status: "True"
      type: DataReady

Comment 17 errata-xmlrpc 2024-07-17 13:12:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591


Note You need to log in before you can comment on or make changes to this bug.