Bug 2172612

Summary: [4.13] VMSnaphot and WaitForFirstConsumer storage: VMRestore is not Complete
Product: Container Native Virtualization (CNV) Reporter: Jenia Peimer <jpeimer>
Component: StorageAssignee: skagan
Status: CLOSED ERRATA QA Contact: Jenia Peimer <jpeimer>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13.0CC: akalenyu, alitke, apinnick, ngavrilo, skagan, yadu
Target Milestone: ---   
Target Release: 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: CNV-v4.13.0.rhel9-1808 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2149654 Environment:
Last Closed: 2023-05-18 02:57:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2149654    
Bug Blocks:    

Description Jenia Peimer 2023-02-22 17:41:55 UTC
+++ This bug was initially created as a clone of Bug #2149654 +++

Description of problem:
VMRestore doesn't get to the Complete state,
restore DV stays WaitForFirstConsumer,
restore PVC is Pending
restore VM is Stopped and not Ready

Version-Release number of selected component (if applicable):
4.12

How reproducible:
Always on SNO cluster with snapshot capable storage with WaitForFirstConsumer volumeBindingMode (TopoLVM storage in our case - lvms-vg1)

Steps to Reproduce:
1. Create a VM - VM is Running
2. Create a VMSnapshot - VMSnapshot is ReadyToUse
3. Create a VMRestore

Actual results:
VMRestore is not Complete

   $ oc get vmrestore
   NAME            TARGETKIND       TARGETNAME    COMPLETE   RESTORETIME   ERROR
   restore-my-vm   VirtualMachine   vm-restored   false  

Expected results:
VMRestore is Complete (PVC Bound, DV Succeded and garbage collected)

Workaround and ONE MORE ISSUE:
1. Start the restored VM
2. See the VM is Ready and Running, DV succeeded, PVC Bound
3. See the VMRestore is still not Complete:

   $ oc get vmrestore
   NAME            TARGETKIND       TARGETNAME    COMPLETE   RESTORETIME   ERROR
   restore-my-vm   VirtualMachine   vm-restored   false  

   $ oc describe vmrestore restore-my-vm | grep Events -A 10
   Events:
     Type     Reason                      Age                    From                Message
     ----     ------                      ----                   ---- 
   Warning  VirtualMachineRestoreError  4m4s (x23 over 4m21s)  restore-controller  VirtualMachineRestore encountered error invalid RunStrategy "Always"

4. See the restored VM runStrategy:
   $ oc get vm vm-restored -oyaml | grep running
      running: true


***
PLEASE NOTE that the restored VM on OCS with Immediate volumeBindingMode on the multi-node cluster gets the "running: false", despite that the source VM had it "true", and we are not getting the above error, and VMRestore becomes Complete:
   $ oc get vm vm-restored-ocs -oyaml | grep running
      running: false
***


5. Stop the restored VM
6. See the VMRestore is Complete:
   $ oc get vmrestore
   NAME            TARGETKIND       TARGETNAME    COMPLETE   RESTORETIME   ERROR
   restore-my-vm   VirtualMachine   vm-restored   true       1s            


Additional info:

VM yaml: 

$ cat vm.yaml 
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  name: vm-cirros-source
  labels:
    kubevirt.io/vm: vm-cirros-source
spec:
  dataVolumeTemplates:
  - metadata:
      name: cirros-dv-source
    spec:
      storage:
        resources:
          requests:
            storage: 1Gi
        storageClassName: odf-lvm-vg1
      source:
        http:
          url: <cirros-0.4.0-x86_64-disk.qcow2>
  running: true
  template:
    metadata:
      labels:
        kubevirt.io/vm: vm-cirros-source
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: datavolumev
        machine:
          type: ""
        resources:
          requests:
            memory: 100M
      terminationGracePeriodSeconds: 0
      volumes:
      - dataVolume:
          name: cirros-dv-source
        name: datavolumev


VMSnapshot yaml:

$ cat snap.yaml 
apiVersion: snapshot.kubevirt.io/v1alpha1
kind: VirtualMachineSnapshot
metadata:
  name: my-vmsnapshot 
spec:
  source:
    apiGroup: kubevirt.io
    kind: VirtualMachine
    name: vm-cirros-source


VMRestore yaml:

$ cat vmrestore.yaml 
apiVersion: snapshot.kubevirt.io/v1alpha1
kind: VirtualMachineRestore
metadata:
  name: restore-my-vm
spec:
  target:
    apiGroup: kubevirt.io
    kind: VirtualMachine
    name: vm-restored
  virtualMachineSnapshotName: my-vmsnapshot

--- Additional comment from Jenia Peimer on 2023-02-19 13:25:23 UTC ---

Just to keep the info in this BZ: this bug was discussed at KubeVirt SIG-Storage Meeting, and the current approach to fix it is to mark VMRestore Complete when DV is WFFC and PVC is Pending.

Comment 1 Jenia Peimer 2023-03-21 15:14:41 UTC
Verified on SNO cluster with TopoLVM with WFFC

Comment 4 errata-xmlrpc 2023-05-18 02:57:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3205