2070366 – VM Snapshot Restore hangs indefinitely when backed by a snapshotclass

Bug 2070366 - VM Snapshot Restore hangs indefinitely when backed by a snapshotclass

Summary: VM Snapshot Restore hangs indefinitely when backed by a snapshotclass

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.5
Assignee:	Bartosz Rybacki
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2098225
TreeView+	depends on / blocked

Reported:	2022-03-30 23:05 UTC by Chandler Wilkerson
Modified:	2022-09-06 14:02 UTC (History)
CC List:	12 users (show)
Fixed In Version:	CNV v4.10.4-17
Doc Type:	Known Issue
Doc Text:	Some persistent volume claim (PVC) annotations created by the Containerized Data Importer can cause the virtual machine snapshot restore operation to hang indefinitely. As a workaround, you can remove the annotations manually: 1. Obtain the xref:../virt/virtual_machines/virtual_disks/virt-managing-vm-snapshots.adoc#virtual-machine-snapshot-controller-and-custom-resource-definitions-crds[VirtualMachineSnapshotContent] custom resource (CR) name from the `status.virtualMachineSnapshotContentName` value in the VirtualMachineSnapshot CR. 2. Edit the VirtualMachineSnapshotContent CR by removing all lines that contain `k8s.io/cloneRequest`. 3. Optional: Delete the DataVolume and PersistentVolumeClaim objects in this namespace if their name begins with `restore-` and they are not referenced by virtual machines. This step is required if you did not specify a value for `spec.dataVolumeTemplates` in the VirtualMachine object. 4. Repeat the xref:../virt/virtual_machines/virtual_disks/virt-managing-vm-snapshots.adoc#virt-restoring-vm-from-snapshot-cli_virt-managing-vm-snapshots[restore operation] with the updated VirtualMachineSnapshot CR.
Clone Of:
Clones:	2098225 (view as bug list)
Environment:
Last Closed:	2022-09-06 14:02:02 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt containerized-data-importer pull 2205	None	Merged	Improve handling of prePopulated DV	2022-06-17 15:55:09 UTC
Github	kubevirt containerized-data-importer pull 2227	None	Merged	Bugfix - Correctly handle populated PVC	2022-06-17 15:54:22 UTC
Github	kubevirt containerized-data-importer pull 2241	None	Merged	Extract the clone logic from main reconcile loop	2022-06-17 15:55:26 UTC
Github	kubevirt containerized-data-importer pull 2322	None	Merged	[release-v1.43] Improve handling of prePopulated DV	2022-07-13 10:49:06 UTC
Github	kubevirt containerized-data-importer pull 2324	None	Merged	[release-v1.43] Extract the clone logic from main reconcile loop	2022-07-13 10:49:04 UTC
Github	kubevirt containerized-data-importer pull 2325	None	Merged	[release-v1.43] Bugfix - Correctly handle populated PVC	2022-07-13 10:49:03 UTC
Github	kubevirt kubevirt pull 8078	None	Merged	Must clear "cloneRequest" annotation on restore of PVC	2022-07-18 13:51:09 UTC
Github	kubevirt kubevirt pull 8244	None	Merged	[release-v0.49] Must clear "cloneRequest" annotation on restore of PVC	2022-08-11 15:59:55 UTC
Red Hat Product Errata	RHSA-2022:6351	None	None	None	2022-09-06 14:02:25 UTC

Description Chandler Wilkerson 2022-03-30 23:05:46 UTC

Description of problem:

New "restore" PVC appears to be waiting on a CDI upload server pod to finish, but work is never sent to that pod.

Version-Release number of selected component (if applicable):
4.10.0

How reproducible:
Always

Steps to Reproduce:
1. Take a snapshot of a (in my case Running) VM
2. Shut down VM
3. Use UI to Restore to snapshot

Actual results:
VM is indefinitely in pending state, no log activity in cdi uploader pod

Expected results:
VM is recreated with PVC referencing VolumeSnapshot, and back-end snapshot class handles restore; VM starts quickly.

Additional info:

Comment 4 Michael Henriksen 2022-04-01 00:07:14 UTC

This issue should only affect DataVolumes created via network clone operations.

Is a regression introduced in this PR:  https://github.com/kubevirt/containerized-data-importer/pull/1922

Comment 7 Bartosz Rybacki 2022-06-30 15:06:45 UTC

To verify it one needs to have a DV that is created by network clone, here are example preparation steps:


1. Create A DV by cloning from one storageClass to another storageClass (the target one should be on something that supports snapshots)
2. Create a VM from this DV

Then repeat steps as reported by Chandler:

Steps to Reproduce:
1. Take a snapshot of a (in my case Running) VM
2. Shut down VM
3. Use UI to Restore to snapshot

Comment 8 Kevin Alon Goldblatt 2022-07-03 15:34:26 UTC

Tested with the following code:
-------------------------------------------
oc version
Client Version: 4.10.0-202206211856.p0.g45460a5.assembly.stream-45460a5
Server Version: 4.10.20
Kubernetes Version: v1.23.5+3afdacb
[cnv-qe-jenkins@localhost ~]$ oc get csv -n openshift-cnv
NAME                                       DISPLAY                    VERSION   REPLACES                                   PHASE
kubevirt-hyperconverged-operator.v4.10.3   OpenShift Virtualization   4.10.3    kubevirt-hyperconverged-operator.v4.10.2   Succeeded
[cnv-qe-jenkins@localhost ~]$ 



Tested with the following scenario:
-------------------------------------------
1. Created source dv on nfs sc
2. Cloned source dv to ceph-rbd target dv
3. Created a VM from the target dv
4. Started the VM via the UI
5. Created a snapshot fo the running VM
6. Shut down the VM via the UI
7. Restored the snapshot via the UI
8. VM is created referencing the restored pvc
9. Started the VM but VM remains in 'Starting' status 


Additional information:
-----------------------------------------
oc get dv 
NAME                       PHASE       PROGRESS   RESTARTS   AGE
source-default1-nfs        Succeeded   100.0%                3h11m
target-default-ceph-rbd1   Succeeded   100.0%                129m

oc get pvc
NAME                                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
restore-9fe8e7c7-557e-45b8-a1f4-15405294aa07-root   Bound    pvc-cb00bf19-af52-412a-8c25-1d2f82cf0a64   6Gi        RWO            ocs-storagecluster-ceph-rbd   114m
source-default1-nfs                                 Bound    nfs-pv-05                                  5Gi        RWO,RWX        nfs                           3h11m
target-default-ceph-rbd1                            Bound    pvc-532bf771-0622-47f5-ab81-80ce37224d92   6Gi        RWO            ocs-storagecluster-ceph-rbd   129m

oc get vm
NAME                          AGE    STATUS     READY
vm-target-default-ceph-rbd1   130m   Starting   False               >>>>>>>

oc get volumesnapshots
NAME                                                          READYTOUSE   SOURCEPVC                  SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                            SNAPSHOTCONTENT                                    CREATIONTIME   AGE
vmsnapshot-1a3acd71-dfcf-4ddf-88c7-c59c4dc6c585-volume-root   true         target-default-ceph-rbd1                           6Gi           ocs-storagecluster-rbdplugin-snapclass   snapcontent-005c15d3-aa3d-44c2-b321-4db03e2bf380   127m           127m


Restored VM - yaml:
----------------------
oc get vm -oyaml
apiVersion: v1
items:
- apiVersion: kubevirt.io/v1
  kind: VirtualMachine
  metadata:
    annotations:
      kubemacpool.io/transaction-timestamp: "2022-07-03T13:40:30.861826914Z"
      kubevirt.io/latest-observed-api-version: v1
      kubevirt.io/storage-observed-api-version: v1alpha3
      restore.kubevirt.io/lastRestoreUID: vm-target-default-ceph-rbd1-2022-7-3-0-restore-ugvktw-9fe8e7c7-557e-45b8-a1f4-15405294aa07
    creationTimestamp: "2022-07-03T13:13:44Z"
    generation: 5
    labels:
      kubevirt.io/vm: vm-my-data-volume
    name: vm-target-default-ceph-rbd1
    namespace: default
    resourceVersion: "5523189"
    uid: 5718c66c-9a9b-486a-a272-524650b0844d
  spec:
    running: true
    template:
      metadata:
        creationTimestamp: null
        labels:
          kubevirt.io/vm: vm-my-data-volume
      spec:
        domain:
          cpu:
            cores: 1
            model: host-model
            sockets: 1
            threads: 1
          devices:
            disks:
            - disk:
                bus: virtio
              name: root
            interfaces:
            - macAddress: 02:b7:63:00:00:00
              masquerade: {}
              name: default
          features:
            acpi:
              enabled: true
          firmware:
            uuid: 024157c1-af88-594a-a0d8-d178ee87e843
          machine:
            type: pc-q35-rhel8.4.0
          resources:
            requests:
              memory: 64M
        networks:
        - name: default
          pod: {}
        volumes:
        - name: root
          persistentVolumeClaim:
            claimName: restore-9fe8e7c7-557e-45b8-a1f4-15405294aa07-root
  status:
    conditions:
    - lastProbeTime: "2022-07-03T13:40:31Z"
      lastTransitionTime: "2022-07-03T13:40:31Z"
      message: Guest VM is not reported as running
      reason: GuestNotRunning
      status: "False"
      type: Ready
    created: true
    printableStatus: Starting
    volumeSnapshotStatuses:
    - enabled: true
      name: root
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""



Restored PVC
-------------------
oc get pvc restore-9fe8e7c7-557e-45b8-a1f4-15405294aa07-root -oyaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    cdi.kubevirt.io/storage.condition.running: "true"
    cdi.kubevirt.io/storage.condition.running.message: ""
    cdi.kubevirt.io/storage.condition.running.reason: Pod is running
    cdi.kubevirt.io/storage.pod.phase: Running
    cdi.kubevirt.io/storage.pod.ready: "true"
    cdi.kubevirt.io/storage.pod.restarts: "0"
    cdi.kubevirt.io/storage.uploadPodName: cdi-upload-restore-9fe8e7c7-557e-45b8-a1f4-15405294aa07-root
    cdi.kubevirt.io/uploadClientName: default/source-default1-nfs-default/restore-9fe8e7c7-557e-45b8-a1f4-15405294aa07-root
    k8s.io/CloneOf: "true"
    k8s.io/CloneRequest: default/source-default1-nfs
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    restore.kubevirt.io/name: vm-target-default-ceph-rbd1-2022-7-3-0-restore-ugvktw
    volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
    volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
  creationTimestamp: "2022-07-03T13:27:17Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app: containerized-data-importer
    app.kubernetes.io/component: storage
    app.kubernetes.io/managed-by: cdi-controller
    app.kubernetes.io/part-of: hyperconverged-cluster
    app.kubernetes.io/version: 4.10.3
  name: restore-9fe8e7c7-557e-45b8-a1f4-15405294aa07-root
  namespace: default
  ownerReferences:
  - apiVersion: kubevirt.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: VirtualMachine
    name: vm-target-default-ceph-rbd1
    uid: 5718c66c-9a9b-486a-a272-524650b0844d
  resourceVersion: "5510789"
  uid: cb00bf19-af52-412a-8c25-1d2f82cf0a64
spec:
  accessModes:
  - ReadWriteOnce
  dataSource:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: vmsnapshot-1a3acd71-dfcf-4ddf-88c7-c59c4dc6c585-volume-root
  resources:
    requests:
      storage: 6Gi
  storageClassName: ocs-storagecluster-ceph-rbd
  volumeMode: Block
  volumeName: pvc-cb00bf19-af52-412a-8c25-1d2f82cf0a64
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 6Gi
  phase: Bound


oc get volumesnapshots vmsnapshot-1a3acd71-dfcf-4ddf-88c7-c59c4dc6c585-volume-root -oyaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  creationTimestamp: "2022-07-03T13:24:46Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
  - snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  generation: 1
  name: vmsnapshot-1a3acd71-dfcf-4ddf-88c7-c59c4dc6c585-volume-root
  namespace: default
  ownerReferences:
  - apiVersion: snapshot.kubevirt.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: VirtualMachineSnapshotContent
    name: vmsnapshot-content-1a3acd71-dfcf-4ddf-88c7-c59c4dc6c585
    uid: 5c375ba4-5e9d-4ab1-8175-472fb3000c4d
  resourceVersion: "5508070"
  uid: 005c15d3-aa3d-44c2-b321-4db03e2bf380
spec:
  source:
    persistentVolumeClaimName: target-default-ceph-rbd1
  volumeSnapshotClassName: ocs-storagecluster-rbdplugin-snapclass
status:
  boundVolumeSnapshotContentName: snapcontent-005c15d3-aa3d-44c2-b321-4db03e2bf380
  creationTime: "2022-07-03T13:24:48Z"
  readyToUse: true
  restoreSize: 6Gi



Source dv
----------------
oc get dv source-default1-nfs -oyaml
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  annotations:
    cdi.kubevirt.io/storage.pod.retainAfterCompletion: "true"
  creationTimestamp: "2022-07-03T12:10:40Z"
  generation: 21
  name: source-default1-nfs
  namespace: default
  resourceVersion: "5437248"
  uid: 4c0ef836-f1c9-4675-9822-35bb3b86a829
spec:
  source:
    http:
      url: https://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
  storage:
    resources:
      requests:
        storage: 500Mi
    storageClassName: nfs
status:
  claimName: source-default1-nfs
  conditions:
  - lastHeartbeatTime: "2022-07-03T12:10:40Z"
    lastTransitionTime: "2022-07-03T12:10:40Z"
    message: PVC source-default1-nfs Bound
    reason: Bound
    status: "True"
    type: Bound
  - lastHeartbeatTime: "2022-07-03T12:11:12Z"
    lastTransitionTime: "2022-07-03T12:11:12Z"
    status: "True"
    type: Ready
  - lastHeartbeatTime: "2022-07-03T12:11:10Z"
    lastTransitionTime: "2022-07-03T12:11:10Z"
    message: Import Complete
    reason: Completed
    status: "False"
    type: Running
  phase: Succeeded
  progress: 100.0%


Target/cloned dv
--------------------
oc get dv target-default-ceph-rbd1 -oyaml
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  annotations:
    cdi.kubevirt.io/cloneType: network
    cdi.kubevirt.io/storage.clone.token: eyJhbGciOiJQUzI1NiJ9.eyJleHAiOjE2NTY4NTQyNTgsImlhdCI6MTY1Njg1Mzk1OCwiaXNzIjoiY2RpLWFwaXNlcnZlciIsIm5hbWUiOiJzb3VyY2UtZGVmYXVsdDEtbmZzIiwibmFtZXNwYWNlIjoiZGVmYXVsdCIsIm5iZiI6MTY1Njg1Mzk1OCwib3BlcnRhdGlvbiI6IkNsb25lIiwicGFyYW1zIjp7InRhcmdldE5hbWUiOiJ0YXJnZXQtZGVmYXVsdC1jZXBoLXJiZDEiLCJ0YXJnZXROYW1lc3BhY2UiOiJkZWZhdWx0In0sInJlc291cmNlIjp7Imdyb3VwIjoiIiwicmVzb3VyY2UiOiJwZXJzaXN0ZW50dm9sdW1lY2xhaW1zIiwidmVyc2lvbiI6InYxIn19.boK_heSwAfWYAhwHNQJTBOybuZXVlaq7cKYL87xI0NycZstTlfpzwVg8NZCAlF4TxU_LvLb-Rxd9nX9OB-g7a3YG_S4qF86RFEkXM-stFiCK-ApOnOeafiodcIt8C3B3nMXURpaudK9-2jP4YZ3KLWb8g6KSSxeyR5SnY8afCJzmQqkf8nbzaittrI_04oVk3UM7qGTLd9LEHCDjyq9jal0-fT-6uaV3Swwytxpgyb3JSYFCvSIZzxodEt2EFkNLZn1gFPwKIZHG4WQdy8u91Da38a_K_CsqaPz2Rx23ghaiARTADghwGnVa_VT4EJH1dX8oUy5TxF19chh2fMREVQ
  creationTimestamp: "2022-07-03T13:12:38Z"
  generation: 26
  name: target-default-ceph-rbd1
  namespace: default
  resourceVersion: "5496921"
  uid: ccfcd020-e9f7-4208-9500-73a47d7f9e2c
spec:
  source:
    pvc:
      name: source-default1-nfs
      namespace: default
  storage:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 6Gi
    storageClassName: ocs-storagecluster-ceph-rbd
status:
  claimName: target-default-ceph-rbd1
  conditions:
  - lastHeartbeatTime: "2022-07-03T13:12:42Z"
    lastTransitionTime: "2022-07-03T13:12:42Z"
    message: PVC target-default-ceph-rbd1 Bound
    reason: Bound
    status: "True"
    type: Bound
  - lastHeartbeatTime: "2022-07-03T13:13:17Z"
    lastTransitionTime: "2022-07-03T13:13:17Z"
    status: "True"
    type: Ready
  - lastHeartbeatTime: "2022-07-03T13:13:15Z"
    lastTransitionTime: "2022-07-03T13:13:15Z"
    message: Clone Complete
    reason: Completed
    status: "False"
    type: Running
  phase: Succeeded
  progress: 100.0%

Comment 9 Bartosz Rybacki 2022-07-05 15:45:54 UTC

I do not like your results. Looks suspicious. I need to config that the clone operation was not atempted by the CDI.

Comment 10 Bartosz Rybacki 2022-07-05 15:50:06 UTC

I mean *I need to confirm* that the restore was clean, and the clone operation was not run again. It would mean we fixed one problem and uncovered another.

Comment 11 Kevin Alon Goldblatt 2022-07-06 11:05:11 UTC

Reproduced the original bug with the following additional information:
-----------------------------------------------------------------------------

Reopening the bug:

Upload pod:
--------------
oc describe pod cdi-upload-restore-b4101e84-b2dd-41a0-b439-9cf1222c401a-root
Name:         cdi-upload-restore-b4101e84-b2dd-41a0-b439-9cf1222c401a-root
Namespace:    default
Priority:     0
Node:         stg03-kevin-mqcc4-worker-0-zffw4/192.168.1.39
Start Time:   Wed, 06 Jul 2022 10:47:21 +0000
Labels:       app=containerized-data-importer
              app.kubernetes.io/component=storage
              app.kubernetes.io/managed-by=cdi-controller
              app.kubernetes.io/part-of=hyperconverged-cluster
              app.kubernetes.io/version=4.10.3
              cdi.kubevirt.io=cdi-upload-server
              cdi.kubevirt.io/uploadTarget=e4c71394-6671-4e64-88ee-fcafadec8cc1
              service=cdi-upload-restore-b4101e84-b2dd-41a0-b439-9cf1222c401a-root
Annotations:  cdi.kubevirt.io/storage.createdByUploadController: yes
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.128.2.143"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.128.2.143"
                    ],
                    "default": true,
                    "dns": {}
                }]
              sidecar.istio.io/inject: false
Status:       Running
IP:           10.128.2.143
IPs:
  IP:           10.128.2.143
Controlled By:  PersistentVolumeClaim/restore-b4101e84-b2dd-41a0-b439-9cf1222c401a-root
Containers:
  cdi-upload-server:
    Container ID:  cri-o://f4284a52256aef3cea8c3b813b88fdb461754fa17944c9f9b3a2d2627ccdcfce
    Image:         registry.redhat.io/container-native-virtualization/virt-cdi-uploadserver@sha256:f824df36b2e080af35a7d86adb0fb58289300d00f97688b245428eb9035fe08e
    Image ID:      registry.redhat.io/container-native-virtualization/virt-cdi-uploadserver@sha256:2e634b958b8d91c5d7281af707597fccead0e8a1373d2e0df005a59cbf4ad9cf
    Port:          <none>
    Host Port:     <none>
    Args:
      -v=1
    State:          Running
      Started:      Wed, 06 Jul 2022 10:47:34 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     750m
      memory:  600M
    Requests:
      cpu:      100m
      memory:   60M
    Readiness:  http-get http://:8080/healthz delay=2s timeout=1s period=5s #success=1 #failure=3
    Environment:
      TLS_KEY:              <set to the key 'tls.key' in secret 'cdi-upload-restore-b4101e84-b2dd-41a0-b439-9cf1222c401a-root'>  Optional: false
      TLS_CERT:             <set to the key 'tls.crt' in secret 'cdi-upload-restore-b4101e84-b2dd-41a0-b439-9cf1222c401a-root'>  Optional: false
      CLIENT_CERT:          -----BEGIN CERTIFICATE-----
                            MIIDaTCCAlGgAwIBAgIIUFtgr/UcO9cwDQYJKoZIhvcNAQELBQAwQjFAMD4GA1UE
                            Aww3b3BlbnNoaWZ0LWNudl9jZGktdXBsb2Fkc2VydmVyLWNsaWVudC1zaWduZXJA
                            MTY1NjUxNTc3MjAeFw0yMjA2MjkxNTE2MTJaFw0zMjA2MjYxNTE2MTNaMEIxQDA+
                            BgNVBAMMN29wZW5zaGlmdC1jbnZfY2RpLXVwbG9hZHNlcnZlci1jbGllbnQtc2ln
                            bmVyQDE2NTY1MTU3NzIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDZ
                            iGnknyFuiKv9d1zkurmtBtkD0L3qtZnErsJk/YUyWkoV5qDCHqX6f/vMZY2r8mF2
                            fm/NFRJycraFay7iykifWrhm6iiduiDJ796wpISecdGQV9WYS9FrVjmBmzf49OBj
                            OfK96jEOK0G9tsm9utINQCQKsgmVveHsuZqzQbVdAnxOw6DJ3uAxJAGT+OxkDKy7
                            t+LUGCy5F2Wp6nUvERMpps4WNL4tfu2uVHJ3KIuSv2R10v/Tv5z+mqRxVFPc99LH
                            9BIxLH0I4rz1IMnkPWP1S/a9eb738RMfxHl2ggkv4aT9ZwOH/FYA1iESb7VOEJzO
                            2jwS9GdROec7hqRh2Na5AgMBAAGjYzBhMA4GA1UdDwEB/wQEAwICpDAPBgNVHRMB
                            Af8EBTADAQH/MB0GA1UdDgQWBBRBS9AN+heQC3bUGA6y/FhMFcrAsjAfBgNVHSME
                            GDAWgBRBS9AN+heQC3bUGA6y/FhMFcrAsjANBgkqhkiG9w0BAQsFAAOCAQEAGpoJ
                            MU5YviGC4CzBZ9c4lypPOGn3bjbiszE9CFXRFre7ohC44kr4hBrqAsg2d4gFjjBO
                            vFgPS8JHfnAt+Ue/f89moxnqc4jpUk5Db1EzTLo2mHA3E6/UAWPbqQwk8dwCdGTw
                            DAOOWTApUgJhQEqNd2Lewgr0hWTvmK9E+yrN640MRJCtI4JkBXUaZTBf3bBkJxS7
                            nkliFr2GYE8nMt2Yz+0YSNtUb07QA8PaqLGRMhvCdjfK29LoZc7fktlYQdC17/Is
                            apLeOfTd5+GawyOCS1V1jlqQCGf+xWpAKtAiemdovbCikMhxAHpviOCzNj2knkhK
                            N8ZJhUtHjQIHZP2P/w==
                            -----END CERTIFICATE-----
                            
      FILESYSTEM_OVERHEAD:  0
      UPLOAD_IMAGE_SIZE:    6Gi
      CLIENT_NAME:          default/source-default-nfs1-default/restore-b4101e84-b2dd-41a0-b439-9cf1222c401a-root
      PREALLOCATION:        false
      DESTINATION:          /dev/cdi-block-volume
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zl7lv (ro)
    Devices:
      /dev/cdi-block-volume from cdi-data-vol
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  cdi-data-vol:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  restore-b4101e84-b2dd-41a0-b439-9cf1222c401a-root
    ReadOnly:   false
  kube-api-access-zl7lv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason                  Age    From                     Message
  ----    ------                  ----   ----                     -------
  Normal  Scheduled               4m19s  default-scheduler        Successfully assigned default/cdi-upload-restore-b4101e84-b2dd-41a0-b439-9cf1222c401a-root to stg03-kevin-mqcc4-worker-0-zffw4
  Normal  SuccessfulAttachVolume  4m18s  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-e4c71394-6671-4e64-88ee-fcafadec8cc1"
  Normal  SuccessfulMountVolume   4m9s   kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-e4c71394-6671-4e64-88ee-fcafadec8cc1" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-e4c71394-6671-4e64-88ee-fcafadec8cc1/dev"
  Normal  SuccessfulMountVolume   4m9s   kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-e4c71394-6671-4e64-88ee-fcafadec8cc1" volumeMapPath "/var/lib/kubelet/pods/138a753e-8518-44b6-955d-1106d8cd427d/volumeDevices/kubernetes.io~csi"
  Normal  AddedInterface          4m6s   multus                   Add eth0 [10.128.2.143/23] from openshift-sdn
  Normal  Pulled                  4m6s   kubelet                  Container image "registry.redhat.io/container-native-virtualization/virt-cdi-uploadserver@sha256:f824df36b2e080af35a7d86adb0fb58289300d00f97688b245428eb9035fe08e" already present on machine
  Normal  Created                 4m6s   kubelet                  Created container cdi-upload-server
  Normal  Started                 4m5s   kubelet                  Started container cdi-upload-server



The vmrestore:
----------------
oc describe vm-target-default-ceph-rbd1-2022-7-6-0-restore-9sjknm
error: the server doesn't have a resource type "vm-target-default-ceph-rbd1-2022-7-6-0-restore-9sjknm"
[cnv-qe-jenkins@localhost ~]$ oc describe vmrestore vm-target-default-ceph-rbd1-2022-7-6-0-restore-9sjknm
Name:         vm-target-default-ceph-rbd1-2022-7-6-0-restore-9sjknm
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  snapshot.kubevirt.io/v1alpha1
Kind:         VirtualMachineRestore
Metadata:
  Creation Timestamp:  2022-07-06T10:47:18Z
  Generation:          5
  Managed Fields:
    API Version:  snapshot.kubevirt.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:target:
          .:
          f:apiGroup:
          f:kind:
          f:name:
        f:virtualMachineSnapshotName:
    Manager:      Mozilla
    Operation:    Update
    Time:         2022-07-06T10:47:18Z
    API Version:  snapshot.kubevirt.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences:
          .:
          k:{"uid":"b1346918-e2a2-4c4b-b784-8d1b27762ea9"}:
      f:status:
        .:
        f:complete:
        f:conditions:
        f:restoreTime:
        f:restores:
    Manager:    virt-controller
    Operation:  Update
    Time:       2022-07-06T10:47:21Z
  Owner References:
    API Version:           kubevirt.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  VirtualMachine
    Name:                  vm-target-default-ceph-rbd1
    UID:                   b1346918-e2a2-4c4b-b784-8d1b27762ea9
  Resource Version:        9548820
  UID:                     b4101e84-b2dd-41a0-b439-9cf1222c401a
Spec:
  Target:
    API Group:                    kubevirt.io
    Kind:                         VirtualMachine
    Name:                         vm-target-default-ceph-rbd1
  Virtual Machine Snapshot Name:  vm-target-default-ceph-rbd1-2022-7-6-0
Status:
  Complete:  true
  Conditions:
    Last Probe Time:       <nil>
    Last Transition Time:  2022-07-06T10:47:21Z
    Reason:                Operation complete
    Status:                False
    Type:                  Progressing
    Last Probe Time:       <nil>
    Last Transition Time:  2022-07-06T10:47:21Z
    Reason:                Operation complete
    Status:                True
    Type:                  Ready
  Restore Time:            2022-07-06T10:47:21Z
  Restores:
    Persistent Volume Claim:  restore-b4101e84-b2dd-41a0-b439-9cf1222c401a-root
    Volume Name:              root
    Volume Snapshot Name:     vmsnapshot-d994cb03-99ef-4ee9-85ce-3563775aab14-volume-root
Events:
  Type    Reason                         Age                From                Message
  ----    ------                         ----               ----                -------
  Normal  VirtualMachineRestoreComplete  10m (x3 over 10m)  restore-controller  Successfully completed VirtualMachineRestore vm-target-default-ceph-rbd1-2022-7-6-0-restore-9sjknm



The vm launcher pod:
--------------------------
oc describe pod virt-launcher-vm-target-default-ceph-rbd1-xqnxl
Name:           virt-launcher-vm-target-default-ceph-rbd1-xqnxl
Namespace:      default
Priority:       0
Node:           stg03-kevin-mqcc4-worker-0-x6lln/192.168.2.208
Start Time:     Wed, 06 Jul 2022 10:49:35 +0000
Labels:         kubevirt.io=virt-launcher
                kubevirt.io/created-by=43cab089-04ea-4de2-b2d8-6a7dbe8fd5ff
                kubevirt.io/vm=vm-my-data-volume
                vm.kubevirt.io/name=vm-target-default-ceph-rbd1
Annotations:    kubevirt.io/domain: vm-target-default-ceph-rbd1
                kubevirt.io/migrationTransportUnix: true
                post.hook.backup.velero.io/command:
                  ["/usr/bin/virt-freezer", "--unfreeze", "--name", "vm-target-default-ceph-rbd1", "--namespace", "default"]
                post.hook.backup.velero.io/container: compute
                pre.hook.backup.velero.io/command: ["/usr/bin/virt-freezer", "--freeze", "--name", "vm-target-default-ceph-rbd1", "--namespace", "default"]
                pre.hook.backup.velero.io/container: compute
                traffic.sidecar.istio.io/kubevirtInterfaces: k6t-eth0
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  VirtualMachineInstance/vm-target-default-ceph-rbd1
Containers:
  compute:
    Container ID:  
    Image:         registry.redhat.io/container-native-virtualization/virt-launcher@sha256:5e44e85ff8da6edf26c4942ca61a83bda92896ebeddde8de528ac313dfdda557
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/virt-launcher
      --qemu-timeout
      294s
      --name
      vm-target-default-ceph-rbd1
      --uid
      43cab089-04ea-4de2-b2d8-6a7dbe8fd5ff
      --namespace
      default
      --kubevirt-share-dir
      /var/run/kubevirt
      --ephemeral-disk-dir
      /var/run/kubevirt-ephemeral-disks
      --container-disk-dir
      /var/run/kubevirt/container-disks
      --grace-period-seconds
      45
      --hook-sidecars
      0
      --ovmf-path
      /usr/share/OVMF
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      devices.kubevirt.io/kvm:        1
      devices.kubevirt.io/tun:        1
      devices.kubevirt.io/vhost-net:  1
    Requests:
      cpu:                            100m
      devices.kubevirt.io/kvm:        1
      devices.kubevirt.io/tun:        1
      devices.kubevirt.io/vhost-net:  1
      ephemeral-storage:              50M
      memory:                         434757920
    Environment:
      POD_NAME:  virt-launcher-vm-target-default-ceph-rbd1-xqnxl (v1:metadata.name)
    Mounts:
      /var/run/kubevirt from public (rw)
      /var/run/kubevirt-ephemeral-disks from ephemeral-disks (rw)
      /var/run/kubevirt-private from private (rw)
      /var/run/kubevirt/container-disks from container-disks (rw)
      /var/run/kubevirt/hotplug-disks from hotplug-disks (rw)
      /var/run/kubevirt/sockets from sockets (rw)
      /var/run/libvirt from libvirt-runtime (rw)
    Devices:
      /dev/root from root
Readiness Gates:
  Type                                   Status
  kubevirt.io/virtual-machine-unpaused   True 
Conditions:
  Type                                   Status
  Initialized                            True 
  Ready                                  False 
  ContainersReady                        False 
  PodScheduled                           True 
  kubevirt.io/virtual-machine-unpaused   True 
Volumes:
  private:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  public:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  sockets:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  root:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  restore-b4101e84-b2dd-41a0-b439-9cf1222c401a-root
    ReadOnly:   false
  virt-bin-share-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  libvirt-runtime:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  ephemeral-disks:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  container-disks:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  hotplug-disks:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  kubevirt.io/schedulable=true
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason              Age    From                     Message
  ----     ------              ----   ----                     -------
  Normal   Scheduled           9m12s  default-scheduler        Successfully assigned default/virt-launcher-vm-target-default-ceph-rbd1-xqnxl to stg03-kevin-mqcc4-worker-0-x6lln
  Warning  FailedAttachVolume  9m13s  attachdetach-controller  Multi-Attach error for volume "pvc-e4c71394-6671-4e64-88ee-fcafadec8cc1" Volume is already used by pod(s) cdi-upload-restore-b4101e84-b2dd-41a0-b439-9cf1222c401a-root
  Warning  FailedMount         7m10s  kubelet                  Unable to attach or mount volumes: unmounted volumes=[root], unattached volumes=[libvirt-runtime root sockets private public ephemeral-disks container-disks hotplug-disks]: timed out waiting for the condition
  Warning  FailedMount         4m55s  kubelet                  Unable to attach or mount volumes: unmounted volumes=[root], unattached volumes=[ephemeral-disks root container-disks hotplug-disks libvirt-runtime sockets private public]: timed out waiting for the condition
  Warning  FailedMount         2m37s  kubelet                  Unable to attach or mount volumes: unmounted volumes=[root], unattached volumes=[container-disks hotplug-disks root libvirt-runtime sockets private public ephemeral-disks]: timed out waiting for the condition
  Warning  FailedMount         23s    kubelet                  Unable to attach or mount volumes: unmounted volumes=[root], unattached volumes=[public root ephemeral-disks container-disks hotplug-disks libvirt-runtime sockets private]: timed out waiting for the condition


source dv yaml:
---------------------
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: source-default1-nfs
  annotations:
    cdi.kubevirt.io/storage.pod.retainAfterCompletion: "true"
spec:
  source:
      http:
         url: "https://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img"
  storage:
    storageClassName: nfs
    resources:
      requests:
        storage: 500Mi



target dv yaml:
------------------
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: target-default-ceph-rbd1
spec:
  source:
    pvc:
      namespace: default
      name: source-default-nfs1
  storage:
    storageClassName: ocs-storagecluster-ceph-rbd
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 6Gi


vm yaml:
-----------
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/vm: vm-my-data-volume
  name: vm-target-default-ceph-rbd1
spec:
  running: false
  template:
    metadata:
      labels:
        kubevirt.io/vm: vm-my-data-volume
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: root 
        resources:
          requests:
            memory: 256M
      volumes:
      - dataVolume:
          name: target-default-ceph-rbd1
        name: root

Comment 12 Bartosz Rybacki 2022-07-06 11:35:06 UTC

The first PR fixed the problem on the CDI side with hadnling the prePopulated volumes, but I did not see the second problem. The second problem is that the pvc from snapshot restore does not have the prePopulated Annotation.

Comment 13 Michael Henriksen 2022-07-06 16:31:46 UTC

Looks like the issue is that the restore controller is not stripping the "k8s.io/CloneRequest" annotation from the PVC that is created from the snapshot.  These are the annotations (prefixes) that are currently removed [1].

Also, the prePopulated annotation is not applicable in this situation since the VM was referring to a stand-alone DataVolume and not a dataVolumeTemplate.  When stand-alone DataVolumes are snapshotted, the DataVolume definition is not saved.  At restore time, only the pvc is created and the VolumeSource is converted from DataVolume to PersistentVolumeClaim [2]

As-is, this problem can lead to data loss if the VM has DataVolumeTemplates and is not snapshotted before a restore.


[1] https://github.com/kubevirt/kubevirt/blob/v0.53.2/pkg/virt-controller/watch/snapshot/restore.go#L73-L78
[2] https://github.com/kubevirt/kubevirt/blob/v0.53.2/pkg/virt-controller/watch/snapshot/restore.go#L473-L484

Comment 14 Michael Henriksen 2022-07-13 15:57:41 UTC

Workaround for this issue if encountered:

1.  Get the name of the VirtualMachineSnapshotContent resource from `status.virtualMachineSnapshotContentName` of the VirtualMachineSnapshot
2.  Edit the VirtualMachineSnapshotContent resource, remove all lines containing `k8s.io/cloneRequest`
3.  Do another restore from the same VirtualMachineSnapshot
4.  (optional) Check namespace for any DataVolumes/PresistentVolumeClaims with the `restore-` prefix that are not referenced by any VMs.  They can be deleted.  This should only be required if not using `spec.dataVolumeTemplates` in the VM.

Comment 15 Adam Litke 2022-07-13 16:02:57 UTC

In light of the given workaround I am removing blocker+ and pushing to the next z-stream.

Comment 16 Shikha Jhala 2022-07-27 20:04:03 UTC

Added known issue to the 4.11 release notes. @kgoldbla Please review: https://github.com/openshift/openshift-docs/pull/48328. Thank you.

Comment 18 Adam Litke 2022-08-01 18:49:06 UTC

Assigning back to you Bartosz.  Can you handle the backports please?

Comment 19 Bartosz Rybacki 2022-08-03 13:04:15 UTC

I need to backport to 4.10 - for kubevirt it means upstream 0.49 - in progress

Comment 20 Bartosz Rybacki 2022-08-11 16:02:15 UTC

Backport merged and the D/S build is ready for testing

Comment 23 Kevin Alon Goldblatt 2022-08-24 10:43:38 UTC

Verified with the following code:
----------------------------------------------
oc version
Client Version: 4.10.0-202207291637.p0.ge29d58e.assembly.stream-e29d58e
Server Version: 4.10.26
Kubernetes Version: v1.23.5+012e945

oc get csv -n openshift-cnv
NAME                                       DISPLAY                    VERSION   REPLACES                                   PHASE
kubevirt-hyperconverged-operator.v4.10.5   OpenShift Virtualization   4.10.5    kubevirt-hyperconverged-operator.v4.10.3   Succeeded
volsync-product.v0.4.1                     VolSync                    0.4.1     volsync-product.v0.4.0                     Succeeded

Verified with the following scenario:
---------------------------------------------
1. Created source dv on nfs sc
2. Cloned source dv to ceph-rbd target dv
3. Created a VM from the target dv
4. Started the VM via the UI
5. Created a snapshot of the running VM
6. Shut down the VM via the UI
7. Restored the snapshot via the UI
8. VM is created referencing the restored pvc
9. Started the VM successfully
10.Accessed the VM successfully

Comment 29 errata-xmlrpc 2022-09-06 14:02:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.10.5 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6351

Note You need to log in before you can comment on or make changes to this bug.