Description of problem: Attempted to restore online snapshot of Windows 2k19 server VM, restore hung with Pending PVC apparently due to mismatch between old volume and new volume sizes (must be identical) Version-Release number of selected component (if applicable): OCP 4.9.0 CNV 4.9.0 How reproducible: Will reinstall to retest. Steps to Reproduce: 1. Online snapshot Windows VM 2. Shut down VM 3. Restore from snapshot Actual results: ProvisioningFailed warning Expected results: Restored VM restarts from snapshot Additional info: oc describe pvc restore-552369e5-42cf-49d2-9d00-e35602a7cb17-rootdisk Name: restore-552369e5-42cf-49d2-9d00-e35602a7cb17-rootdisk Namespace: default StorageClass: cnv-integration-svm Status: Pending Volume: Labels: app=containerized-data-importer app.kubernetes.io/component=storage app.kubernetes.io/managed-by=cdi-controller app.kubernetes.io/part-of=hyperconverged-cluster app.kubernetes.io/version=v4.9.0 cdi-controller=cdi-tmp-fb49b48e-86aa-4905-96e0-9c759e411317 cdi.kubevirt.io=cdi-smart-clone Annotations: k8s.io/CloneOf: true k8s.io/SmartCloneRequest: true restore.kubevirt.io/name: wintest-2021-11-8-02-restore-4hnh7u volume.beta.kubernetes.io/storage-provisioner: csi.trident.netapp.io Finalizers: [kubernetes.io/pvc-protection] Capacity: Access Modes: VolumeMode: Filesystem DataSource: APIGroup: snapshot.storage.k8s.io Kind: VolumeSnapshot Name: vmsnapshot-9f5f7b71-d04d-4726-ba04-1ade131cd353-volume-rootdisk Used By: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Provisioning 6m28s (x13 over 11m) csi.trident.netapp.io_trident-csi-5fddc99d78-qmjpr_d18d90d8-405e-43bd-800d-297afff97bd3 External provisioner is provisioning volume for claim "default/restore-552369e5-42cf-49d2-9d00-e35602a7cb17-rootdisk" Warning ProvisioningFailed 6m28s (x13 over 11m) csi.trident.netapp.io_trident-csi-5fddc99d78-qmjpr_d18d90d8-405e-43bd-800d-297afff97bd3 failed to provision volume with StorageClass "cnv-integration-svm": error getting handle for DataSource Type VolumeSnapshot by Name vmsnapshot-9f5f7b71-d04d-4726-ba04-1ade131cd353-volume-rootdisk: requested volume size 10383777792 is less than the size 12058169344 for the source snapshot vmsnapshot-9f5f7b71-d04d-4726-ba04-1ade131cd353-volume-rootdisk Normal ExternalProvisioning 5m58s (x26 over 11m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "csi.trident.netapp.io" or manually created by system administrator NAME SOURCEKIND SOURCENAME PHASE READYTOUSE CREATIONTIME ERROR virtualmachinesnapshot.snapshot.kubevirt.io/wintest-2021-11-8 VirtualMachine wintest Succeeded true 58m virtualmachinesnapshot.snapshot.kubevirt.io/wintest-2021-11-8-02 VirtualMachine wintest Succeeded true 55m NAME TARGETKIND TARGETNAME COMPLETE RESTORETIME ERROR virtualmachinerestore.snapshot.kubevirt.io/wintest-2021-11-8-02-restore-4hnh7u VirtualMachine wintest false oc describe vmrestore wintest-2021-11-8-02-restore-4hnh7u Name: wintest-2021-11-8-02-restore-4hnh7u Namespace: default Labels: <none> Annotations: <none> API Version: snapshot.kubevirt.io/v1alpha1 Kind: VirtualMachineRestore Metadata: Creation Timestamp: 2021-11-08T23:17:16Z Generation: 3 Managed Fields: API Version: snapshot.kubevirt.io/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:spec: .: f:target: .: f:apiGroup: f:kind: f:name: f:virtualMachineSnapshotName: Manager: Mozilla Operation: Update Time: 2021-11-08T23:17:16Z API Version: snapshot.kubevirt.io/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:ownerReferences: .: k:{"uid":"89daea3d-8712-4baf-b565-47b1ece4ff23"}: f:status: .: f:complete: f:conditions: f:restores: Manager: virt-controller Operation: Update Time: 2021-11-08T23:17:16Z Owner References: API Version: kubevirt.io/v1 Block Owner Deletion: true Controller: true Kind: VirtualMachine Name: wintest UID: 89daea3d-8712-4baf-b565-47b1ece4ff23 Resource Version: 5624615 UID: 552369e5-42cf-49d2-9d00-e35602a7cb17 Spec: Target: API Group: kubevirt.io Kind: VirtualMachine Name: wintest Virtual Machine Snapshot Name: wintest-2021-11-8-02 Status: Complete: false Conditions: Last Probe Time: <nil> Last Transition Time: 2021-11-08T23:17:16Z Reason: Creating new PVCs Status: True Type: Progressing Last Probe Time: <nil> Last Transition Time: 2021-11-08T23:17:16Z Reason: Waiting for new PVCs Status: False Type: Ready Restores: Persistent Volume Claim: restore-552369e5-42cf-49d2-9d00-e35602a7cb17-rootdisk Volume Name: rootdisk Volume Snapshot Name: vmsnapshot-9f5f7b71-d04d-4726-ba04-1ade131cd353-volume-rootdisk Events: <none>
Shelly, please take a look.
Additional debugging: I created a new VM, win-resize-test with a 40Gi root disk and used virtctl guestfs to resize the OS into the larger PVC, reserving about 5.5% for overhead as CDI does. I was able to snapshot and restore this VM without issue. Additionally, just to ensure the environment, I created a RHEL8 VM using default settings. It performed a snapshot and restore without issue.
It is possible the size mismatch is coming from a bug in UI. I recreated the base win2k19 image with a 22Gi DV. $ oc -n openshift-virtualization-os-images get dv win2k19 -o yaml| grep storage cdi.kubevirt.io/storage.bind.immediate.requested: "true" cdi.kubevirt.io/storage.clone.token: eyJhbGciOiJQUzI1NiIsImtpZCI6IiJ9.eyJleHAiOjE2MzcxNjg2MTUsImlhdCI6MTYzNzE2ODMxNSwiaXNzIjoiY2RpLWFwaXNlcnZlciIsIm5hbWUiOiJ3aW5kb3dzLWluc3RhbGwtcm9vdGRpc2siLCJuYW1lc3BhY2UiOiJrdWJldmlydC1naXRvcHMiLCJuYmYiOjE2MzcxNjgzMTUsIm9wZXJ0YXRpb24iOiJDbG9uZSIsInBhcmFtcyI6eyJ0YXJnZXROYW1lIjoid2luMmsxOSIsInRhcmdldE5hbWVzcGFjZSI6Im9wZW5zaGlmdC12aXJ0dWFsaXphdGlvbi1vcy1pbWFnZXMifSwicmVzb3VyY2UiOnsiZ3JvdXAiOiIiLCJyZXNvdXJjZSI6InBlcnNpc3RlbnR2b2x1bWVjbGFpbXMiLCJ2ZXJzaW9uIjoidjEifX0.ucsCMmluPrSXQaOfhY6e_CLC8b0d56zgoXop82FnDU7mcVvGUj0cdT0asxDZ5I0_2nUS5DrqviDR2BYU79yhkwAmDLbY5NroumV9CufIqBjnjeQpVX50Pzh0dB-0byNTxZR8HEqdBCGq8QYJgNU9C_Cva0OpDBQmFzPqJotv9oVlUHnM-gQi__t59KpJIJrhArAO95KnsNHVKN2jEvzGzQT0YAsz67cvXxo-xzCZN0Md-rfofM-TRyNmBAmfO3ugjUQAP09APXoTj6k814TDD46Ry_I0Br-5QT2Isv0TLhevr_tvsM9HhQL3IwtVbuYpbcJ2_Jzm7aiCIbwZzjWsAw {"apiVersion":"cdi.kubevirt.io/v1beta1","kind":"DataVolume","metadata":{"annotations":{"cdi.kubevirt.io/storage.bind.immediate.requested":"true","kubevirt.ui/provider":"Microsoft"},"name":"win2k19","namespace":"openshift-virtualization-os-images"},"spec":{"pvc":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"22Gi"}}},"source":{"pvc":{"name":"windows-install-rootdisk","namespace":"kubevirt-gitops"}}}} storage: 22Gi and its pvc: $ oc -n openshift-virtualization-os-images get pvc win2k19 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE win2k19 Bound pvc-353b979e-a9bf-4fb4-8518-e379dc0affc4 22Gi RWO cnv-integration-svm 5h29m I then create a VM using the console wizard, and it says it is creating a VM with 9G storage. Looking at the resultant DV: $ oc get dv win2k19-magnificent-manatee -o yaml | grep storage cdi.kubevirt.io/storage.clone.token: eyJhbGciOiJQUzI1NiIsImtpZCI6IiJ9.eyJleHAiOjE2MzcxODgyMjgsImlhdCI6MTYzNzE4NzkyOCwiaXNzIjoiY2RpLWFwaXNlcnZlciIsIm5hbWUiOiJ3aW4yazE5IiwibmFtZXNwYWNlIjoib3BlbnNoaWZ0LXZpcnR1YWxpemF0aW9uLW9zLWltYWdlcyIsIm5iZiI6MTYzNzE4NzkyOCwib3BlcnRhdGlvbiI6IkNsb25lIiwicGFyYW1zIjp7InRhcmdldE5hbWUiOiJ3aW4yazE5LW1hZ25pZmljZW50LW1hbmF0ZWUiLCJ0YXJnZXROYW1lc3BhY2UiOiJkZWZhdWx0In0sInJlc291cmNlIjp7Imdyb3VwIjoiIiwicmVzb3VyY2UiOiJwZXJzaXN0ZW50dm9sdW1lY2xhaW1zIiwidmVyc2lvbiI6InYxIn19.q0Dpvygf9kAIxC4Gfh8s0KNxKfR0p_YkR9S4eCT2D4HToCHVcRo07R23OMXHb-e7SdU9O9vUjSsQ5kXJ1jmSIyBjHroExvL6FYU3wU0GFsZtmkSM3bLLgTO4x6BR6ZkHqJQ34m5MOUxdTSJ0ogyB2gQ_gn0JGp-bnVzCRhsRVWw5pnv3t8jm1CVOtDtm2QZxgvpafXrdPoTYAFhHjmlh81fs0EnP5wUpR_Nu1FMYy0VOq4Y2kH0a5fGB5WUGUGjU-PeQ1KCIc6_OWFlIplmVXSq_i-yv8nhftdRokfh51NzPO-JXDqL2FgIBK13EgcQGyziVxbbpgDLoa4kIQnS9iw storage: 10186796Ki storageClassName: cnv-integration-svm The storage is way low, however: $ oc get pvc win2k19-magnificent-manatee NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE win2k19-magnificent-manatee Bound pvc-b422dcd1-87d9-4d38-b0f0-6276b14a288b 22Gi RWO cnv-integration-svm 7m51s Snapshot: $ oc get vmsnapshot,volumesnapshot,pvc | grep manatee virtualmachinesnapshot.snapshot.kubevirt.io/win2k19-magnificent-manatee-2021-11-17 VirtualMachine win2k19-magnificent-manatee Succeeded true 4m32s volumesnapshot.snapshot.storage.k8s.io/vmsnapshot-b73f4fd6-e71e-49bf-be8c-5d0fef96e960-volume-win2k19-magnificent-manatee true win2k19-magnificent-manatee 10100400Ki csi-snapclass snapcontent-6025e952-779e-44b8-87f3-5037de4ca7b1 4m33s 4m33s persistentvolumeclaim/win2k19-magnificent-manatee Bound pvc-b422dcd1-87d9-4d38-b0f0-6276b14a288b 22Gi RWO cnv-integration-svm 17m It then creates a restore of insufficient size: $ oc get pvc restore-21fef1e1-62b9-4caa-bfed-e38a7538d99d-win2k19-magnificent-manatee -o yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: k8s.io/CloneOf: "true" k8s.io/SmartCloneRequest: "true" restore.kubevirt.io/name: win2k19-magnificent-manatee-2021-11-17-restore-fmtfni volume.beta.kubernetes.io/storage-provisioner: csi.trident.netapp.io creationTimestamp: "2021-11-17T22:43:34Z" finalizers: - kubernetes.io/pvc-protection labels: app: containerized-data-importer app.kubernetes.io/component: storage app.kubernetes.io/managed-by: cdi-controller app.kubernetes.io/part-of: hyperconverged-cluster app.kubernetes.io/version: v4.9.0 cdi-controller: cdi-tmp-f8dfe527-ade8-48a5-ad20-9bf514829cac cdi.kubevirt.io: cdi-smart-clone name: restore-21fef1e1-62b9-4caa-bfed-e38a7538d99d-win2k19-magnificent-manatee namespace: default ownerReferences: - apiVersion: kubevirt.io/v1 blockOwnerDeletion: true controller: true kind: VirtualMachine name: win2k19-magnificent-manatee uid: 2ae73de8-d362-49d1-aad3-86f21cb8a08b resourceVersion: "14966926" uid: 1e43066c-7f6a-4505-b914-3d53b547b3cd spec: accessModes: - ReadWriteOnce dataSource: apiGroup: snapshot.storage.k8s.io kind: VolumeSnapshot name: vmsnapshot-b73f4fd6-e71e-49bf-be8c-5d0fef96e960-volume-win2k19-magnificent-manatee resources: requests: storage: 10080872Ki storageClassName: cnv-integration-svm volumeMode: Filesystem status: phase: Pending
Shelly, is there any updates for this bug?
We are unable to reproduce this bug and will close it. If it continues to be an issue please reopen it.
This is still an issue, if it's a matter of access to the cluster, I can provide that. (and apologies if I missed earlier, and/or had a broken cluster at the time...)
Here is another example: My Windows install job creates a VM with the following dataVolumeTemplate: dataVolumeTemplates: - metadata: name: windows-install-rootdisk spec: pvc: accessModes: - ReadWriteOnce resources: requests: storage: 22Gi source: blank: {} The installer ends up writing around 10Gi to the disk, and I use the following DV to clone it into openshift-virtualization-os-images: apiVersion: cdi.kubevirt.io/v1beta1 kind: DataVolume metadata: name: win2k19 namespace: openshift-virtualization-os-images annotations: cdi.kubevirt.io/storage.bind.immediate.requested: "true" kubevirt.ui/provider: Microsoft spec: source: pvc: namespace: kubevirt-gitops name: windows-install-rootdisk storage: accessModes: - ReadWriteOnce resources: requests: storage: 22Gi This ends up creating the following PVC in openshift-virtualization-os-images: apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: cdi.kubevirt.io/ownedByDataVolume: openshift-virtualization-os-images/win2k19 cdi.kubevirt.io/readyForTransfer: "true" cdi.kubevirt.io/smartCloneSnapshot: kubevirt-gitops/cdi-tmp-96428e41-993d-4c45-9542-db80a83388ca cdi.kubevirt.io/storage.condition.running: "False" cdi.kubevirt.io/storage.condition.running.message: Clone Complete cdi.kubevirt.io/storage.condition.running.reason: Completed cdi.kubevirt.io/storage.populatedFor: win2k19 k8s.io/CloneOf: "true" k8s.io/SmartCloneRequest: "true" pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: csi.trident.netapp.io volume.kubernetes.io/storage-provisioner: csi.trident.netapp.io creationTimestamp: "2022-03-30T18:47:20Z" finalizers: - kubernetes.io/pvc-protection labels: alerts.k8s.io/KubePersistentVolumeFillingUp: disabled app: containerized-data-importer app.kubernetes.io/component: storage app.kubernetes.io/managed-by: cdi-controller app.kubernetes.io/part-of: hyperconverged-cluster app.kubernetes.io/version: 4.10.0 cdi-controller: cdi-tmp-96428e41-993d-4c45-9542-db80a83388ca cdi.kubevirt.io: cdi-smart-clone name: win2k19 namespace: openshift-virtualization-os-images ownerReferences: - apiVersion: cdi.kubevirt.io/v1beta1 blockOwnerDeletion: true controller: true kind: DataVolume name: win2k19 uid: 96428e41-993d-4c45-9542-db80a83388ca resourceVersion: "17272464" uid: b08eea7e-ef6f-411f-a20f-ceb81735eaea spec: accessModes: - ReadWriteOnce dataSource: apiGroup: snapshot.storage.k8s.io kind: VolumeSnapshot name: cdi-tmp-96428e41-993d-4c45-9542-db80a83388ca resources: requests: storage: 10377400Ki storageClassName: cnv-integration-svm volumeMode: Filesystem volumeName: pvc-1784ae57-2465-42b4-bcb9-d710f83271c2 status: accessModes: - ReadWriteOnce capacity: storage: 22Gi phase: Bound Note that under spec.resources.requests.storage, something has reduced the size to 10377400Ki which is 9.89 GiB
Chandler,
Shelly, I thought that this might be a dup of 2064936 but that seems impossible since this bug is related to VMSnapshotRestore which is a kubevirt feature and 2064936 deals with filesystem overhead calculations made by CDI. Please work with Chandler to further diagnose.
regarding: https://bugzilla.redhat.com/show_bug.cgi?id=2021354#c11 I am not sure that the PVC is undersized. The PVC size in the status (status.capacity.storage) is (correctly) reported as 22Gi. Regarding the value in `spec.resources.requests.storage`. When doing a "smart clone" that value is initially set from the `status.restoreSize` of the snapshot here [1]. Since it was stated that approximately 10G was written to the blank disk, this initial value could make sense. The smart clone process then extends the PVC to the requested size by updating the PVC spec.resources.requests.storage. It is strange that the PVC does not reflect this update. But based on the current value of status.capacity.storage it appears to have been resized correctly. [1] https://github.com/kubevirt/containerized-data-importer/blob/v1.48.1/pkg/controller/smart-clone-controller.go#L378
After digging into this for a bit, I can see how the DataVolume in [1] would be problematic if it was part a VMSnapshot+VMRestore. The main issue is that the requested size PVC size (10377400Ki) is smaller than the actual PVC size (22Gi) and the snapshot/restore controllers do not handle this totally valid situation correctly. The restore controller will create a 10377400Ki PVC. This is obviously problematic because the snapshot may be up to 22G. There are a couple flawed assumptions in the current snapshot/restore logic. 1. The "status.restoreSize" of a VolumeSnapshot equals "spec.resources.requests.storage" of source PVC 2. The source PVC "spec.resources.requests.storage" equals "status.capacity.storage" With some provisioners (ceph rbd), the above is true. But clearly that is not always the case as here with netapp trident I believe this issue can be addressed as follows: 1. VirtualMachineSnapshots should include "status.capacity.storage" for each PVC. Not necessarily the entire PVC status, but a least that part 2. VM Restore controller has to more intelligently restore PVCs A. If storage class supports expansion i. Create target PVC with initial size of VolumeSnapshot "status.restoreSize" ii. Expand PVC to have size equal to source PVC "status.capacity.storage" if necessary B. If expansion not supported i. Create target PVC with initial size of PVC "status.capacity.storage" ii. Hope for the best (works fine with trident) The remaining question is how to handle with restoring from old VM snapshots. There are a couple options: 1. Validating webhook can check for each volume in VMsnapshot that VolumeSnapshot "status.restoreSize" > PVC "spec.resources.requests.storage" and reject if so 2. Instead of rejecting, create new target PVC some X% bigger than VolumeSnapshot "status.restoreSize" [1] https://bugzilla.redhat.com/show_bug.cgi?id=2021354#c11
One more important issue to note. Smart clone does not update "spec.resources.requests.storage" if "status.capacity.storage" is >= desired target size. This is how we ended up with the PVC in [1]. We should fix that. The code is here: https://github.com/kubevirt/containerized-data-importer/blob/main/pkg/controller/datavolume-controller.go#L1337-L1348 Although this bug would be fixed with only the above fix, it has exposed some flaws in the snapshot/restore logic. Specifically, handling PVCs with more data than "spec.resources.requests.storage". [1] https://bugzilla.redhat.com/show_bug.cgi?id=2021354#c11
As Michael mentioned I'm working on the fix in Smart clone. The flaws in the snapshot restore process will be considered and handled.
Adding here a link to the bug we opened as a result of this bug: https://bugzilla.redhat.com/show_bug.cgi?id=2086825
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 4.9.5 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:5389