Description of problem: When running VM import from RHV to CNV/Ceph-RBD/Block, of a RHEL7 VM with 100GB disk, VM import failed very quickly on: Import error (RHV) v2vmigrationvm0 could not be imported. DataVolumeCreationFailed: Error while importing disk image: v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c. pod CrashLoopBackoff restart exceeded events log showed: $ oc get events -n openshift-cnv 6m44s Normal Created pod/importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Created container importer 6m44s Normal Started pod/importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Started container importer 107s Normal ReconcileHCO clusterserviceversion/kubevirt-hyperconverged-operator.v2.5.1 HCO Reconcile completed successfully 107s Normal ReconcileHCO hyperconverged/kubevirt-hyperconverged HCO Reconcile completed successfully 31m Normal Pending datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c PVC v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Pending 31m Normal ExternalProvisioning persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator 31m Normal Provisioning persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c External provisioner is provisioning volume for claim "openshift-cnv/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c" 31m Normal ProvisioningSucceeded persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Successfully provisioned volume pvc-950b3ca9-4674-48ba-9350-059e7820f954 31m Normal ImportScheduled datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Import into v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c scheduled 31m Normal Bound datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c PVC v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Bound 31m Normal ImportInProgress datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Import into v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c in progress 29m Warning Error datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Unable to process data: read tcp 10.128.3.37:47062->10.1.40.88:54322: read: connection reset by peer 29m Warning ErrImportFailed persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Unable to process data: read tcp 10.128.3.37:47062->10.1.40.88:54322: read: connection reset by peer 28m Warning Error datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Unable to connect to imageio data source: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk: The following disks are locked: v2v_migration_vm_0-000. Please try again in a few minutes.]". HTTP response code is "409". HTTP response message is "409 Conflict". 25m Warning ErrImportFailed persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Unable to connect to imageio data source: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk: The following disks are locked: v2v_migration_vm_0-000. Please try again in a few minutes.]". HTTP response code is "409". HTTP response message is "409 Conflict". 29m Warning CrashLoopBackOff datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c back-off 10s restarting failed container=importer pod=importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c_openshift-cnv(35d4bf14-d5b0-453d-a25c-ca50dc45ad89) 28m Warning CrashLoopBackOff datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c back-off 20s restarting failed container=importer pod=importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c_openshift-cnv(35d4bf14-d5b0-453d-a25c-ca50dc45ad89) 28m Warning CrashLoopBackOff datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c back-off 40s restarting failed container=importer pod=importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c_openshift-cnv(35d4bf14-d5b0-453d-a25c-ca50dc45ad89) 25m Warning Error datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Unable to process data: read tcp 10.128.3.37:51784->10.1.40.88:54322: read: connection reset by peer 25m Warning ErrImportFailed persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Unable to process data: read tcp 10.128.3.37:51784->10.1.40.88:54322: read: connection reset by peer 25m Warning CrashLoopBackOff datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c back-off 1m20s restarting failed container=importer pod=importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c_openshift-cnv(35d4bf14-d5b0-453d-a25c-ca50dc45ad89) Version-Release number of selected component (if applicable): CNV-2.5.1 Additional info: This bug was found while trying to reproduce bug 1893526.
@slucidi, could you please check if CDI is reporting this error in its status. IIUC, whatever is reported in the status will bubble up in VMIO, but if it's only in the events it won't, right ? If yes, either CDI should report the error in the status, or VMIO should check the events. Leaving the BZ in NEW state, as we need to investigate more to know which component is "faulty".
It appears that the importer pod error is recorded in the termination log for the container, and in the event log. It looks like VMIO will retrieve the termination message and re-emit it, and retry until it hits the crash loop backoff limit. That means that the VirtualMachineImport should have the termination errors in its event log, but the status once the import fails completely will be "pod CrashLoopBackoff restart exceeded". Ilanit, do you have a reproducer environment, or can you check the VirtualMachineImport event log to see if the messages appear there?
Tested on OCP-4.7/CNV-2.6.0. VM import of a 100GB disk VM when there's only 65GB on the Ceph storage on OCP side. After couple of hours VM import in UI remain in 46% $ oc describe vmimports/vm-import-v2vmigrationvm0-lvjt5 Shows these events: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ImportScheduled 30m virtualmachineimport-controller Import of Virtual Machine default/v2vmigrationvm0 started Normal ImportInProgress 30m virtualmachineimport-controller Import of Virtual Machine default/v2vmigrationvm0 disk v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c in progress Warning EventPVCImportFailed 31s (x385 over 5m31s) virtualmachineimport-controller Unable to process data: unexpected EOF I can provide this "mgn04" cluster details offline, if needed.
https://github.com/kubevirt/vm-import-operator/pull/461
The fix should be part of hco-bundle-registry build v2.6.2-4 / iib:66925.
Tested on hco-v2.6.2-23 iib:68580 It is not possible to verify the bug since the importer pod doesn't fail but continues forever to try: The cdi importer behavior on this version is different. VM import from RHV to Ceph-RBD/Block. OCP Ceph size is 70GB, and imported VM require 106GB. The importer log endlessly shows progress: "I0422 08:40:02.495231 1 prometheus.go:69] 100.00" When bug was reported, it used to fail with crash loop back, after few minutes, but now it continues forever. This is a problem because it is not reflected to the user that there is not enough space to do the import. @Maya, Can you please confirm that this is indeed the expected behavior?
Adding that when cancelling the VM import the PVC and the importer pod remains in Terminating status, and the PV remains occupied. But this is not new and I think there is an OCS bug for it.
Based on the test result detailed in comment #7 this bug cannot be verified. It cannot be fixed also from VM import side since on CNV-2.6.2 when the Ceph gets full importer pod doesn't fail anymore. @Fabien, @Sam, Based on the above would it be OK to move this bug to won't fix? and for 2.6.2 would this documentation for VM import from RHV would be OK?: Make sure there is enough space for the VM import, and if VM import remain at 75% with no progress for long time check the importer log, and if it repeatedly show progress 100, then the Ceph storage needs to be expanded? Regarding the no option to release the Ceph storage even though VM import is cancelled we already have this bug: Bug 1893528 - [v2v][VM import] Not possible to release target Ceph-RBD/Block space after "partial" disk copy. that was closed as duplicate on this OCS bug: Bug 1897351 - [Tracking bug 1910288] Capacity limit usability concerns
I'm fine with closing this BZ as WONTFIX and only updating the docs.
Based on comment #10 closing this bug on won't fix. Cloning it to a doc bug to document comment #9