Bug 1910019

Summary:	[v2v][VM import from RHV] Communication issue is not reflected in the VM import failure in CNV UI
Product:	Container Native Virtualization (CNV)	Reporter:	Ilanit Stein <istein>
Component:	V2V	Assignee:	Sam Lucidi <slucidi>
Status:	CLOSED WONTFIX	QA Contact:	Daniel Gur <dagur>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.5.1	CC:	cnv-qe-bugs, fdupont, mrashish, slucidi
Target Milestone:	---	Flags:	istein: needinfo+ istein: needinfo-
Target Release:	2.6.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1954008 (view as bug list)		Environment:
Last Closed:	2021-04-27 12:24:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1954008

Description Ilanit Stein 2020-12-22 10:42:12 UTC

Description of problem:
When running VM import from RHV to CNV/Ceph-RBD/Block,
of a RHEL7 VM with 100GB disk, VM import failed very quickly on:

Import error (RHV)
v2vmigrationvm0 could not be imported.
DataVolumeCreationFailed: Error while importing disk image: v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c. pod CrashLoopBackoff restart exceeded

events log showed:
 $ oc get events -n openshift-cnv

6m44s       Normal    Created                  pod/importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c            Created container importer
6m44s       Normal    Started                  pod/importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c            Started container importer
107s        Normal    ReconcileHCO             clusterserviceversion/kubevirt-hyperconverged-operator.v2.5.1                HCO Reconcile completed successfully
107s        Normal    ReconcileHCO             hyperconverged/kubevirt-hyperconverged                                       HCO Reconcile completed successfully
31m         Normal    Pending                  datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              PVC v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Pending
31m         Normal    ExternalProvisioning     persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c   waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator
31m         Normal    Provisioning             persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c   External provisioner is provisioning volume for claim "openshift-cnv/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c"
31m         Normal    ProvisioningSucceeded    persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c   Successfully provisioned volume pvc-950b3ca9-4674-48ba-9350-059e7820f954
31m         Normal    ImportScheduled          datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              Import into v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c scheduled
31m         Normal    Bound                    datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              PVC v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Bound
31m         Normal    ImportInProgress         datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              Import into v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c in progress
29m         Warning   Error                    datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              Unable to process data: read tcp 10.128.3.37:47062->10.1.40.88:54322: read: connection reset by peer
29m         Warning   ErrImportFailed          persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c   Unable to process data: read tcp 10.128.3.37:47062->10.1.40.88:54322: read: connection reset by peer
28m         Warning   Error                    datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              Unable to connect to imageio data source: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk: The following disks are locked: v2v_migration_vm_0-000. Please try again in a few minutes.]". HTTP response code is "409". HTTP response message is "409 Conflict".
25m         Warning   ErrImportFailed          persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c   Unable to connect to imageio data source: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk: The following disks are locked: v2v_migration_vm_0-000. Please try again in a few minutes.]". HTTP response code is "409". HTTP response message is "409 Conflict".
29m         Warning   CrashLoopBackOff         datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              back-off 10s restarting failed container=importer pod=importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c_openshift-cnv(35d4bf14-d5b0-453d-a25c-ca50dc45ad89)
28m         Warning   CrashLoopBackOff         datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              back-off 20s restarting failed container=importer pod=importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c_openshift-cnv(35d4bf14-d5b0-453d-a25c-ca50dc45ad89)
28m         Warning   CrashLoopBackOff         datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              back-off 40s restarting failed container=importer pod=importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c_openshift-cnv(35d4bf14-d5b0-453d-a25c-ca50dc45ad89)
25m         Warning   Error                    datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              Unable to process data: read tcp 10.128.3.37:51784->10.1.40.88:54322: read: connection reset by peer
25m         Warning   ErrImportFailed          persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c   Unable to process data: read tcp 10.128.3.37:51784->10.1.40.88:54322: read: connection reset by peer
25m         Warning   CrashLoopBackOff         datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              back-off 1m20s restarting failed container=importer pod=importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c_openshift-cnv(35d4bf14-d5b0-453d-a25c-ca50dc45ad89)

Version-Release number of selected component (if applicable):
CNV-2.5.1

Additional info:
This bug was found while trying to reproduce bug 1893526.

Comment 1 Fabien Dupont 2020-12-28 10:31:46 UTC

@slucidi, could you please check if CDI is reporting this error in its status.
IIUC, whatever is reported in the status will bubble up in VMIO, but if it's only in the events it won't, right ? If yes, either CDI should report the error in the status, or VMIO should check the events.

Leaving the BZ in NEW state, as we need to investigate more to know which component is "faulty".

Comment 2 Sam Lucidi 2021-01-11 21:45:57 UTC

It appears that the importer pod error is recorded in the termination log for the container, and in the event log. It looks like VMIO will retrieve the termination message and re-emit it, and retry until it hits the crash loop backoff limit. That means that the VirtualMachineImport should have the termination errors in its event log, but the status once the import fails completely will be "pod CrashLoopBackoff restart exceeded".

Ilanit, do you have a reproducer environment, or can you check the VirtualMachineImport event log to see if the messages appear there?

Comment 3 Ilanit Stein 2021-01-18 19:21:02 UTC

Tested on OCP-4.7/CNV-2.6.0.
VM import of a 100GB disk VM when there's only 65GB on the Ceph storage on OCP side.
After couple of hours VM import in UI remain in 46%
$ oc describe vmimports/vm-import-v2vmigrationvm0-lvjt5 

Shows these events:
Events:
  Type     Reason                Age                    From                             Message
  ----     ------                ----                   ----                             -------
  Normal   ImportScheduled       30m                    virtualmachineimport-controller  Import of Virtual Machine default/v2vmigrationvm0 started
  Normal   ImportInProgress      30m                    virtualmachineimport-controller  Import of Virtual Machine default/v2vmigrationvm0 disk v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c in progress
  Warning  EventPVCImportFailed  31s (x385 over 5m31s)  virtualmachineimport-controller  Unable to process data: unexpected EOF

I can provide this "mgn04" cluster details offline, if needed.

Comment 4 Sam Lucidi 2021-01-19 13:34:07 UTC

https://github.com/kubevirt/vm-import-operator/pull/461

Comment 6 Fabien Dupont 2021-04-20 13:56:05 UTC

The fix should be part of hco-bundle-registry build v2.6.2-4 / iib:66925.

Comment 7 Ilanit Stein 2021-04-22 09:04:40 UTC

Tested on hco-v2.6.2-23 iib:68580

It is not possible to verify the bug since the importer pod doesn't fail but continues forever to try:

The cdi importer behavior on this version is different.
VM import from RHV to Ceph-RBD/Block.
OCP Ceph size is 70GB, and imported VM require 106GB.
The importer log endlessly shows progress:
"I0422 08:40:02.495231       1 prometheus.go:69] 100.00"

When bug was reported, it used to fail with crash loop back, after few minutes, 
but now it continues forever.

This is a problem because it is not reflected to the user that there is not enough space to do the import.

@Maya,
Can you please confirm that this is indeed the expected behavior?

Comment 8 Ilanit Stein 2021-04-22 09:25:43 UTC

Adding that when cancelling the VM import the PVC and the importer pod remains in Terminating status, and the PV remains occupied.
But this is not new and I think there is an OCS bug for it.

Comment 9 Ilanit Stein 2021-04-25 14:27:51 UTC

Based on the test result detailed in comment #7 this bug cannot be verified.
It cannot be fixed also from VM import side since on CNV-2.6.2 when the Ceph gets full importer pod doesn't fail anymore.

@Fabien, @Sam,

Based on the above would it be OK to move this bug to won't fix?
and for 2.6.2 would this documentation for VM import from RHV would be OK?:
Make sure there is enough space for the VM import, 
and if VM import remain at 75% with no progress for long time check the importer log, and if it repeatedly show progress 100, 
then the Ceph storage needs to be expanded?

Regarding the no option to release the Ceph storage even though VM import is cancelled we already have this bug:
Bug 1893528 - [v2v][VM import] Not possible to release target Ceph-RBD/Block space after "partial" disk copy.
that was closed as duplicate on this OCS bug:
Bug 1897351 - [Tracking bug 1910288] Capacity limit usability concerns

Comment 10 Fabien Dupont 2021-04-26 14:04:26 UTC

I'm fine with closing this BZ as WONTFIX and only updating the docs.

Comment 11 Ilanit Stein 2021-04-27 12:24:32 UTC

Based on comment #10 closing this bug on won't fix.
Cloning it to a doc bug to document comment #9