Bug 1910019 - [v2v][VM import from RHV] Communication issue is not reflected in the VM import failure in CNV UI
Summary: [v2v][VM import from RHV] Communication issue is not reflected in the VM impo...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: V2V
Version: 2.5.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 2.6.2
Assignee: Sam Lucidi
QA Contact: Daniel Gur
URL:
Whiteboard:
Depends On:
Blocks: 1954008
TreeView+ depends on / blocked
 
Reported: 2020-12-22 10:42 UTC by Ilanit Stein
Modified: 2021-04-27 12:29 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1954008 (view as bug list)
Environment:
Last Closed: 2021-04-27 12:24:32 UTC
Target Upstream Version:
Embargoed:
istein: needinfo+
istein: needinfo-


Attachments (Terms of Use)

Description Ilanit Stein 2020-12-22 10:42:12 UTC
Description of problem:
When running VM import from RHV to CNV/Ceph-RBD/Block,
of a RHEL7 VM with 100GB disk, VM import failed very quickly on:

Import error (RHV)
v2vmigrationvm0 could not be imported.
DataVolumeCreationFailed: Error while importing disk image: v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c. pod CrashLoopBackoff restart exceeded

events log showed:
 $ oc get events -n openshift-cnv

6m44s       Normal    Created                  pod/importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c            Created container importer
6m44s       Normal    Started                  pod/importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c            Started container importer
107s        Normal    ReconcileHCO             clusterserviceversion/kubevirt-hyperconverged-operator.v2.5.1                HCO Reconcile completed successfully
107s        Normal    ReconcileHCO             hyperconverged/kubevirt-hyperconverged                                       HCO Reconcile completed successfully
31m         Normal    Pending                  datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              PVC v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Pending
31m         Normal    ExternalProvisioning     persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c   waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator
31m         Normal    Provisioning             persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c   External provisioner is provisioning volume for claim "openshift-cnv/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c"
31m         Normal    ProvisioningSucceeded    persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c   Successfully provisioned volume pvc-950b3ca9-4674-48ba-9350-059e7820f954
31m         Normal    ImportScheduled          datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              Import into v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c scheduled
31m         Normal    Bound                    datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              PVC v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c Bound
31m         Normal    ImportInProgress         datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              Import into v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c in progress
29m         Warning   Error                    datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              Unable to process data: read tcp 10.128.3.37:47062->10.1.40.88:54322: read: connection reset by peer
29m         Warning   ErrImportFailed          persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c   Unable to process data: read tcp 10.128.3.37:47062->10.1.40.88:54322: read: connection reset by peer
28m         Warning   Error                    datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              Unable to connect to imageio data source: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk: The following disks are locked: v2v_migration_vm_0-000. Please try again in a few minutes.]". HTTP response code is "409". HTTP response message is "409 Conflict".
25m         Warning   ErrImportFailed          persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c   Unable to connect to imageio data source: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk: The following disks are locked: v2v_migration_vm_0-000. Please try again in a few minutes.]". HTTP response code is "409". HTTP response message is "409 Conflict".
29m         Warning   CrashLoopBackOff         datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              back-off 10s restarting failed container=importer pod=importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c_openshift-cnv(35d4bf14-d5b0-453d-a25c-ca50dc45ad89)
28m         Warning   CrashLoopBackOff         datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              back-off 20s restarting failed container=importer pod=importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c_openshift-cnv(35d4bf14-d5b0-453d-a25c-ca50dc45ad89)
28m         Warning   CrashLoopBackOff         datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              back-off 40s restarting failed container=importer pod=importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c_openshift-cnv(35d4bf14-d5b0-453d-a25c-ca50dc45ad89)
25m         Warning   Error                    datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              Unable to process data: read tcp 10.128.3.37:51784->10.1.40.88:54322: read: connection reset by peer
25m         Warning   ErrImportFailed          persistentvolumeclaim/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c   Unable to process data: read tcp 10.128.3.37:51784->10.1.40.88:54322: read: connection reset by peer
25m         Warning   CrashLoopBackOff         datavolume/v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c              back-off 1m20s restarting failed container=importer pod=importer-v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c_openshift-cnv(35d4bf14-d5b0-453d-a25c-ca50dc45ad89)

Version-Release number of selected component (if applicable):
CNV-2.5.1

Additional info:
This bug was found while trying to reproduce bug 1893526.

Comment 1 Fabien Dupont 2020-12-28 10:31:46 UTC
@slucidi, could you please check if CDI is reporting this error in its status.
IIUC, whatever is reported in the status will bubble up in VMIO, but if it's only in the events it won't, right ? If yes, either CDI should report the error in the status, or VMIO should check the events.

Leaving the BZ in NEW state, as we need to investigate more to know which component is "faulty".

Comment 2 Sam Lucidi 2021-01-11 21:45:57 UTC
It appears that the importer pod error is recorded in the termination log for the container, and in the event log. It looks like VMIO will retrieve the termination message and re-emit it, and retry until it hits the crash loop backoff limit. That means that the VirtualMachineImport should have the termination errors in its event log, but the status once the import fails completely will be "pod CrashLoopBackoff restart exceeded".

Ilanit, do you have a reproducer environment, or can you check the VirtualMachineImport event log to see if the messages appear there?

Comment 3 Ilanit Stein 2021-01-18 19:21:02 UTC
Tested on OCP-4.7/CNV-2.6.0.
VM import of a 100GB disk VM when there's only 65GB on the Ceph storage on OCP side.
After couple of hours VM import in UI remain in 46%
$ oc describe vmimports/vm-import-v2vmigrationvm0-lvjt5 

Shows these events:
Events:
  Type     Reason                Age                    From                             Message
  ----     ------                ----                   ----                             -------
  Normal   ImportScheduled       30m                    virtualmachineimport-controller  Import of Virtual Machine default/v2vmigrationvm0 started
  Normal   ImportInProgress      30m                    virtualmachineimport-controller  Import of Virtual Machine default/v2vmigrationvm0 disk v2vmigrationvm0-03072434-e45b-430c-8860-ff50b0c71a2c in progress
  Warning  EventPVCImportFailed  31s (x385 over 5m31s)  virtualmachineimport-controller  Unable to process data: unexpected EOF

I can provide this "mgn04" cluster details offline, if needed.

Comment 6 Fabien Dupont 2021-04-20 13:56:05 UTC
The fix should be part of hco-bundle-registry build v2.6.2-4 / iib:66925.

Comment 7 Ilanit Stein 2021-04-22 09:04:40 UTC
Tested on hco-v2.6.2-23 iib:68580

It is not possible to verify the bug since the importer pod doesn't fail but continues forever to try:

The cdi importer behavior on this version is different.
VM import from RHV to Ceph-RBD/Block.
OCP Ceph size is 70GB, and imported VM require 106GB.
The importer log endlessly shows progress:
"I0422 08:40:02.495231       1 prometheus.go:69] 100.00"

When bug was reported, it used to fail with crash loop back, after few minutes, 
but now it continues forever.

This is a problem because it is not reflected to the user that there is not enough space to do the import.

@Maya,
Can you please confirm that this is indeed the expected behavior?

Comment 8 Ilanit Stein 2021-04-22 09:25:43 UTC
Adding that when cancelling the VM import the PVC and the importer pod remains in Terminating status, and the PV remains occupied.
But this is not new and I think there is an OCS bug for it.

Comment 9 Ilanit Stein 2021-04-25 14:27:51 UTC
Based on the test result detailed in comment #7 this bug cannot be verified.
It cannot be fixed also from VM import side since on CNV-2.6.2 when the Ceph gets full importer pod doesn't fail anymore.

@Fabien, @Sam,

Based on the above would it be OK to move this bug to won't fix?
and for 2.6.2 would this documentation for VM import from RHV would be OK?:
Make sure there is enough space for the VM import, 
and if VM import remain at 75% with no progress for long time check the importer log, and if it repeatedly show progress 100, 
then the Ceph storage needs to be expanded?

Regarding the no option to release the Ceph storage even though VM import is cancelled we already have this bug:
Bug 1893528 - [v2v][VM import] Not possible to release target Ceph-RBD/Block space after "partial" disk copy.
that was closed as duplicate on this OCS bug:
Bug 1897351 - [Tracking bug 1910288] Capacity limit usability concerns

Comment 10 Fabien Dupont 2021-04-26 14:04:26 UTC
I'm fine with closing this BZ as WONTFIX and only updating the docs.

Comment 11 Ilanit Stein 2021-04-27 12:24:32 UTC
Based on comment #10 closing this bug on won't fix.
Cloning it to a doc bug to document comment #9


Note You need to log in before you can comment on or make changes to this bug.