Bug 1856111 - [v2v][RHV to CNV VM import] Add debug to ovirt client
Summary: [v2v][RHV to CNV VM import] Add debug to ovirt client
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: V2V
Version: 2.4.0
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: ---
: 2.5.0
Assignee: Ondra Machacek
QA Contact: Ilanit Stein
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-12 20:04 UTC by Ilanit Stein
Modified: 2020-12-07 09:52 UTC (History)
5 users (show)

Fixed In Version: 2.5.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-07 09:52:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vm-import-controller.log (767.34 KB, text/plain)
2020-07-12 20:04 UTC, Ilanit Stein
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt vm-import-operator pull 331 0 None closed Add debug to ovirt client 2020-11-05 02:00:24 UTC

Description Ilanit Stein 2020-07-12 20:04:09 UTC
Description of problem:
VM import from RHV of a Fedora 32 VM with OS type RHEL8, to target "standard" storage class.
While the VM's disk was copied, VM import of the same VM, to target NFS storage class.

Result:
The first VM import was displayed in UI as if it was stuck @85%.
The second VM import failed:
At first it failed on disk being locked, and then, after few minutes, it got this other failure: 
"
The virtual machine could not be imported.
DataVolumeCreationFailed: Data volume default/fedora32-nfs-b870c429-11e0-4630-b3df-21da551a48c0 creation failed: Internal error occurred: failed calling webhook "datavolume-mutate.cdi.kubevirt.io": Post https://cdi-api.openshift-cnv.svc:443/datavolume-mutate?timeout=30s: no endpoints available for service "cdi-api"
"

Piotr Kliczewski:

I checked your environment and found one importer pod (importer-fedora32-1-b870c429-11e0-4630-b3df-21da551a48c0) in Terminating status:

I see a message there: 
        message: 'Unable to connect to imageio data source: Fault reason is "Operation
          Failed". Fault detail is "[Cannot transfer Virtual Disk: The following disks
          are locked: GlanceDisk-f6c31e5. Please try again in a few minutes.]". HTTP
          response code is "409". HTTP response message is "409 Conflict".'

I suspect it failed due to the same disk being imported already.

The DV (fedora32-1-b870c429-11e0-4630-b3df-21da551a48c0) is still in:

  phase: ImportInProgress
  progress: 100.00%

Corresponding VMImport contains a message:
  kind: VirtualMachineImport
  metadata:
    annotations:
      vmimport.v2v.kubevirt.io/progress: "85"
      vmimport.v2v.kubevirt.io/source-vm-initial-state: down
    name: vm-import-fedora32-1-xfhww
....
  dataVolumes:
  - name: fedora32-1-b870c429-11e0-4630-b3df-21da551a48c0


Interesting point is that the second import failed to create DV.

  kind: VirtualMachineImport
  name: vm-import-fedora32-nfs-p22xd
...
      message: 'Data volume default/fedora32-nfs-b870c429-11e0-4630-b3df-21da551a48c0
        creation failed: Internal error occurred: failed calling webhook "datavolume-mutate.cdi.kubevirt.io":
        Post https://cdi-api.openshift-cnv.svc:443/datavolume-mutate?timeout=30s:
        no endpoints available for service "cdi-api"'
      reason: DataVolumeCreationFailed


Further steps on this same environment:
VM import a RHEL-7 VM, from RHV to CNV, using default storage, which is standard (default) - kubernetes.io/cinder.
The VM import seemed stuck in UI on 10% progress.
In UI there was NO indication of why it is not progressing.

There were errors in the vm-import-controller log like: "ovirt client panicked: runtime error: invalid memory address or nil pointer dereference" 

I then tried to import another RHEL-7 VM to NFS storage. That VM import finished successfully.
I then deleted this VM, and tried to VM import it again to NFS storage - That failed with import error, related to ovirt.
I think the error was something like: "invalid memory address or nil pointer dereference" in UI.

I removed this VM import resource, and tried again to run VM import from RHV, but that gets stuck on 
"Checking RHV API credentials" stage forever.
That is, it is no longer possible to do VM import from RHV.

Checking VM import from VMware provider, that worked before the above steps:
For existing/new VMware provider,
It is "stuck" in "Checking vCenter credentials" stage.

Version-Release number of selected component (if applicable):
CNV-2.4 from July 9 2020.


Additional info:
standard (default) kubernetes.io/cinder:
This comes by default with OCP4.4+,
It should provision columns on top of OpenStack Platform (RHOSP) Cinder.
In our case it is not functional because it requires configuration with openstack which we don't do at the moment.

Comment 1 Ilanit Stein 2020-07-12 20:04:56 UTC
Created attachment 1700746 [details]
vm-import-controller.log

Comment 2 Piotr Kliczewski 2020-07-13 08:01:26 UTC
Ilanit, There is no enough infomation in the provided log to understand why the client panicked. Please use quay.io/pkliczewski/vm-import-controller:latest with updated debug statement to reproduce the issue.

Comment 3 Piotr Kliczewski 2020-07-15 11:05:07 UTC
@Ilanit, recently we see DataVolumeCreationFailed failure to occur more often. It seems not to be related to to cinder.

Comment 4 Piotr Kliczewski 2020-07-16 16:17:29 UTC
Based on the issue we know with naming length limits it seems to be related to BZ #1857165.

Here is the reason: "error occurred: failed calling webhook "datavolume-mutate.cdi.kubevirt.io"

Comment 5 Ilanit Stein 2020-07-16 21:03:36 UTC
@Piotr,
I had to redeploy as this PSI environment became not usable and not accessible.

I will try to reproduce the issues mentioned in the bug description.

Testing "standard" storage VM import solely show that is not actually reaching the copy disk stage.
Importer pod is pending PVC bound forever.

Comment 6 Piotr Kliczewski 2020-07-17 10:25:36 UTC
I understand that the flow is different but the issue should be fixed by behaviour described in BZ #1857784

Comment 7 Ondra Machacek 2020-07-22 16:14:27 UTC
As part of this bug we have improved error message in log, so it's clear to the user, that the RHV instance is down, and we can't connect.

Other issues in this bug are related to bug 1857784 and they should be verified there.

Comment 8 Nelly Credi 2020-08-10 08:08:11 UTC
Please add fixed in version

Comment 9 Ilanit Stein 2020-10-01 14:48:32 UTC
Tried to verify on CNV-2.5 from osbs on Sep 30, 2020.

Ovirt panic was not reproduced.

Moving to verified upon that the debug messages were added, and if this ovirt panic reproduces,
hopefully we'll have more debug information.


Note You need to log in before you can comment on or make changes to this bug.