1567801 – vGPU: running VM with mdev_type hook switched to pause mode after host upgrade and cannot be run anymore.

Bug 1567801 - vGPU: running VM with mdev_type hook switched to pause mode after host upgrade and cannot be run anymore.

Summary: vGPU: running VM with mdev_type hook switched to pause mode after host upgrad...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.2.2.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.3
Target Release:	---
Assignee:	Martin Polednik
QA Contact:	Nisim Simsolo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-16 08:53 UTC by Nisim Simsolo
Modified:	2018-05-10 06:28 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-05-10 06:28:48 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+ rule-engine: blocker+

Attachments	(Terms of Use)
engine.log (650.37 KB, text/plain) 2018-04-16 08:55 UTC, Nisim Simsolo	no flags	Details
vdsm.log (4.45 MB, text/plain) 2018-04-16 08:56 UTC, Nisim Simsolo	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	90454	0	master	MERGED	devices: ignore hostdevs we can't handle in recovery	2020-12-18 07:31:52 UTC
oVirt gerrit	90456	0	ovirt-4.2	MERGED	devices: ignore hostdevs we can't handle in recovery	2020-12-18 07:32:22 UTC

Description Nisim Simsolo 2018-04-16 08:53:50 UTC

Description of problem:
Few issues observed on running VMs with mdev_type hook after host upgrade:
1. VMs status switched to pause mode.
2. VM failed to resume with the next engine.log

2018-04-16 11:32:36,935+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-24129) [353f0f5b-dd2f-4e7f-ad80-8d4b73dab766] Failed in 'ResumeBrokerVDS' method

3. VM failed to run after powering it off with the next engine.log

2018-04-16 11:35:44,873+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_DOWN_ERROR(119), VM vGPU_RHEL7_03 is down with error. Exit message: Wake up from hibernation failed:Virtual machine already exists.

4. VM state is running in host although it's powered off in Webadmin.
5. Nvidia instances are still connected to GPU and not cleared after powering off VM.

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.3-0.1.el7
libvirt-client-3.9.0-14.el7_5.2.x86_64
sanlock-3.6.0-1.el7.x86_64
vdsm-4.20.25-1.el7ev.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64
NVIDIA-Linux-x86_64-390.21-vgpu-kvm
kernel-3.10.0-851.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Run VM with mdev_type hook
2. Restart vdsm service
3.

Actual results:
1. VM status changed to pause mode
2. VM cannot be resume
3. VM failed to run after powering it off
4. VM remains in running state in host
5. Nvidia instance is still connected to GPU

Expected results:
VM should continue to run and if it switched to pause mode it should be able to resume.

Additional info:
vdsm.log and engine.log attached

Comment 1 Nisim Simsolo 2018-04-16 08:55:54 UTC

Created attachment 1422389 [details]
engine.log

Comment 2 Nisim Simsolo 2018-04-16 08:56:23 UTC

Created attachment 1422390 [details]
vdsm.log

Comment 3 Michal Skrivanek 2018-04-17 05:22:12 UTC

(In reply to Nisim Simsolo from comment #0)
> Description of problem:
> Few issues observed on running VMs with mdev_type hook after host upgrade:

What upgrade?

> 1. VMs status switched to pause mode.
> 2. VM failed to resume with the next engine.log

What resume? How is that related to upgrade or reproduction steps below?

> 
> 2018-04-16 11:32:36,935+03 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand]
> (EE-ManagedThreadFactory-engine-Thread-24129)
> [353f0f5b-dd2f-4e7f-ad80-8d4b73dab766] Failed in 'ResumeBrokerVDS' method
> 
> 3. VM failed to run after powering it off with the next engine.log 

Why is the resume here again, was it powered off or suspended?

> 
> 2018-04-16 11:35:44,873+03 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_DOWN_ERROR(119), VM vGPU_RHEL7_03
> is down with error. Exit message: Wake up from hibernation failed:Virtual
> machine already exists.
> 
> 4. VM state is running in host although it's powered off in Webadmin.

Checked how exactly?

> 5. Nvidia instances are still connected to GPU and not cleared after
> powering off VM.

Checked how exactly?

> Version-Release number of selected component (if applicable):
> ovirt-engine-4.2.3-0.1.el7
> libvirt-client-3.9.0-14.el7_5.2.x86_64
> sanlock-3.6.0-1.el7.x86_64
> vdsm-4.20.25-1.el7ev.x86_64
> qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64
> NVIDIA-Linux-x86_64-390.21-vgpu-kvm
> kernel-3.10.0-851.el7.x86_64
> 
> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1. Run VM with mdev_type hook
> 2. Restart vdsm service

Steps do not match what you say above
 
> Actual results:
> 1. VM status changed to pause mode
> 2. VM cannot be resume
> 3. VM failed to run after powering it off
> 4. VM remains in running state in host
> 5. Nvidia instance is still connected to GPU 
> 
> Expected results:
> VM should continue to run and if it switched to pause mode it should be able
> to resume.
> 
> Additional info:
> vdsm.log and engine.log attached

Comment 4 Nisim Simsolo 2018-04-17 08:10:29 UTC

(In reply to Michal Skrivanek from comment #3)
> (In reply to Nisim Simsolo from comment #0)
> > Description of problem:
> > Few issues observed on running VMs with mdev_type hook after host upgrade:
> 
> What upgrade?

I meant to host update from  rhv-4.2.2-10 to rhv-4.2.3-1 (VDSM updated from vdsm-4.20.23-1.el7ev.x86_64 to vdsm-4.20.25-1.el7ev.x86_64)
> 
> > 1. VMs status switched to pause mode.
> > 2. VM failed to resume with the next engine.log
> 
> What resume? How is that related to upgrade or reproduction steps below?

The VM state switched from up state to pause state after the update and it failed to be resumed.
It's related because VM cannot be run anymore.

> 
> > 
> > 2018-04-16 11:32:36,935+03 ERROR
> > [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand]
> > (EE-ManagedThreadFactory-engine-Thread-24129)
> > [353f0f5b-dd2f-4e7f-ad80-8d4b73dab766] Failed in 'ResumeBrokerVDS' method
> > 
> > 3. VM failed to run after powering it off with the next engine.log 
> 
> Why is the resume here again, was it powered off or suspended?

I powered it off because resume failed. running it from powered off state failed also
> 
> > 
> > 2018-04-16 11:35:44,873+03 ERROR
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_DOWN_ERROR(119), VM vGPU_RHEL7_03
> > is down with error. Exit message: Wake up from hibernation failed:Virtual
> > machine already exists.
> > 
> > 4. VM state is running in host although it's powered off in Webadmin.
> 
> Checked how exactly?

# virsh -r list
# ps -aux | grep qemu

> 
> > 5. Nvidia instances are still connected to GPU and not cleared after
> > powering off VM.
> 
> Checked how exactly?

# nvidia-smi

> 
> > Version-Release number of selected component (if applicable):
> > ovirt-engine-4.2.3-0.1.el7
> > libvirt-client-3.9.0-14.el7_5.2.x86_64
> > sanlock-3.6.0-1.el7.x86_64
> > vdsm-4.20.25-1.el7ev.x86_64
> > qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64
> > NVIDIA-Linux-x86_64-390.21-vgpu-kvm
> > kernel-3.10.0-851.el7.x86_64
> > 
> > How reproducible:
> > 100%
> > 
> > Steps to Reproduce:
> > 1. Run VM with mdev_type hook
> > 2. Restart vdsm service
> 
> Steps do not match what you say above

Same behavior observed in both cases, host update or restarting vdsm service
>  
> > Actual results:
> > 1. VM status changed to pause mode
> > 2. VM cannot be resume
> > 3. VM failed to run after powering it off
> > 4. VM remains in running state in host
> > 5. Nvidia instance is still connected to GPU 
> > 
> > Expected results:
> > VM should continue to run and if it switched to pause mode it should be able
> > to resume.
> > 
> > Additional info:
> > vdsm.log and engine.log attached

Comment 5 Martin Polednik 2018-04-18 13:54:22 UTC

I can confirm this happens, and more specifically it happens due to VDSM trying to re-create hostdev object from unknown device (mdev is only known to the hook, not VDSM core).

The traceback can be seen in VDSM log:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2777, in _run
    self._devices = self._make_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2627, in _make_devices
    return self._make_devices_from_xml(disk_objs)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2715, in _make_devices_from_xml
    self.id, self.domain, self._md_desc, self.log
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/common.py", line 219, in dev_map_from_domain_xml
    dev_obj = dev_class.from_xml_tree(log, dev_elem, dev_meta)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/hostdevice.py", line 381, in from_xml_tree
    dev_name = _get_device_name(dev, dev_type)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/hostdevice.py", line 407, in _get_device_name
    return device_name_from_address(dev_type, src_addr)
  File "/usr/lib/python2.7/site-packages/vdsm/common/hostdev.py", line 552, in device_name_from_address
    _format_address(address_type, device_address)
KeyError: 'mdev_uuid294fb789-98c7-3d30-8c92-4047252328a4'

Comment 6 Nisim Simsolo 2018-04-29 10:51:33 UTC

Verified:
rhvm-4.2.3.3-0.1
vdsm-4.20.27-1.el7ev.x86_64
libvirt-client-3.9.0-14.el7_5.2.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64
kernel-3.10.0-799.el7.x86_64
NVIDIA-Linux-x86_64-390.21 
GPU type: Tesla M60

Verification scenario:
1. Run some VMs with mdev_type hook. Open Console on these VMs.
2. Restart vdsm service
Verify VM continues to run, Console remains open and VMs in Webadmin remain in "up" state.
3. Reboot host.
Webadmin - Verify VMs state switched to "down". 
4. After host rebooted, verify all VMs state is still down. Run VMs.
Verify Vms are running properly with mdev_type hook.

Comment 7 Sandro Bonazzola 2018-05-10 06:28:48 UTC

This bugzilla is included in oVirt 4.2.3 release, published on May 4th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.