1272742 – VFIO: VM with attached GPU is powered off when trying to hotplug increase memory of VM.

Bug 1272742 - VFIO: VM with attached GPU is powered off when trying to hotplug increase memory of VM.

Summary: VFIO: VM with attached GPU is powered off when trying to hotplug increase mem...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	3.6.0.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-3.6.2
Target Release:	3.6.2
Assignee:	Martin Polednik
QA Contact:	Nisim Simsolo
Docs Contact:
URL:
Whiteboard:
Depends On:	1273491 1284775 1305498
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-18 11:44 UTC by Nisim Simsolo
Modified:	2016-08-01 14:02 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Clones:	1273491 (view as bug list)
Environment:
Last Closed:	2016-02-18 11:06:54 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-3.6.z+ rule-engine: blocker+ mgoldboi: planning_ack+ michal.skrivanek: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)
engine.log (6.76 MB, text/plain) 2015-10-18 11:45 UTC, Nisim Simsolo	no flags	Details
vdsm.log.1.xz (655.73 KB, application/x-xz) 2015-10-18 11:47 UTC, Nisim Simsolo	no flags	Details
vdsm.log (44.47 KB, text/plain) 2015-10-18 11:47 UTC, Nisim Simsolo	no flags	Details
qemu log (55.55 KB, text/plain) 2015-10-19 14:53 UTC, Nisim Simsolo	no flags	Details
libvirt.xml of win2012_intel VM (3.72 KB, text/plain) 2015-10-19 14:54 UTC, Nisim Simsolo	no flags	Details
dmesg before memory hotplug (104.61 KB, text/plain) 2015-10-20 11:39 UTC, Nisim Simsolo	no flags	Details
dmesg after memory hotplug (106.49 KB, text/plain) 2015-10-20 11:40 UTC, Nisim Simsolo	no flags	Details
View All

Description Nisim Simsolo 2015-10-18 11:44:34 UTC

Description of problem:
*Notes: Next bug is relevant only for GPU passthrough (does not occur with other host devices attached to VM).
- Bug is relevant only for memory hotplug (does not occur when hotplug increasing VM CPUs amount).
- Bug occurs on linux/windows VMs.

When trying to increase running VM memory, VM is powered off instead of increase memory using hotplug mechanism.

Version-Release number of selected component (if applicable):
rhevm-3.6.0.1-0.1.el6
sanlock-3.2.4-1.el7.x86_64
vdsm-4.17.9-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7.x86_64
libvirt-client-1.2.17-5.el7.x86_64

How reproducible:
Consistently

Steps to Reproduce:
1. Run VM with GPU attached.
2. From virtual machines tab > Edit VM > system > increase memory size.
3. click OK (without checking "apply later" checkbox).

Actual results:
VM powered off (Connection reset by peer)

Expected results:
VM memory should be hotplug increased without powering off VM.

Additional info:
- vmId (win7_intel): cfcdab4-415e-40d7-a19c-bdbc80a9172d'
- Issue occurred at: 2015-10-18 13:54:06,596 ERROR [org.ovirt.engine.core.vdsbroker.SetAmountOfMemoryVDSCommand] (ajp-/127.0.0.1:8702-11) [515aecb4] Failed in 'SetAmountOfMemoryVDS' method
2015-10-18 13:54:06,605 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-11) [515aecb4] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message:
VDSM intel-vfio.tlv.redhat.com command failed: Unable to read from monitor: 
Connection reset by peer

engine.log and vdsm.log attached.

Comment 1 Nisim Simsolo 2015-10-18 11:45:30 UTC

Created attachment 1084120 [details]
engine.log

Comment 2 Nisim Simsolo 2015-10-18 11:47:26 UTC

Created attachment 1084132 [details]
vdsm.log.1.xz

Comment 3 Nisim Simsolo 2015-10-18 11:47:45 UTC

Created attachment 1084135 [details]
vdsm.log

Comment 4 Red Hat Bugzilla Rules Engine 2015-10-19 10:51:32 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 5 Martin Polednik 2015-10-19 12:19:00 UTC

There doesn't seem to be any pointers to what might have happened in vdsm log (apart from hotplug being successful), can you also provide us with qemu log? Can you verify that the VM is dead by executing 'virsh -r list' and looking up the VM's name and status on hypervisor?

Comment 6 Nisim Simsolo 2015-10-19 14:52:48 UTC

- 'Virsh -r list' shows that VM is not running after issue reproduced.
- Observing qemu log shows that there is an issue with memory allocation: 
2015-10-19T14:48:43.304948Z vfio_dma_map(0x7f7c0c22b7a0, 0x140000000, 0x40000000, 0x7f7a87800000) = -12 (Cannot allocate memory)
qemu: hardware error: vfio: DMA mapping failed, unable to continue
- qemu log and libvirt XML attached.

Comment 7 Nisim Simsolo 2015-10-19 14:53:43 UTC

Created attachment 1084442 [details]
qemu log

Comment 8 Nisim Simsolo 2015-10-19 14:54:35 UTC

Created attachment 1084443 [details]
libvirt.xml of win2012_intel VM

Comment 9 Martin Polednik 2015-10-19 15:01:07 UTC

Alex, any idea what could be the cause? This happened both with maxMemory 4 TiB (our default) and 256 GiB. The VM works fine (with drivers blacklisted correctly as far as we know) until the hotplug is triggered.

Comment 10 Alex Williamson 2015-10-19 23:38:37 UTC

Please report dmesg for the host after this occurs

Comment 11 Nisim Simsolo 2015-10-20 11:38:24 UTC

attaching host dmesg before the issue and after the issue reproduced.

Comment 12 Nisim Simsolo 2015-10-20 11:39:06 UTC

Created attachment 1084695 [details]
dmesg before memory hotplug

Comment 13 Nisim Simsolo 2015-10-20 11:40:37 UTC

Created attachment 1084696 [details]
dmesg after memory hotplug

Comment 14 Alex Williamson 2015-10-20 12:59:16 UTC

The process locked memory rlimit is set to 5G, which I believe is what libvirt uses for a 4G VM, the initial memory size of the VM.  Therefore, the qemu-kvm process is not going to be a be able to lock more pages unless someone bumps the limit further.  The evidence is in dmesg:

[  599.043115] vfio_pin_pages: RLIMIT_MEMLOCK (5368709120) exceeded
[  599.043119] vfio_pin_pages: RLIMIT_MEMLOCK (5368709120) exceeded

This results in the -ENOMEM failure in vfio_dma_map.  On the vfio side, there is no reason this would be GPU specific, it should happen for any *assigned* device (I emphasize assigned because RHEV treats things like USB passthrough the same as PCI device assignment).  Has anyone every tested whether libvirt increases the process locked memory limits when there's an assigned device and memory is hot-added?

Comment 15 Laine Stump 2015-10-20 15:03:07 UTC

(In reply to Alex Williamson from comment #14)
> Has anyone every tested
> whether libvirt increases the process locked memory limits when there's an
> assigned device and memory is hot-added?

I haven't tested it, but I can see from examining the code that the only place libvirt sets the max locked memory limit is when an assigned PCI device is hotplugged.

I think this bug should be re-assigned to libvirt, but I'm not sure which product's libvirt component to assign it to.

Comment 16 Yaniv Lavi 2015-10-29 12:45:05 UTC

In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 17 Michal Skrivanek 2015-11-05 18:03:41 UTC

Pending on resolution of 1273491. Expecting that post 7.2 GA hence we want to bump up libvirt version in vdsm spec

Comment 18 Martin Polednik 2015-11-11 11:11:44 UTC

Workaround for the time being is disabling memory hotplug in engine-config.

Comment 21 Michal Skrivanek 2015-12-09 12:49:10 UTC

moving to ON_QA based on https://bugzilla.redhat.com/show_bug.cgi?id=1280420#c9

Comment 22 Red Hat Bugzilla Rules Engine 2015-12-09 12:59:18 UTC

Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 23 Nisim Simsolo 2015-12-14 14:09:48 UTC

Verified: 
rhevm-3.6.1.2-0.1 
libvirt-client-1.2.17-13.el7_2.2.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.4.x86_64
vdsm-4.17.13-1.el7ev.noarch

Scenario:
(verified using rhel and windows 8 VM on AMD and Intel based hosts)
1. Run VM
2. Hotplug increase memory
3. Verify VM continues to run and memory increased properly.

Note You need to log in before you can comment on or make changes to this bug.