Bug 1272742 - VFIO: VM with attached GPU is powered off when trying to hotplug increase memory of VM.
VFIO: VM with attached GPU is powered off when trying to hotplug increase mem...
Status: CLOSED CURRENTRELEASE
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt (Show other bugs)
3.6.0.1
Unspecified Unspecified
high Severity urgent (vote)
: ovirt-3.6.2
: 3.6.2
Assigned To: Martin Polednik
Nisim Simsolo
:
Depends On: 1284775 1273491 1305498
Blocks:
  Show dependency treegraph
 
Reported: 2015-10-18 07:44 EDT by Nisim Simsolo
Modified: 2016-08-01 10:02 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When using Virtual Function I/O (VFIO) passthrough devices, the memory lock limit failed to be modified during a memory hot-plug operation. As a consequence, the guest virtual machine terminated unexpectedly. Now, the memory lock limit modification is performed before the memory hot-plug, and the described crash no longer occurs.
Story Points: ---
Clone Of:
: 1273491 (view as bug list)
Environment:
Last Closed: 2016-02-18 06:06:54 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑3.6.z+
rule-engine: blocker+
mgoldboi: planning_ack+
michal.skrivanek: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
engine.log (6.76 MB, text/plain)
2015-10-18 07:45 EDT, Nisim Simsolo
no flags Details
vdsm.log.1.xz (655.73 KB, application/x-xz)
2015-10-18 07:47 EDT, Nisim Simsolo
no flags Details
vdsm.log (44.47 KB, text/plain)
2015-10-18 07:47 EDT, Nisim Simsolo
no flags Details
qemu log (55.55 KB, text/plain)
2015-10-19 10:53 EDT, Nisim Simsolo
no flags Details
libvirt.xml of win2012_intel VM (3.72 KB, text/plain)
2015-10-19 10:54 EDT, Nisim Simsolo
no flags Details
dmesg before memory hotplug (104.61 KB, text/plain)
2015-10-20 07:39 EDT, Nisim Simsolo
no flags Details
dmesg after memory hotplug (106.49 KB, text/plain)
2015-10-20 07:40 EDT, Nisim Simsolo
no flags Details

  None (edit)
Description Nisim Simsolo 2015-10-18 07:44:34 EDT
Description of problem:
*Notes: Next bug is relevant only for GPU passthrough (does not occur with other host devices attached to VM).
- Bug is relevant only for memory hotplug (does not occur when hotplug increasing VM CPUs amount).
- Bug occurs on linux/windows VMs.

When trying to increase running VM memory, VM is powered off instead of increase memory using hotplug mechanism.

Version-Release number of selected component (if applicable):
rhevm-3.6.0.1-0.1.el6
sanlock-3.2.4-1.el7.x86_64
vdsm-4.17.9-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7.x86_64
libvirt-client-1.2.17-5.el7.x86_64

How reproducible:
Consistently

Steps to Reproduce:
1. Run VM with GPU attached.
2. From virtual machines tab > Edit VM > system > increase memory size.
3. click OK (without checking "apply later" checkbox).

Actual results:
VM powered off (Connection reset by peer)

Expected results:
VM memory should be hotplug increased without powering off VM.

Additional info:
- vmId (win7_intel): cfcdab4-415e-40d7-a19c-bdbc80a9172d'
- Issue occurred at: 2015-10-18 13:54:06,596 ERROR [org.ovirt.engine.core.vdsbroker.SetAmountOfMemoryVDSCommand] (ajp-/127.0.0.1:8702-11) [515aecb4] Failed in 'SetAmountOfMemoryVDS' method
2015-10-18 13:54:06,605 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-11) [515aecb4] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message:
VDSM intel-vfio.tlv.redhat.com command failed: Unable to read from monitor: 
Connection reset by peer

engine.log and vdsm.log attached.
Comment 1 Nisim Simsolo 2015-10-18 07:45 EDT
Created attachment 1084120 [details]
engine.log
Comment 2 Nisim Simsolo 2015-10-18 07:47 EDT
Created attachment 1084132 [details]
vdsm.log.1.xz
Comment 3 Nisim Simsolo 2015-10-18 07:47 EDT
Created attachment 1084135 [details]
vdsm.log
Comment 4 Red Hat Bugzilla Rules Engine 2015-10-19 06:51:32 EDT
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Comment 5 Martin Polednik 2015-10-19 08:19:00 EDT
There doesn't seem to be any pointers to what might have happened in vdsm log (apart from hotplug being successful), can you also provide us with qemu log? Can you verify that the VM is dead by executing 'virsh -r list' and looking up the VM's name and status on hypervisor?
Comment 6 Nisim Simsolo 2015-10-19 10:52:48 EDT
- 'Virsh -r list' shows that VM is not running after issue reproduced.
- Observing qemu log shows that there is an issue with memory allocation: 
2015-10-19T14:48:43.304948Z vfio_dma_map(0x7f7c0c22b7a0, 0x140000000, 0x40000000, 0x7f7a87800000) = -12 (Cannot allocate memory)
qemu: hardware error: vfio: DMA mapping failed, unable to continue
- qemu log and libvirt XML attached.
Comment 7 Nisim Simsolo 2015-10-19 10:53 EDT
Created attachment 1084442 [details]
qemu log
Comment 8 Nisim Simsolo 2015-10-19 10:54 EDT
Created attachment 1084443 [details]
libvirt.xml of win2012_intel VM
Comment 9 Martin Polednik 2015-10-19 11:01:07 EDT
Alex, any idea what could be the cause? This happened both with maxMemory 4 TiB (our default) and 256 GiB. The VM works fine (with drivers blacklisted correctly as far as we know) until the hotplug is triggered.
Comment 10 Alex Williamson 2015-10-19 19:38:37 EDT
Please report dmesg for the host after this occurs
Comment 11 Nisim Simsolo 2015-10-20 07:38:24 EDT
attaching host dmesg before the issue and after the issue reproduced.
Comment 12 Nisim Simsolo 2015-10-20 07:39 EDT
Created attachment 1084695 [details]
dmesg before memory hotplug
Comment 13 Nisim Simsolo 2015-10-20 07:40 EDT
Created attachment 1084696 [details]
dmesg after memory hotplug
Comment 14 Alex Williamson 2015-10-20 08:59:16 EDT
The process locked memory rlimit is set to 5G, which I believe is what libvirt uses for a 4G VM, the initial memory size of the VM.  Therefore, the qemu-kvm process is not going to be a be able to lock more pages unless someone bumps the limit further.  The evidence is in dmesg:

[  599.043115] vfio_pin_pages: RLIMIT_MEMLOCK (5368709120) exceeded
[  599.043119] vfio_pin_pages: RLIMIT_MEMLOCK (5368709120) exceeded

This results in the -ENOMEM failure in vfio_dma_map.  On the vfio side, there is no reason this would be GPU specific, it should happen for any *assigned* device (I emphasize assigned because RHEV treats things like USB passthrough the same as PCI device assignment).  Has anyone every tested whether libvirt increases the process locked memory limits when there's an assigned device and memory is hot-added?
Comment 15 Laine Stump 2015-10-20 11:03:07 EDT
(In reply to Alex Williamson from comment #14)
> Has anyone every tested
> whether libvirt increases the process locked memory limits when there's an
> assigned device and memory is hot-added?

I haven't tested it, but I can see from examining the code that the only place libvirt sets the max locked memory limit is when an assigned PCI device is hotplugged.

I think this bug should be re-assigned to libvirt, but I'm not sure which product's libvirt component to assign it to.
Comment 16 Yaniv Lavi (Dary) 2015-10-29 08:45:05 EDT
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.
Comment 17 Michal Skrivanek 2015-11-05 13:03:41 EST
Pending on resolution of 1273491. Expecting that post 7.2 GA hence we want to bump up libvirt version in vdsm spec
Comment 18 Martin Polednik 2015-11-11 06:11:44 EST
Workaround for the time being is disabling memory hotplug in engine-config.
Comment 21 Michal Skrivanek 2015-12-09 07:49:10 EST
moving to ON_QA based on https://bugzilla.redhat.com/show_bug.cgi?id=1280420#c9
Comment 22 Red Hat Bugzilla Rules Engine 2015-12-09 07:59:18 EST
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
Comment 23 Nisim Simsolo 2015-12-14 09:09:48 EST
Verified: 
rhevm-3.6.1.2-0.1 
libvirt-client-1.2.17-13.el7_2.2.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.4.x86_64
vdsm-4.17.13-1.el7ev.noarch

Scenario:
(verified using rhel and windows 8 VM on AMD and Intel based hosts)
1. Run VM
2. Hotplug increase memory
3. Verify VM continues to run and memory increased properly.

Note You need to log in before you can comment on or make changes to this bug.