Bug 1272742

Summary: VFIO: VM with attached GPU is powered off when trying to hotplug increase memory of VM.
Product: [oVirt] ovirt-engine Reporter: Nisim Simsolo <nsimsolo>
Component: BLL.VirtAssignee: Martin Polednik <mpoledni>
Status: CLOSED CURRENTRELEASE QA Contact: Nisim Simsolo <nsimsolo>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.6.0.1CC: alex.williamson, bugs, gklein, hannsj_uhl, jherrman, laine, mavital, mgoldboi, michal.skrivanek, mpoledni, nsimsolo
Target Milestone: ovirt-3.6.2Flags: rule-engine: ovirt-3.6.z+
rule-engine: blocker+
mgoldboi: planning_ack+
michal.skrivanek: devel_ack+
mavital: testing_ack+
Target Release: 3.6.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
When using Virtual Function I/O (VFIO) passthrough devices, the memory lock limit failed to be modified during a memory hot-plug operation. As a consequence, the guest virtual machine terminated unexpectedly. Now, the memory lock limit modification is performed before the memory hot-plug, and the described crash no longer occurs.
Story Points: ---
Clone Of:
: 1273491 (view as bug list) Environment:
Last Closed: 2016-02-18 11:06:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1273491, 1284775, 1305498    
Bug Blocks:    
Attachments:
Description Flags
engine.log
none
vdsm.log.1.xz
none
vdsm.log
none
qemu log
none
libvirt.xml of win2012_intel VM
none
dmesg before memory hotplug
none
dmesg after memory hotplug none

Description Nisim Simsolo 2015-10-18 11:44:34 UTC
Description of problem:
*Notes: Next bug is relevant only for GPU passthrough (does not occur with other host devices attached to VM).
- Bug is relevant only for memory hotplug (does not occur when hotplug increasing VM CPUs amount).
- Bug occurs on linux/windows VMs.

When trying to increase running VM memory, VM is powered off instead of increase memory using hotplug mechanism.

Version-Release number of selected component (if applicable):
rhevm-3.6.0.1-0.1.el6
sanlock-3.2.4-1.el7.x86_64
vdsm-4.17.9-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7.x86_64
libvirt-client-1.2.17-5.el7.x86_64

How reproducible:
Consistently

Steps to Reproduce:
1. Run VM with GPU attached.
2. From virtual machines tab > Edit VM > system > increase memory size.
3. click OK (without checking "apply later" checkbox).

Actual results:
VM powered off (Connection reset by peer)

Expected results:
VM memory should be hotplug increased without powering off VM.

Additional info:
- vmId (win7_intel): cfcdab4-415e-40d7-a19c-bdbc80a9172d'
- Issue occurred at: 2015-10-18 13:54:06,596 ERROR [org.ovirt.engine.core.vdsbroker.SetAmountOfMemoryVDSCommand] (ajp-/127.0.0.1:8702-11) [515aecb4] Failed in 'SetAmountOfMemoryVDS' method
2015-10-18 13:54:06,605 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-11) [515aecb4] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message:
VDSM intel-vfio.tlv.redhat.com command failed: Unable to read from monitor: 
Connection reset by peer

engine.log and vdsm.log attached.

Comment 1 Nisim Simsolo 2015-10-18 11:45:30 UTC
Created attachment 1084120 [details]
engine.log

Comment 2 Nisim Simsolo 2015-10-18 11:47:26 UTC
Created attachment 1084132 [details]
vdsm.log.1.xz

Comment 3 Nisim Simsolo 2015-10-18 11:47:45 UTC
Created attachment 1084135 [details]
vdsm.log

Comment 4 Red Hat Bugzilla Rules Engine 2015-10-19 10:51:32 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 5 Martin Polednik 2015-10-19 12:19:00 UTC
There doesn't seem to be any pointers to what might have happened in vdsm log (apart from hotplug being successful), can you also provide us with qemu log? Can you verify that the VM is dead by executing 'virsh -r list' and looking up the VM's name and status on hypervisor?

Comment 6 Nisim Simsolo 2015-10-19 14:52:48 UTC
- 'Virsh -r list' shows that VM is not running after issue reproduced.
- Observing qemu log shows that there is an issue with memory allocation: 
2015-10-19T14:48:43.304948Z vfio_dma_map(0x7f7c0c22b7a0, 0x140000000, 0x40000000, 0x7f7a87800000) = -12 (Cannot allocate memory)
qemu: hardware error: vfio: DMA mapping failed, unable to continue
- qemu log and libvirt XML attached.

Comment 7 Nisim Simsolo 2015-10-19 14:53:43 UTC
Created attachment 1084442 [details]
qemu log

Comment 8 Nisim Simsolo 2015-10-19 14:54:35 UTC
Created attachment 1084443 [details]
libvirt.xml of win2012_intel VM

Comment 9 Martin Polednik 2015-10-19 15:01:07 UTC
Alex, any idea what could be the cause? This happened both with maxMemory 4 TiB (our default) and 256 GiB. The VM works fine (with drivers blacklisted correctly as far as we know) until the hotplug is triggered.

Comment 10 Alex Williamson 2015-10-19 23:38:37 UTC
Please report dmesg for the host after this occurs

Comment 11 Nisim Simsolo 2015-10-20 11:38:24 UTC
attaching host dmesg before the issue and after the issue reproduced.

Comment 12 Nisim Simsolo 2015-10-20 11:39:06 UTC
Created attachment 1084695 [details]
dmesg before memory hotplug

Comment 13 Nisim Simsolo 2015-10-20 11:40:37 UTC
Created attachment 1084696 [details]
dmesg after memory hotplug

Comment 14 Alex Williamson 2015-10-20 12:59:16 UTC
The process locked memory rlimit is set to 5G, which I believe is what libvirt uses for a 4G VM, the initial memory size of the VM.  Therefore, the qemu-kvm process is not going to be a be able to lock more pages unless someone bumps the limit further.  The evidence is in dmesg:

[  599.043115] vfio_pin_pages: RLIMIT_MEMLOCK (5368709120) exceeded
[  599.043119] vfio_pin_pages: RLIMIT_MEMLOCK (5368709120) exceeded

This results in the -ENOMEM failure in vfio_dma_map.  On the vfio side, there is no reason this would be GPU specific, it should happen for any *assigned* device (I emphasize assigned because RHEV treats things like USB passthrough the same as PCI device assignment).  Has anyone every tested whether libvirt increases the process locked memory limits when there's an assigned device and memory is hot-added?

Comment 15 Laine Stump 2015-10-20 15:03:07 UTC
(In reply to Alex Williamson from comment #14)
> Has anyone every tested
> whether libvirt increases the process locked memory limits when there's an
> assigned device and memory is hot-added?

I haven't tested it, but I can see from examining the code that the only place libvirt sets the max locked memory limit is when an assigned PCI device is hotplugged.

I think this bug should be re-assigned to libvirt, but I'm not sure which product's libvirt component to assign it to.

Comment 16 Yaniv Lavi 2015-10-29 12:45:05 UTC
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 17 Michal Skrivanek 2015-11-05 18:03:41 UTC
Pending on resolution of 1273491. Expecting that post 7.2 GA hence we want to bump up libvirt version in vdsm spec

Comment 18 Martin Polednik 2015-11-11 11:11:44 UTC
Workaround for the time being is disabling memory hotplug in engine-config.

Comment 21 Michal Skrivanek 2015-12-09 12:49:10 UTC
moving to ON_QA based on https://bugzilla.redhat.com/show_bug.cgi?id=1280420#c9

Comment 22 Red Hat Bugzilla Rules Engine 2015-12-09 12:59:18 UTC
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 23 Nisim Simsolo 2015-12-14 14:09:48 UTC
Verified: 
rhevm-3.6.1.2-0.1 
libvirt-client-1.2.17-13.el7_2.2.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.4.x86_64
vdsm-4.17.13-1.el7ev.noarch

Scenario:
(verified using rhel and windows 8 VM on AMD and Intel based hosts)
1. Run VM
2. Hotplug increase memory
3. Verify VM continues to run and memory increased properly.