Red Hat Bugzilla – Bug 1272742
VFIO: VM with attached GPU is powered off when trying to hotplug increase memory of VM.
Last modified: 2016-08-01 10:02:55 EDT
Description of problem:
*Notes: Next bug is relevant only for GPU passthrough (does not occur with other host devices attached to VM).
- Bug is relevant only for memory hotplug (does not occur when hotplug increasing VM CPUs amount).
- Bug occurs on linux/windows VMs.
When trying to increase running VM memory, VM is powered off instead of increase memory using hotplug mechanism.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Run VM with GPU attached.
2. From virtual machines tab > Edit VM > system > increase memory size.
3. click OK (without checking "apply later" checkbox).
VM powered off (Connection reset by peer)
VM memory should be hotplug increased without powering off VM.
- vmId (win7_intel): cfcdab4-415e-40d7-a19c-bdbc80a9172d'
- Issue occurred at: 2015-10-18 13:54:06,596 ERROR [org.ovirt.engine.core.vdsbroker.SetAmountOfMemoryVDSCommand] (ajp-/127.0.0.1:8702-11) [515aecb4] Failed in 'SetAmountOfMemoryVDS' method
2015-10-18 13:54:06,605 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-11) [515aecb4] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message:
VDSM intel-vfio.tlv.redhat.com command failed: Unable to read from monitor:
Connection reset by peer
engine.log and vdsm.log attached.
Created attachment 1084120 [details]
Created attachment 1084132 [details]
Created attachment 1084135 [details]
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
There doesn't seem to be any pointers to what might have happened in vdsm log (apart from hotplug being successful), can you also provide us with qemu log? Can you verify that the VM is dead by executing 'virsh -r list' and looking up the VM's name and status on hypervisor?
- 'Virsh -r list' shows that VM is not running after issue reproduced.
- Observing qemu log shows that there is an issue with memory allocation:
2015-10-19T14:48:43.304948Z vfio_dma_map(0x7f7c0c22b7a0, 0x140000000, 0x40000000, 0x7f7a87800000) = -12 (Cannot allocate memory)
qemu: hardware error: vfio: DMA mapping failed, unable to continue
- qemu log and libvirt XML attached.
Created attachment 1084442 [details]
Created attachment 1084443 [details]
libvirt.xml of win2012_intel VM
Alex, any idea what could be the cause? This happened both with maxMemory 4 TiB (our default) and 256 GiB. The VM works fine (with drivers blacklisted correctly as far as we know) until the hotplug is triggered.
Please report dmesg for the host after this occurs
attaching host dmesg before the issue and after the issue reproduced.
Created attachment 1084695 [details]
dmesg before memory hotplug
Created attachment 1084696 [details]
dmesg after memory hotplug
The process locked memory rlimit is set to 5G, which I believe is what libvirt uses for a 4G VM, the initial memory size of the VM. Therefore, the qemu-kvm process is not going to be a be able to lock more pages unless someone bumps the limit further. The evidence is in dmesg:
[ 599.043115] vfio_pin_pages: RLIMIT_MEMLOCK (5368709120) exceeded
[ 599.043119] vfio_pin_pages: RLIMIT_MEMLOCK (5368709120) exceeded
This results in the -ENOMEM failure in vfio_dma_map. On the vfio side, there is no reason this would be GPU specific, it should happen for any *assigned* device (I emphasize assigned because RHEV treats things like USB passthrough the same as PCI device assignment). Has anyone every tested whether libvirt increases the process locked memory limits when there's an assigned device and memory is hot-added?
(In reply to Alex Williamson from comment #14)
> Has anyone every tested
> whether libvirt increases the process locked memory limits when there's an
> assigned device and memory is hot-added?
I haven't tested it, but I can see from examining the code that the only place libvirt sets the max locked memory limit is when an assigned PCI device is hotplugged.
I think this bug should be re-assigned to libvirt, but I'm not sure which product's libvirt component to assign it to.
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.
Pending on resolution of 1273491. Expecting that post 7.2 GA hence we want to bump up libvirt version in vdsm spec
Workaround for the time being is disabling memory hotplug in engine-config.
moving to ON_QA based on https://bugzilla.redhat.com/show_bug.cgi?id=1280420#c9
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
(verified using rhel and windows 8 VM on AMD and Intel based hosts)
1. Run VM
2. Hotplug increase memory
3. Verify VM continues to run and memory increased properly.