Bug 1273480
Summary: | ppc64le: VFIO doesn't work for small guests (1 GiB) | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Martin Polednik <mpoledni> | ||||||||||||
Component: | libvirt | Assignee: | Andrea Bolognani <abologna> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 7.2 | CC: | abologna, dgibson, dyuan, dzheng, gklein, gsun, hannsj_uhl, jdenemar, jherrman, jsuchane, juzhang, lvivier, mavital, michen, mpoledni, mzhan, pkrempa, qzhang, rbalakri, tlavigne, virt-maint | ||||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||||
Target Release: | 7.3 | ||||||||||||||
Hardware: | ppc64le | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | libvirt-1.3.1-1.el7 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: |
Prior to this update, guest virtual machines with 1 GiB or less allocated memory in some cases failed to start on IBM Power systems. With this update, the memory lock limit is calculated using architecture-specific logic, which allows the affected guests to start as expected.
|
Story Points: | --- | ||||||||||||
Clone Of: | |||||||||||||||
: | 1283924 (view as bug list) | Environment: | |||||||||||||
Last Closed: | 2016-11-03 18:26:22 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 1154205, 1277183, 1277184, 1277186, 1277189, 1283924, 1308609 | ||||||||||||||
Attachments: |
|
Created attachment 1084771 [details]
domain XML
Created attachment 1084772 [details]
qemu log
This can be reproduced on PPC64LE, but can not on x86_64. kernel-3.10.0-324.el7.ppc64le qemu-kvm-rhev-2.3.0-31.el7.ppc64le libvirt-daemon-1.2.17-13.el7.ppc64le Also tried with <hostdev mode='subsystem' type='pci' managed='yes'> and without <numatune>, and same problem happens. Only just saw this. This looks much more likely to be a qemu or KVM bug than libvirt, reassigning. Laurent, can you investigate this one. I'm a bit baffled by the error messages - those suggest the vfio calls to the host kernel are returning ENOMEM, which I wouldn't expect unless the host was in dire straits. I'm not able to reproduce it with an USB PCI card. What is the kind of the card you are trying to assign ? Do you now why the same card is assigned twice: once by function 0, once by function 1 ? According to error message: "vfio: failed to enable container: Cannot allocate memory" the failing function in QEMU is: 736 ret = ioctl(fd, VFIO_IOMMU_ENABLE); 737 if (ret) { 738 error_report("vfio: failed to enable container: %m"); 739 ret = -errno; 740 goto free_container_exit; 741 } and errno is ENOMEM. In kernel, ENOMEM is returned by: tce_iommu_ioctl() ... 936 case VFIO_IOMMU_ENABLE: ... 941 ret = tce_iommu_enable(container); ... 943 return ret; tce_iommu_enable() ... 280 locked = table_group->tce32_size >> PAGE_SHIFT; 281 ret = try_increment_locked_vm(locked); ... try_increment_locked_vm() ... 45 locked = current->mm->locked_vm + npages; 46 lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; 47 if (locked > lock_limit && !capable(CAP_IPC_LOCK)) 48 ret = -ENOMEM; ... So we have this error because too much memory is locked. What is the size of the memory regions on the PCI card ? (if it is a video card the limit can be reached quickly) For instance on my system, my limits with libvirtd is: $ cat /proc/93693/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 976978 976978 processes Max open files 1024 4096 files Max locked memory 3221225472 3221225472 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 976978 976978 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us So, if your card exposes more than 3 GB, this could explain the problem. Andrea, is it possible to change the memlock limit in libvirt ? (In reply to Laurent Vivier from comment #7) > Andrea, is it possible to change the memlock limit in libvirt ? You mean the memlock limit for the QEMU process spawned by libvirt, right? That limit should be set to either <memtune><hard_limit>[1], if defined, or <maxMemory>[2] + 1GB. In Martin's case, that would be 11GB. Is this consistent with what you're seeing? If not, can you please share your guest configuration? [1] https://libvirt.org/formatdomain.html#elementsMemoryTuning [2] https://libvirt.org/formatdomain.html#elementsMemoryAllocation Also, the test was done on video card (actually unsupported ATI, just to see where this goes). Using this XML, it happens even on a NIC. More testing will happen as soon as we are able to plug in nvidia (quadro 2000 or close) card. video 0004:01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cedar GL [FirePro 2270] [1002:68f2] Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:0126] Kernel driver in use: pci-stub 0004:01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Cedar HDMI Audio [Radeon HD 5400/6300 Series] [1002:aa68] Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Cedar HDMI Audio [Radeon HD 5400/6300 Series] [1002:aa68] Kernel driver in use: pci-stub nic 0002:01:00.0 Ethernet controller [0200]: Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet [14e4:168a] (rev 10) Subsystem: IBM Device [1014:0493] Kernel driver in use: bnx2x 0002:01:00.1 Ethernet controller [0200]: Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet [14e4:168a] (rev 10) Subsystem: IBM Device [1014:0493] Kernel driver in use: bnx2x 0002:01:00.2 Ethernet controller [0200]: Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet [14e4:168a] (rev 10) Subsystem: IBM Device [1014:0494] Kernel driver in use: bnx2x 0002:01:00.3 Ethernet controller [0200]: Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet [14e4:168a] (rev 10) Subsystem: IBM Device [1014:0494] Kernel driver in use: bnx2x Interesting note: when I stop my guest, the host crashes when the card goes back to the host. This seems related to BZ1270717. I've been able to reproduce it with libvirt, but the same QEMU command line works fine if started manually. I cannot check the limits because qemu exits immediately. (In reply to Andrea Bolognani from comment #8) > (In reply to Laurent Vivier from comment #7) > > Andrea, is it possible to change the memlock limit in libvirt ? > > You mean the memlock limit for the QEMU process spawned by > libvirt, right? Yes, > That limit should be set to either <memtune><hard_limit>[1], > if defined, or <maxMemory>[2] + 1GB. In Martin's case, that > would be 11GB. This could be the reason of the problem: if I use <memtune> to set the hardlimit, the guest boots fine. <memtune> <hard_limit unit='KiB'>10485760</hard_limit> <soft_limit unit='KiB'>10485760</soft_limit> </memtune> cat /proc/9427/limits Limit Soft Limit Hard Limit Units .... Max locked memory 10737418240 10737418240 bytes Max address space unlimited unlimited bytes ... So, I'll give it back to you... Created attachment 1087642 [details]
My guest definition on ibm-p8-virt-01
This is the definition of my guest on ibm-p8-virt-01.
This is the working one, remove the <memtune> tags to have the faulty one.
Update: I'm unable to reproduce the issue with libvirt master, which means the bug has already been fixed upstream. Now to figure out what needs to be backported :) Never mind, my test build and the RPMs where the bug can be reproduced seem to differ in configuration, so it might still need to be fixed upstream. I've incorrectly stated in Comment #8 that the limit is <maxMemory> + 1GB, while it's in fact <memory> + 1GB. Setting <memtune><hard_limit> still overrides this. Reproduced with kernel-3.10.0-327.el7.ppc64le qemu-kvm-rhev-2.3.0-31.el7.ppc64le libvirt-client-1.2.17-13.el7.ppc64le # virsh nodedev-dumpxml pci_0003_09_00_0 ... <iommuGroup number='1'> <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/> <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/> <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/> <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/> </iommuGroup> # virsh dumpxml guest <maxMemory slots='2' unit='KiB'>10485760</maxMemory> <cpu mode='custom' match='exact'> <model fallback='allow'>POWER8</model> <numa> <cell id='0' cpus='0' memory='1048576' unit='KiB'/> </numa> </cpu> <devices> <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/> </source> </hostdev> ... </devices> # virsh start guest <Error messages are same with those reported> Tried with <memtune> and the guest can start correctly. <memtune> <hard_limit unit='KiB'>10485760</hard_limit> <soft_limit unit='KiB'>10485760</soft_limit> </memtune> Created attachment 1094841 [details]
simple domain XML that can be used to reproduce the issue
Updated the Summary to reflect the fact that the issue occurs
regardless of whether maxMemory has been set, but only for
small guests, eg. 1 GiB.
I'm attaching the domain XML as per David's request. You can
get the guest to boot by removing the <hostdev> elements; if
you try hotplugging the PCI devices later on you will get
the same error about not being able to allocate memory.
A fix is being tested.
Fix posted upstream: https://www.redhat.com/archives/libvir-list/2015-November/msg00629.html Waiting for David's comment on the list to get the last ACK required to push it. The fix has been pushed upstream. commit d269ef165c178ad62b48e5179fc4f3b4fa5e590b Author: Andrea Bolognani <abologna> Date: Fri Nov 13 10:37:12 2015 +0100 qemu: Add ppc64-specific math to qemuDomainGetMlockLimitBytes() The amount of memory a ppc64 domain might need to lock is different than that of a equally-sized x86 domain, so we need to check the domain's architecture and act accordingly. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1273480 v1.2.21-108-gd269ef1 Packages: kernel-3.10.0-362.el7.ppc64le qemu-kvm-rhev-2.5.0-2.el7.ppc64le libvirt-1.3.2-1.el7.ppc64le Guest kernel: 3.10.0-327.3.1.el7.ppc64le Steps: Case1: Start a guest with 1G memory + Hostdev 1. Check PCI device on host # lspci ... 0003:09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0003:09:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0003:09:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0003:09:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) # virsh nodedev-dumpxml pci_0003_09_00_0 <device> <name>pci_0003_09_00_0</name> ... <iommuGroup number='1'> <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/> <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/> <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/> <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/> </iommuGroup> ... </device> 2. Guest XML. <maxMemory slots='16' unit='KiB'>10485760</maxMemory> <memory unit='KiB'>1048576</memory> <cpu mode='custom' match='exact'> <model fallback='allow'>POWER8</model> <numa> <cell id='0' cpus='0' memory='1048576' unit='KiB'/> </numa> </cpu> ... <devices> ... <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/> </source> </hostdev> ... </devices> 2. The guest can start successfully 3. Within the guest, check PCI devices. [root@localhost ~]# lspci ... 00:07.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 00:08.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 00:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 00:0a.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 4. On the host, check the dumpxml of the guest and see the address of PCI is correct. <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/> </source> <alias name='hostdev0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/> </source> <alias name='hostdev1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/> </source> <alias name='hostdev2'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/> </source> <alias name='hostdev3'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/> </hostdev> Case2: Start a guest with less than 1G memory without Hostdev, then hotplug a PCI device. 1. Guest XMl <memory unit='KiB'>800576</memory> <cpu mode='custom' match='exact'> <model fallback='allow'>POWER8</model> <numa> <cell id='0' cpus='0' memory='800576' unit='KiB'/> </numa> </cpu> 2. Start the guest and check and see no additional PCI device within the guest. 3. Prepare a XML. <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/> </source> </hostdev> 4. Unload other PCI devices within same iommugroup, such as pci_0003_09_00_1, pci_0003_09_00_2, and pci_0003_09_00_3 # virsh nodedev-detach pci_0003_09_00_1 Device pci_0003_09_00_1 detached # virsh nodedev-reset pci_0003_09_00_1 Device pci_0003_09_00_1 reset ... Same operations for other PCI devices. 5. Attach device pci_0003_09_00_0 to the guest # virsh attach-device avocado-vt-vm1 attach.xml 6. Check the device is added within the guest. [root@localhost ~]# lspci ... 00:08.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 7. Double check dumpxml of the guest on host to see the address of this PCI device is correct. # virsh dumpxml guest|grep hostdev -a4 <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/> </source> <alias name='hostdev0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/> </hostdev> So two tests are passed. Mark it as Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2577.html |
Created attachment 1084770 [details] dmesg output Description of problem: VM fails to start when it is started with memory hotplug enabled (maxMemory element set) and at the same time has PCI device assigned. Version-Release number of selected component (if applicable): kernel-3.10.0-313.el7.ppc64le qemu-kvm-rhev-2.3.0-31.el7.ppc64le libvirt-daemon-1.2.17-13.el7.ppc64le How reproducible: Always Steps to Reproduce: 1. Define a domain with maxMemory and PCI device assigned 2. Start the domain Actual results: 2015-10-20T11:38:25.691928Z qemu-kvm: -device vfio-pci,host=0004:01:00.0,id=hostdev0,bus=pci.0,addr=0x3: vfio: failed to enable container: Cannot allocate memory 2015-10-20T11:38:25.692036Z qemu-kvm: -device vfio-pci,host=0004:01:00.0,id=hostdev0,bus=pci.0,addr=0x3: vfio: failed to setup container for group 4 2015-10-20T11:38:25.717706Z qemu-kvm: -device vfio-pci,host=0004:01:00.0,id=hostdev0,bus=pci.0,addr=0x3: vfio: failed to get group 4 2015-10-20T11:38:25.717790Z qemu-kvm: -device vfio-pci,host=0004:01:00.0,id=hostdev0,bus=pci.0,addr=0x3: Device initialization failed 2015-10-20T11:38:25.717812Z qemu-kvm: -device vfio-pci,host=0004:01:00.0,id=hostdev0,bus=pci.0,addr=0x3: Device 'vfio-pci' could not be initialized Causing the domain not to start. Expected results: Domain is started. Additional info: