Bug 1273480 - ppc64le: VFIO doesn't work for small guests (1 GiB)
ppc64le: VFIO doesn't work for small guests (1 GiB)
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt (Show other bugs)
7.2
ppc64le Linux
high Severity urgent
: rc
: 7.3
Assigned To: Andrea Bolognani
Virtualization Bugs
: ZStream
Depends On:
Blocks: 1154205 1277183 1277184 1277186 RHEV3.6PPC_PCI_Passthrough 1283924 RHEV4.0PPC
  Show dependency treegraph
 
Reported: 2015-10-20 10:04 EDT by Martin Polednik
Modified: 2016-11-03 14:26 EDT (History)
21 users (show)

See Also:
Fixed In Version: libvirt-1.3.1-1.el7
Doc Type: Bug Fix
Doc Text:
Prior to this update, guest virtual machines with 1 GiB or less allocated memory in some cases failed to start on IBM Power systems. With this update, the memory lock limit is calculated using architecture-specific logic, which allows the affected guests to start as expected.
Story Points: ---
Clone Of:
: 1283924 (view as bug list)
Environment:
Last Closed: 2016-11-03 14:26:22 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
dmesg output (183.31 KB, text/plain)
2015-10-20 10:04 EDT, Martin Polednik
no flags Details
domain XML (2.46 KB, text/plain)
2015-10-20 10:05 EDT, Martin Polednik
no flags Details
qemu log (2.31 KB, text/plain)
2015-10-20 10:06 EDT, Martin Polednik
no flags Details
My guest definition on ibm-p8-virt-01 (2.67 KB, text/plain)
2015-10-29 13:39 EDT, Laurent Vivier
no flags Details
simple domain XML that can be used to reproduce the issue (2.28 KB, application/xml)
2015-11-16 05:18 EST, Andrea Bolognani
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:2577 normal SHIPPED_LIVE Moderate: libvirt security, bug fix, and enhancement update 2016-11-03 08:07:06 EDT

  None (edit)
Description Martin Polednik 2015-10-20 10:04:56 EDT
Created attachment 1084770 [details]
dmesg output

Description of problem:
VM fails to start when it is started with memory hotplug enabled (maxMemory element set) and at the same time has PCI device assigned.

Version-Release number of selected component (if applicable):
kernel-3.10.0-313.el7.ppc64le
qemu-kvm-rhev-2.3.0-31.el7.ppc64le
libvirt-daemon-1.2.17-13.el7.ppc64le

How reproducible:
Always

Steps to Reproduce:
1. Define a domain with maxMemory and PCI device assigned
2. Start the domain

Actual results:
2015-10-20T11:38:25.691928Z qemu-kvm: -device vfio-pci,host=0004:01:00.0,id=hostdev0,bus=pci.0,addr=0x3: vfio: failed to enable container: Cannot allocate memory
2015-10-20T11:38:25.692036Z qemu-kvm: -device vfio-pci,host=0004:01:00.0,id=hostdev0,bus=pci.0,addr=0x3: vfio: failed to setup container for group 4
2015-10-20T11:38:25.717706Z qemu-kvm: -device vfio-pci,host=0004:01:00.0,id=hostdev0,bus=pci.0,addr=0x3: vfio: failed to get group 4
2015-10-20T11:38:25.717790Z qemu-kvm: -device vfio-pci,host=0004:01:00.0,id=hostdev0,bus=pci.0,addr=0x3: Device initialization failed
2015-10-20T11:38:25.717812Z qemu-kvm: -device vfio-pci,host=0004:01:00.0,id=hostdev0,bus=pci.0,addr=0x3: Device 'vfio-pci' could not be initialized
Causing the domain not to start.

Expected results:
Domain is started.

Additional info:
Comment 1 Martin Polednik 2015-10-20 10:05 EDT
Created attachment 1084771 [details]
domain XML
Comment 2 Martin Polednik 2015-10-20 10:06 EDT
Created attachment 1084772 [details]
qemu log
Comment 4 Dan Zheng 2015-10-21 04:09:42 EDT
This can be reproduced on PPC64LE, but can not on x86_64.
kernel-3.10.0-324.el7.ppc64le
qemu-kvm-rhev-2.3.0-31.el7.ppc64le
libvirt-daemon-1.2.17-13.el7.ppc64le


Also tried with <hostdev mode='subsystem' type='pci' managed='yes'> and without <numatune>, and same problem happens.
Comment 5 David Gibson 2015-10-28 20:21:15 EDT
Only just saw this.

This looks much more likely to be a qemu or KVM bug than libvirt, reassigning.

Laurent, can you investigate this one.

I'm a bit baffled by the error messages - those suggest the vfio calls to the host kernel are returning ENOMEM, which I wouldn't expect unless the host was in dire straits.
Comment 6 Laurent Vivier 2015-10-29 07:49:56 EDT
I'm not able to reproduce it with an USB PCI card.

What is the kind of the card you are trying to assign ?

Do you now why the same card is assigned twice: once by function 0, once by function 1 ?
Comment 7 Laurent Vivier 2015-10-29 09:38:54 EDT
According to error message:

"vfio: failed to enable container: Cannot allocate memory"

the failing function in QEMU is:

    736         ret = ioctl(fd, VFIO_IOMMU_ENABLE);
    737         if (ret) {
    738             error_report("vfio: failed to enable container: %m");
    739             ret = -errno;
    740             goto free_container_exit;
    741         }

and errno is ENOMEM.

In kernel, ENOMEM is returned by:

tce_iommu_ioctl()
...
    936         case VFIO_IOMMU_ENABLE:
...
    941                 ret = tce_iommu_enable(container);
...
    943                 return ret;

tce_iommu_enable()
...
    280         locked = table_group->tce32_size >> PAGE_SHIFT;
    281         ret = try_increment_locked_vm(locked);
...

try_increment_locked_vm()
...
     45         locked = current->mm->locked_vm + npages;
     46         lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
     47         if (locked > lock_limit && !capable(CAP_IPC_LOCK))
     48                 ret = -ENOMEM;
...

So we have this error because too much memory is locked.

What is the size of the memory regions on the PCI card ?
(if it is a video card the limit can be reached quickly)

For instance on my system, my limits with libvirtd is:
$ cat /proc/93693/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             976978               976978               processes 
Max open files            1024                 4096                 files     
Max locked memory         3221225472           3221225472           bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       976978               976978               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        

So, if your card exposes more than 3 GB, this could explain the problem.

Andrea, is it possible to change the memlock limit in libvirt ?
Comment 8 Andrea Bolognani 2015-10-29 10:21:53 EDT
(In reply to Laurent Vivier from comment #7)
> Andrea, is it possible to change the memlock limit in libvirt ?

You mean the memlock limit for the QEMU process spawned by
libvirt, right?

That limit should be set to either <memtune><hard_limit>[1],
if defined, or <maxMemory>[2] + 1GB. In Martin's case, that
would be 11GB.

Is this consistent with what you're seeing? If not, can you
please share your guest configuration?


[1] https://libvirt.org/formatdomain.html#elementsMemoryTuning
[2] https://libvirt.org/formatdomain.html#elementsMemoryAllocation
Comment 9 Martin Polednik 2015-10-29 11:01:12 EDT
Also, the test was done on video card (actually unsupported ATI, just to see where this goes). Using this XML, it happens even on a NIC. More testing will happen as soon as we are able to plug in nvidia (quadro 2000 or close) card.

video
0004:01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cedar GL [FirePro 2270] [1002:68f2]
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:0126]
	Kernel driver in use: pci-stub
0004:01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Cedar HDMI Audio [Radeon HD 5400/6300 Series] [1002:aa68]
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Cedar HDMI Audio [Radeon HD 5400/6300 Series] [1002:aa68]
	Kernel driver in use: pci-stub

nic
0002:01:00.0 Ethernet controller [0200]: Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet [14e4:168a] (rev 10)
	Subsystem: IBM Device [1014:0493]
	Kernel driver in use: bnx2x
0002:01:00.1 Ethernet controller [0200]: Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet [14e4:168a] (rev 10)
	Subsystem: IBM Device [1014:0493]
	Kernel driver in use: bnx2x
0002:01:00.2 Ethernet controller [0200]: Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet [14e4:168a] (rev 10)
	Subsystem: IBM Device [1014:0494]
	Kernel driver in use: bnx2x
0002:01:00.3 Ethernet controller [0200]: Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet [14e4:168a] (rev 10)
	Subsystem: IBM Device [1014:0494]
	Kernel driver in use: bnx2x
Comment 10 Laurent Vivier 2015-10-29 11:23:44 EDT
Interesting note: when I stop my guest, the host crashes when the card goes back to the host. This seems related to BZ1270717.
Comment 11 Laurent Vivier 2015-10-29 13:36:37 EDT
I've been able to reproduce it with libvirt, but the same QEMU command line works fine if started manually. I cannot check the limits because qemu exits immediately.

(In reply to Andrea Bolognani from comment #8)
> (In reply to Laurent Vivier from comment #7)
> > Andrea, is it possible to change the memlock limit in libvirt ?
> 
> You mean the memlock limit for the QEMU process spawned by
> libvirt, right?

Yes,

> That limit should be set to either <memtune><hard_limit>[1],
> if defined, or <maxMemory>[2] + 1GB. In Martin's case, that
> would be 11GB.

This could be the reason of the problem: if I use <memtune> to set the hardlimit, the guest boots fine.

  <memtune>
    <hard_limit unit='KiB'>10485760</hard_limit>
    <soft_limit unit='KiB'>10485760</soft_limit>
  </memtune>

cat /proc/9427/limits 
Limit                     Soft Limit           Hard Limit           Units     
....
Max locked memory         10737418240          10737418240          bytes     
Max address space         unlimited            unlimited            bytes     
...

So, I'll give it back to you...
Comment 12 Laurent Vivier 2015-10-29 13:39 EDT
Created attachment 1087642 [details]
My guest definition on ibm-p8-virt-01

This is the definition of my guest on ibm-p8-virt-01.

This is the working one, remove the <memtune> tags to have the faulty one.
Comment 13 Andrea Bolognani 2015-11-03 10:46:43 EST
Update: I'm unable to reproduce the issue with libvirt
master, which means the bug has already been fixed
upstream.

Now to figure out what needs to be backported :)
Comment 14 Andrea Bolognani 2015-11-03 12:02:20 EST
Never mind, my test build and the RPMs where the bug can be
reproduced seem to differ in configuration, so it might still
need to be fixed upstream.
Comment 15 Andrea Bolognani 2015-11-05 10:56:46 EST
I've incorrectly stated in Comment #8 that the limit is
<maxMemory> + 1GB, while it's in fact <memory> + 1GB.

Setting <memtune><hard_limit> still overrides this.
Comment 16 Dan Zheng 2015-11-09 00:00:44 EST
Reproduced with 

kernel-3.10.0-327.el7.ppc64le
qemu-kvm-rhev-2.3.0-31.el7.ppc64le
libvirt-client-1.2.17-13.el7.ppc64le


# virsh nodedev-dumpxml pci_0003_09_00_0
...
 <iommuGroup number='1'>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/>
 </iommuGroup>

# virsh dumpxml guest

<maxMemory slots='2' unit='KiB'>10485760</maxMemory>
<cpu mode='custom' match='exact'>
    <model fallback='allow'>POWER8</model>
    <numa>
      <cell id='0' cpus='0' memory='1048576' unit='KiB'/>
    </numa>
</cpu>
<devices>
     <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/>
      </source>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
      </source>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/>
      </source>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/>
      </source>
    </hostdev>

...
</devices>

# virsh start guest
<Error messages are same with those reported>

Tried with <memtune> and the guest can start correctly.
 <memtune>
    <hard_limit unit='KiB'>10485760</hard_limit>
    <soft_limit unit='KiB'>10485760</soft_limit>
  </memtune>
Comment 17 Andrea Bolognani 2015-11-16 05:18 EST
Created attachment 1094841 [details]
simple domain XML that can be used to reproduce the issue

Updated the Summary to reflect the fact that the issue occurs
regardless of whether maxMemory has been set, but only for
small guests, eg. 1 GiB.

I'm attaching the domain XML as per David's request. You can
get the guest to boot by removing the <hostdev> elements; if
you try hotplugging the PCI devices later on you will get
the same error about not being able to allocate memory.

A fix is being tested.
Comment 18 Andrea Bolognani 2015-11-19 09:17:09 EST
Fix posted upstream:

  https://www.redhat.com/archives/libvir-list/2015-November/msg00629.html

Waiting for David's comment on the list to get the last
ACK required to push it.
Comment 19 Andrea Bolognani 2015-11-20 04:55:50 EST
The fix has been pushed upstream.

commit d269ef165c178ad62b48e5179fc4f3b4fa5e590b
Author: Andrea Bolognani <abologna@redhat.com>
Date:   Fri Nov 13 10:37:12 2015 +0100

    qemu: Add ppc64-specific math to qemuDomainGetMlockLimitBytes()
    
    The amount of memory a ppc64 domain might need to lock is different
    than that of a equally-sized x86 domain, so we need to check the
    domain's architecture and act accordingly.
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1273480

v1.2.21-108-gd269ef1
Comment 22 Dan Zheng 2016-03-10 22:07:41 EST
Packages:
kernel-3.10.0-362.el7.ppc64le
qemu-kvm-rhev-2.5.0-2.el7.ppc64le
libvirt-1.3.2-1.el7.ppc64le

Guest kernel: 3.10.0-327.3.1.el7.ppc64le

Steps:
Case1: Start a guest with 1G memory + Hostdev

1. Check PCI device on host
# lspci
...
0003:09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0003:09:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0003:09:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0003:09:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)

# virsh nodedev-dumpxml pci_0003_09_00_0
<device>
  <name>pci_0003_09_00_0</name>
 ...
    <iommuGroup number='1'>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/>
    </iommuGroup>
...
</device>
2. Guest XML.

<maxMemory slots='16' unit='KiB'>10485760</maxMemory>
<memory unit='KiB'>1048576</memory>
<cpu mode='custom' match='exact'>
  <model fallback='allow'>POWER8</model>
  <numa>
    <cell id='0' cpus='0' memory='1048576' unit='KiB'/>
  </numa>
</cpu>


...
<devices>
...
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/>
      </source>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
      </source>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/>
      </source>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/>
      </source>
    </hostdev>
...
 </devices>

2. The guest can start successfully
3. Within the guest, check PCI devices.
[root@localhost ~]# lspci
...
00:07.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
00:08.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
00:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
00:0a.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)

4. On the host, check the dumpxml of the guest and see the address of PCI is correct.
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/>
      </source>
      <alias name='hostdev2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/>
      </source>
      <alias name='hostdev3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </hostdev>


Case2: Start a guest with less than 1G memory without Hostdev, then hotplug a PCI device.

1. Guest XMl
<memory unit='KiB'>800576</memory>
<cpu mode='custom' match='exact'>
  <model fallback='allow'>POWER8</model>
  <numa>
    <cell id='0' cpus='0' memory='800576' unit='KiB'/>
  </numa>
</cpu>

2. Start the guest and check and see no additional PCI device within the guest.
3. Prepare a XML.
  <hostdev mode='subsystem' type='pci' managed='yes'>
       <source>
         <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/>
       </source>
  </hostdev> 
4. Unload other PCI devices within same iommugroup, such as pci_0003_09_00_1, pci_0003_09_00_2, and pci_0003_09_00_3

# virsh nodedev-detach pci_0003_09_00_1
Device pci_0003_09_00_1 detached
# virsh nodedev-reset pci_0003_09_00_1
Device pci_0003_09_00_1 reset
...
Same operations for other PCI devices.

5. Attach device pci_0003_09_00_0 to the guest
# virsh attach-device avocado-vt-vm1 attach.xml
6. Check the device is added within the guest.
[root@localhost ~]# lspci
...
00:08.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
7. Double check dumpxml of the guest on host to see the address of this PCI device is correct.
# virsh dumpxml guest|grep hostdev -a4

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>


So two tests are passed. 
Mark it as Verified.
Comment 24 errata-xmlrpc 2016-11-03 14:26:22 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2577.html

Note You need to log in before you can comment on or make changes to this bug.