Bug 1427005 - [RFE] libvirt support of VT-d protected device assignment
Summary: [RFE] libvirt support of VT-d protected device assignment
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.4
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: rc
: 7.4
Assignee: Ján Tomko
QA Contact: Jingjing Shao
URL:
Whiteboard:
Depends On: 1335808
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-27 04:15 UTC by Peter Xu
Modified: 2017-08-02 00:03 UTC (History)
15 users (show)

Fixed In Version: libvirt-3.2.0-6.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-01 17:24:15 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1846 normal SHIPPED_LIVE libvirt bug fix and enhancement update 2017-08-01 18:02:50 UTC

Description Peter Xu 2017-02-27 04:15:20 UTC
Description of problem:

Upstream QEMU will going to support VT-d protected device assignment. Latest patchset:

  https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg01331.html

We expect it to be in for QEMU 2.9, but it depends... Anyway, we'll need this part of support for libvirt soon.

For general emulated devices like e1000, we only need this line to enable VT-d protection:

  -device intel-iommu[,intremap=on]

Here "intremap=on" will also enable interrupt remapping, which is optional.

When guest will have assigned devices, the VT-d parameter will be slightly different in two aspects:

1. "caching-mode=on" is required. This enables VT-d Caching Mode. The parameter will then look like:

  -device intel-iommu,caching-mode=on[,intremap=on]

2. We need to make sure the intel-iommu device be created before any assigned devices. More general, it would be best if we can create the intel-iommu device before all the rest of the devices. 

  For example, this will work:

  $qemu -M q35,accel=kvm,kernel-irqchip=split -m 2G \
        -device intel-iommu,caching-mode=on,intremap=on \
        -device vfio-pci,host=02:00.0

  While this will NOT work:

  $qemu -M q35,accel=kvm,kernel-irqchip=split -m 2G \
        -device vfio-pci,host=02:00.0
        -device intel-iommu,caching-mode=on,intremap=on \

  (For the 2nd requirement, we might provide something better in QEMU 2.10+ to try to remove this requirement, but for now this is required.)

Comment 3 Ján Tomko 2017-03-23 15:29:50 UTC
Upstream patches:
https://www.redhat.com/archives/libvir-list/2017-March/msg01072.html

Any feedback on:
* how to properly probe if QEMU supports kernel-irqchip=split
* the documentation patches
is especially welcome.

Comment 10 Ján Tomko 2017-05-15 15:06:38 UTC
Pushed upstream as:
commit 8023b21a95f271e51810de7f1362e609eaadc1e4
Author:     Ján Tomko <jtomko@redhat.com>
CommitDate: 2017-05-15 15:41:17 +0200

    conf: add <ioapic driver> to <features>
    
    Add a new <ioapic> element with a driver attribute.
    
    Possible values are qemu and kvm. With 'qemu', the I/O
    APIC can be put in the userspace even for KVM domains.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1427005

commit 6b5c6314b2f7a3b54c94a591e6b0dcd13ef1c6ce
Author:     Ján Tomko <jtomko@redhat.com>
CommitDate: 2017-05-15 15:44:11 +0200

    qemu: format kernel_irqchip on the command line
    
    Add kernel_irqchip=split/on to the QEMU command line
    and a capability that looks for it in query-command-line-options
    output. For the 'split' option, use a version check
    since it cannot be reasonably probed.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1427005

commit 2020e2c6f2656ca1aa9032859ccde76185c37c39
Author:     Ján Tomko <jtomko@redhat.com>
CommitDate: 2017-05-15 15:44:11 +0200

    conf: add <driver intremap> to <iommu>
    
    Add a new attribute to control interrupt remapping.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1427005

commit 04028a9db9f2657e8d57d1e4705073c908aa248c
Author:     Ján Tomko <jtomko@redhat.com>
CommitDate: 2017-05-15 15:44:11 +0200

    qemu: format intel-iommu,intremap on the command line
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1427005

commit d12781b47eb0c9f3a498d88b632c327aa08aaf8a
Author:     Ján Tomko <jtomko@redhat.com>
CommitDate: 2017-05-15 15:44:11 +0200

    conf: add caching_mode attribute to iommu device
    
    Add a new attribute to control the caching mode.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1427005

commit a56914486ca67f921ee6e3ce26b5787fccb47155
Author:     Ján Tomko <jtomko@redhat.com>
CommitDate: 2017-05-15 15:44:11 +0200

    qemu: format caching-mode on iommu command line
    
    Format the caching-mode option for the intel-iommu device,
    based on its <driver caching> attribute value.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1427005

commit 3a276c6524026b661ed7bee4539fc5387b963611
Author:     Ján Tomko <jtomko@redhat.com>
CommitDate: 2017-05-15 15:44:12 +0200

    conf: split out virDomainIOMMUDefCheckABIStability

commit 935d927aa881753fff30f6236eedcf9680bca638
Author:     Ján Tomko <jtomko@redhat.com>
CommitDate: 2017-05-15 15:44:12 +0200

    conf: add ABI stability checks for IOMMU options
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1427005

git describe: v3.3.0-47-g935d927

Comment 14 Jingjing Shao 2017-05-26 09:27:06 UTC
Hi Jan,

I try to verify this bug but I find three issues as below. Can you help to check them and give some feedback? Thank you in advance.


(1) If I add the iommu device without caching_mode='on', the host will crash if I attach vf to the guest 

(2) The attribute intremap='split' is not allowed in the guest xml, but the error info has the 'split' 

(3) In the guest, the devices share the same iommu-groups, the device can not be attached to nested guest.


The details info :

(1) If I add the xml to guest, without caching_mode='on'
<iommu model='intel'/>

or 
<ioapic driver='qemu'/>
....
<iommu model='intel'>
 <driver intremap='on'/> ===> or  <driver intremap='off'/>
</iommu>


# virsh list  
 Id    Name                           State
----------------------------------------------------
 16    q35-js                         running

# cat vf.xml 
<interface type='hostdev' managed='yes'>
<mac address='02:24:6b:89:bc:e9'/>
<source>
<address type='pci' domain='0x0000' bus='0x86' slot='0x10' function='0x1'/>
</source>
</interface>

# virsh attach-device  q35-js vf.xml

The host will crash



(2) Add the xml as below without add " <ioapic driver='qemu'/> "
    <iommu model='intel'>
      <driver intremap='on'/>
    </iommu>

Start the guest and get the error

# virsh start q35-js
error: Failed to start domain q35-js
error: internal error: qemu unexpectedly closed the monitor: 2017-05-26T06:23:33.798217Z qemu-kvm: -chardev pty,id=charserial0: char device redirected to /dev/pts/1 (label charserial0)
2017-05-26T06:23:33.857340Z qemu-kvm: -device intel-iommu,intremap=on: Intel Interrupt Remapping cannot work with kernel-irqchip=on, please use 'split|off'.

But If I change the attribute intremap='split' 
 <iommu model='intel'>
    <driver intremap='split'/>
 </iommu>

I also get error

# virsh edit q35-js
error: XML document failed to validate against schema: Unable to validate doc against /usr/share/libvirt/schemas/domain.rng
Extra element devices in interleave
Element domain failed to validate content
Failed. Try again? [y,n,i,f,?]: 
error: XML error: unknown intremap value: split


(3) Add the xml as below to guest 
 <ioapic driver='qemu'/>
...
 <iommu model='intel'>
  <driver intremap='on' caching_mode='on'/>
 </iommu>

# virsh start q35-js
Domain q35-js started

# ps -ef | grep iommu
.... -device intel-iommu,intremap=on,caching-mode=on ...

login in the guest
# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-663.el7.x86_64 root=/dev/mapper/rhel-root ro console=tty0 console=ttyS0,115200 reboot=pci biosdevname=0 crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet intel_iommu=on


# lspci | grep Eth
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
03:01.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8100/8101L/8139 PCI Fast Ethernet Adapter (rev 20)

# virsh nodedev-list --tree
 +- pci_0000_00_1e_0
  |   |
  |   +- pci_0000_02_00_0
  |       |
  |       +- pci_0000_03_01_0
  |           |
  |           +- net_enp3s1_52_54_00_1c_10_ac


# virsh nodedev-dumpxml pci_0000_03_01_0
<device>
  <name>pci_0000_03_01_0</name>
  <path>/sys/devices/pci0000:00/0000:00:1e.0/0000:02:00.0/0000:03:01.0</path>
  <parent>pci_0000_02_00_0</parent>
  <driver>
    <name>8139cp</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>3</bus>
    <slot>1</slot>
    <function>0</function>
    <product id='0x8139'>RTL-8100/8101L/8139 PCI Fast Ethernet Adapter</product>
    <vendor id='0x10ec'>Realtek Semiconductor Co., Ltd.</vendor>
    <iommuGroup number='9'>
      <address domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
      <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
      <address domain='0x0000' bus='0x03' slot='0x01' function='0x0'/>
    </iommuGroup>
  </capability>
</device>

# cat pf.xml
 <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x03' slot='0x01' function='0x0'/>
      </source>
   </hostdev>

# virsh attach-device rhel7  pf.xml
error: Failed to attach device from pf.xml
error: internal error: unable to execute QEMU command 'device_add': vfio error: 0000:03:01.0: failed to setup container for group 9: failed to set iommu for container: Device or resource busy

Comment 15 Ján Tomko 2017-05-31 14:24:44 UTC
1) not a libvirt bug
2) libvirt should catch that and report a nicer error. Please file a new bug.
3) AFAIK that does not work on bare metal either. If the devices are in the same IOMMU group, they all need to be detached from the host before assigning one of them to the guest.

Comment 16 Jingjing Shao 2017-06-01 05:30:08 UTC
(In reply to Ján Tomko from comment #15)
> 1) not a libvirt bug

I think it is really a bug for it causes the host crash. So is there another bug to track this issue? or need to file a new bug for other component ?


> 2) libvirt should catch that and report a nicer error. Please file a new bug.
OK. file a new bug for this.
https://bugzilla.redhat.com/show_bug.cgi?id=1457610


> 3) AFAIK that does not work on bare metal either. If the devices are in the
> same IOMMU group, they all need to be detached from the host before
> assigning one of them to the guest.

I try this but still get error. the details are as below.

libvirt-3.2.0-7.el7.x86_64
# virsh nodedev-list --tree
 +- pci_0000_b4_00_0
      |
      +- pci_0000_b5_00_0
          |
          +- net_eth0_52_54_00_ee_67_31


# virsh nodedev-dumpxml pci_0000_b5_00_0
<device>
  <name>pci_0000_b5_00_0</name>
  <path>/sys/devices/pci0000:b4/0000:b4:00.0/0000:b5:00.0</path>
  <parent>pci_0000_b4_00_0</parent>
  <driver>
    <name>virtio-pci</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>181</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x1041'>Virtio network device</product>
    <vendor id='0x1af4'>Red Hat, Inc</vendor>
    <iommuGroup number='9'>
      <address domain='0x0000' bus='0xb4' slot='0x00' function='0x0'/>
      <address domain='0x0000' bus='0xb5' slot='0x00' function='0x0'/>
    </iommuGroup>
    <numa node='0'/>
    <pci-express>
      <link validity='cap' port='0' speed='2.5' width='1'/>
      <link validity='sta' speed='2.5' width='1'/>
    </pci-express>
  </capability>
</device>

# virsh nodedev-dumpxml pci_0000_b4_00_0  <====it is pcieport
<device>
  <name>pci_0000_b4_00_0</name>
  <path>/sys/devices/pci0000:b4/0000:b4:00.0</path>
  <parent>computer</parent>
  <driver>
    <name>pcieport</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>180</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x3420'>7500/5520/5500/X58 I/O Hub PCI Express Root Port 0</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
    <capability type='pci-bridge'/>
    <iommuGroup number='9'>
      <address domain='0x0000' bus='0xb4' slot='0x00' function='0x0'/>
      <address domain='0x0000' bus='0xb5' slot='0x00' function='0x0'/>
    </iommuGroup>
    <numa node='0'/>
    <pci-express>
      <link validity='cap' port='0' speed='2.5' width='1'/>
      <link validity='sta' speed='2.5' width='1'/>
    </pci-express>
  </capability>
</device>

# virsh nodedev-detach  pci_0000_b5_00_0
Device pci_0000_b5_00_0 detached


# virsh nodedev-dumpxml pci_0000_b5_00_0  |grep driver -A2   
  <driver>
    <name>vfio-pci</name>
  </driver>

# virsh nodedev-detach pci_0000_b4_00_0
Device pci_0000_b4_00_0 detached

# virsh nodedev-dumpxml pci_0000_b4_00_0 | grep driver -A 
#

# cat pf.xml 
 <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0xb5' slot='0x00' function='0x0'/>
      </source>
   </hostdev>

# virsh attach-device rhel7 pf.xml
[ 5591.704766] Out of memory: Kill process 4947 (qemu-kvm) score 474 or sacrifice child
[ 5591.706008] Killed process 4947 (qemu-kvm) total-vm:2026324kB, anon-rss:184776kB, file-rss:0kB, shmem-rss:0kB
error: Failed to attach device from pf.xml
error: internal error: child reported: Kernel does not provide mount namespace: No such file or directory

# virsh domstate rhel7  --reason
shut off (crashed)

Comment 17 Jingjing Shao 2017-06-07 09:38:59 UTC
(In reply to Jingjing Shao from comment #16)
> (In reply to Ján Tomko from comment #15)
> > 1) not a libvirt bug
> 
> I think it is really a bug for it causes the host crash. So is there another
> bug to track this issue? or need to file a new bug for other component ?

I found two bugs to track this issues, they may be caused by the same reason.
https://bugzilla.redhat.com/show_bug.cgi?id=1441605
https://bugzilla.redhat.com/show_bug.cgi?id=1450309#c3

> 
> > 3) AFAIK that does not work on bare metal either. If the devices are in the
> > same IOMMU group, they all need to be detached from the host before
> > assigning one of them to the guest.
> 
> I try this but still get error. the details are as below.

Please help to check the third issue and give me some feedback, thank you in advance.

Comment 18 Ján Tomko 2017-06-13 14:34:22 UTC
(In reply to Jingjing Shao from comment #16)
> # virsh attach-device rhel7 pf.xml
> [ 5591.704766] Out of memory: Kill process 4947 (qemu-kvm) score 474 or
> sacrifice child
> [ 5591.706008] Killed process 4947 (qemu-kvm) total-vm:2026324kB,
> anon-rss:184776kB, file-rss:0kB, shmem-rss:0kB

Seems like qemu cannot allocate enough memory. With the iommu device, more locked memory might be required. Did you try setting the hard_limit to something huge?

<memtune>
  <hard_limit unit='G'>100</hard_limit>
</memtune>

This should also increase the amount of memory QEMU can lock.

> error: Failed to attach device from pf.xml
> error: internal error: child reported: Kernel does not provide mount
> namespace: No such file or directory
> 
> # virsh domstate rhel7  --reason
> shut off (crashed)

I am not sure if this error is relevant after the OOM error. But if it's relevant, see if you can reproduce the same error with namespaces = [] in qemu.conf.

Comment 19 Jingjing Shao 2017-06-14 15:01:49 UTC
(In reply to Ján Tomko from comment #18)
> (In reply to Jingjing Shao from comment #16)
> > # virsh attach-device rhel7 pf.xml
> > [ 5591.704766] Out of memory: Kill process 4947 (qemu-kvm) score 474 or
> > sacrifice child
> > [ 5591.706008] Killed process 4947 (qemu-kvm) total-vm:2026324kB,
> > anon-rss:184776kB, file-rss:0kB, shmem-rss:0kB
> 
> Seems like qemu cannot allocate enough memory. With the iommu device, more
> locked memory might be required. Did you try setting the hard_limit to
> something huge?

No, I did not set the hard_limit as below and I add the memory part

# virsh dumpxml q35-js
<domain type='kvm' id='62'>
  <name>q35-js</name>
  <uuid>34cc0dae-8998-480c-b2db-171ce1e7461a</uuid>
  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>1048576</currentMemory>
  <vcpu placement='static'>1</vcpu>


> 
> <memtune>
>   <hard_limit unit='G'>100</hard_limit>
> </memtune>
> 
> This should also increase the amount of memory QEMU can lock.
> 
> > error: Failed to attach device from pf.xml
> > error: internal error: child reported: Kernel does not provide mount
> > namespace: No such file or directory
> > 
> > # virsh domstate rhel7  --reason
> > shut off (crashed)
> 
> I am not sure if this error is relevant after the OOM error. But if it's
> relevant, see if you can reproduce the same error with namespaces = [] in
> qemu.conf.

I add the namespaces = [] in the qemu.conf and can not reproduce this error
but I got another error as below 

# virsh attach-device rhel7  pf.xml
[  990.408929] Out of memory: Kill process 2975 (qemu-kvm) score 557 or sacrifice child
[  990.410051] Killed process 2975 (qemu-kvm) total-vm:2054172kB, anon-rss:438364kB, file-rss:12kB, shmem-rss:0kB
error: Failed to attach device from pf.xml
error: Unable to write to '/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d1\x2drhel7.scope/devices.deny': No such file or directory


# virsh domstate rhel7 --reason
shut off (crashed)


Can you help to check this issue ?

Comment 21 Laine Stump 2017-06-14 16:42:39 UTC
About case (3) in Comment 14 - the problem isn't because the other devices are in the same iommu group as the device you're trying to assign - those other two devices are PCI controllers (a dmi-to-pci-bridge and a pci-bridge), and it's normal for the parent PCI controllers of a device to be in the same IOMMU group. That isn't a problem because they are (or at least *should be*) excepted from the "all devices in the group must be unbound from their host drivers" rule by vfio. So something else is causing this error:

  error: internal error: unable to execute QEMU command 'device_add': vfio error:
  0000:03:01.0: failed to setup container for group 9: failed to set iommu for
  container: Device or resource busy

Comment 22 Ján Tomko 2017-06-15 12:03:33 UTC
(In reply to Jingjing Shao from comment #19)
> I add the namespaces = [] in the qemu.conf and can not reproduce this error
> but I got another error as below 
> 
> # virsh attach-device rhel7  pf.xml
> [  990.408929] Out of memory: Kill process 2975 (qemu-kvm) score 557 or
> sacrifice child
> [  990.410051] Killed process 2975 (qemu-kvm) total-vm:2054172kB,
> anon-rss:438364kB, file-rss:12kB, shmem-rss:0kB

The qemu process was killed by the OOM killer here. Is there enough free memory on the host? Was this with or without the hard_limit set?

> error: Failed to attach device from pf.xml
> error: Unable to write to
> '/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2d1\x2drhel7.scope/
> devices.deny': No such file or directory

Comment 23 Jingjing Shao 2017-06-16 09:19:21 UTC
Hi Ján,

Thanks your patient reply . It really caused by the memory problem.
I try the test which include a l1 guest with 5G memory and a l2 guest (nested)with 1G memory and get the result as expected.

But if the memory of two guests is not suitable, I still meet the error info as above. 

So does we have some doc for the memory configuration ?  

i.e. 
1. what is minimum memory of l1 guest?   
2. If the l1 guest is with minimum memory, what is maximum memory of nested guest? 

And  I test with four scenarios as below and can you  help to check if they are enough to verify this bug? 

1. Host ->virtual network-> l1 guest ->virtual network-> l2 guest 
2. Host ->virtual network-> l1 guest -> pci assignment -> l2 guest 
3. Host ->vf pci assignment-> l1 guest ->virtual network-> l2 guest 
4. Host ->vf pci assignment-> l1 guest ->pci assignment-> l2 guest 

And stretch test
Host ->vf pci assignment with numa node-> l1 guest ->pci assignment -> l2 guest 

I will provice the detailed steps as below one by one

Comment 24 Jingjing Shao 2017-06-16 09:23:48 UTC
Preparation :
1. Add the info to l1 guest q35-js
...
 <features>
    <ioapic driver='qemu'/>
  </features>
...
    <iommu model='intel'>
      <driver intremap='on' caching_mode='on'/>
    </iommu>
...

2. start the guest 
# virsh start q35-js
Domain q35-js started

3. Check the qemu command line
 -device intel-iommu,intremap=on,caching-mode=on

Comment 25 Jingjing Shao 2017-06-16 09:38:50 UTC
scenario 1 : Host ->virtual network-> l1 guest ->virtual network-> l2 guest 

1. Prepare a l1 guest with xml as below
  <interface type='network'>
      <mac address='52:54:00:ee:67:31'/>
      <source network='default' bridge='vir'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net1'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </interface>


2. check the device in the l1 guest
# virsh nodedev-list --tree
  +- pci_0000_00_1b_0
  |   |
  |   +- net_eth0_52_54_00_98_14_7e
  | 

    
# virsh nodedev-dumpxml pci_0000_00_1b_0
<device>
  <name>pci_0000_00_1b_0</name>
  <path>/sys/devices/pci0000:00/0000:00:1b.0</path>
  <parent>computer</parent>
  <driver>
    <name>virtio-pci</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>0</bus>
    <slot>27</slot>
    <function>0</function>
    <product id='0x1000'>Virtio network device</product>
    <vendor id='0x1af4'>Red Hat, Inc</vendor>
    <iommuGroup number='11'>
      <address domain='0x0000' bus='0x00' slot='0x1b' function='0x0'/>
    </iommuGroup>
  </capability>
</device>

3. Attach the device with macvtap and add the xml as below to l2 guest rhel7
 <interface type='direct'>
      <mac address='52:54:00:b1:9c:b0'/>
      <source dev='eth0' mode='bridge'/>
      <model type='rtl8139'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>

4. start the l2 guest rhel7 and check the device in the l2 guest

# virsh nodedev-dumpxml pci_0000_00_02_0
<device>
  <name>pci_0000_00_02_0</name>
  <path>/sys/devices/pci0000:00/0000:00:02.0</path>
  <parent>computer</parent>
  <driver>
    <name>8139cp</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>0</bus>
    <slot>2</slot>
    <function>0</function>
    <product id='0x8139'>RTL-8100/8101L/8139 PCI Fast Ethernet Adapter</product>
    <vendor id='0x10ec'>Realtek Semiconductor Co., Ltd.</vendor>
  </capability>
</device>

Comment 26 Jingjing Shao 2017-06-17 03:50:14 UTC
scenario 2 : Host ->virtual network-> l1 guest -> pci assignment -> l2 guest 

Repeat the step1~2 in scenario 1
...

3. Start the guest and attach the device to guest with the vfio pci assignment 

# cat pf3.xml
 <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x00' slot='0x1b' function='0x0'/>
      </source>
   </hostdev>


# virsh attach-device rhel7 pf3.xml
Device attached successfully



# virsh dumpxml rhel7 | grep interface
...
  <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x00' slot='0x1b' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </hostdev>
...

Check the xml of device
# virsh nodedev-dumpxml pci_0000_00_1b_0
<device>
  <name>pci_0000_00_1b_0</name>
  <path>/sys/devices/pci0000:00/0000:00:1b.0</path>
  <parent>computer</parent>
  <driver>
    <name>vfio-pci</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>0</bus>
    <slot>27</slot>
    <function>0</function>
    <product id='0x1000'>Virtio network device</product>
    <vendor id='0x1af4'>Red Hat, Inc</vendor>
    <iommuGroup number='11'>
      <address domain='0x0000' bus='0x00' slot='0x1b' function='0x0'/>
    </iommuGroup>
  </capability>
</device>


4.login the l2 guest rhel7 and check the device

# lspci
00:02.0 Ethernet controller: Red Hat, Inc Virtio network device (rev 01)

# virsh nodedev-dumpxml pci_0000_00_02_0
<device>
  <name>pci_0000_00_02_0</name>
  <path>/sys/devices/pci0000:00/0000:00:02.0</path>
  <parent>computer</parent>
  <driver>
    <name>virtio-pci</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>0</bus>
    <slot>2</slot>
    <function>0</function>
    <product id='0x1000'>Virtio network device</product>
    <vendor id='0x1af4'>Red Hat, Inc</vendor>
  </capability>
</device>

Comment 27 Jingjing Shao 2017-06-17 04:15:17 UTC
scenario 3 : Host ->vf pci assignment-> l1 guest ->virtual network-> l2 guest (with numa node)

1. prepare l1 guest with numa configuration, attach vf to the guest, check the dumpxml of device 

# virsh dumpxml q35-js 

   <numa>
      <cell id='0' cpus='0' memory='5242880' unit='KiB'/>
    </numa>
....
    <interface type='hostdev' managed='yes'>
      <mac address='02:24:6b:89:bc:e9'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0x86' slot='0x10' function='0x3'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </interface>
....



#lspci

  +- pci_0000_b4_01_0
      |
      +- pci_0000_b6_00_0

# virsh nodedev-dumpxml pci_0000_b6_00_0
<device>
  <name>pci_0000_b6_00_0</name>
  <path>/sys/devices/pci0000:b4/0000:b4:01.0/0000:b6:00.0</path>
  <parent>pci_0000_b4_01_0</parent>
  <driver>
    <name>vfio-pci</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>182</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x10ed'>82599 Ethernet Controller Virtual Function</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
    <iommuGroup number='15'>
      <address domain='0x0000' bus='0xb4' slot='0x01' function='0x0'/>
      <address domain='0x0000' bus='0xb6' slot='0x00' function='0x0'/>
    </iommuGroup>
    <numa node='0'/>
    <pci-express>
      <link validity='cap' port='0' width='0'/>
      <link validity='sta' width='0'/>
    </pci-express>
  </capability>
</device>

Check the numa_node
# cat /sys/devices/pci0000:b4/0000:b4:01.0/0000:b6:00.0/numa_node
0

2. Create a virtual network whose source is this device to l2 guest rhel7
 <interface type='direct'>
      <mac address='52:54:00:ef:ab:ac'/>
      <source dev='enp182s0' mode='bridge'/>
      <target dev='macvtap0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>


login the l2 guest to check this device

# lspci
00:02.0 Ethernet controller: Red Hat, Inc Virtio network device

Comment 28 Jingjing Shao 2017-06-17 04:18:12 UTC
scenario 4: Host ->vf pci assignment-> l1 guest ->pci assignment-> l2 guest (With numa node)

Repeat the step1  in scenario 3


2,Attach the vf device to l2 guest  rhel7
# cat pf.xml 
 <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0xb6' slot='0x00' function='0x0'/>
      </source>
   </hostdev>


# virsh attach-device rhel7  pf.xml
Device attached successfully


# virsh dumpxml rhel7 
...
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0xb6' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </hostdev>
...

3. Check the device info in guest2
# lspci
00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)

# virsh nodedev-dumpxml pci_0000_00_03_0
<device>
  <name>pci_0000_00_03_0</name>
  <path>/sys/devices/pci0000:00/0000:00:03.0</path>
  <parent>computer</parent>
  <driver>
    <name>ixgbevf</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>0</bus>
    <slot>3</slot>
    <function>0</function>
    <product id='0x10ed'>82599 Ethernet Controller Virtual Function</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
    <pci-express>
      <link validity='cap' port='0' width='0'/>
      <link validity='sta' width='0'/>
    </pci-express>
  </capability>
</device>

Comment 29 Ján Tomko 2017-06-20 15:07:57 UTC
(In reply to Jingjing Shao from comment #23)
> Hi Ján,
> 
> Thanks your patient reply . It really caused by the memory problem.
> I try the test which include a l1 guest with 5G memory and a l2 guest
> (nested)with 1G memory and get the result as expected.
> 
> But if the memory of two guests is not suitable, I still meet the error info
> as above. 
> 
> So does we have some doc for the memory configuration ?  
> 

Yes. We have documented this to be an undecidable problem:
http://libvirt.org/formatdomain.html#elementsMemoryTuning
Since commit 7e66766 using <memoryBacking><locked> implies
no limit on locked memory. (setting <hard_limit> also works, because it also influences the locked memory limit)

> i.e. 
> 1. what is minimum memory of l1 guest?   
> 2. If the l1 guest is with minimum memory, what is maximum memory of nested
> guest?

So the unhelpful but honest answer to these questions is:
the minimum/maximum that successfully works for your use-case +/- some memory to reduce the chance of failing later

> 
> And  I test with four scenarios as below and can you  help to check if they
> are enough to verify this bug?

Yes, these look sufficient to me.

> 
> 1. Host ->virtual network-> l1 guest ->virtual network-> l2 guest 
> 2. Host ->virtual network-> l1 guest -> pci assignment -> l2 guest 
> 3. Host ->vf pci assignment-> l1 guest ->virtual network-> l2 guest 
> 4. Host ->vf pci assignment-> l1 guest ->pci assignment-> l2 guest 
> 
> And stretch test
> Host ->vf pci assignment with numa node-> l1 guest ->pci assignment -> l2
> guest

Comment 30 Jingjing Shao 2017-06-22 05:43:56 UTC
According to the comment 29, change the status to verified

Comment 31 errata-xmlrpc 2017-08-01 17:24:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1846

Comment 32 errata-xmlrpc 2017-08-02 00:03:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1846


Note You need to log in before you can comment on or make changes to this bug.