Add support for device assignment fro guest user-space, by enabling vIOMMU for VFIO device. This will enable an IOMMU protected device assignment from guest userspace and in nested VMs. The patches for this are being discussed upstream: https://lists.nongnu.org/archive/html/qemu-devel/2016-04/msg01665.html
*** Bug 1330118 has been marked as a duplicate of this bug. ***
http://post-office.corp.redhat.com/archives/rhvirt-patches/2017-April/msg00365.html
Fix included in qemu-kvm-rhev-2.9.0-2.el7
Hi Peper, QE is verifying this bug. However I'm not quit sure the correct command lines. Is below command line right? Do some options are missed or wrongly used? Could you provide some suggestions? Thanks. /usr/libexec/qemu-kvm -name rhel7.4 -M q35,kernel-irqchip=split \ -device intel-iommu,device-iotlb=on,intremap,caching-mode=true \ -cpu Haswell-noTSX -m 8G -numa node \ -smp 4,sockets=1,cores=4,threads=1 \ -device vfio-pci,host=0000:81:00.0 \ -device vfio-pci,host=0000:81:00.1 \ -drive file=/home/images_nfv-virt-rt-kvm/rhel7.4_rt.qcow2,format=qcow2,if=none,id=drive-virtio-blk0,werror=stop,rerror=stop \ -device virtio-blk-pci,drive=drive-virtio-blk0,id=virtio-blk0 \ -vnc :2 \ -monitor stdio \ Best Regards, Pei
QE file a new bug: Bug 1448813 - qemu crash when shutdown guest with '-device intel-iommu' and '-device vfio-pci'
==Summary== Performance of passthrough with iommu looks good. Throughput results of "passthrough with iommu", "bare-metal" and "passthrough without iommu" are almost same, very close to the 10G line rate. Note: - QE did 5 times for each type of testing. - Default Parameters when testing throughput: Traffic Generator: MoonGen Acceptable Loss: 0.002% Frame Size: 64Byte Unidirectional: Yes Search run time:60s Validation run time: 30s Virtio features: default CPU: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz NIC: 10-Gigabit X540-AT2 ==Results== - passthrough with iommu: No Throughput(Mpps) packets_loss_rate 1 14.695494 0.000000% 2 14.642680 0.000000% 3 14.697329 0.000000% 4 14.642594 0.000000% 5 14.696112 0.000000% - bare-metal: No throughput(Mpps) packets_loss_rate 1 14.696393 0.000000% 2 14.695025 0.000000% 3 14.696213 0.000000% 4 14.641595 0.000000% 5 14.694909 0.000000% - passthrough without iommu: No throughput(Mpps) packets_loss_rate 1 14.696650 0.000000% 2 14.641099 0.000000% 3 14.696032 0.000000% 4 14.695486 0.000000% 5 14.696716 0.000000% ==Steps of "passthrough with iommu"== Versions: 3.10.0-664.rt56.583.el7.x86_64 qemu-kvm-rhev-2.9.0-3.el7.x86_64 Steps: 1. In host, Add "iommu=pt intel_iommu=on" and "default_hugepagesz=1G" to kernel line, refer to[1]. 2. In host, bind 2 network devices to vfio driver, refer to [2]. 3. In host, reserve hugepages from the NUMA node where these assigned network devices locate. refer to[3]. Note: Cores, memory used by guest should be in same NUMA node where the host network devices locate. 4. Boot guest with iommu, assigned network devices and hugepage, refer to [4]. 5. Pin 4 vCPUs to individual cores. Cores binded to vCPU 1~3 should be in the same NUMA node with network devices. In this case, we bind vCPU1 to core 1, vCPU2 to core 3 and vCPU3 to core 5, refer to[5]. 6. In guest, add "intel_iommu=on" and "default_hugepagesz=1G" in kernel line, refer to[6]. 7. In guest, load vfio, refer to [7]. 8. In guest, reserve hugepage,refer to [8]. 9. In guest, bind 2 assigned network devices to vfio driver, and start dpdk's testpmd,refer to[9]. 10. In another host, start MoonGen, refer to[10]. Reference [1] kernel command line of host # cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-663.el7.x86_64 root=/dev/mapper/rhel_dell--per730--27-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per730-27/root rd.lvm.lv=rhel_dell-per730-27/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on isolcpus=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 nohz=on nohz_full=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 rcu_nocbs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 intel_pstate=disable nosoftlockup [2] nics will be assigned # ls /sys/bus/pci/drivers/vfio-pci/ 0000:04:00.0 0000:04:00.1 bind module new_id remove_id uevent unbind [3] reserve hugepage from NUMA node 1 echo 10 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages [4] qemu command line # /usr/libexec/qemu-kvm -name rhel7.4 -M q35,kernel-irqchip=split \ -cpu host -m 8G \ -device intel-iommu,intremap=true,caching-mode=true \ -object memory-backend-file,id=mem,size=8G,mem-path=/dev/hugepages,share=on \ -numa node,memdev=mem -mem-prealloc \ -smp 4,sockets=1,cores=4,threads=1 \ -device pcie-root-port,id=root.1,slot=1 \ -device pcie-root-port,id=root.2,slot=2 \ -device pcie-root-port,id=root.3,slot=3 \ -device pcie-root-port,id=root.4,slot=4 \ -device vfio-pci,host=0000:81:00.0,bus=root.1 \ -device vfio-pci,host=0000:81:00.1,bus=root.2 \ -netdev tap,id=hostnet0,vhost=on \ -device virtio-net-pci,netdev=hostnet0,id=net0,bus=root.3,mac=88:66:da:5f:dd:01 \ -drive file=/home/images_nfv-virt-rt-kvm/rhel7.4.qcow2,format=qcow2,if=none,id=drive-virtio-blk0,werror=stop,rerror=stop \ -device virtio-blk-pci,drive=drive-virtio-blk0,id=virtio-blk0,bus=root.4 \ -vnc :2 \ -monitor stdio \ Only memory of node1 should be used: # numastat -c qemu-kvm Per-node process memory usage (in MBs) for PID 5581 (qemu-kvm) Node 0 Node 1 Total ------ ------ ----- Huge 0 8192 8192 Heap 16 4 20 Stack 0 0 0 Private 33 0 33 ------- ------ ------ ----- Total 50 8196 8246 [5] pin vCPUs (qemu) info cpus * CPU #0: pc=0xffffffff864a7286 (halted) thread_id=4993 CPU #1: pc=0xffffffff864a7286 (halted) thread_id=5006 CPU #2: pc=0xffffffff864a7286 (halted) thread_id=5007 CPU #3: pc=0xffffffff864a7286 (halted) thread_id=5008 # taskset -cp 30 4993 # taskset -cp 1 5006 # taskset -cp 3 5007 # taskset -cp 5 5008 [6] kernle line of guest # cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-663.el7.x86_64 root=/dev/mapper/rhel_bootp--73--75--189-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_bootp-73-75-189/root rd.lvm.lv=rhel_bootp-73-75-189/swap rhgb quiet LANG=en_US.UTF-8 intel_iommu=on default_hugepagesz=1G [7] load vfio in guest # modprobe vfio # modprobe vfio-pci [8]reserve hugepage in guest # echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages [9] start testpmd # dpdk-devbind --bind=vfio-pci 0000:01:00.0 # dpdk-devbind --bind=vfio-pci 0000:02:00.0 # dpdk-devbind --status Network devices using DPDK-compatible driver ============================================ 0000:01:00.0 'Ethernet Controller 10-Gigabit X540-AT2' drv=vfio-pci unused=ixgbe 0000:02:00.0 'Ethernet Controller 10-Gigabit X540-AT2' drv=vfio-pci unused=ixgbe # /usr/bin/testpmd \ -l 1,2,3 \ -n 4 \ -d /usr/lib64/librte_pmd_ixgbe.so \ -w 0000:01:00.0 -w 0000:02:00.0 \ -- \ --nb-cores=2 \ --disable-hw-vlan \ -i \ --disable-rss \ --rxq=1 --txq=1 [10] start MoonGen with https://github.com/atheurer/MoonGen/blob/opnfv-dev/examples/opnfv-vsperf.lua # ./build/MoonGen examples/opnfv-vsperf.lua
Hi Peter, Can Comment 11 verify this bug? If not, please provide the missed scenarios. Thanks. Best Regards, Pei
Pei, Can you also try it with device assignment from DPDK in the guest? Maxime, Can you comment about the config for that?
(In reply to Amnon Ilan from comment #13) > Pei, Can you also try it with device assignment from DPDK in the guest? Hi Amnon, Seems with device assignment from guest, the nested guest can work, but assigned NICs can not receive data. But I'm not quite sure if below steps is correct. Steps: 1. In host, set kvm_intel with nest enabled # modprobe -r kvm_intel # modprobe kvm_intel nested=1 enable_shadow_vmcs=1 ept=1 enable_apicv=1 2. In host, boot L1 guest with 6 cores, 12G hugepages from NUMA node1, intel-iommu and assigned NICs. Note: host hugepage size is 1G. # /usr/libexec/qemu-kvm -name rhel7.4 -M q35,kernel-irqchip=split \ -cpu host -m 12G \ -device intel-iommu,intremap=true,caching-mode=true \ -object memory-backend-file,id=mem,size=12G,mem-path=/dev/hugepages,share=on \ -numa node,memdev=mem -mem-prealloc \ -smp 6,sockets=1,cores=6,threads=1 \ -device pcie-root-port,id=root.1,slot=1 \ -device pcie-root-port,id=root.2,slot=2 \ -device pcie-root-port,id=root.3,slot=3 \ -device pcie-root-port,id=root.4,slot=4 \ -device vfio-pci,host=0000:81:00.0,bus=root.1 \ -device vfio-pci,host=0000:81:00.1,bus=root.2 \ -netdev tap,id=hostnet0,vhost=on \ -device virtio-net-pci,netdev=hostnet0,id=net0,bus=root.3,mac=88:66:da:5f:dd:11 \ -drive file=/home/images_nfv-virt-rt-kvm/rhel7.4_L1.qcow2,format=qcow2,if=none,id=drive-virtio-blk0,werror=stop,rerror=stop \ -device virtio-blk-pci,drive=drive-virtio-blk0,id=virtio-blk0,bus=root.4 \ -vnc :2 \ -monitor stdio \ 3. In host, Pin 6 vCPUs to individual cores on NUMA node1 (qemu) info cpus * CPU #0: pc=0x000000007ffb516f thread_id=6795 CPU #1: pc=0x00000000000fd45c (halted) thread_id=6796 CPU #2: pc=0x00000000000fd45c (halted) thread_id=6797 CPU #3: pc=0x00000000000fd45c (halted) thread_id=6798 CPU #4: pc=0x00000000000fd45c (halted) thread_id=6799 CPU #5: pc=0x00000000000fd45c (halted) thread_id=6800 # taskset -cp 1 6795 # taskset -cp 3 6796 # taskset -cp 5 6797 # taskset -cp 7 6798 # taskset -cp 9 6799 # taskset -cp 11 6800 4. In L1 guest, bind NICs to vfio # modprobe vfio # modprobe vfio-pci # dpdk-devbind --bind=vfio-pci 0000:01:00.0 # dpdk-devbind --bind=vfio-pci 0000:02:00.0 # dpdk-devbind --status Network devices using DPDK-compatible driver ============================================ 0000:01:00.0 'Ethernet Controller 10-Gigabit X540-AT2' drv=vfio-pci unused=ixgbe 0000:02:00.0 'Ethernet Controller 10-Gigabit X540-AT2' drv=vfio-pci unused=ixgbe 5. In L1 guest, boot L2 guest with 4 cores, 6G memory and assigned network devices. Note: L1 guest hugepage size is 1G. # /usr/libexec/qemu-kvm -name rhel7.4 -M q35,kernel-irqchip=split \ -cpu host -m 6G \ -device intel-iommu,intremap=true,caching-mode=true \ -object memory-backend-file,id=mem,size=6G,mem-path=/dev/hugepages,share=on \ -numa node,memdev=mem -mem-prealloc \ -smp 4,sockets=1,cores=4,threads=1 \ -device pcie-root-port,id=root.1,slot=1 \ -device pcie-root-port,id=root.2,slot=2 \ -device pcie-root-port,id=root.3,slot=3 \ -device pcie-root-port,id=root.4,slot=4 \ -device vfio-pci,host=0000:01:00.0,bus=root.1 \ -device vfio-pci,host=0000:02:00.0,bus=root.2 \ -netdev tap,id=hostnet0,vhost=on \ -device virtio-net-pci,netdev=hostnet0,id=net0,bus=root.3,mac=88:66:da:5f:dd:01 \ -drive file=/home/rhel7.4_L2.qcow2,format=qcow2,if=none,id=drive-virtio-blk0,werror=stop,rerror=stop \ -device virtio-blk-pci,drive=drive-virtio-blk0,id=virtio-blk0,bus=root.4 \ -vnc :2 \ -monitor stdio \ 6. In L2 guest, bind NICs to vfio driver # modprobe vfio # modprobe vfio-pci # dpdk-devbind --bind=vfio-pci 0000:01:00.0 # dpdk-devbind --bind=vfio-pci 0000:02:00.0 # dpdk-devbind --status Network devices using DPDK-compatible driver ============================================ 0000:01:00.0 'Ethernet Controller 10-Gigabit X540-AT2' drv=vfio-pci unused=ixgbe 0000:02:00.0 'Ethernet Controller 10-Gigabit X540-AT2' drv=vfio-pci unused=ixgbe 7. In L2 guest, start testpmd. Note: L2 guest hugepage size is 2M. As 1G hugpeage need cpu flag pdpe1gb, but can not find this cpu flag in L2 guest. # /usr/bin/testpmd \ -l 1,2,3 \ -n 4 \ -d /usr/lib64/librte_pmd_ixgbe.so \ -w 0000:01:00.0 -w 0000:02:00.0 \ -- \ --nb-cores=2 \ --disable-hw-vlan \ -i \ --disable-rss \ --rxq=1 --txq=1 8. Start MoonGen in another host. # ./build/MoonGen examples/l2-load-latency.lua 0 1 64 9. In L2 guest, testpmd fails to get packets. testpmd> quit Telling cores to stop... Waiting for lcores to finish... ---------------------- Forward statistics for port 0 ---------------------- RX-packets: 9 RX-dropped: 1205687 RX-total: 1205696 TX-packets: 0 TX-dropped: 0 TX-total: 0 ---------------------------------------------------------------------------- ---------------------- Forward statistics for port 1 ---------------------- RX-packets: 9 RX-dropped: 1205379 RX-total: 1205388 TX-packets: 0 TX-dropped: 0 TX-total: 0 ---------------------------------------------------------------------------- +++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++ RX-packets: 18 RX-dropped: 2411066 RX-total: 2411084 TX-packets: 0 TX-dropped: 0 TX-total: 0 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Done.
Hi Pei, I don't see anything wrong in the above steps, but nested KVM is new to me. In L2 guest, have you tried to use the Kernel driver for the network cards instead of binding them to DPDK? Regards, Maxime
(In reply to Amnon Ilan from comment #13) > Pei, Can you also try it with device assignment from DPDK in the guest? > Maxime, Can you comment about the config for that? Hi, Amnon, IIUC Pei was testing using dpdk and moongen in comment 11, or did you really mean "nested device assignment" here? Thanks.
(In reply to Maxime Coquelin from comment #15) > Hi Pei, > > I don't see anything wrong in the above steps, > but nested KVM is new to me. > > In L2 guest, have you tried to use the Kernel > driver for the network cards instead of binding them to DPDK? Agree. We can try the kernel driver first. (Btw, does anyone know why we have lots of "RX-dropped" there while little for RX-packets?) Also, Pei, can we try to boot L2 guest without vIOMMU for now (but of course we need vIOMMU in the L1 guest)? Peter
(In reply to Maxime Coquelin from comment #15) > Hi Pei, > > I don't see anything wrong in the above steps, > but nested KVM is new to me. > > In L2 guest, have you tried to use the Kernel > driver for the network cards instead of binding them to DPDK? Hi Maxime, with kernel driver, the assigned network devices in L2 guest can not receive ping packets. (In reply to Peter Xu from comment #17) > > Also, Pei, can we try to boot L2 guest without vIOMMU for now (but of course > we need vIOMMU in the L1 guest)? Peter, L1 guest with vIOMMU and L2 guest without vIOMMU(command refer to[1]): (1)With kernel driver, the assigned network devices in L2 guest are able to receive packets, ping works well. (2)Bind assigned network devices to no-iommu vfio driver in L2 guest(refer to[2]), the dpdk's testpmd is also able to receive packets, it works well. Below is the throughput results: (Same default parameters of throughput testing with Comment 11) No Throughput(Mpps) packets_loss_rate 1 10.838200 0.001992% 2. 11.029459 0.001861% [1]qemu command line of L2 guest without vIOMMU: # /usr/libexec/qemu-kvm -name rhel7.4 -M q35 \ -cpu host -m 6G \ -object memory-backend-file,id=mem,size=6G,mem-path=/dev/hugepages,share=on \ -numa node,memdev=mem -mem-prealloc \ -smp 4,sockets=1,cores=4,threads=1 \ -device pcie-root-port,id=root.1,slot=1 \ -device pcie-root-port,id=root.2,slot=2 \ -device pcie-root-port,id=root.3,slot=3 \ -device pcie-root-port,id=root.4,slot=4 \ -device vfio-pci,host=0000:01:00.0,bus=root.1 \ -device vfio-pci,host=0000:02:00.0,bus=root.2 \ -netdev tap,id=hostnet0,vhost=on \ -device virtio-net-pci,netdev=hostnet0,id=net0,bus=root.3,mac=88:66:da:5f:dd:01 \ -drive file=/home/rhel7.4_L2.qcow2,format=qcow2,if=none,id=drive-virtio-blk0,werror=stop,rerror=stop \ -device virtio-blk-pci,drive=drive-virtio-blk0,id=virtio-blk0,bus=root.4 \ -vnc :2 \ -monitor stdio \ [2] Load noiommu vfio in L2 guest # modprobe vfio enable_unsafe_noiommu_mode=Y # modprobe vfio-pci # cat /sys/module/vfio/parameters/enable_unsafe_noiommu_mode Y
> (In reply to Peter Xu from comment #17) > > > > Also, Pei, can we try to boot L2 guest without vIOMMU for now (but of course > > we need vIOMMU in the L1 guest)? > > Peter, L1 guest with vIOMMU and L2 guest without vIOMMU(command refer to[1]): > > (1)With kernel driver, the assigned network devices in L2 guest are able to > receive packets, ping works well. > > (2)Bind assigned network devices to no-iommu vfio driver in L2 guest(refer > to[2]), the dpdk's testpmd is also able to receive packets, it works well. > Below is the throughput results: > > (Same default parameters of throughput testing with Comment 11) > No Throughput(Mpps) packets_loss_rate > 1 10.838200 0.001992% > 2. 11.029459 0.001861% Thanks Pei for your testing. So looks like nested device assignment works, but nested vIOMMU is not ready yet. May need some debugging to know why. But before that: do we really have requirement for nested vIOMMU? IMHO nested device assignment should be the goal, but not nested vIOMMU for now, right? Amnon/others? Peter
Hi Peter, (In reply to Peter Xu from comment #19) > But before that: do we really have requirement for nested vIOMMU? IMHO > nested device assignment should be the goal, but not nested vIOMMU for now, > right? > > Amnon/others? Amnon is off today, but we discussed about this yesterday, and I understand that nesting is not the priority, so a 7.5 bz could be created for this failing case. Maxime
(In reply to Maxime Coquelin from comment #20) > Hi Peter, > > (In reply to Peter Xu from comment #19) > > But before that: do we really have requirement for nested vIOMMU? IMHO > > nested device assignment should be the goal, but not nested vIOMMU for now, > > right? > > > > Amnon/others? > > Amnon is off today, but we discussed about this yesterday, and I understand > that nesting is not the priority, so a 7.5 bz could be created for this > failing case. > Right, please open a 7.5 BZ for the nesting issues. BTW, do we want to test VFIO with vIOMMU for other devices? (not only network devices) Thanks, Amnon
(In reply to Amnon Ilan from comment #21) > > Right, please open a 7.5 BZ for the nesting issues. OK. QE file a new bug[1] to track this new issue. [1]Bug 1450712 - Booting nested guest with vIOMMU, the assigned network devices can not receive packets Best Regards, Pei
(In reply to Amnon Ilan from comment #21) > > BTW, do we want to test VFIO with vIOMMU for other devices? (not > only network devices) Currently for device assignment, besides network devices, we still have colleagues testing GPU. But seems this device doesn't need vIOMMU in guest. So for QE perspective, seems no other devices needed to be tested. Hi Peter, Maxime, could you please share your opinions about this question? Thanks.
(In reply to Pei Zhang from comment #23) > (In reply to Amnon Ilan from comment #21) > > > > BTW, do we want to test VFIO with vIOMMU for other devices? (not > > only network devices) > > Currently for device assignment, besides network devices, we still have > colleagues testing GPU. But seems this device doesn't need vIOMMU in guest. > So for QE perspective, seems no other devices needed to be tested. > > > Hi Peter, Maxime, could you please share your opinions about this question? > Thanks. IMHO it's okay for now. After all we can never cover all the cases/cards, while we can still open new bz for specific issues when needed. I would also like to see how Maxime/Amnon/others see this though.
(In reply to Peter Xu from comment #24) > (In reply to Pei Zhang from comment #23) > > (In reply to Amnon Ilan from comment #21) > > > > > > BTW, do we want to test VFIO with vIOMMU for other devices? (not > > > only network devices) > > > > Currently for device assignment, besides network devices, we still have > > colleagues testing GPU. But seems this device doesn't need vIOMMU in guest. > > So for QE perspective, seems no other devices needed to be tested. > > > > > > Hi Peter, Maxime, could you please share your opinions about this question? > > Thanks. > > IMHO it's okay for now. After all we can never cover all the cases/cards, > while we can still open new bz for specific issues when needed. > > I would also like to see how Maxime/Amnon/others see this though. I don't have other use-cases in mind for now.
(In reply to Maxime Coquelin from comment #25) > (In reply to Peter Xu from comment #24) > > (In reply to Pei Zhang from comment #23) > > > > IMHO it's okay for now. After all we can never cover all the cases/cards, > > while we can still open new bz for specific issues when needed. > > > > I would also like to see how Maxime/Amnon/others see this though. > > I don't have other use-cases in mind for now. Me neither, thanks
Set this bug as 'VERIFIED' as Comment 11, Comment 24, Comment 25, Comment 26. Please correct me if you have any concern. Thanks.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392