Bug 1498817
| Summary: | Vhost IOMMU support regression since qemu-kvm-rhev-2.9.0-16.el7_4.5 | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Maxime Coquelin <maxime.coquelin> |
| Component: | qemu-kvm-rhev | Assignee: | Maxime Coquelin <maxime.coquelin> |
| Status: | CLOSED ERRATA | QA Contact: | Pei Zhang <pezhang> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 7.4 | CC: | ailan, chayang, jasowang, mst, mtessun, pbonzini, peterx, pezhang, sgordon, virt-maint, wexu |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | qemu-kvm-rhev-2.10.0-4.el7 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-04-11 00:38:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
A patch series fixing this issue had already been posted upstream by Peter Xu, but haven't been applied: <1496404254-17429-1-git-send-email-peterx> https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg00571.html I rebased and re-posted the two first patches of the series: Message-Id: <20171005171309.1250-1-maxime.coquelin> https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg01139.html Following patches posted upstream:
commit 96fe9842106b9bdcb2d2d4493b062e65ed6db6d6
Author: Maxime Coquelin <maxime.coquelin>
Date: Tue Oct 10 10:20:15 2017 +0200
memory: fix off-by-one error in memory_region_notify_one()
This patch fixes an off-by-one error that could lead to the
notifyee to receive notifications for ranges it is not
registered to.
The bug has been spotted by code review.
Fixes: bd2bfa4c52e5 ("memory: introduce memory_region_notify_one()")
Cc: qemu-stable
Cc: Peter Xu <peterx>
Signed-off-by: Maxime Coquelin <maxime.coquelin>
commit 3a90d32d26caf499787e2b33a92c96d8bb903c6f
Author: Peter Xu <peterx>
Date: Thu Oct 5 16:30:34 2017 +0200
exec: simplify address_space_get_iotlb_entry
This patch let address_space_get_iotlb_entry() to use the newly
introduced page_mask parameter in flatview_do_translate(). Then we
will be sure the IOTLB can be aligned to page mask, also we should
nicely support huge pages now when introducing a764040.
Fixes: a764040 ("exec: abstract address_space_do_translate()")
Signed-off-by: Peter Xu <peterx>
Signed-off-by: Maxime Coquelin <maxime.coquelin>
Acked-by: Michael S. Tsirkin <mst>
commit 23ba2a608f236564ac37705c97c5e7b916bd7849
Author: Peter Xu <peterx>
Date: Thu Oct 5 15:35:26 2017 +0200
exec: add page_mask for flatview_do_translate
The function is originally used for flatview_space_translate() and what
we care about most is (xlat, plen) range. However for iotlb requests, we
don't really care about "plen", but the size of the page that "xlat" is
located on. While, plen cannot really contain this information.
A simple example to show why "plen" is not good for IOTLB translations:
E.g., for huge pages, it is possible that guest mapped 1G huge page on
device side that used this GPA range:
0x100000000 - 0x13fffffff
Then let's say we want to translate one IOVA that finally mapped to GPA
0x13ffffe00 (which is located on this 1G huge page). Then here we'll
get:
(xlat, plen) = (0x13fffe00, 0x200)
So the IOTLB would be only covering a very small range since from
"plen" (which is 0x200 bytes) we cannot tell the size of the page.
Actually we can really know that this is a huge page - we just throw the
information away in flatview_do_translate().
This patch introduced "page_mask" optional parameter to capture that
page mask info. Also, I made "plen" an optional parameter as well, with
some comments for the whole function.
No functional change yet.
Signed-off-by: Peter Xu <peterx>
Signed-off-by: Maxime Coquelin <maxime.coquelin>
Patches merged upstream and RHEL 7.5 backport posted. Brew build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14314471 Fix included in qemu-kvm-rhev-2.10.0-4.el7 This bug has been fixed well. Do PVP testing with vIOMMU. Host, qemu and guest all work well. And the throughput testing results looks good.
==Verification==
Versions:
3.10.0-824.el7.x86_64
qemu-kvm-rhev-2.10.0-13.el7.x86_64
dpdk-17.11-4.el7.x86_64
Steps:
1. Boot testpmd in host with 2 "net_vhost,..,iommu-support=1"
testpmd \
-l 1,3,5,7,9 --socket-mem=1024,1024 -n 4 \
-d /usr/lib64/librte_pmd_vhost.so \
--vdev 'net_vhost0,iface=/tmp/vhost-user1,iommu-support=1' \
--vdev 'net_vhost1,iface=/tmp/vhost-user2,iommu-support=1' -- \
--portmask=f --disable-hw-vlan -i --rxq=1 --txq=1 \
--nb-cores=4 --forward-mode=io
testpmd> set portlist 0,2,1,3
testpmd> start
2. Boot VM with vIOMMU
/usr/libexec/qemu-kvm -name rhel7.5_nonrt \
-M q35,kernel-irqchip=split \
-cpu host -m 8G \
-device intel-iommu,intremap=true,caching-mode=true \
-object memory-backend-file,id=mem,size=8G,mem-path=/dev/hugepages,share=on \
-numa node,memdev=mem -mem-prealloc \
-smp 6,sockets=1,cores=6,threads=1 \
-device pcie-root-port,id=root.1,chassis=1 \
-device pcie-root-port,id=root.2,chassis=2 \
-device pcie-root-port,id=root.3,chassis=3 \
-drive file=/home/images_nfv-virt-rt-kvm/rhel7.5_nonrt.qcow2,format=qcow2,if=none,id=drive-virtio-blk0,werror=stop,rerror=stop \
-device virtio-blk-pci,drive=drive-virtio-blk0,id=virtio-blk0,bus=root.1 \
-chardev socket,id=charnet1,path=/tmp/vhost-user1 \
-netdev vhost-user,chardev=charnet1,id=hostnet1 \
-device virtio-net-pci,netdev=hostnet1,id=net1,mac=18:66:da:5f:dd:02,iommu_platform=on,ats=on,bus=root.2 \
-chardev socket,id=charnet2,path=/tmp/vhost-user2 \
-netdev vhost-user,chardev=charnet2,id=hostnet2 \
-device virtio-net-pci,netdev=hostnet2,id=net2,mac=18:66:da:5f:dd:03,iommu_platform=on,ats=on,bus=root.3 \
-vnc :2 \
-monitor stdio \
3. In guest, load vfio and start testpmd
# modprobe vfio
# modprobe vfio-pci
# /usr/bin/testpmd \
-l 1,2,3 \
-n 4 \
-d /usr/lib64/librte_pmd_virtio.so.1 \
-w 0000:02:00.0 -w 0000:03:00.0 \
-- \
--nb-cores=2 \
--disable-hw-vlan \
-i \
--disable-rss \
--rxq=1 --txq=1
4. In another host, start Trex server and generates packets to guest to get throughput value.
DIRECTORY=~/src/lua-trafficgen
cd $DIRECTORY
./binary-search.py \
--traffic-generator=trex-txrx \
--search-runtime=30 \
--validation-runtime=60 \
--rate-unit=mpps \
--rate=0 \
--run-bidirec=1 \
--run-revunidirec=0 \
--frame-size=64 \
--num-flows=1024 \
--one-shot=0 \
--max-loss-pct=0.002
Throughput: 18.6Mpps
So this bug has been fixed well. Move status of this bug to "VERIFIED".
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:1104 |
Description of problem: A regression is introduced by patch d5ba92b69 ("exec: abstract address_space_do_translate()") that fixes bug when iommu support is enabled in QEMU's vhost command line but not in kernel command line (see Bz1482856). The patch changes the way IOTLB updates sent to the backends are generated. Prior to this patch IOTLB entries sent to the backend are aligned on the guest page boundaries (both addresses and size). For example, with the guest using 2MB pages: * Backend sends IOTLB miss request for iova = 0x112378fb4 * QEMU replies with an IOTLB update with iova = 0x112200000, size = 0x200000 * Bakend insert above entry in its cache and compute the translation In this case, if the backend needs later to translate 0x112378004, it will result in a cache it and no need to send another IOTLB miss. With this patch, the addr of the IOTLB entry will be the address requested via the IOTLB miss, the size is computed to cover the remaining of the guest page. The same example gives: * Backend sends IOTLB miss request for iova = 0x112378fb4 * QEMU replies with an IOTLB update with iova = 112378fb4, size = 0x8704c * Bakend insert above entry in its cache and compute the translation In this case, if the backend needs later to translate 0x112378004, it will result in another cache miss: * Backend sends IOTLB miss request for iova = 0x112378004 * QEMU replies with an IOTLB update with iova = 0x112378004, size = 0x87FFC * Bakend insert above entry in its cache and compute the translation It results in having much more IOTLB misses, and more importantly it pollutes the device IOTLB cache by multiplying the number of entries that moreover overlap. Note that current Kernel & User backends implementation do not merge contiguous and overlapping IOTLB entries at device IOTLB cache insertion. Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.9.0-16.el7_4.5 The problem is also seen upstream. Patch introducing regression: commit d5ba92b697f81189c20aa672581ca4aadf3b8302 Author: Peter Xu <peterx> Date: Mon Aug 21 08:52:14 2017 +0200 exec: abstract address_space_do_translate() RH-Author: Peter Xu <peterx> Message-id: <1503305534-8404-2-git-send-email-peterx> Patchwork-id: 76027 O-Subject: [RHEL-7.4.z qemu-kvm-rhev PATCH 1/1] exec: abstract address_space_do_translate() Bugzilla: 1482856 RH-Acked-by: Xiao Wang <jasowang> RH-Acked-by: Laurent Vivier <lvivier> RH-Acked-by: Paolo Bonzini <pbonzini> RH-Acked-by: Michael S. Tsirkin <mst> This function is an abstraction helper for address_space_translate() and address_space_get_iotlb_entry(). It does the lookup of address into memory region section, then does proper IOMMU translation if necessary. Refactor the two existing functions to use it. This fixes vhost when IOMMU is disabled by guest. Tested-by: Maxime Coquelin <maxime.coquelin> Signed-off-by: Peter Xu <peterx> Reviewed-by: Michael S. Tsirkin <mst> Signed-off-by: Michael S. Tsirkin <mst> (cherry picked from commit a764040cc831cfe5b8bf1c80e8341b9bf2de3ce8) Signed-off-by: Peter Xu <peterx> Signed-off-by: Miroslav Rezanina <mrezanin> How reproducible: 100% Steps to Reproduce: See reproduction steps with Kernel backend provided by Pei Zhang: https://bugzilla.redhat.com/show_bug.cgi?id=1480446#c11 Also reproducible with Vhost-user backend, but IOMMU support is not in DPDK upstream yet. Additional info: Reverting the patch solves this issue.