Bug 1498817

Summary: Vhost IOMMU support regression since qemu-kvm-rhev-2.9.0-16.el7_4.5
Product: Red Hat Enterprise Linux 7 Reporter: Maxime Coquelin <maxime.coquelin>
Component: qemu-kvm-rhevAssignee: Maxime Coquelin <maxime.coquelin>
Status: CLOSED ERRATA QA Contact: Pei Zhang <pezhang>
Severity: high Docs Contact:
Priority: high    
Version: 7.4CC: ailan, chayang, jasowang, mst, mtessun, pbonzini, peterx, pezhang, sgordon, virt-maint, wexu
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.10.0-4.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-11 00:38:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Maxime Coquelin 2017-10-05 10:31:58 UTC
Description of problem:

A regression is introduced by patch d5ba92b69 ("exec: abstract address_space_do_translate()") that fixes bug when iommu support is enabled in QEMU's vhost command line but not in kernel command line (see Bz1482856).

The patch changes the way IOTLB updates sent to the backends are generated.

Prior to this patch IOTLB entries sent to the backend are aligned on the guest page boundaries (both addresses and size). For example, with the guest using 2MB pages:
 * Backend sends IOTLB miss request for iova = 0x112378fb4
 * QEMU replies with an IOTLB update with iova = 0x112200000, size = 0x200000
 * Bakend insert above entry in its cache and compute the translation
In this case, if the backend needs later to translate 0x112378004, it will result in a cache it and no need to send another IOTLB miss.

With this patch, the addr of the IOTLB entry will be the address requested via the IOTLB miss, the size is computed to cover the remaining of the guest page.
The same example gives:
 * Backend sends IOTLB miss request for iova = 0x112378fb4
 * QEMU replies with an IOTLB update with iova = 112378fb4, size = 0x8704c
 * Bakend insert above entry in its cache and compute the translation
In this case, if the backend needs later to translate 0x112378004, it will result in another cache miss:
 * Backend sends IOTLB miss request for iova = 0x112378004
 * QEMU replies with an IOTLB update with iova = 0x112378004, size = 0x87FFC
 * Bakend insert above entry in its cache and compute the translation
It results in having much more IOTLB misses, and more importantly it pollutes the device IOTLB cache by multiplying the number of entries that moreover overlap.

Note that current Kernel & User backends implementation do not merge contiguous and overlapping IOTLB entries at device IOTLB cache insertion.

Version-Release number of selected component (if applicable):

qemu-kvm-rhev-2.9.0-16.el7_4.5
The problem is also seen upstream.

Patch introducing regression:

commit d5ba92b697f81189c20aa672581ca4aadf3b8302
Author: Peter Xu <peterx>
Date:   Mon Aug 21 08:52:14 2017 +0200

    exec: abstract address_space_do_translate()
    
    RH-Author: Peter Xu <peterx>
    Message-id: <1503305534-8404-2-git-send-email-peterx>
    Patchwork-id: 76027
    O-Subject: [RHEL-7.4.z qemu-kvm-rhev PATCH 1/1] exec: abstract address_space_do_translate()
    Bugzilla: 1482856
    RH-Acked-by: Xiao Wang <jasowang>
    RH-Acked-by: Laurent Vivier <lvivier>
    RH-Acked-by: Paolo Bonzini <pbonzini>
    RH-Acked-by: Michael S. Tsirkin <mst>
    
    This function is an abstraction helper for address_space_translate() and
    address_space_get_iotlb_entry(). It does the lookup of address into
    memory region section, then does proper IOMMU translation if necessary.
    Refactor the two existing functions to use it.
    
    This fixes vhost when IOMMU is disabled by guest.
    
    Tested-by: Maxime Coquelin <maxime.coquelin>
    Signed-off-by: Peter Xu <peterx>
    Reviewed-by: Michael S. Tsirkin <mst>
    Signed-off-by: Michael S. Tsirkin <mst>
    (cherry picked from commit a764040cc831cfe5b8bf1c80e8341b9bf2de3ce8)
    Signed-off-by: Peter Xu <peterx>
    Signed-off-by: Miroslav Rezanina <mrezanin>


How reproducible:
100%

Steps to Reproduce:

See reproduction steps with Kernel backend provided by Pei Zhang:
https://bugzilla.redhat.com/show_bug.cgi?id=1480446#c11

Also reproducible with Vhost-user backend, but IOMMU support is not in DPDK upstream yet.

Additional info:

Reverting the patch solves this issue.

Comment 2 Maxime Coquelin 2017-10-05 17:26:26 UTC
A patch series fixing this issue had already been posted upstream by Peter Xu,
but haven't been applied:
<1496404254-17429-1-git-send-email-peterx>
https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg00571.html

I rebased and re-posted the two first patches of the series:
Message-Id: <20171005171309.1250-1-maxime.coquelin>
https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg01139.html

Comment 5 Maxime Coquelin 2017-10-12 16:28:32 UTC
Following patches posted upstream:
commit 96fe9842106b9bdcb2d2d4493b062e65ed6db6d6
Author: Maxime Coquelin <maxime.coquelin>
Date:   Tue Oct 10 10:20:15 2017 +0200

    memory: fix off-by-one error in memory_region_notify_one()
    
    This patch fixes an off-by-one error that could lead to the
    notifyee to receive notifications for ranges it is not
    registered to.
    
    The bug has been spotted by code review.
    
    Fixes: bd2bfa4c52e5 ("memory: introduce memory_region_notify_one()")
    Cc: qemu-stable
    Cc: Peter Xu <peterx>
    Signed-off-by: Maxime Coquelin <maxime.coquelin>

commit 3a90d32d26caf499787e2b33a92c96d8bb903c6f
Author: Peter Xu <peterx>
Date:   Thu Oct 5 16:30:34 2017 +0200

    exec: simplify address_space_get_iotlb_entry
    
    This patch let address_space_get_iotlb_entry() to use the newly
    introduced page_mask parameter in flatview_do_translate(). Then we
    will be sure the IOTLB can be aligned to page mask, also we should
    nicely support huge pages now when introducing a764040.
    
    Fixes: a764040 ("exec: abstract address_space_do_translate()")
    Signed-off-by: Peter Xu <peterx>
    Signed-off-by: Maxime Coquelin <maxime.coquelin>
    Acked-by: Michael S. Tsirkin <mst>

commit 23ba2a608f236564ac37705c97c5e7b916bd7849
Author: Peter Xu <peterx>
Date:   Thu Oct 5 15:35:26 2017 +0200

    exec: add page_mask for flatview_do_translate
    
    The function is originally used for flatview_space_translate() and what
    we care about most is (xlat, plen) range. However for iotlb requests, we
    don't really care about "plen", but the size of the page that "xlat" is
    located on. While, plen cannot really contain this information.
    
    A simple example to show why "plen" is not good for IOTLB translations:
    
    E.g., for huge pages, it is possible that guest mapped 1G huge page on
    device side that used this GPA range:
    
      0x100000000 - 0x13fffffff
    
    Then let's say we want to translate one IOVA that finally mapped to GPA
    0x13ffffe00 (which is located on this 1G huge page). Then here we'll
    get:
    
      (xlat, plen) = (0x13fffe00, 0x200)
    
    So the IOTLB would be only covering a very small range since from
    "plen" (which is 0x200 bytes) we cannot tell the size of the page.
    
    Actually we can really know that this is a huge page - we just throw the
    information away in flatview_do_translate().
    
    This patch introduced "page_mask" optional parameter to capture that
    page mask info. Also, I made "plen" an optional parameter as well, with
    some comments for the whole function.
    
    No functional change yet.
    
    Signed-off-by: Peter Xu <peterx>
    Signed-off-by: Maxime Coquelin <maxime.coquelin>

Comment 7 Maxime Coquelin 2017-10-20 14:21:19 UTC
Patches merged upstream and RHEL 7.5 backport posted.
Brew build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14314471

Comment 8 Miroslav Rezanina 2017-11-02 10:09:52 UTC
Fix included in qemu-kvm-rhev-2.10.0-4.el7

Comment 13 Pei Zhang 2017-12-26 06:22:39 UTC
This bug has been fixed well. Do PVP testing with vIOMMU. Host, qemu and guest all work well. And the throughput testing results looks good.

==Verification==
Versions:
3.10.0-824.el7.x86_64
qemu-kvm-rhev-2.10.0-13.el7.x86_64
dpdk-17.11-4.el7.x86_64


Steps:
1. Boot testpmd in host with 2 "net_vhost,..,iommu-support=1"
testpmd \
-l 1,3,5,7,9 --socket-mem=1024,1024 -n 4 \
-d /usr/lib64/librte_pmd_vhost.so \
--vdev 'net_vhost0,iface=/tmp/vhost-user1,iommu-support=1' \
--vdev 'net_vhost1,iface=/tmp/vhost-user2,iommu-support=1' -- \
--portmask=f --disable-hw-vlan -i --rxq=1 --txq=1 \
--nb-cores=4 --forward-mode=io

testpmd> set portlist 0,2,1,3
testpmd> start 

2. Boot VM with vIOMMU
/usr/libexec/qemu-kvm -name rhel7.5_nonrt \
-M q35,kernel-irqchip=split \
-cpu host -m 8G \
-device intel-iommu,intremap=true,caching-mode=true \
-object memory-backend-file,id=mem,size=8G,mem-path=/dev/hugepages,share=on \
-numa node,memdev=mem -mem-prealloc \
-smp 6,sockets=1,cores=6,threads=1 \
-device pcie-root-port,id=root.1,chassis=1 \
-device pcie-root-port,id=root.2,chassis=2 \
-device pcie-root-port,id=root.3,chassis=3 \
-drive file=/home/images_nfv-virt-rt-kvm/rhel7.5_nonrt.qcow2,format=qcow2,if=none,id=drive-virtio-blk0,werror=stop,rerror=stop \
-device virtio-blk-pci,drive=drive-virtio-blk0,id=virtio-blk0,bus=root.1 \
-chardev socket,id=charnet1,path=/tmp/vhost-user1 \
-netdev vhost-user,chardev=charnet1,id=hostnet1 \
-device virtio-net-pci,netdev=hostnet1,id=net1,mac=18:66:da:5f:dd:02,iommu_platform=on,ats=on,bus=root.2 \
-chardev socket,id=charnet2,path=/tmp/vhost-user2 \
-netdev vhost-user,chardev=charnet2,id=hostnet2 \
-device virtio-net-pci,netdev=hostnet2,id=net2,mac=18:66:da:5f:dd:03,iommu_platform=on,ats=on,bus=root.3 \
-vnc :2 \
-monitor stdio \

3. In guest, load vfio and start testpmd
# modprobe vfio
# modprobe vfio-pci

# /usr/bin/testpmd \
-l 1,2,3 \
-n 4 \
-d /usr/lib64/librte_pmd_virtio.so.1 \
-w 0000:02:00.0 -w 0000:03:00.0 \
-- \
--nb-cores=2 \
--disable-hw-vlan \
-i \
--disable-rss \
--rxq=1 --txq=1

4. In another host, start Trex server and generates packets to guest to get throughput value.

DIRECTORY=~/src/lua-trafficgen
cd $DIRECTORY
./binary-search.py \
        --traffic-generator=trex-txrx \
        --search-runtime=30 \
        --validation-runtime=60 \
        --rate-unit=mpps \
        --rate=0 \
        --run-bidirec=1 \
        --run-revunidirec=0 \
        --frame-size=64 \
        --num-flows=1024 \
        --one-shot=0 \
        --max-loss-pct=0.002


Throughput: 18.6Mpps


So this bug has been fixed well. Move status of this bug to "VERIFIED".

Comment 15 errata-xmlrpc 2018-04-11 00:38:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1104