The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1807129 - [RHEL8] Both qemu and guest hang after migrating guest in which vhost-user NIC is using virtio-pci [ovs2.12]
Summary: [RHEL8] Both qemu and guest hang after migrating guest in which vhost-user NI...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch2.12
Version: FDP 20.A
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Maxime Coquelin
QA Contact: Pei Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-25 16:36 UTC by Timothy Redaelli
Modified: 2020-03-10 09:35 UTC (History)
10 users (show)

Fixed In Version: 2.12.0-22
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1799017
Environment:
Last Closed: 2020-03-10 09:35:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0746 0 None None None 2020-03-10 09:35:50 UTC

Description Timothy Redaelli 2020-02-25 16:36:13 UTC
+++ This bug was initially created as a clone of Bug #1799017 +++

+++ This bug was initially created as a clone of Bug #1798996 +++

Description of problem:
Boot guest over ovs with vhost-user ports. In guest, keep vhost-user NIC using virtio-pci driver. Then migrate guest from src to des host. Both qemu and guest will hang on src host.

Version-Release number of selected component (if applicable):
4.18.0-176.el8.x86_64
qemu-kvm-4.2.0-8.module+el8.2.0+5607+dc756904.x86_64
openvswitch2.12-2.12.0-21.el8fdp.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Boot ovs with 1 vhost-user NIC on both src and des host. Refer to [1]

2. Boot qemu with vhost-user. Refer to [2]

3. Check vhost-user driver in guest. Keep it's default virtio-pci.

4. Migrate guest from src to des. Both src qemu and guest hang.

(qemu) migrate -d tcp:10.73.72.196:5555
(qemu) 

Actual results:
Both qemu and guest hang when do migration.

Expected results:
Both qemu and guest should keep working well and can migrate successfully.

Additional info:
1. This is a regression bug.
openvswitch2.12-2.12.0-12.el8fdp.x86_64   works well.

2. If binding vhost-user NIC from virtio-pci to vfio-pci in guest, this issue is gone.

3. openvswitch2.13-2.13.0-0.20200121git2a4f006.el8fdp.x86_64 works well.

4. This was testing with FDP 20.B. As there is no version "FDP 20.B" in bugzilla now, so I chose 20.A. I would highlight 20.A version works well.

5. Though qemu hang, I don't think it's qemu issue. 

As below versions (1) and (2) work well.

(1)qemu-kvm-4.2.0-8.module+el8.2.0+5607+dc756904.x86_64 & openvswitch2.13-2.13.0-0.20200121git2a4f006.el8fdp.x86_64 works well

(2)qemu-kvm-4.2.0-8.module+el8.2.0+5607+dc756904.x86_64 & openvswitch2.12-2.12.0-12.el8fdp.x86_64                   works well

(3)qemu-kvm-4.2.0-8.module+el8.2.0+5607+dc756904.x86_64 & openvswitch2.12-2.12.0-21.el8fdp.x86_64(bug version)      fail

Reference:
[1]
#!/bin/bash

set -e

echo "killing old ovs process"
pkill -f ovs-vswitchd || true
sleep 5
pkill -f ovsdb-server || true

echo "probing ovs kernel module"
modprobe -r openvswitch || true
modprobe openvswitch

echo "clean env"
DB_FILE=/etc/openvswitch/conf.db
rm -rf /var/run/openvswitch
mkdir /var/run/openvswitch
rm -f $DB_FILE

echo "init ovs db and boot db server"
export DB_SOCK=/var/run/openvswitch/db.sock
ovsdb-tool create /etc/openvswitch/conf.db /usr/share/openvswitch/vswitch.ovsschema
ovsdb-server --remote=punix:$DB_SOCK --remote=db:Open_vSwitch,Open_vSwitch,manager_options --pidfile --detach --log-file
ovs-vsctl --no-wait init

echo "start ovs vswitch daemon"
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024"
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask="0x1"
ovs-vsctl --no-wait set Open_vSwitch . other_config:vhost-iommu-support=true
ovs-vswitchd unix:$DB_SOCK --pidfile --detach --log-file=/var/log/openvswitch/ovs-vswitchd.log

echo "creating bridge and ports"

ovs-vsctl --if-exists del-br ovsbr0
ovs-vsctl add-br ovsbr0 -- set bridge ovsbr0 datapath_type=netdev
ovs-vsctl add-port ovsbr0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-devargs=0000:5e:00.0 
ovs-vsctl add-port ovsbr0 vhost-user0 -- set Interface vhost-user0 type=dpdkvhostuserclient options:vhost-server-path=/tmp/vhostuser0.sock
ovs-ofctl del-flows ovsbr0
ovs-ofctl add-flow ovsbr0 "in_port=1,idle_timeout=0 actions=output:2"
ovs-ofctl add-flow ovsbr0 "in_port=2,idle_timeout=0 actions=output:1"

ovs-vsctl set Open_vSwitch . other_config={}
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x1
ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x14
ovs-vsctl set Interface dpdk0 options:n_rxq=1


echo "all done"

[2]
/usr/libexec/qemu-kvm \
-name guest=rhel8.2 \
-machine pc-q35-rhel8.2.0,kernel_irqchip=split \
-cpu host \
-m 8192 \
-overcommit mem-lock=on \
-smp 6,sockets=6,cores=1,threads=1 \
-object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/1-rhel8.2,share=yes,size=8589934592,host-nodes=0,policy=bind \
-numa node,nodeid=0,cpus=0-5,memdev=ram-node0 \
-device intel-iommu,intremap=on,caching-mode=on,device-iotlb=on \
-device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 \
-device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \
-device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 \
-blockdev driver=file,cache.direct=off,cache.no-flush=on,filename=/mnt/nfv/rhel8.2.qcow2,node-name=my_file \
-blockdev driver=qcow2,node-name=my,file=my_file \
-device virtio-blk-pci,scsi=off,iommu_platform=on,ats=on,bus=pci.2,addr=0x0,drive=my,id=virtio-disk0,bootindex=1,write-cache=on \
-chardev socket,id=charnet1,path=/tmp/vhostuser0.sock,server \
-netdev vhost-user,chardev=charnet1,id=hostnet1 \
-device virtio-net-pci,rx_queue_size=1024,netdev=hostnet1,id=net1,mac=18:66:da:5f:dd:02,bus=pci.3,addr=0x0,iommu_platform=on,ats=on \
-monitor stdio \
-vnc 0:1 \

--- Additional comment from Maxime Coquelin on 2020-02-24 16:30:37 CET ---

Fix posted upstream in merged in master:

commit 4f37df14c405b754b5e971c75f4f67f4bb5bfdde
Author: Adrian Moreno <amorenoz>
Date:   Thu Feb 13 11:04:58 2020 +0100

    vhost: protect log address translation in IOTLB update

    Currently, the log address translation only  happens in the vhost-user's
    translate_ring_addresses(). However, the IOTLB update handler is not
    checking if it was mapped to re-trigger that translation.

    Since the log address mapping could fail, check it on iotlb updates.
    Also, check it on vring_translate() so we do not dirty pages if the
    logging address is not yet ready.

    Additionally, properly protect the accesses to the iotlb structures.

    Fixes: fbda9f145927 ("vhost: translate incoming log address to GPA")
    Cc: stable

    Signed-off-by: Adrian Moreno <amorenoz>
    Reviewed-by: Maxime Coquelin <maxime.coquelin>

--- Additional comment from Maxime Coquelin on 2020-02-25 14:43:52 CET ---

Backported two patches:
 - "vhost: fix vring memory partially mapped"
 - "vhost: protect log address translation in IOTLB update"

Comment 3 Pei Zhang 2020-02-28 15:25:34 UTC
Verified with openvswitch2.12-2.12.0-22.el8fdp.x86_64:

All migration test cases get PASS. And all ovs related cases from Virt get PASS.

Testcase: live_migration_nonrt_server_2Q_1G_iommu_ovs
=======================Stream Rate: 1Mpps=========================
No Stream_Rate Downtime Totaltime Ping_Loss moongen_Loss
0 1Mpps 224 18018 0 513347
1 1Mpps 220 17735 0 506617
2 1Mpps 220 16936 0 507326
3 1Mpps 197 16872 0 460713
Max 1Mpps 224 18018 0 513347
Min 1Mpps 197 16872 0 460713
Mean 1Mpps 215 17390 0 497000
Median 1Mpps 220 17335 0 506971
Stdev 0 12.29 573.83 0.0 24379.53

Testcase: live_migration_nonrt_server_1Q_2M_iommu_ovs
=======================Stream Rate: 1Mpps=========================
No Stream_Rate Downtime Totaltime Ping_Loss moongen_Loss
0 1Mpps 178 15316 0 428264
1 1Mpps 160 14334 0 386516
2 1Mpps 196 14527 0 461821
3 1Mpps 152 14567 0 372551
Max 1Mpps 196 15316 0 461821
Min 1Mpps 152 14334 0 372551
Mean 1Mpps 171 14686 0 412288
Median 1Mpps 169 14547 0 407390
Stdev 0 19.62 432.14 0.0 40628.3

Testcase: live_migration_nonrt_server_1Q_1G_iommu_ovs
=======================Stream Rate: 1Mpps=========================
No Stream_Rate Downtime Totaltime Ping_Loss moongen_Loss
0 1Mpps 180 17023 0 433150
1 1Mpps 180 16282 0 426178
2 1Mpps 183 16382 0 426972
3 1Mpps 162 17468 0 386119
Max 1Mpps 183 17468 0 433150
Min 1Mpps 162 16282 0 386119
Mean 1Mpps 176 16788 0 418104
Median 1Mpps 180 16702 0 426575
Stdev 0 9.59 559.31 0.0 21550.35

Testcase: nfv_acceptance_nonrt_server_2Q_1G_iommu
Packets_loss Frame_Size Run_No Throughput Avg_Throughput
0 64 0 21.307297 21.307297

Testcase: vhostuser_hotplug_nonrt_server_iommu
Packets_loss Frame_Size Run_No Throughput Avg_Throughput
0 64 0 21.307297 21.307297
0 64 0 21.307303 21.307303

Testcase: vhostuser_reconnect_nonrt_iommu_qemu
Packets_loss Frame_Size Run_No Throughput Avg_Throughput
0 64 0 21.307242 21.307242
0 64 0 21.127245 21.127245
0 64 0 21.127239 21.127239

Versions:
4.18.0-184.el8.x86_64
tuned-2.13.0-5.el8.noarch
dpdk-19.11-4.el8.x86_64
openvswitch2.12-2.12.0-22.el8fdp.x86_64
python3-libvirt-6.0.0-1.module+el8.2.0+5453+31b2b136.x86_64
qemu-kvm-4.2.0-12.module+el8.2.0+5858+afd073bc.x86_64

So this bug has been fixed very well. Move to 'VERIFIED'.

Comment 6 errata-xmlrpc 2020-03-10 09:35:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0746


Note You need to log in before you can comment on or make changes to this bug.