Bug 1788415

Summary:	packed=on: boot qemu with vhost-user and vIOMMU over OpenvSwitch, starting testpmd in guest will cause both qemu and ovs crash
Product:	Red Hat Enterprise Linux Advanced Virtualization	Reporter:	Pei Zhang <pezhang>
Component:	qemu-kvm	Assignee:	lulu <lulu>
qemu-kvm sub component:	Networking	QA Contact:	Pei Zhang <pezhang>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	high
Priority:	high	CC:	aadam, amorenoz, chayang, eperezma, jinzhao, juzhang, virt-maint
Version:	8.2	Flags:	pm-rhel: mirror+
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1793064 1793068 (view as bug list)		Environment:
Last Closed:	2020-09-21 06:58:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1793064, 1793068, 1812740, 1812741

Description Pei Zhang 2020-01-07 06:27:13 UTC

Description of problem:
Boot guest with vhost-user and vIOMMU, enable packed-ring of the virtio-net-pci of vhost-user. Boot testpmd in guest, qemu will crash.

Version-Release number of selected component (if applicable):
4.18.0-167.el8.x86_64
qemu-kvm-4.2.0-4.module+el8.2.0+5220+e82621dc.x86_64

How reproducible:
100%

Steps to Reproduce:

1. Boot ovs with 2 dpdkvhostuserclient ports, refer to [1]

# ovs-vsctl show
3d492a93-d752-4203-8b4f-04fa0a0a5127
    Bridge "ovsbr1"
        Port "dpdk1"
            Interface "dpdk1"
                type: dpdk
                options: {dpdk-devargs="0000:5e:00.1", n_rxq="2"}
        Port "ovsbr1"
            Interface "ovsbr1"
                type: internal
        Port "vhost-user1"
            Interface "vhost-user1"
                type: dpdkvhostuserclient
                options: {vhost-server-path="/tmp/vhostuser1.sock"}
    Bridge "ovsbr0"
        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
                options: {dpdk-devargs="0000:5e:00.0", n_rxq="2"}
        Port "vhost-user0"
            Interface "vhost-user0"
                type: dpdkvhostuserclient
                options: {vhost-server-path="/tmp/vhostuser0.sock"}
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal


2.Boot qemu with vhost-user and vIOMMU, enable packed-ring of the virtio-net-pci of vhost-user, full cmd refer to[2]

-device intel-iommu,intremap=on,caching-mode=on,device-iotlb=on \
-chardev socket,id=charnet1,path=/tmp/vhostuser0.sock,server \
-netdev vhost-user,chardev=charnet1,queues=2,id=hostnet1 \
-device virtio-net-pci,mq=on,vectors=6,rx_queue_size=1024,netdev=hostnet1,id=net1,mac=88:66:da:5f:dd:02,bus=pci.6,addr=0x0,iommu_platform=on,ats=on,packed=on \
-chardev socket,id=charnet2,path=/tmp/vhostuser1.sock,server \
-netdev vhost-user,chardev=charnet2,queues=2,id=hostnet2 \
-device virtio-net-pci,mq=on,vectors=6,rx_queue_size=1024,netdev=hostnet2,id=net2,mac=88:66:da:5f:dd:03,bus=pci.7,addr=0x0,iommu_platform=on,ats=on,packed=on \


3. Start testpmd in guest

# modprobe vfio
# modprobe vfio-pci
# dpdk-devbind --bind=vfio-pci 0000:06:00.0
# dpdk-devbind --bind=vfio-pci 0000:07:00.0

# /usr/bin/testpmd \
	-l 1,2,3,4,5 \
	-n 4 \
	-d /usr/lib64/librte_pmd_virtio.so \
	-w 0000:06:00.0 -w 0000:07:00.0 \
	--iova-mode pa \
	-- \
	--nb-cores=4 \
	-i \
	--disable-rss \
	--rxd=512 --txd=512 \
	--rxq=2 --txq=2


3. qemu crash, ovs-vswitchd also crash.

(qemu) qemu-kvm: Failed to read msg header. Read -1 instead of 12. Original request 22.
qemu-kvm: Fail to update device iotlb
qemu-kvm: Failed to read msg header. Read 0 instead of 12. Original request 8.
qemu-kvm: Fail to update device iotlb
qemu-kvm: Failed to set msg fds.
qemu-kvm: vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11)
qemu-kvm: Failed to set msg fds.
qemu-kvm: vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11)
qemu-kvm: Failed to set msg fds.
qemu-kvm: vhost VQ 2 ring restore failed: -1: Resource temporarily unavailable (11)
qemu-kvm: Failed to set msg fds.
qemu-kvm: vhost VQ 3 ring restore failed: -1: Resource temporarily unavailable (11)
qemu-kvm: Failed to read from slave.
qemu-kvm: Failed to read from slave.
Segmentation fault (core dumped)

# dmesg
...
[12378.241785] vhost-events[5802]: segfault at 2 ip 0000558d52492169 sp 00007f4812ffb6d0 error 6 in ovs-vswitchd[558d521d7000+61f000]
[12378.253525] Code: 0b 41 1c 66 89 02 ba 02 00 00 00 eb 87 0f 1f 40 00 83 c8 01 66 89 01 31 c0 48 83 c4 08 c3 0f 1f 00 48 8b 41 10 ba 01 00 00 00 <66> 89 50 02 31 c0 48 83 c4 08 c3 66 90 66 2e 0f 1f 84 00 00 00 00
[12378.624852] qemu-kvm[6052]: segfault at 198 ip 0000562ac564c6b0 sp 00007ffce7053f80 error 4 in qemu-kvm[562ac528d000+a46000]
[12378.636067] Code: 75 10 48 8b 05 29 94 b2 00 89 c0 64 48 89 03 0f ae f0 90 48 8b 45 00 31 c9 85 d2 48 89 e7 0f 95 c1 41 b8 01 00 00 00 4c 89 e2 <48> 8b b0 98 01 00 00 e8 64 c7 f4 ff 48 83 3c 24 00 0f 84 f9 00 00
...


Actual results:
qemu crash.

Expected results:
qemu should not crash.

Additional info:
1. Without vIOMMU, qemu works well.

Reference
[1]
# cat boot_ovs_client.sh 
#!/bin/bash

set -e

echo "killing old ovs process"
pkill -f ovs-vswitchd || true
sleep 5
pkill -f ovsdb-server || true

echo "probing ovs kernel module"
modprobe -r openvswitch || true
modprobe openvswitch

echo "clean env"
DB_FILE=/etc/openvswitch/conf.db
rm -rf /var/run/openvswitch
mkdir /var/run/openvswitch
rm -f $DB_FILE

echo "init ovs db and boot db server"
export DB_SOCK=/var/run/openvswitch/db.sock
ovsdb-tool create /etc/openvswitch/conf.db /usr/share/openvswitch/vswitch.ovsschema
ovsdb-server --remote=punix:$DB_SOCK --remote=db:Open_vSwitch,Open_vSwitch,manager_options --pidfile --detach --log-file
ovs-vsctl --no-wait init

echo "start ovs vswitch daemon"
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024"
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask="0x1"
ovs-vsctl --no-wait set Open_vSwitch . other_config:vhost-iommu-support=true
ovs-vswitchd unix:$DB_SOCK --pidfile --detach --log-file=/var/log/openvswitch/ovs-vswitchd.log

echo "creating bridge and ports"

ovs-vsctl --if-exists del-br ovsbr0
ovs-vsctl add-br ovsbr0 -- set bridge ovsbr0 datapath_type=netdev
ovs-vsctl add-port ovsbr0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-devargs=0000:5e:00.0 
ovs-vsctl add-port ovsbr0 vhost-user0 -- set Interface vhost-user0 type=dpdkvhostuserclient options:vhost-server-path=/tmp/vhostuser0.sock
ovs-ofctl del-flows ovsbr0
ovs-ofctl add-flow ovsbr0 "in_port=1,idle_timeout=0 actions=output:2"
ovs-ofctl add-flow ovsbr0 "in_port=2,idle_timeout=0 actions=output:1"

ovs-vsctl --if-exists del-br ovsbr1
ovs-vsctl add-br ovsbr1 -- set bridge ovsbr1 datapath_type=netdev
ovs-vsctl add-port ovsbr1 dpdk1 -- set Interface dpdk1 type=dpdk options:dpdk-devargs=0000:5e:00.1
ovs-vsctl add-port ovsbr1 vhost-user1 -- set Interface vhost-user1 type=dpdkvhostuserclient options:vhost-server-path=/tmp/vhostuser1.sock
ovs-ofctl del-flows ovsbr1
ovs-ofctl add-flow ovsbr1 "in_port=1,idle_timeout=0 actions=output:2"
ovs-ofctl add-flow ovsbr1 "in_port=2,idle_timeout=0 actions=output:1"

ovs-vsctl set Open_vSwitch . other_config={}
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x1
ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x15554
ovs-vsctl set Interface dpdk0 options:n_rxq=2
ovs-vsctl set Interface dpdk1 options:n_rxq=2

echo "all done"


[2]
# cat qemu.sh 
/usr/libexec/qemu-kvm \
-name guest=rhel8.2 \
-machine q35,kernel_irqchip=split \
-cpu host \
-m 8192 \
-smp 6,sockets=6,cores=1,threads=1 \
-object memory-backend-file,id=mem,size=8G,mem-path=/dev/hugepages,share=on \
-numa node,memdev=mem -mem-prealloc \
-device intel-iommu,intremap=on,caching-mode=on,device-iotlb=on \
-device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 \
-device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \
-device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 \
-device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 \
-device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 \
-device pcie-root-port,port=0x15,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \
-device pcie-root-port,port=0x16,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 \
-blockdev driver=file,cache.direct=off,cache.no-flush=on,filename=/home/images_nfv-virt-rt-kvm/rhel8.2.qcow2,node-name=my_file \
-blockdev driver=qcow2,node-name=my,file=my_file \
-device virtio-blk-pci,scsi=off,iommu_platform=on,ats=on,bus=pci.2,addr=0x0,drive=my,id=virtio-disk0,bootindex=1,write-cache=on \
-netdev tap,id=hostnet0 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=88:66:da:5f:dd:01,bus=pci.3,addr=0x0 \
-chardev socket,id=charnet1,path=/tmp/vhostuser0.sock,server \
-netdev vhost-user,chardev=charnet1,queues=2,id=hostnet1 \
-device virtio-net-pci,mq=on,vectors=6,rx_queue_size=1024,netdev=hostnet1,id=net1,mac=88:66:da:5f:dd:02,bus=pci.6,addr=0x0,iommu_platform=on,ats=on,packed=on \
-chardev socket,id=charnet2,path=/tmp/vhostuser1.sock,server \
-netdev vhost-user,chardev=charnet2,queues=2,id=hostnet2 \
-device virtio-net-pci,mq=on,vectors=6,rx_queue_size=1024,netdev=hostnet2,id=net2,mac=88:66:da:5f:dd:03,bus=pci.7,addr=0x0,iommu_platform=on,ats=on,packed=on \
-monitor stdio \
-vnc :2 \

Comment 1 Pei Zhang 2020-01-07 07:12:46 UTC

Update OpenvSwith versions: openvswitch2.11-2.11.0-35.el8fdp.x86_64

Comment 2 Pei Zhang 2020-01-07 07:19:23 UTC

If replace OpenvSwitch with dpdk's testpmd as vhost-user clients, qemu will work well.

dpdk version: dpdk-19.11-1.el8.x86_64

Steps:
1. Replace Step 1 of Description with booting dpdk's testpmd, qemu will keep working well and testpmd can receive packets well.

/usr/bin/testpmd \
	-l 2,4,6,8,10,12,14,16,18 \
	--socket-mem 1024,1024 \
	-n 4 \
	-d /usr/lib64/librte_pmd_vhost.so  \
	--vdev 'net_vhost0,iface=/tmp/vhostuser0.sock,queues=2,client=1,iommu-support=1' \
	--vdev 'net_vhost1,iface=/tmp/vhostuser1.sock,queues=2,client=1,iommu-support=1' \
	--iova-mode pa \
	-- \
	--portmask=f \
	-i \
	--rxd=512 --txd=512 \
	--rxq=2 --txq=2 \
	--nb-cores=8 \
	--forward-mode=io
testpmd> set portlist 0,2,1,3
testpmd> start


testpmd> show port stats all 

  ######################## NIC statistics for port 0  ########################
  RX-packets: 88822640   RX-missed: 0          RX-bytes:  5329364952
  RX-errors: 0
  RX-nombuf:  0         
  TX-packets: 75692878   TX-errors: 0          TX-bytes:  4541579232

  Throughput (since last show)
  Rx-pps:       147041          Rx-bps:     70580080
  Tx-pps:       125307          Tx-bps:     60147664
  ############################################################################

  ######################## NIC statistics for port 1  ########################
  RX-packets: 75693848   RX-missed: 0          RX-bytes:  4541637432
  RX-errors: 0
  RX-nombuf:  0         
  TX-packets: 88821732   TX-errors: 0          TX-bytes:  5329310472

  Throughput (since last show)
  Rx-pps:       125307          Rx-bps:     60147664
  Tx-pps:       147041          Tx-bps:     70580080
  ############################################################################

Comment 7 Adrián Moreno 2020-01-16 11:40:50 UTC

There might be multiple issues going on here, so let's try to split them up:

Looking at the code, qemu's implementation of vhost-user + multiqueue + iommu is likely to be utterly broken. It will create a slave channel per queue pair. When the second slave channel is created, the first one is closed by the vhost-user backend (which explains the "Failed to read from slave" errors). And when the first queue is started, SET_VRING_ADDR on queue pair 1 will generate an IOTBL_MISS on slave channel bound to queue pair 2. That is most likely the cause of qemu's segfault. I'll work upstream to fix that.

If that's true:
- It should be as reproducible with testpmd as it is with OvS. Pei, can you double check "queues=2" is present in qemu's command line for the testpmd case?
- It should be as reproducible with or without "packed=on"

Pei, can you please confirm this?

Now, that does not explain OvS's crash. Can you please attach some logs to try to figure out what's going on there?

Thanks

Comment 8 Pei Zhang 2020-01-17 09:39:57 UTC

Hi Adrián,

I've been trying to reproduce this issue many times (As I cannot reproduce with libvirt at first, but 100% reproduced). Finally I found the way to reproduce, no matter with qemu or libvirt.

Here is the update:

1. vIOMMU + packed=on + memory host-nodes=1 are 3 key points to reproduce the issue 100%. Without one of them, this issue can not be reproduced.

-chardev socket,id=charnet1,path=/tmp/vhostuser0.sock,server \
-netdev vhost-user,chardev=charnet1,queues=2,id=hostnet1 \
-device virtio-net-pci,mq=on,vectors=6,rx_queue_size=1024,netdev=hostnet1,id=net1,mac=88:66:da:5f:dd:02,bus=pci.6,addr=0x0,iommu_platform=on,ats=on,packed=on \
-chardev socket,id=charnet2,path=/tmp/vhostuser1.sock,server \
-netdev vhost-user,chardev=charnet2,queues=2,id=hostnet2 \
-device virtio-net-pci,mq=on,vectors=6,rx_queue_size=1024,netdev=hostnet2,id=net2,mac=88:66:da:5f:dd:03,bus=pci.7,addr=0x0,iommu_platform=on,ats=on,packed=on \


-object memory-backend-file,id=mem,size=8G,mem-path=/dev/hugepages,share=on,host-nodes=1,policy=bind \
-numa node,memdev=mem -mem-prealloc \


2. When I reported this bz, I didn't explicitly set memory host-nodes=1, but in my setup it's default using memory host-nodes=1. (When we set  memory host-nodes=0, it works very well)

Sorry for late response. I just wanted to provide some certain testing results to avoid possible confusion and this took a bit long time.

Comment 9 Pei Zhang 2020-01-17 09:41:01 UTC

(In reply to Pei Zhang from comment #8)
> Hi Adrián,
> 
> I've been trying to reproduce this issue many times (As I cannot reproduce
> with libvirt at first, but 100% reproduced). 

fix typo: but 100% reproduced with qemu.

Comment 17 Pei Zhang 2020-01-21 07:26:46 UTC

(In reply to Adrián Moreno from comment #7)
> There might be multiple issues going on here, so let's try to split them up:
> 
> Looking at the code, qemu's implementation of vhost-user + multiqueue +
> iommu is likely to be utterly broken. It will create a slave channel per
> queue pair. When the second slave channel is created, the first one is
> closed by the vhost-user backend (which explains the "Failed to read from
> slave" errors). And when the first queue is started, SET_VRING_ADDR on queue
> pair 1 will generate an IOTBL_MISS on slave channel bound to queue pair 2.
> That is most likely the cause of qemu's segfault. I'll work upstream to fix
> that.
> 

Hi Adrian,

I've filed a bz to track the multiqueue issue with dpdk19.11. This one is not related to packed=on. I've already cc you.

Bug 1793327 - "qemu-kvm: Failed to read from slave." shows when boot qemu vhost-user 2 queues over dpdk 19.11


Best regards,

Pei

Comment 18 Ademar Reis 2020-02-05 23:11:57 UTC

QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 20 Pei Zhang 2020-03-17 07:17:44 UTC

Update:

With ovs patch fix of Bug 1812620, both ovs and qemu crash issues are gone. Both ovs and qemu keep working well and the throughput result looks good. 


Testcase: nfv_acceptance_nonrt_server_2Q_1G_iommu_packed
Packets_loss Frame_Size Run_No Throughput Avg_Throughput
   0            64        0     20.950139   20.950139


Versions:
4.18.0-187.el8.x86_64
qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64
tuned-2.13.0-5.el8.noarch
python3-libvirt-6.0.0-1.module+el8.2.0+5453+31b2b136.x86_64
openvswitch2.13-2.13.0-6.el8fdp.x86_64
dpdk-19.11-4.el8.x86_64

Comment 21 Pei Zhang 2020-03-17 13:24:42 UTC

More info:

The ovs patch can fix both qemu and ovs crash issue. With latest fdp ovs2.11, ovs2.12, ovs2.13, both qemu and ovs work well.

Below is more testing versions:

ovs2.11:
openvswitch2.11-2.11.0-35.el8fdp.x86_64, qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64: both qemu and ovs crash.
openvswitch2.11-2.11.0-50.el8fdp.x86_64, qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64: both qemu and ovs work well.

ovs2.12:
openvswitch2.12-2.12.0-12.el8fdp.x86_64, qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64: both qemu and ovs crash.
openvswitch2.12-2.12.0-23.el8fdp.x86_64, qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64: both qemu and ovs work well.

ovs2.13:
openvswitch2.13-2.13.0-6.el8fdp.x86_64, qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64: both qemu and ovs work well.

Hi Cindy,

From QE function and performance testing perspective, the qemu crash issue has gone with ovs fix. This bug can not be reproduced any more. However I'm not sure if there were some defects in qemu code and might be covered by the ovs fix.

Comment 22 lulu@redhat.com 2020-09-21 05:52:40 UTC

After discussed with pei, We plan to move this to AV8.4

Comment 23 lulu@redhat.com 2020-09-21 06:57:14 UTC

Hi pei, I have checked the log and the fix, I thinks the fix in dpdk have already fix this crash, We don't need fix it in qemu. Maybe we can close this bug?

Comment 24 Pei Zhang 2020-09-21 06:58:48 UTC

(In reply to lulu from comment #23)
> Hi pei, I have checked the log and the fix, I thinks the fix in dpdk have
> already fix this crash, We don't need fix it in qemu. Maybe we can close
> this bug?

Hi Cindy,

Thanks for the explain. I agree we can close it.

Best regards,

Pei