Bug 1525446

Summary: Host dpdk's testpmd "Segmentation fault" when migrating VM with vhost-user and packets flow
Product: Red Hat Enterprise Linux 7 Reporter: Pei Zhang <pezhang>
Component: dpdkAssignee: Aaron Conole <aconole>
Status: CLOSED ERRATA QA Contact: Pei Zhang <pezhang>
Severity: high Docs Contact:
Priority: medium    
Version: 7.5CC: atragler, chayang, fbaudin, jhsiao, juzhang, maxime.coquelin, michen, pezhang, vkaplans
Target Milestone: rcKeywords: Extras, Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: dpdk-17.11-6.el7.x86_64 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 23:59:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
XML of VM none

Description Pei Zhang 2017-12-13 10:54:28 UTC
Created attachment 1367272 [details]
XML of VM

Description of problem:
Boot dpdk's testpmd with vhost-user in host. Next boot VM using same vhost-user socket. Then generates packets from another host to this VM. 

Then testpmd will "Segmentation fault" after migration finished.


Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.10.0-12.el7.x86_64
3.10.0-820.el7.x86_64
dpdk-17.11-3.el7.x86_64/


How reproducible:
100%

Steps to Reproduce:
1. Boot testpmd in src and des host
# /usr/bin/testpmd -l 2,4,6,8,10,12,14 \
--socket-mem 1024,1024 \
-n 4  \
--vdev 'net_vhost0,iface=/tmp/vhost-user1,client=1'  \
--vdev 'net_vhost1,iface=/tmp/vhost-user2,client=1'  \
--vdev 'net_vhost2,iface=/tmp/vhost-user3,client=1'  \
-- \
--portmask=3F \
--disable-hw-vlan \
-i \
--rxq=1 --txq=1 \
--nb-cores=6 \
--forward-mode=io

2. Boot VM 
See attachment of this Comment.


3. Start testpmd in guest

modprobe vfio enable_unsafe_noiommu_mode=Y
modprobe vfio-pci

# /usr/bin/testpmd \
-l 1,2,3 \
-n 4 \
-d /usr/lib64/librte_pmd_virtio.so.1 \
-w 0000:00:03.0 -w 0000:00:06.0 \
-- \
--nb-cores=2 \
--disable-hw-vlan \
-i \
--disable-rss \
--rxq=1 --txq=1

4. Generator packets from another host
 ./build/MoonGen examples/l2-load-latency.lua 0 1 64


5. Do migration from src to des host
# virsh migrate --verbose --persistent --live rhel7.5_nonrt qemu+ssh://192.168.1.2/system

6. After migration finished, dpdk quit with "Segmentation fault". Also there are error info in # dmesg.

# dmesg
[16105.282031] testpmd[24507]: segfault at 24 ip 00007fef3bcf42d7 sp 00007fef333f6c40 error 4 in librte_vhost.so.4[7fef3bcef000+10000]


Actual results:
dpdk's testpmd quit unexpected.


Expected results:
dpdk's testpmd should always work well.


Additional info:

Comment 2 Pei Zhang 2017-12-14 09:35:24 UTC
Additional info:

1. This is a regression issue.
dpdk-16.11.2-6.el7.x86_64   works well.


2. More details about the error.
(1) After migrating from src to des, the dpdk's testpmd in src host "Segmentation fault".

(2) Another error in #dmesg
[15979.767740] lcore-slave-4[27362]: segfault at 2 ip 00007fdb728fbab1 sp 00007fdb6d7fe950 error 4
[15979.767813] lcore-slave-6[27363]: segfault at 2 ip 00007fdb728fd93a sp 00007fdb6cffe900 error 4
[15979.767816]  in librte_vhost.so.4[7fdb728f2000+10000]
[15979.790956]  in librte_vhost.so.4[7fdb728f2000+10000]

Comment 6 Maxime Coquelin 2018-01-29 14:09:12 UTC
Looking at the logs, it crashed withing the same second SET_VRING_ADDR is being 
handled while the device is running.

I already reproduced such crash with DPDK v17.11 and posted a fix for this 
specific one [0]. However, this patch has been discarded as Victor has
fixed async virtio_net struct changes more generally by introducing a new lock.

Victor patch has been accepted upstream and queued for v17.11 LTS release:
https://dpdk.org/dev/patchwork/patch/33921/
This patch is needed also for Bz1450680.

Adding Victor in cc:.

Regards,
Maxime
[0]: http://dpdk.org/dev/patchwork/patch/31659/

Comment 8 Pei Zhang 2018-02-07 05:16:14 UTC
==Verification==

Versions:
3.10.0-843.el7.x86_64
qemu-kvm-rhev-2.10.0-19.el7.x86_64
libvirt-3.9.0-11.el7.x86_64
dpdk-17.11-7.el7.x86_64

Steps:
Following steps in Description. 

All 10 migration runs work well, both dpdk and guest work well, no any error. 

So this bug has been fixed well. Move status to 'VERIFIED'.

Comment 11 errata-xmlrpc 2018-04-10 23:59:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1065