Bug 1880299 - vhost-user mq connection fails to restart after kill host testpmd which acts as vhost-user client
Summary: vhost-user mq connection fails to restart after kill host testpmd which acts ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 8.3
Assignee: Eugenio Pérez Martín
QA Contact: Pei Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-18 08:59 UTC by Pei Zhang
Modified: 2021-05-25 06:44 UTC (History)
8 users (show)

Fixed In Version: qemu-kvm-5.2.0-8.module+el8.4.0+10093+e085f1eb
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1930549 (view as bug list)
Environment:
Last Closed: 2021-05-25 06:43:36 UTC
Type: Bug
Target Upstream Version:
amorenoz: needinfo-


Attachments (Terms of Use)

Description Pei Zhang 2020-09-18 08:59:48 UTC
Description of problem:
Boot VM with vhost-user 4 queues, then boot testpmd as vhost-user client. Kill testpmd and start testpmd again, vhost-user connection fails to restart. This will cause both qemu and testpmd crash.

Version-Release number of selected component (if applicable):
4.18.0-237.el8.x86_64
qemu-kvm-5.1.0-7.module+el8.3.0+8099+dba2fe3e.x86_64
dpdk-19.11.3-1.el8.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Boot VM with vhost-user 4 queues 

    <interface type="vhostuser">
      <mac address="88:66:da:5f:dd:02" />
      <source mode="server" path="/tmp/vhost-user1.sock" type="unix" />
      <model type="virtio" />
      <driver ats="on" iommu="on" name="vhost" queues="4" rx_queue_size="1024" />
      <address bus="0x6" domain="0x0000" function="0x0" slot="0x00" type="pci" />
    </interface>



2. Boot testpmd as vhost-user client

# cat testpmd_4q.sh 
testpmd \
-l 2,4,6,8,10,12,14,16,18 \
--socket-mem 1024,1024 \
-n 4  \
--vdev 'net_vhost0,iface=/tmp/vhost-user1.sock,queues=4,client=1,iommu-support=1' \
-d /usr/lib64/librte_pmd_vhost.so \
-- \
--portmask=f \
-i \
--rxd=512 --txd=512 \
--rxq=4 --txq=4 \
--nb-cores=8 \
--forward-mode=txonly

# sh testpmd_4q.sh 

3. Kill testpmd

# pkill testpmd


4. Start testpmd again

# sh testpmd.sh

5. vhost-user connection fails to restart. Both qemu and testpmd crash.

testpmd_4q.sh: line 13:  5227 Segmentation fault      (core dumped) testpmd -l 2,4,6,8,10,12,14,16,18 --socket-mem 1024,1024 -n 4 --vdev 'net_vhost0,iface=/tmp/vhost-user1.sock,queues=4,client=1,iommu-support=1' -d /usr/lib64/librte_pmd_vhost.so -- --portmask=f -i --rxd=512 --txd=512 --rxq=4 --txq=4 --nb-cores=8 --forward-mode=txonly

# dmesg
[ 3782.630934] vhost-events[5239]: segfault at 2d8 ip 00007f5dc8e171cb sp 00007f5db97f9870 error 4 in librte_vhost.so.20.0[7f5dc8e10000+4e000]
[ 3782.643491] Code: 89 40 08 e9 3f ff ff ff e8 22 bc ff ff 66 90 f3 0f 1e fa 41 57 41 56 41 55 41 54 49 89 cc 55 48 89 f5 53 48 89 fb 48 83 ec 48 <4c> 8b af d8 02 00 00 48 89 14 24 44 89 44 24 10 64 48 8b 04 25 28
[ 3784.517001] qemu-kvm[5097]: segfault at 1b0 ip 000055b8b3406540 sp 00007ffded2577f0 error 4 in qemu-kvm[55b8b304d000+a66000]
[ 3784.528225] Code: 75 10 48 8b 05 71 80 ae 00 89 c0 64 48 89 03 0f ae f0 90 48 8b 45 00 31 c9 85 d2 48 89 e7 0f 95 c1 41 b8 01 00 00 00 4c 89 e2 <48> 8b b0 b0 01 00 00 e8 64 d2 f5 ff 48 83 3c 24 00 0f 84 f9 00 00


Actual results:
vhost-user connection fails to restart. Both qemu and testpmd crash.

Expected results:
vhost-user connection should restart well. And qemu and testpmd should work well.

Additional info:
1. This issue can not be reproduced with vhost-user 2 queues. We need to test with vhost-user 4 queues to trigger.

Comment 12 Pei Zhang 2021-02-01 11:56:30 UTC
Hello Adrian,

I cannot trigger this issue with latest rhel8.4-av, both dpdk and qemu keep working well. Could you confirm from code level if this issue has been fixed? Thanks a lot.

Versions:
4.18.0-278.rt7.43.el8.dt4.x86_64
qemu-kvm-5.2.0-4.module+el8.4.0+9676+589043b9.x86_64
tuned-2.15.0-1.el8.noarch
libvirt-7.0.0-3.module+el8.4.0+9709+a99efd61.x86_64
python3-libvirt-6.10.0-1.module+el8.4.0+8948+a39b3f3a.x86_64
openvswitch2.13-2.13.0-86.el8fdp.x86_64
dpdk-20.11-1.el8.x86_64

Best regards,

Pei

Comment 13 Adrián Moreno 2021-02-03 08:50:28 UTC
Hi Pei,

AFACS the bug is still there in latest qemu.

The thing is, the bug was triggered *after* DPDK crashed. The DPDK crash was fixed by Maxime in 20.11, so that's why it's no longer reproducible with the new DPDK version.

Can you confirm this by testing qemu-5.2.0 and dpdk 19.11?

Comment 14 Eugenio Pérez Martín 2021-02-03 16:58:24 UTC
Hi!

From the description of #c4, I think that the pending issue in qemu is
the same as https://bugzilla.redhat.com/show_bug.cgi?id=1852906, which
I was able to reproduce with both packed and split vq.

I posted a patch upstream for it:
https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ .

I always had success reproducing the issue with testpmd txonly forwarding
mode, tough.

Comment 15 Adrián Moreno 2021-02-03 17:03:26 UTC
(In reply to Eugenio Pérez Martín from comment #14)
> Hi!
> 
> From the description of #c4, I think that the pending issue in qemu is
> the same as https://bugzilla.redhat.com/show_bug.cgi?id=1852906, which
> I was able to reproduce with both packed and split vq.
> 
> I posted a patch upstream for it:
> https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ .
> 
Yes, that's the one! Thanks Eugenio. I think you fixed two BZs with 5 lines of code. :)

> I always had success reproducing the issue with testpmd txonly forwarding
> mode, tough.

How? Killing the host's DPDK while it was transmitting? What version of DPDK?

Comment 16 Eugenio Pérez Martín 2021-02-03 19:06:15 UTC
(In reply to Adrián Moreno from comment #15)
> (In reply to Eugenio Pérez Martín from comment #14)
> > Hi!
> > 
> > From the description of #c4, I think that the pending issue in qemu is
> > the same as https://bugzilla.redhat.com/show_bug.cgi?id=1852906, which
> > I was able to reproduce with both packed and split vq.
> > 
> > I posted a patch upstream for it:
> > https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ .
> > 
> Yes, that's the one! Thanks Eugenio. I think you fixed two BZs with 5 lines
> of code. :)
> 
> > I always had success reproducing the issue with testpmd txonly forwarding
> > mode, tough.
> 
> How? Killing the host's DPDK while it was transmitting? What version of DPDK?

Restarting the guest. If I restart testpmd [1] with packed what I get is that
testpmd is not able to recover packed ring, but it works with split.

It should work with any version of testpmd, since qemu did stop vhost device
queue before notifying testpmd.

[1] OVS, actually, but I think it will be the same result.

Comment 17 Pei Zhang 2021-02-04 11:38:58 UTC
(In reply to Adrián Moreno from comment #13)
> Hi Pei,
> 
> AFACS the bug is still there in latest qemu.
> 
> The thing is, the bug was triggered *after* DPDK crashed. The DPDK crash was
> fixed by Maxime in 20.11, so that's why it's no longer reproducible with the
> new DPDK version.
> 
> Can you confirm this by testing qemu-5.2.0 and dpdk 19.11?

Hi Adrian,

After several tries, I can confirm this issue can only be reproduced with qemu-5.1.0 + dpdk 19.11. Other combinations work well, no crash. 

qemu-kvm-5.1.0-7.module+el8.3.0+8099+dba2fe3e.x86_64 &  dpdk-19.11.3-1.el8.x86_64            Both testpmd and qemu crash
qemu-kvm-5.1.0-7.module+el8.3.0+8099+dba2fe3e.x86_64 &  dpdk-20.11-1.el8.x86_64              Works
qemu-kvm-5.2.0-5.scrmod+el8.4.0+9783+7f5b6b81.wrb210203.x86_64 & dpdk-19.11.3-1.el8.x86_64   Works


Best regards,

Pei

Comment 18 Pei Zhang 2021-02-04 11:46:04 UTC
(In reply to Eugenio Pérez Martín from comment #14)
> Hi!
> 
> From the description of #c4, I think that the pending issue in qemu is
> the same as https://bugzilla.redhat.com/show_bug.cgi?id=1852906, which
> I was able to reproduce with both packed and split vq.
> 
> I posted a patch upstream for it:
> https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ .
> 
> I always had success reproducing the issue with testpmd txonly forwarding
> mode, tough.

Hello Eugenio,

Thank you for sending the patch. 

As Comment 17, I cannot reproduce the original issue (See Description) with dpdk 19.11 + qemu 5.2 and dpdk 20.11 + qemu 5.2 any more. So the issue you hit might be a different one, could you share your reproduce steps? The detail the better. If it's a new testing scenario, I can add it to future testing. Thanks a lot.

Best regards,

Pei

Comment 19 Eugenio Pérez Martín 2021-02-04 11:54:21 UTC
(In reply to Pei Zhang from comment #18)
> (In reply to Eugenio Pérez Martín from comment #14)
> > Hi!
> > 
> > From the description of #c4, I think that the pending issue in qemu is
> > the same as https://bugzilla.redhat.com/show_bug.cgi?id=1852906, which
> > I was able to reproduce with both packed and split vq.
> > 
> > I posted a patch upstream for it:
> > https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ .
> > 
> > I always had success reproducing the issue with testpmd txonly forwarding
> > mode, tough.
> 
> Hello Eugenio,
> 
> Thank you for sending the patch. 
> 
> As Comment 17, I cannot reproduce the original issue (See Description) with
> dpdk 19.11 + qemu 5.2 and dpdk 20.11 + qemu 5.2 any more. So the issue you
> hit might be a different one, could you share your reproduce steps? The
> detail the better. If it's a new testing scenario, I can add it to future
> testing. Thanks a lot.
> 
> Best regards,
> 
> Pei

Hi Pei.

I actually reboot the guest, not testpmd, as described by
https://bugzilla.redhat.com/show_bug.cgi?id=1852906 . It does not reproduce
100% (need to have a iotlb request in flight), but I'm pretty sure it will
reproduce at least every 2-3 tries.

Please let me know if you need more information.

Comment 20 Adrián Moreno 2021-02-04 12:06:12 UTC
There were really 3 bugs here:
1 - qemu not saving per-vq features: Solved in qemu-5.2.0 by :
https://patchwork.kernel.org/project/qemu-devel/patch/46CBC206-E0CA-4249-81CD-10F75DA30441@tencent.com/

2 - dpdk crashing when holes in virtqueue struct were created (triggered by 1). Solved in dpdk 20.11 by :
https://patches.dpdk.org/patch/81398/

3 - qemu crashing when testpmd suddenly crashes and there are inflight iotlb messages. Solved by Eugenio's proposed patch:
https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/

So, I believe that Eugenio's patch on top of qemu-5.1.0 + dpdk 19.11 will also "work" (testpmd will crash but qemu will not)

Comment 21 Pei Zhang 2021-02-19 03:26:00 UTC
(In reply to Eugenio Pérez Martín from comment #19)
> (In reply to Pei Zhang from comment #18)
> > (In reply to Eugenio Pérez Martín from comment #14)
> > > Hi!
> > > 
> > > From the description of #c4, I think that the pending issue in qemu is
> > > the same as https://bugzilla.redhat.com/show_bug.cgi?id=1852906, which
> > > I was able to reproduce with both packed and split vq.
> > > 
> > > I posted a patch upstream for it:
> > > https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ .
> > > 
> > > I always had success reproducing the issue with testpmd txonly forwarding
> > > mode, tough.
> > 
> > Hello Eugenio,
> > 
> > Thank you for sending the patch. 
> > 
> > As Comment 17, I cannot reproduce the original issue (See Description) with
> > dpdk 19.11 + qemu 5.2 and dpdk 20.11 + qemu 5.2 any more. So the issue you
> > hit might be a different one, could you share your reproduce steps? The
> > detail the better. If it's a new testing scenario, I can add it to future
> > testing. Thanks a lot.
> > 
> > Best regards,
> > 
> > Pei
> 
> Hi Pei.
> 
> I actually reboot the guest, not testpmd, as described by
> https://bugzilla.redhat.com/show_bug.cgi?id=1852906 . It does not reproduce
> 100% (need to have a iotlb request in flight), but I'm pretty sure it will
> reproduce at least every 2-3 tries.
> 
> Please let me know if you need more information.

Hi Eugenio,

Thanks for the info. I reproduced this issue :)

There are 2 extra more steps besides steps in Description to reproduce this issue now:
1. Add packets flow in the VM, so there will flight iotlb request
2. Reboot VM


Best regards,

Pei

Comment 22 Pei Zhang 2021-02-19 03:30:45 UTC
(In reply to Adrián Moreno from comment #20)
> There were really 3 bugs here:
> 1 - qemu not saving per-vq features: Solved in qemu-5.2.0 by :
> https://patchwork.kernel.org/project/qemu-devel/patch/46CBC206-E0CA-4249-
> 81CD-10F75DA30441@tencent.com/
> 
> 2 - dpdk crashing when holes in virtqueue struct were created (triggered by
> 1). Solved in dpdk 20.11 by :
> https://patches.dpdk.org/patch/81398/
> 
> 3 - qemu crashing when testpmd suddenly crashes and there are inflight iotlb
> messages. Solved by Eugenio's proposed patch:
> https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/
> 
> So, I believe that Eugenio's patch on top of qemu-5.1.0 + dpdk 19.11 will
> also "work" (testpmd will crash but qemu will not)

Hello Adrian,

Thanks for the summary about these issues and related patches. It's very helpful for me to understand them.

With qemu 5.2 + dpdk 20.11, Qemu still can crash once there are inflight iotlb messages.

Versions I tested:
qemu-kvm-5.2.0-7.module+el8.4.0+9943+d64b3717.x86_64
dpdk-20.11-1.el8.x86_64


Best regards,

Pei

Comment 23 Pei Zhang 2021-02-19 06:47:58 UTC
Hi Eugenio,

Just let you know, RHEL9 also hit this issue and requires this patch fix. I've filed Bug 1930549 to track it. Thanks.

Best regards,

Pei

Comment 29 Pei Zhang 2021-02-20 06:34:03 UTC
Verification:


Versions:

4.18.0-289.el8.x86_64
qemu-kvm-5.2.0-8.module+el8.4.0+10093+e085f1eb.x86_64
dpdk-20.11-1.el8.x86_64


Steps:

1. In host, boot VM with vhost-user 4 queues 

2. In host, boot testpmd with vhost-user 4 queues client

3. In another host, start MoonGen to send packets to VM

4. In VM, check vhost-user network status, it can receive packets well. RT packets increases.

5. In host, Kill testpmd

6. Start testpmd again

7. Reboot VM. Guest keeps working well and  vhost-user network can recover recovering packets.

8. Repeat step 5~7 several times, no any error shows.


So this bug has been fixed very well. Move to 'VERIFIED'.

Comment 31 errata-xmlrpc 2021-05-25 06:43:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2098


Note You need to log in before you can comment on or make changes to this bug.