Description of problem: Boot VM with vhost-user 4 queues, then boot testpmd as vhost-user client. Kill testpmd and start testpmd again, vhost-user connection fails to restart. This will cause both qemu and testpmd crash. Version-Release number of selected component (if applicable): 4.18.0-237.el8.x86_64 qemu-kvm-5.1.0-7.module+el8.3.0+8099+dba2fe3e.x86_64 dpdk-19.11.3-1.el8.x86_64 How reproducible: 100% Steps to Reproduce: 1. Boot VM with vhost-user 4 queues <interface type="vhostuser"> <mac address="88:66:da:5f:dd:02" /> <source mode="server" path="/tmp/vhost-user1.sock" type="unix" /> <model type="virtio" /> <driver ats="on" iommu="on" name="vhost" queues="4" rx_queue_size="1024" /> <address bus="0x6" domain="0x0000" function="0x0" slot="0x00" type="pci" /> </interface> 2. Boot testpmd as vhost-user client # cat testpmd_4q.sh testpmd \ -l 2,4,6,8,10,12,14,16,18 \ --socket-mem 1024,1024 \ -n 4 \ --vdev 'net_vhost0,iface=/tmp/vhost-user1.sock,queues=4,client=1,iommu-support=1' \ -d /usr/lib64/librte_pmd_vhost.so \ -- \ --portmask=f \ -i \ --rxd=512 --txd=512 \ --rxq=4 --txq=4 \ --nb-cores=8 \ --forward-mode=txonly # sh testpmd_4q.sh 3. Kill testpmd # pkill testpmd 4. Start testpmd again # sh testpmd.sh 5. vhost-user connection fails to restart. Both qemu and testpmd crash. testpmd_4q.sh: line 13: 5227 Segmentation fault (core dumped) testpmd -l 2,4,6,8,10,12,14,16,18 --socket-mem 1024,1024 -n 4 --vdev 'net_vhost0,iface=/tmp/vhost-user1.sock,queues=4,client=1,iommu-support=1' -d /usr/lib64/librte_pmd_vhost.so -- --portmask=f -i --rxd=512 --txd=512 --rxq=4 --txq=4 --nb-cores=8 --forward-mode=txonly # dmesg [ 3782.630934] vhost-events[5239]: segfault at 2d8 ip 00007f5dc8e171cb sp 00007f5db97f9870 error 4 in librte_vhost.so.20.0[7f5dc8e10000+4e000] [ 3782.643491] Code: 89 40 08 e9 3f ff ff ff e8 22 bc ff ff 66 90 f3 0f 1e fa 41 57 41 56 41 55 41 54 49 89 cc 55 48 89 f5 53 48 89 fb 48 83 ec 48 <4c> 8b af d8 02 00 00 48 89 14 24 44 89 44 24 10 64 48 8b 04 25 28 [ 3784.517001] qemu-kvm[5097]: segfault at 1b0 ip 000055b8b3406540 sp 00007ffded2577f0 error 4 in qemu-kvm[55b8b304d000+a66000] [ 3784.528225] Code: 75 10 48 8b 05 71 80 ae 00 89 c0 64 48 89 03 0f ae f0 90 48 8b 45 00 31 c9 85 d2 48 89 e7 0f 95 c1 41 b8 01 00 00 00 4c 89 e2 <48> 8b b0 b0 01 00 00 e8 64 d2 f5 ff 48 83 3c 24 00 0f 84 f9 00 00 Actual results: vhost-user connection fails to restart. Both qemu and testpmd crash. Expected results: vhost-user connection should restart well. And qemu and testpmd should work well. Additional info: 1. This issue can not be reproduced with vhost-user 2 queues. We need to test with vhost-user 4 queues to trigger.
Hello Adrian, I cannot trigger this issue with latest rhel8.4-av, both dpdk and qemu keep working well. Could you confirm from code level if this issue has been fixed? Thanks a lot. Versions: 4.18.0-278.rt7.43.el8.dt4.x86_64 qemu-kvm-5.2.0-4.module+el8.4.0+9676+589043b9.x86_64 tuned-2.15.0-1.el8.noarch libvirt-7.0.0-3.module+el8.4.0+9709+a99efd61.x86_64 python3-libvirt-6.10.0-1.module+el8.4.0+8948+a39b3f3a.x86_64 openvswitch2.13-2.13.0-86.el8fdp.x86_64 dpdk-20.11-1.el8.x86_64 Best regards, Pei
Hi Pei, AFACS the bug is still there in latest qemu. The thing is, the bug was triggered *after* DPDK crashed. The DPDK crash was fixed by Maxime in 20.11, so that's why it's no longer reproducible with the new DPDK version. Can you confirm this by testing qemu-5.2.0 and dpdk 19.11?
Hi! From the description of #c4, I think that the pending issue in qemu is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1852906, which I was able to reproduce with both packed and split vq. I posted a patch upstream for it: https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ . I always had success reproducing the issue with testpmd txonly forwarding mode, tough.
(In reply to Eugenio Pérez Martín from comment #14) > Hi! > > From the description of #c4, I think that the pending issue in qemu is > the same as https://bugzilla.redhat.com/show_bug.cgi?id=1852906, which > I was able to reproduce with both packed and split vq. > > I posted a patch upstream for it: > https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ . > Yes, that's the one! Thanks Eugenio. I think you fixed two BZs with 5 lines of code. :) > I always had success reproducing the issue with testpmd txonly forwarding > mode, tough. How? Killing the host's DPDK while it was transmitting? What version of DPDK?
(In reply to Adrián Moreno from comment #15) > (In reply to Eugenio Pérez Martín from comment #14) > > Hi! > > > > From the description of #c4, I think that the pending issue in qemu is > > the same as https://bugzilla.redhat.com/show_bug.cgi?id=1852906, which > > I was able to reproduce with both packed and split vq. > > > > I posted a patch upstream for it: > > https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ . > > > Yes, that's the one! Thanks Eugenio. I think you fixed two BZs with 5 lines > of code. :) > > > I always had success reproducing the issue with testpmd txonly forwarding > > mode, tough. > > How? Killing the host's DPDK while it was transmitting? What version of DPDK? Restarting the guest. If I restart testpmd [1] with packed what I get is that testpmd is not able to recover packed ring, but it works with split. It should work with any version of testpmd, since qemu did stop vhost device queue before notifying testpmd. [1] OVS, actually, but I think it will be the same result.
(In reply to Adrián Moreno from comment #13) > Hi Pei, > > AFACS the bug is still there in latest qemu. > > The thing is, the bug was triggered *after* DPDK crashed. The DPDK crash was > fixed by Maxime in 20.11, so that's why it's no longer reproducible with the > new DPDK version. > > Can you confirm this by testing qemu-5.2.0 and dpdk 19.11? Hi Adrian, After several tries, I can confirm this issue can only be reproduced with qemu-5.1.0 + dpdk 19.11. Other combinations work well, no crash. qemu-kvm-5.1.0-7.module+el8.3.0+8099+dba2fe3e.x86_64 & dpdk-19.11.3-1.el8.x86_64 Both testpmd and qemu crash qemu-kvm-5.1.0-7.module+el8.3.0+8099+dba2fe3e.x86_64 & dpdk-20.11-1.el8.x86_64 Works qemu-kvm-5.2.0-5.scrmod+el8.4.0+9783+7f5b6b81.wrb210203.x86_64 & dpdk-19.11.3-1.el8.x86_64 Works Best regards, Pei
(In reply to Eugenio Pérez Martín from comment #14) > Hi! > > From the description of #c4, I think that the pending issue in qemu is > the same as https://bugzilla.redhat.com/show_bug.cgi?id=1852906, which > I was able to reproduce with both packed and split vq. > > I posted a patch upstream for it: > https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ . > > I always had success reproducing the issue with testpmd txonly forwarding > mode, tough. Hello Eugenio, Thank you for sending the patch. As Comment 17, I cannot reproduce the original issue (See Description) with dpdk 19.11 + qemu 5.2 and dpdk 20.11 + qemu 5.2 any more. So the issue you hit might be a different one, could you share your reproduce steps? The detail the better. If it's a new testing scenario, I can add it to future testing. Thanks a lot. Best regards, Pei
(In reply to Pei Zhang from comment #18) > (In reply to Eugenio Pérez Martín from comment #14) > > Hi! > > > > From the description of #c4, I think that the pending issue in qemu is > > the same as https://bugzilla.redhat.com/show_bug.cgi?id=1852906, which > > I was able to reproduce with both packed and split vq. > > > > I posted a patch upstream for it: > > https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ . > > > > I always had success reproducing the issue with testpmd txonly forwarding > > mode, tough. > > Hello Eugenio, > > Thank you for sending the patch. > > As Comment 17, I cannot reproduce the original issue (See Description) with > dpdk 19.11 + qemu 5.2 and dpdk 20.11 + qemu 5.2 any more. So the issue you > hit might be a different one, could you share your reproduce steps? The > detail the better. If it's a new testing scenario, I can add it to future > testing. Thanks a lot. > > Best regards, > > Pei Hi Pei. I actually reboot the guest, not testpmd, as described by https://bugzilla.redhat.com/show_bug.cgi?id=1852906 . It does not reproduce 100% (need to have a iotlb request in flight), but I'm pretty sure it will reproduce at least every 2-3 tries. Please let me know if you need more information.
There were really 3 bugs here: 1 - qemu not saving per-vq features: Solved in qemu-5.2.0 by : https://patchwork.kernel.org/project/qemu-devel/patch/46CBC206-E0CA-4249-81CD-10F75DA30441@tencent.com/ 2 - dpdk crashing when holes in virtqueue struct were created (triggered by 1). Solved in dpdk 20.11 by : https://patches.dpdk.org/patch/81398/ 3 - qemu crashing when testpmd suddenly crashes and there are inflight iotlb messages. Solved by Eugenio's proposed patch: https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ So, I believe that Eugenio's patch on top of qemu-5.1.0 + dpdk 19.11 will also "work" (testpmd will crash but qemu will not)
(In reply to Eugenio Pérez Martín from comment #19) > (In reply to Pei Zhang from comment #18) > > (In reply to Eugenio Pérez Martín from comment #14) > > > Hi! > > > > > > From the description of #c4, I think that the pending issue in qemu is > > > the same as https://bugzilla.redhat.com/show_bug.cgi?id=1852906, which > > > I was able to reproduce with both packed and split vq. > > > > > > I posted a patch upstream for it: > > > https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ . > > > > > > I always had success reproducing the issue with testpmd txonly forwarding > > > mode, tough. > > > > Hello Eugenio, > > > > Thank you for sending the patch. > > > > As Comment 17, I cannot reproduce the original issue (See Description) with > > dpdk 19.11 + qemu 5.2 and dpdk 20.11 + qemu 5.2 any more. So the issue you > > hit might be a different one, could you share your reproduce steps? The > > detail the better. If it's a new testing scenario, I can add it to future > > testing. Thanks a lot. > > > > Best regards, > > > > Pei > > Hi Pei. > > I actually reboot the guest, not testpmd, as described by > https://bugzilla.redhat.com/show_bug.cgi?id=1852906 . It does not reproduce > 100% (need to have a iotlb request in flight), but I'm pretty sure it will > reproduce at least every 2-3 tries. > > Please let me know if you need more information. Hi Eugenio, Thanks for the info. I reproduced this issue :) There are 2 extra more steps besides steps in Description to reproduce this issue now: 1. Add packets flow in the VM, so there will flight iotlb request 2. Reboot VM Best regards, Pei
(In reply to Adrián Moreno from comment #20) > There were really 3 bugs here: > 1 - qemu not saving per-vq features: Solved in qemu-5.2.0 by : > https://patchwork.kernel.org/project/qemu-devel/patch/46CBC206-E0CA-4249- > 81CD-10F75DA30441/ > > 2 - dpdk crashing when holes in virtqueue struct were created (triggered by > 1). Solved in dpdk 20.11 by : > https://patches.dpdk.org/patch/81398/ > > 3 - qemu crashing when testpmd suddenly crashes and there are inflight iotlb > messages. Solved by Eugenio's proposed patch: > https://patchew.org/QEMU/20210129090728.831208-1-eperezma@redhat.com/ > > So, I believe that Eugenio's patch on top of qemu-5.1.0 + dpdk 19.11 will > also "work" (testpmd will crash but qemu will not) Hello Adrian, Thanks for the summary about these issues and related patches. It's very helpful for me to understand them. With qemu 5.2 + dpdk 20.11, Qemu still can crash once there are inflight iotlb messages. Versions I tested: qemu-kvm-5.2.0-7.module+el8.4.0+9943+d64b3717.x86_64 dpdk-20.11-1.el8.x86_64 Best regards, Pei
Hi Eugenio, Just let you know, RHEL9 also hit this issue and requires this patch fix. I've filed Bug 1930549 to track it. Thanks. Best regards, Pei
Verification: Versions: 4.18.0-289.el8.x86_64 qemu-kvm-5.2.0-8.module+el8.4.0+10093+e085f1eb.x86_64 dpdk-20.11-1.el8.x86_64 Steps: 1. In host, boot VM with vhost-user 4 queues 2. In host, boot testpmd with vhost-user 4 queues client 3. In another host, start MoonGen to send packets to VM 4. In VM, check vhost-user network status, it can receive packets well. RT packets increases. 5. In host, Kill testpmd 6. Start testpmd again 7. Reboot VM. Guest keeps working well and vhost-user network can recover recovering packets. 8. Repeat step 5~7 several times, no any error shows. So this bug has been fixed very well. Move to 'VERIFIED'.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2098