Bug 1485867
Summary: | No recovery after vhost-user process restart | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | tiama |
Component: | qemu-kvm-rhev | Assignee: | Jens Freimann <jfreiman> |
Status: | CLOSED WONTFIX | QA Contact: | Pei Zhang <pezhang> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.5 | CC: | ailan, chayang, drjones, juzhang, knoel, marcandre.lureau, michen, pezhang, virt-maint, xiywang |
Target Milestone: | rc | Keywords: | Regression |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-10-06 08:26:49 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
tiama
2017-08-28 10:21:46 UTC
qemu-2.10.0-rc4 [1] fail Sorry for comment 1 : qemu-kvm-rhev-2.9.0-1.el7.x86_64 fail qemu-kvm-rhev-2.9.0-4.el7.x86_64 fail qemu-kvm-rhev-2.9.0-6.el7.x86_64 fail qemu-kvm-rhev-2.9.0-7.el7.x86_64 fail [1] http://download.qemu-project.org/qemu-2.10.0-rc4.tar.xz This is only a vhost-user-bridge regression from 2.8 when libvhost-user was introduced. Since it's only a manual test, I don't think we need to backport it. You can either use vhost-user-bridge from older qemu releases, or use the one from qemu upstream once this fixed is applied "[PATCH 0/2] vhost-user-bridge reconnect regression" (regression introduced in 2.9 actually) Hi Marc-Andre, I agree vhost-user-bridge is a manual test. If I understand correct, probably we can test reconnect issue over ovs with dpdk. Versions: openvswitch-2.8.0-0.1.20170810git3631ed2.el7fdb.x86_64 qemu-kvm-rhev-2.9.0-16.el7.x86_64 3.10.0-693.el7.x86_64 dpdk-17.05-3.el7fdb.x86_64 Steps: 1. Boot ovs as vhost-user client 2. Boot guest as vhost-user server. 3. In guest, start testpmd using vhost-user vNICs. 4. In Another host, start MoonGen as packets generator. 5. Check packets receiv/send in guest. Network works well. 6. Re-start ovs to emulate vhost-user client reconnect. 7. Check packets receiv/send in guest. There are packets, so the network can recover. 8. Repeat step 6~step7, network can always recover. Based on this scenario, the network can recover. Another scenario: 1. Boot ovs as vhost-user client 2. Boot guest as vhost-user server. 3. In guest, set IP address for the vhost-user vNICs 4, In another host, ping this IP address. Ping receive packets, it works. 5. Re-start ovs to emulate vhost-user client reconnect. 7. Check packets receiv/send in guest. Ping can not receive packets, so the network with IP address doesn't recover. Based on this scenario, the network cannot recover. This is same issue with Description of this bug. So seems from vhost-user perspective, after client re-connect, the vhost-user server can recover. However the network with IP address can not recover until reboot the guest. (In reply to Pei Zhang from comment #7) > Another scenario: > 1. Boot ovs as vhost-user client > To isolate OVS issues from this, can you try the same but with the PVP setup? (i.e. with testpmd on the host) http://dpdk.org/doc/guides/howto/pvp_reference_benchmark.html?highlight=pvp (In reply to Amnon Ilan from comment #8) > (In reply to Pei Zhang from comment #7) > > Another scenario: > > 1. Boot ovs as vhost-user client > > > > To isolate OVS issues from this, can you try the same but with the > PVP setup? (i.e. with testpmd on the host) > http://dpdk.org/doc/guides/howto/pvp_reference_benchmark.html?highlight=pvp PVP works well. This ip issue should be caused by dpdk. Versions: qemu-kvm-rhev-2.9.0-16.el7.x86_64 libvirt-3.7.0-2.el7.x86_64 dpdk-16.11-4.el7fdp.x86_64 Steps: 1. Boot VM as vhost-user server 2. Boot testpmd as vhost-user client # testpmd -l 19,17,15 --socket-mem=1024,1024 -n 4 \ --vdev 'net_vhost0,iface=/tmp/vhost-user1,client=1' -- \ --portmask=3 --disable-hw-vlan -i --rxq=1 --txq=1 \ --nb-cores=2 --forward-mode=io testpmd> set portlist 0,1 testpmd> start 3. Set ip in guest, and start ping. Works well. 4. quit testpmd in host, then restart vhost-user client(repeat step2) testpmd> quit 5. Check ping in guest, works well, so network can recover. Same steps with dpdk-17.05-3.el7fdb.x86_64, the ip network can not recover. Another update: For scenario in Comment 7, when testing with old openvswitch version, the ip network can recover. (1)work(ip network can recover) openvswitch-2.6.1-20.git20161206.el7fdp.x86_64 dpdk-16.11-4.el7fdp.x86_64 (2)work openvswitch-2.7.2-7.git20170719.el7fdp.x86_64 (linked with dpdk-16.11.2.tar.xz) (3)fail(ip network can not recover) openvswitch-2.8.0-0.1.20170810git3631ed2.el7fdb.x86_64 (linked with dpdk-17.05.1.tar.xz) Note: From OVS 2.7, OVS is statically linked with DPDK 16.11.1, dpdk is not needed to install on host any more. So from either pvp or openvsiwtch perspective, the network can recover with dpdk-16.11. But can not recover with dpdk-17.05. 1. In Description, we are using old vhost-user-bridge(which come from qemu-kvm-rhev-2.6.0-27.el7.src.rpm). So with this old version tool, vhost-user reconnect of qemu-kvm-rhev-2.9.0-16.el7.x86_64 can not recover. 2. We compiled the new vhost-user-bridge(which come from qemu-kvm-rhev-2.9.0-16.el7.src.rpm). Here is the problem: When restart the vhost-user-bridge, it will panic: # ./vhost-user-bridge -c ud socket: /tmp/vubr.sock (client) local: 127.0.0.1:4444 remote: 127.0.0.1:5555 Added sock 3 for watching. max_sock: 3 Added sock 4 for watching. max_sock: 4 Waiting for data from udp backend on 127.0.0.1:4444... Added sock 5 for watching. max_sock: 5 Added sock 5 for watching. max_sock: 5 *** IN UDP RECEIVE CALLBACK *** hdrlen = 12 PANIC: Guest moved used index from 0 to 643 Sock 3 removed from dispatcher watch. Got UDP packet, but no available descriptors on RX virtq. So we can not test reconnect issue with latest vhost-user-bridge. This tool has a bug. 3. As Jens said in the mail, besides reconnect issue, dpdk' testpmd still hit segment issue(I hit same issue too). So QE reported below two bugs: [1]Bug 1491898 - In PVP testing, dpdk's testpmd will "Segmentation fault" after booting VM [2]Bug 1491909 - IP network can not recover after vhost-user reconnect from OVS side 4. A summary: (1)From pvp and openvswitch layer, the vhost-user reconnect works well with qemu-kvm-rhev-2.9.0-16.el7.x86_64. (But latest version of dpdk and openvswitch hit regression issues, it's not problems of qemu) (2)vhost-user-bridge has a bug. So we can not test reconnect issue with this tool. Thanks Jen. Best Regards, Pei The way we compiled the vhost-user-bridge tool: 1.Download qemu-kvm-rhev-2.9.0-16.el7.src.rpm 2.Compile vhost-user-bridge # rpm2cpio qemu-kvm-rhev-2.9.0-16.el7.src.rpm | cpio -div # tar -xvf qemu-2.9.0.tar.xz # cd qemu-2.9.0 # mkdir build # cd build/ # ../configure # make tests/vhost-user-bridge (In reply to Pei Zhang from comment #11) > (2)vhost-user-bridge has a bug. So we can not test reconnect issue with this > tool. This should be fixed with the patches Marc-Andre mentioned in comment #4. Just to be sure: Have you tried with a vhost-user-bridge binary from upstream? Or an older one from v2.8.0? (In reply to Jens Freimann from comment #13) > (In reply to Pei Zhang from comment #11) > > (2)vhost-user-bridge has a bug. So we can not test reconnect issue with this > > tool. > > This should be fixed with the patches Marc-Andre mentioned in comment #4. With this fix, the network still can not recover most times. > Just to be sure: Have you tried with a vhost-user-bridge binary from > upstream? Or an older one from v2.8.0? No, I tested vhost-user-bridge from downstream and compiled this tool as Comment 12. As qemu-kvm-rhev-2.8.0-x.el7 have all been deleted from brewweb, so seems we can not download and test qemu-kvm-rhev-2.8 now. Best Regards, Pei With upstream vhost-user-bridge from current upstream master (4f2058ded4feb2fa815b33b57b305c81d5016307) I see this when I start vhost-user-bridge to reconnect: [ 210.299955] virtio_net virtio1: output.0:id 0 is not a head! Reconnect works with v2.8.0, so I ran bisect and found: Bisecting: 0 revisions left to test after this (roughly 0 steps) [e10e798c85c2331dab338b6a01835ebde81136e5] tests/vhost-user-bridge: use contrib/libvhost-user Qemu command line: /usr/local/bin/qemu-system-x86_64 --enable-kvm -drive id=drive_image1,if=none,snapshot=off,cache=none,format=qcow2,file=/root/jens/rhel74_1.qcow2 -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x3 -device virtio-net-pci,netdev=mynet1,mac=54:52:00:1a:2c:01 -chardev socket,id=char0,path=/tmp/vubr.sock,server -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce -m 4096 -smp 8 -chardev socket,path=/tmp/port0,server,nowait,id=port0-char -device virtio-serial -device virtserialport,id=port1,name=org.fedoraproject.port.0,chardev=port0-char -nographic -display none -serial mon:stdio -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on -numa node,memdev=mem -mem-prealloc vhost-user-bridge: tests/vhost-user-bridge -c Procedure: In guest I set up eth0, ran dhclient and pinged, then killed vhost-user-bridge and tried to ping again. (In reply to Jens Freimann from comment #15) > With upstream vhost-user-bridge from current upstream master > (4f2058ded4feb2fa815b33b57b305c81d5016307) I see this when I start > vhost-user-bridge to reconnect: > > [ 210.299955] virtio_net virtio1: output.0:id 0 is not a head! We would need to debug guest driver to understand that error > Reconnect works with v2.8.0, so I ran bisect and found: > Bisecting: 0 revisions left to test after this (roughly 0 steps) > [e10e798c85c2331dab338b6a01835ebde81136e5] tests/vhost-user-bridge: use > contrib/libvhost-user > That's what I found in https://bugzilla.redhat.com/show_bug.cgi?id=1485867#c4 > Qemu command line: > /usr/local/bin/qemu-system-x86_64 --enable-kvm -drive > id=drive_image1,if=none,snapshot=off,cache=none,format=qcow2,file=/root/jens/ > rhel74_1.qcow2 -device > virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x3 > -device virtio-net-pci,netdev=mynet1,mac=54:52:00:1a:2c:01 -chardev > socket,id=char0,path=/tmp/vubr.sock,server -netdev > type=vhost-user,id=mynet1,chardev=char0,vhostforce -m 4096 -smp 8 > -chardev socket,path=/tmp/port0,server,nowait,id=port0-char -device > virtio-serial -device > virtserialport,id=port1,name=org.fedoraproject.port.0,chardev=port0-char > -nographic -display none -serial mon:stdio -object > memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on -numa > node,memdev=mem -mem-prealloc > > vhost-user-bridge: tests/vhost-user-bridge -c > > Procedure: > In guest I set up eth0, ran dhclient and pinged, then killed > vhost-user-bridge and tried to ping again. Last time I checked, it was fixed with 672339f7eff5e9226f302037290e84e783d2b5cd. But your testing of upstream includes this fix already. What is the guest kernel version? Are you going to investigate further? Btw, vhost-user-bridge is a manual test, so temporary break is to be excepted... I don't think we need to backport the fix in RHEL, and this bug should probably be handled upstream only, no? (In reply to Marc-Andre Lureau from comment #16) > (In reply to Jens Freimann from comment #15) > > With upstream vhost-user-bridge from current upstream master > > (4f2058ded4feb2fa815b33b57b305c81d5016307) I see this when I start > > vhost-user-bridge to reconnect: > > > > [ 210.299955] virtio_net virtio1: output.0:id 0 is not a head! > > We would need to debug guest driver to understand that error > > > Reconnect works with v2.8.0, so I ran bisect and found: > > Bisecting: 0 revisions left to test after this (roughly 0 steps) > > [e10e798c85c2331dab338b6a01835ebde81136e5] tests/vhost-user-bridge: use > > contrib/libvhost-user > > > > That's what I found in > https://bugzilla.redhat.com/show_bug.cgi?id=1485867#c4 Wanted to test with 672339f7eff5e9226f302037290e84e783d2b5cd ff included so I started with upstream master > > Qemu command line: > > /usr/local/bin/qemu-system-x86_64 --enable-kvm -drive > > id=drive_image1,if=none,snapshot=off,cache=none,format=qcow2,file=/root/jens/ > > rhel74_1.qcow2 -device > > virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x3 > > -device virtio-net-pci,netdev=mynet1,mac=54:52:00:1a:2c:01 -chardev > > socket,id=char0,path=/tmp/vubr.sock,server -netdev > > type=vhost-user,id=mynet1,chardev=char0,vhostforce -m 4096 -smp 8 > > -chardev socket,path=/tmp/port0,server,nowait,id=port0-char -device > > virtio-serial -device > > virtserialport,id=port1,name=org.fedoraproject.port.0,chardev=port0-char > > -nographic -display none -serial mon:stdio -object > > memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on -numa > > node,memdev=mem -mem-prealloc > > > > vhost-user-bridge: tests/vhost-user-bridge -c > > > > Procedure: > > In guest I set up eth0, ran dhclient and pinged, then killed > > vhost-user-bridge and tried to ping again. > > Last time I checked, it was fixed with > 672339f7eff5e9226f302037290e84e783d2b5cd. But your testing of upstream > includes this fix already. What is the guest kernel version? Are you going > to investigate further? The guest is RHEL 7.4 with kernel 3.10.0-679.el7.x86_64. Yes, I will investigate further. > Btw, vhost-user-bridge is a manual test, so temporary break is to be > excepted... > I don't think we need to backport the fix in RHEL, and this bug should > probably be handled upstream only, no? Yes, I agree. We can fix this upstream and manual testing could be done with v2.8 in the meantime. Pei do you agree? (In reply to Jens Freimann from comment #17) > (In reply to Marc-Andre Lureau from comment #16) [...] > > > Btw, vhost-user-bridge is a manual test, so temporary break is to be > > excepted... > > I don't think we need to backport the fix in RHEL, and this bug should > > probably be handled upstream only, no? > > Yes, I agree. We can fix this upstream and manual testing could be done with > v2.8 in the meantime. > > Pei do you agree? Hi Jens, Marc-Andre, From QE perspective, we have two concerns: 1. Users can get the vhost-user-bridge tool, even though we are not quite sure if they use it. Just like in Comment 12, this tool can be compiled in qemu-kvm-rhev-2.9.0-16.el7.src.rpm which customers or partners can access. 2. If we don't backport the fix in RHEL, does this mean vhost-user-bridge tool will not supported? I mean should QE remove vhost-user-bridge related test case and we cover the reconnect issue by testing PVP or OpenvSwitch layer? Thanks, Pei (In reply to Pei Zhang from comment #18) > (In reply to Jens Freimann from comment #17) > > (In reply to Marc-Andre Lureau from comment #16) > [...] > > > > > Btw, vhost-user-bridge is a manual test, so temporary break is to be > > > excepted... > > > I don't think we need to backport the fix in RHEL, and this bug should > > > probably be handled upstream only, no? > > > > Yes, I agree. We can fix this upstream and manual testing could be done with > > v2.8 in the meantime. > > > > Pei do you agree? > > Hi Jens, Marc-Andre, > > From QE perspective, we have two concerns: > > 1. Users can get the vhost-user-bridge tool, even though we are not quite > sure if they use it. > > Just like in Comment 12, this tool can be compiled in > qemu-kvm-rhev-2.9.0-16.el7.src.rpm which customers or partners can access. > > 2. If we don't backport the fix in RHEL, does this mean vhost-user-bridge > tool will not supported? > > I mean should QE remove vhost-user-bridge related test case and we cover the > reconnect issue by testing PVP or OpenvSwitch layer? It is a tool used for QEMU (unit) tests. I think testing the vhost-user reconnect feature with supported components might be better. (In reply to Jens Freimann from comment #19) [...] > > > > 2. If we don't backport the fix in RHEL, does this mean vhost-user-bridge > > tool will not supported? > > > > I mean should QE remove vhost-user-bridge related test case and we cover the > > reconnect issue by testing PVP or OpenvSwitch layer? > > It is a tool used for QEMU (unit) tests. I think testing the vhost-user > reconnect feature with supported components might be better. OK, we will test vhost-user reconnect by PVP and Openvswitch. Thanks. Best Regards, Pei As discussed I will look into fixing this upstream and QE will test vhost-user not with vhost-user-bridge but instead with a PVP setup and OVS. No need to fix this in RHEL. |