Bug 1355902
| Summary: | vhost-user reconnect misc fixes and improvements | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Ademar Reis <areis> |
| Component: | qemu-kvm-rhev | Assignee: | Marc-Andre Lureau <marcandre.lureau> |
| Status: | CLOSED ERRATA | QA Contact: | Pei Zhang <pezhang> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.2 | CC: | ailan, chayang, juzhang, lmiksik, marcandre.lureau, pezhang, virt-maint, xfu, xiywang |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | qemu-kvm-rhev-2.6.0-23.el7 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-07 21:23:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1370005 | ||
| Bug Blocks: | |||
|
Description
Ademar Reis
2016-07-12 21:58:09 UTC
backport posted on the rhvirt-patches list please grant the exception, we had to rush bug 1322087 to please some customer, however qemu is fairly easy to assert/crash currently when vhost-user is disconnected. This series prevents most of the known crashes. Fix included in qemu-kvm-rhev-2.6.0-23.el7 Hi Marc-Andre, QE want to verify this bug. But there are many patches(about 28), so could you please provide some testing scenarios to cover them? And detail testing steps are also welcomed. Thank you, Pei In general, the patch set fixes many code path that could be triggered at run time when the backend is disconnected. However, there is no easy way to test those (it would need to break the guest kernel and disconnect the backend at particular time for ex). There are a few scenarios left broken, such as migrating with disconnected backend. One that is easily observable, Do not crash on unbind virtio-pci when backend is disconnected: 1. qemu-system-x86_64 -enable-kvm -cpu SandyBridge -m 1024 -object memory-backend-file,id=mem,size=1024M,mem-path=/tmp,share=on -numa node,memdev=mem -mem-prealloc -chardev socket,id=char0,path=/tmp/vubr.sock,server -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce -device virtio-net-pci,netdev=mynet1 rhel.qcow2 2. connect with vubr: tests/vhost-user-bridge -c 3. kill vubr 4. in guest, # echo -n '0000:00:01.0' > /sys/bus/pci/drivers/virtio-pci/unbind 5. rebind in guest, # echo -n '0000:00:01.0' > /sys/bus/pci/drivers/virtio-pci/bind There is also a fix to avoid qemu starting with uninitialized backend: Wait for backend init to complete: 1. qemu-system-x86_64 -enable-kvm -cpu SandyBridge -m 1024 -object memory-backend-file,id=mem,size=1024M,mem-path=/tmp,share=on -numa node,memdev=mem -mem-prealloc -chardev socket,id=char0,path=/tmp/vubr.sock,server -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce -device virtio-net-pci,netdev=mynet1 rhel.qcow2 2. connect with nc: nc -U /tmp/vubr.sock 3. kill nc, optionnal: repeat 2-3 4. reconnect with vubr: tests/vhost-user-bridge -c 5. only after backend init, qemu starts Hi Marc-Andre, Thank you for your detail suggestions first. QE tested below 3 scenarios according to your Comment 7. And there is 2 questions for your confirm, could you please answer them? Thanks. Q1: In 'Scenario1: Migration part' testing, with this fixed version, the 'Link detected' of network become 'yes', in the unfixed version, it's 'no'. Is this the right check point? If not, could you provide the right steps and check points if necessary? (In both unfix and fix versions, qemu and guest work well, no crush occur. The only difference is the status of network. ) Q2. In 'Scenario2:bind/unbind virtio-pci' testing, In Could you check these steps? As I can not reproduce any crush. With unfix and fix versions, no any error shows in the testing. Q3. As there is no error with fixed version(higher) with these 3 scenarios. Can QE verify this bug? Versions of reproduce: Host: 3.10.0-510.el7.x86_64 qemu-kvm-rhev-2.6.0-18.el7.x86_64 Guest: 3.10.0-510.el7.x86_64 Versions of verification: Host: 3.10.0-510.el7.x86_64 qemu-kvm-rhev-2.6.0-27.el7.x86_64 Guest 3.10.0-510.el7.x86_64 ==Scenario1: Migration part== Reproduce: Steps: 1. Boot guest from src host # /usr/libexec/qemu-kvm -cpu SandyBridge -m 4096 -smp 4 \ -object memory-backend-file,id=mem0,size=4096M,mem-path=/dev/hugepages,share=on \ -numa node,nodeid=0,memdev=mem0 \ -mem-prealloc \ /mnt/rhel7.3.qcow2_snapshot5 \ -chardev socket,id=char0,path=/tmp/vubr.sock,server \ -device virtio-net-pci,netdev=mynet1,mac=54:52:00:1a:2c:01 \ -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \ -vga std -vnc :10 \ -monitor stdio \ -serial unix:/tmp/monitor,server,nowait \ 2. Boog guest form des host with '-incoming' <same as step1> -incoming tcp:0:5555 3. Start vubr in src and des host # ./vhost-user-bridge -c 4. Kill vubr in des host 5. Do migrateion (qemu) migrate -d tcp:10.73.72.154:5555 6. After migration finished, check status in guest. (1) status of network device is down # ethtool eth0 Settings for eth0: Link detected: no (2) guest works well. Verification: 1~5 same as above. 6. After migration finished, check status in guest. (1) status of network device keeps up # ethtool eth0 Settings for eth0: Link detected: yes (2) guest works well. ==Scenario2:bind/unbind virtio-pci== Steps: 1. Boot guest with vhostuser as server # /usr/libexec/qemu-kvm -cpu SandyBridge -m 4096 -smp 4 \ -object memory-backend-file,id=mem0,size=4096M,mem-path=/dev/hugepages,share=on \ -numa node,nodeid=0,memdev=mem0 \ -mem-prealloc \ /mnt/rhel7.3.qcow2_snapshot5 \ -chardev socket,id=char0,path=/tmp/vubr.sock,server \ -device virtio-net-pci,netdev=mynet1,mac=54:52:00:1a:2c:01 \ -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \ -vga std -vnc :10 \ -monitor stdio \ -serial unix:/tmp/monitor,server,nowait \ 2. Connect with vubr # ./vhost-user-bridge -c 3. Kill vubr 4. In guest, unbind/bind network device # lspci | grep Eth 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device # echo -n '0000:00:03.0' > /sys/bus/pci/drivers/virtio-pci/unbind # echo -n '0000:00:03.0' > /sys/bus/pci/drivers/virtio-pci/bind 5. Check qemu and guest status, both works well. ==Scenario3:Qemu core dump part== Reproduce: Steps: 1. Boot guest with vhotuser as server /usr/libexec/qemu-kvm -m 4096 -smp 4 \ -object memory-backend-file,id=mem0,size=4096M,mem-path=/dev/hugepages,share=on \ -numa node,nodeid=0,memdev=mem0 \ -mem-prealloc \ /home/pezhang/rhel7.3.qcow2_snapshot5 \ -chardev socket,id=char0,path=/tmp/vubr.sock,server \ -device virtio-net-pci,netdev=mynet1,mac=54:52:00:1a:2c:01 \ -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \ -vga std -vnc :10 \ -monitor stdio \ -serial unix:/tmp/monitor,server,nowait \ 2. Connect with nc # nc -U /tmp/vubr.sock 3. Kill nc, qemu core dump qemu-kvm: -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce: Failed to read msg header. Read 0 instead of 12. Original request 1. qemu-kvm: -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce: Failed to read msg header. Read 0 instead of 12. Original request 15. qemu-kvm: -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce: Failed to read msg header. Read 0 instead of 12. Original request 1. Segmentation fault (core dumped) So qemu core dump has been reproduced. Verification: 1. Boot guest with vhotuser as server 2. Connect with nc 3. Kill nc, qemu still works well. 4. Repeate 2~3 5. Reconnect with vubr # ./vhost-user-bridge -c 6. Qemu starts and guest works well. So qemu core dump part has been fixed well. Best Regards, -Pei (In reply to Pei Zhang from comment #8) > Hi Marc-Andre, > > Thank you for your detail suggestions first. QE tested below 3 scenarios > according to your Comment 7. And there is 2 questions for your confirm, > could you please answer them? Thanks. > > Q1: In 'Scenario1: Migration part' testing, with this fixed version, the > 'Link detected' of network become 'yes', in the unfixed version, it's 'no'. > Is this the right check point? If not, could you provide the right steps and > check points if necessary? (In both unfix and fix versions, qemu and guest > work well, no crush occur. The only difference is the status of network. ) after migration, since vubr is connected, link detected must be "yes". Note that migration with disconnected backend isn't really supported at this point, as I explained in comment #7 > Q2. In 'Scenario2:bind/unbind virtio-pci' testing, In Could you check these > steps? As I can not reproduce any crush. With unfix and fix versions, no any > error shows in the testing. You are right, that case was actually "fixed" by commit 52fbd7024284fcb52ac6d9e8634d23a25badc2d7(vhost-net: do not crash if backend is not present) However, there are many cases where qemu assert that get_vhost_net() returns non-null, but withtout this patch: "vhost-user: keep vhost_net after a disconnection" (in this series), it will crash. I would need to find a way to reproduce it, unfortunately, this may involve a custom driver changes, or custom dpdk. It's seems fairly painful to give you a reproducer. The point is, those fixes should have been in the initial reconnect series, but we rushed to have the basics "working" for the customer, so it's not a series about fixing known bug, but rather a series to complete the initial reconnect support. > Q3. As there is no error with fixed version(higher) with these 3 scenarios. > Can QE verify this bug? thanks (In reply to Marc-Andre Lureau from comment #9) [...] > after migration, since vubr is connected, link detected must be "yes". > > Note that migration with disconnected backend isn't really supported at this > point, as I explained in comment #7 OK. > > Q2. In 'Scenario2:bind/unbind virtio-pci' testing, In Could you check these > > steps? As I can not reproduce any crush. With unfix and fix versions, no any > > error shows in the testing. > > You are right, that case was actually "fixed" by commit > 52fbd7024284fcb52ac6d9e8634d23a25badc2d7(vhost-net: do not crash if backend > is not present) > > However, there are many cases where qemu assert that get_vhost_net() returns > non-null, but withtout this patch: "vhost-user: keep vhost_net after a > disconnection" (in this series), it will crash. I would need to find a way > to reproduce it, unfortunately, this may involve a custom driver changes, or > custom dpdk. It's seems fairly painful to give you a reproducer. The point > is, those fixes should have been in the initial reconnect series, but we > rushed to have the basics "working" for the customer, so it's not a series > about fixing known bug, but rather a series to complete the initial > reconnect support. So this part can not be reproduced.(Please correct me if I didn't understand correct.) So in order to verify this bug, QE continued testing more vhostuser disconnect/reconnect scenarios, and they all works well and didn't cause regressions bugs. Summary them as below: 1. dpdk's testpmd with disconnect/connect vubr, works well. 2. unbind/bind virtio-pci with disconnect/connect vubr, work well. 3. network status after disconnect/connect vubr, works well. Hi Marc-Andre, Based on all these testings, can QE set this bug as 'VERIFIED'? Thanks. ==disconnect/reconnect testing(continued)== 1. Boot guest with 2 vhostuser server /usr/libexec/qemu-kvm -cpu SandyBridge -m 4096 -smp 4 \ -object memory-backend-file,id=mem0,size=4096M,mem-path=/dev/hugepages,share=on \ -numa node,nodeid=0,memdev=mem0 \ -mem-prealloc \ /mnt/rhel7.3.qcow2_snapshot5 \ -chardev socket,id=char0,path=/tmp/vubr0.sock,server \ -device virtio-net-pci,netdev=mynet0,mac=54:52:00:1a:2c:01 \ -netdev type=vhost-user,id=mynet0,chardev=char0,vhostforce \ -chardev socket,id=char1,path=/tmp/vubr1.sock,server \ -device virtio-net-pci,netdev=mynet1,mac=54:52:00:1a:2c:02 \ -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce \ -vga std -vnc :10 \ -monitor stdio \ -serial unix:/tmp/monitor,server,nowait \ 2. Connect with vubr ./vhost-user-bridge -c -u /tmp/vubr0.sock -l 127.0.0.1:4444 -r 127.0.0.1:5555 ./vhost-user-bridge -c -u /tmp/vubr1.sock -l 127.0.0.1:6666 -r 127.0.0.1:7777 3. In guest, bind NICs to vfio (1) load vfio modprobe -r vfio modprobe -r vfio_iommu_type1 modprobe vfio enable_unsafe_noiommu_mode=Y modprobe vfio-pci cat /sys/module/vfio/parameters/enable_unsafe_noiommu_mode (2) bind to vfio # lspci -n -s 0000:00:03.0 00:03.0 0200: 1af4:1000 # echo 0000:00:03.0 > /sys/bus/pci/devices/0000\:00\:03.0/driver/unbind # echo 0000:00:04.0 > /sys/bus/pci/devices/0000\:00\:04.0/driver/unbind # echo "1af4 1000" > /sys/bus/pci/drivers/vfio-pci/new_id # echo "1af4 1000" > /sys/bus/pci/drivers/vfio-pci/remove_id # ls /sys/bus/pci/drivers/vfio-pci/ 0000:00:03.0 0000:00:04.0 bind module new_id remove_id uevent unbind 4. Start dpdk's testpmd, works well. 5. Kill all vubr 6. Start dpdk's testpmd again. No errors in qemu and guest. # cat testpmd-1q.sh queues=1 cores=2 /root/dpdk-16.07/x86_64-native-linuxapp-gcc/build/app/test-pmd/testpmd -l 0,1,2 -n 1 -d /root/dpdk-16.07/x86_64-native-linuxapp-gcc/lib/librte_pmd_virtio.so \ -w 00:03.0 -w 00:04.0 \ -- \ --disable-hw-vlan -i \ --crc-strip \ --nb-cores=${cores} \ --disable-rss \ --rxq=${queues} --txq=${queues} \ --auto-start \ --rxd=256 --txd=256 \ 7. Repeate connect/kill vubr several times, qemu and guest keeps working well. 8. Connect vubr again, start slirp as background # /usr/libexec/qemu-kvm \ -net none \ -net socket,vlan=0,udp=localhost:4444,localaddr=localhost:5555 \ -net user,vlan=0 9. Return NICs from vfio to virtio-pci. Guest network works well, #wget works. # dhclient eth0 # ifconfig eth0 eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.0.2.15 netmask 255.255.255.0 broadcast 10.0.2.255 inet6 fe80::5652:ff:fe1a:2c01 prefixlen 64 scopeid 0x20<link> inet6 fec0::5652:ff:fe1a:2c01 prefixlen 64 scopeid 0x40<site> ether 54:52:00:1a:2c:01 txqueuelen 1000 (Ethernet) RX packets 20 bytes 6580 (6.4 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 81 bytes 10248 (10.0 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 # wget http://download.eng.bos.redhat.com/brewroot/packages/qemu-kvm-rhev/2.6.0/27.el7/src/qemu-kvm-rhev-2.6.0-27.el7.src.rpm [...] Saving to: ‘qemu-kvm-rhev-2.6.0-27.el7.src.rpm’ 100%[======================================>] 27,253,220 163KB/s in 3m 5s 2016-09-26 11:20:09 (144 KB/s) - ‘qemu-kvm-rhev-2.6.0-27.el7.src.rpm’ saved [27253220/27253220] 10. Repeate unnbind/bind virtio-pci driver several times, meanwhile connect/kill vubr several times too. Both qemu and guest still keep working well. # echo -n '0000:00:03.0' > /sys/bus/pci/drivers/virtio-pci/unbind # echo -n '0000:00:03.0' > /sys/bus/pci/drivers/virtio-pci/bind # echo -n '0000:00:04.0' > /sys/bus/pci/drivers/virtio-pci/unbind # echo -n '0000:00:04.0' > /sys/bus/pci/drivers/virtio-pci/bind (In reply to Pei Zhang from comment #10) > (In reply to Marc-Andre Lureau from comment #9) > [...] > > after migration, since vubr is connected, link detected must be "yes". > > > > Note that migration with disconnected backend isn't really supported at this > > point, as I explained in comment #7 > > OK. > > > > Q2. In 'Scenario2:bind/unbind virtio-pci' testing, In Could you check these > > > steps? As I can not reproduce any crush. With unfix and fix versions, no any > > > error shows in the testing. > > > > You are right, that case was actually "fixed" by commit > > 52fbd7024284fcb52ac6d9e8634d23a25badc2d7(vhost-net: do not crash if backend > > is not present) > > > > However, there are many cases where qemu assert that get_vhost_net() returns > > non-null, but withtout this patch: "vhost-user: keep vhost_net after a > > disconnection" (in this series), it will crash. I would need to find a way > > to reproduce it, unfortunately, this may involve a custom driver changes, or > > custom dpdk. It's seems fairly painful to give you a reproducer. The point > > is, those fixes should have been in the initial reconnect series, but we > > rushed to have the basics "working" for the customer, so it's not a series > > about fixing known bug, but rather a series to complete the initial > > reconnect support. > > So this part can not be reproduced.(Please correct me if I didn't understand > correct.) So in order to verify this bug, QE continued testing more > vhostuser disconnect/reconnect scenarios, and they all works well and didn't > cause regressions bugs. Summary them as below: > 1. dpdk's testpmd with disconnect/connect vubr, works well. > 2. unbind/bind virtio-pci with disconnect/connect vubr, work well. > 3. network status after disconnect/connect vubr, works well. > > Hi Marc-Andre, > Based on all these testings, can QE set this bug as 'VERIFIED'? Thanks. yes: test looks fine, no regression + "Wait for backend init to complete" verified. Thanks Marc-Andre. Set this bug 'VERIFIED' as Comment 8 , Comment 9, Comment 10 and Comment 11. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2673.html |