Bug 1528229
Summary: | Live migration fails when testing VM with openvswitch multiple pmds and vhost-user single queue | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Pei Zhang <pezhang> | ||||||
Component: | openvswitch | Assignee: | Aaron Conole <aconole> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Pei Zhang <pezhang> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.5 | CC: | aconole, atragler, chayang, fleitner, juzhang, michen, ovs-qe, pezhang, pvauter, qding, sukulkar | ||||||
Target Milestone: | rc | Keywords: | Regression | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | 2.9.0-0.4.20171212git6625e43 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-03-19 10:22:13 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1475436 | ||||||||
Attachments: |
|
Can you make sure the abrt coredump hook is installed and collect a core dump of ovs-vswitchd that crashes? (In reply to Aaron Conole from comment #3) > Can you make sure the abrt coredump hook is installed and collect a core > dump of ovs-vswitchd that crashes? Hi Aaron, Please check [1]. [1]http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1528229/ Best Regards, Pei According to the info in that crash you used .2? Can you confirm the rpm versions used to generate that crashdump? I can't seem to get a backtrace out of the crashdump. I get unknown symbol information. Also, can you use .1 to generate the crashdump? I don't have access to a rhel7.5 system at the moment. (In reply to Aaron Conole from comment #5) > According to the info in that crash you used .2? Can you confirm the rpm > versions used to generate that crashdump? I can't seem to get a backtrace > out of the crashdump. I get unknown symbol information. Yes, I was testing with .2. > Also, can you use .1 to generate the crashdump? I don't have access to a > rhel7.5 system at the moment. This is .1 crashdump. # abrt-cli list id 6ddeb16173cc4a6241c021b4958c218da37af81a reason: ovs-vswitchd killed by SIGSEGV time: Wed 10 Jan 2018 06:01:19 AM EST cmdline: ovs-vswitchd unix:/var/run/openvswitch/db.sock --pidfile --detach --log-file=/var/log/openvswitch/ovs-vswitchd.log package: openvswitch-2.9.0-0.1.20171212git6625e43.el7fdb uid: 0 (root) Directory: /var/spool/abrt/ccpp-2018-01-10-06:01:19-38753 Run 'abrt-cli report /var/spool/abrt/ccpp-2018-01-10-06:01:19-38753' for creating a case in Red Hat Customer Portal core dump file please refer to: http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1528229/Jan10/ Best Regards, Pei Created attachment 1387907 [details]
rpm for ovs with backported patch for testing
Attached a possible fixed RPM. Please test this out.
(In reply to Aaron Conole from comment #7) > Created attachment 1387907 [details] > rpm for ovs with backported patch for testing > > Attached a possible fixed RPM. Please test this out. Hi Aaron, This issue has gone with this build. All 15 migration work as expected, no any error.(This is a high MoonGen packets loss issue, however I think it's Bug 1512463). Versions: kernel-3.10.0-841.el7.x86_64 qemu-kvm-rhev-2.10.0-18.el7.x86_64 openvswitch-2.9.0-0.4.20171212git6625e43.bz1528229.el7fdb.x86_64 Thanks, Pei Thanks, Pei. This is resolved via backport of 7320ecf6898f559cd129f2a8bcbce71cbb25075e This bug has been fixed well. ==Verification== Versions: 3.10.0-841.el7.x86_64 qemu-kvm-rhev-2.10.0-18.el7.x86_64 libvirt-3.9.0-9.el7.x86_64 tuned-2.9.0-1.el7.noarch openvswitch-2.9.0-0.4.20171212git6625e43.el7fdb.x86_64 dpdk-17.11-7.el7.x86_64 Steps: Same as Description. All 20 migration runs work well. So this bug has been fixed well. Move status to 'VERIFIED'. Update: Versions: kernel-3.10.0-855.el7.x86_64 qemu-kvm-rhev-2.10.0-21.el7.x86_64 libvirt-3.9.0-13.el7.x86_64 tuned-2.9.0-1.el7.noarch dpdk-17.11-7.el7.x86_64 openvswitch-2.9.0-3.el7fdp.x86_64 Steps: 200 migration runs work well. Beaker job: https://beaker.engineering.redhat.com/recipes/4852564#tasks Note: There is still packets loss issue, however it's Bug 1512463 - Guest network can not recover immediately after ping-pong live migration over ovs-dpdk ==Verification== Versions: kernel-3.10.0-855.el7.x86_64 qemu-kvm-rhev-2.10.0-21.el7.x86_64 libvirt-3.9.0-13.el7.x86_64 dpdk-17.11-7.el7.x86_64 openvswitch-2.9.0-1.el7fdb.x86_64 microcode-20180108.tgz Steps: All 20 migration runs work well. VM acts as client: ===========Stream Rate: 1Mpps=========== No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss 0 1Mpps 132 17095 17 10550410.0 1 1Mpps 133 17193 17 10664811.0 2 1Mpps 122 16439 16 7852092.0 3 1Mpps 138 15410 17 7702828.0 4 1Mpps 130 17172 16 14683583.0 5 1Mpps 133 17279 17 14498931.0 6 1Mpps 138 17068 18 3734361.0 7 1Mpps 125 18668 16 10902951.0 8 1Mpps 125 16105 17 11887164.0 9 1Mpps 129 17175 16 14594933.0 VM acts as Server: ===========Stream Rate: 1Mpps=========== No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss 0 1Mpps 146 13878 15 5388814.0 1 1Mpps 141 14826 115 12939958.0 2 1Mpps 128 14052 13 10341139.0 3 1Mpps 147 13794 14 11001215.0 4 1Mpps 134 14228 13 11302087.0 5 1Mpps 138 15280 13 13966103.0 6 1Mpps 134 15004 112 9595515.0 7 1Mpps 137 13509 14 9695822.0 8 1Mpps 128 14985 116 7379643.0 9 1Mpps 130 13758 22 6602482.0 So this bug has been fixed very well in openvswitch-2.9.0-1.el7fdb.x86_64. Move status of this bug to 'VERIFIED'. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0550 |
Created attachment 1370803 [details] XML of VM Description of problem: This is doing NFV live migration with openvswitch and vhost-user single queue. Migration always fails, we hit below two kinds of issues: (1) After migrating from src to des, the guest network can not recover at all. (2) Sometimes the first migration from src to des fail with qemu core dump. Version-Release number of selected component (if applicable): 3.10.0-823.el7.x86_64 qemu-kvm-rhev-2.10.0-12.el7.x86_64 libvirt-3.9.0-6.el7.x86_64 tuned-2.9.0-1.el7.noarch openvswitch-2.9.0-0.1.20171212git6625e43.el7fdb.x86_64 dpdk-17.11-4.el7.x86_64 How reproducible: 2/5 Steps to Reproduce: 1. Boot ovs on src and des host, see[1]. 2. Set multiple pmds for ovs, set totally 6 cpus as there are 6 ports. ovs-vsctl set Open_vSwitch . other_config={} ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x1 ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x1554 ovs-vsctl set Interface dpdk0 options:n_rxq=1 ovs-vsctl set Interface dpdk1 options:n_rxq=1 ovs-vsctl set Interface dpdk2 options:n_rxq=1 3. Boot VM, see attachment. 4. In VM, start testpmd. 5. In another host, generate packets to VM. ./build/MoonGen examples/l2-load-latency.lua 0 1 5000 6. Do migration from src to des # /bin/virsh migrate --verbose --persistent --live rhel7.5_nonrt qemu+ssh://192.168.1.2/system 7. Migration fails with 2 issue mentioned above. Actual results: (1) After migrating from src to des, the guest network can not recover at all. Related log look like below: ====cat /var/log/libvirt/qemu/rhel7.5_nonrt.log==== ... 2017-12-21 10:17:34.912+0000: initiating migration 2017-12-21T10:17:35.268775Z qemu-kvm: Failed to read msg header. Read 0 instead of 12. Original request 6. 2017-12-21T10:17:35.268820Z qemu-kvm: vhost_set_log_base failed: Success (0) 2017-12-21T10:17:35.268827Z qemu-kvm: Failed to set msg fds. 2017-12-21T10:17:35.268833Z qemu-kvm: vhost_set_features failed: Success (0) 2017-12-21T10:17:35.268838Z qemu-kvm: Failed to set msg fds. 2017-12-21T10:17:35.268843Z qemu-kvm: vhost_set_vring_addr failed: Success (0) 2017-12-21T10:17:35.268849Z qemu-kvm: Failed to set msg fds. 2017-12-21T10:17:35.268855Z qemu-kvm: vhost_set_vring_addr failed: Success (0) 2017-12-21T10:17:35.312092Z qemu-kvm: Failed to set msg fds. 2017-12-21T10:17:35.312117Z qemu-kvm: vhost VQ 0 ring restore failed: -1: Success (0) 2017-12-21T10:17:35.312133Z qemu-kvm: Failed to set msg fds. 2017-12-21T10:17:35.312138Z qemu-kvm: vhost VQ 1 ring restore failed: -1: Success (0) 2017-12-21T10:17:35.312554Z qemu-kvm: Failed to read from slave. 2017-12-21T10:17:35.312571Z qemu-kvm: Failed to read from slave. 2017-12-21T10:17:35.312618Z qemu-kvm: Failed to set msg fds. 2017-12-21T10:17:35.312626Z qemu-kvm: vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11) 2017-12-21T10:17:35.312633Z qemu-kvm: Failed to set msg fds. 2017-12-21T10:17:35.312639Z qemu-kvm: vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11) 2017-12-21T10:17:35.312984Z qemu-kvm: Failed to set msg fds. 2017-12-21T10:17:35.312994Z qemu-kvm: vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11) 2017-12-21T10:17:35.313001Z qemu-kvm: Failed to set msg fds. 2017-12-21T10:17:35.313006Z qemu-kvm: vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11) 2017-12-21 10:17:44.416+0000: shutting down, reason=migrated 2017-12-21T10:17:44.417363Z qemu-kvm: terminating on signal 15 from pid 1338 (/usr/sbin/libvirtd) ====check host # dmesg==== ... [110599.158973] pmd90[31429]: segfault at 2 ip 000056488562d8e1 sp 00007f90cdff8780 error 4 [110599.159011] pmd91[31428]: segfault at 2 ip 000056488562f76a sp 00007f90ce7fb580 error 4 [110599.159014] in ovs-vswitchd[564885465000+478000] [110599.180449] in ovs-vswitchd[564885465000+478000] (2) Sometimes the first migration from src to des fail with qemu core dump. 2017-12-20 05:54:30.196+0000: initiating migration 2017-12-20T05:54:30.552693Z qemu-kvm: Failed to read msg header. Read -1 instead of 12. Original request 6. 2017-12-20T05:54:30.552746Z qemu-kvm: vhost_set_log_base failed: Input/output error (5) 2017-12-20T05:54:30.552780Z qemu-kvm: Failed to set msg fds. 2017-12-20T05:54:30.552787Z qemu-kvm: vhost_set_vring_addr failed: Invalid argument (22) 2017-12-20T05:54:30.552793Z qemu-kvm: Failed to set msg fds. 2017-12-20T05:54:30.552799Z qemu-kvm: vhost_set_vring_addr failed: Invalid argument (22) 2017-12-20T05:54:30.552804Z qemu-kvm: Failed to set msg fds. 2017-12-20T05:54:30.552810Z qemu-kvm: vhost_set_features failed: Invalid argument (22) 2017-12-20 05:54:30.769+0000: shutting down, reason=crashed Expected results: Migration should work well. Additional info: 1. Without multiple pmds, doesn't hit this issue: Do 10 migration runs without step2, all migration runs work well. 2. This is a regression bug. openvswitch-2.8.0-4.el7fdb.x86_64.rpm work well Reference: [1] # ovs-vsctl show b9357f5b-bb2c-429e-8e6d-b171d7242b7c Bridge "ovsbr1" Port "ovsbr1" Interface "ovsbr1" type: internal Port "dpdk2" Interface "dpdk2" type: dpdk options: {dpdk-devargs="0000:06:00.0", n_rxq="1", n_txq="1"} Port "vhost-user2" Interface "vhost-user2" type: dpdkvhostuser Bridge "ovsbr0" Port "vhost-user1" Interface "vhost-user1" type: dpdkvhostuser Port "dpdk1" Interface "dpdk1" type: dpdk options: {dpdk-devargs="0000:04:00.1", n_rxq="1", n_txq="1"} Port "vhost-user0" Interface "vhost-user0" type: dpdkvhostuser Port "ovsbr0" Interface "ovsbr0" type: internal Port "dpdk0" Interface "dpdk0" type: dpdk options: {dpdk-devargs="0000:04:00.0", n_rxq="1", n_txq="1"}