1528229 – Live migration fails when testing VM with openvswitch multiple pmds and vhost-user single queue

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1528229 - Live migration fails when testing VM with openvswitch multiple pmds and vhost-user single queue

Summary: Live migration fails when testing VM with openvswitch multiple pmds and vhost...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	openvswitch
Sub Component:
Version:	7.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Aaron Conole
QA Contact:	Pei Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1475436
TreeView+	depends on / blocked

Reported:	2017-12-21 10:42 UTC by Pei Zhang
Modified:	2018-04-12 12:12 UTC (History)
CC List:	11 users (show)
Fixed In Version:	2.9.0-0.4.20171212git6625e43
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-03-19 10:22:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
XML of VM (3.31 KB, text/plain) 2017-12-21 10:42 UTC, Pei Zhang	no flags	Details
rpm for ovs with backported patch for testing (6.14 MB, application/x-rpm) 2018-01-29 16:28 UTC, Aaron Conole	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:0550	0	None	None	None	2018-03-19 10:24:00 UTC

Description Pei Zhang 2017-12-21 10:42:13 UTC

Created attachment 1370803 [details]
XML of VM

Description of problem:
This is doing NFV live migration with openvswitch and vhost-user single queue. Migration always fails, we hit below two kinds of issues:

(1) After migrating from src to des, the guest network can not recover at all.

(2) Sometimes the first migration from src to des fail with qemu core dump.

Version-Release number of selected component (if applicable):
3.10.0-823.el7.x86_64
qemu-kvm-rhev-2.10.0-12.el7.x86_64
libvirt-3.9.0-6.el7.x86_64
tuned-2.9.0-1.el7.noarch
openvswitch-2.9.0-0.1.20171212git6625e43.el7fdb.x86_64
dpdk-17.11-4.el7.x86_64

How reproducible:
2/5

Steps to Reproduce:
1. Boot ovs on src and des host, see[1]. 

2. Set multiple pmds for ovs, set totally 6 cpus as there are 6 ports.

ovs-vsctl set Open_vSwitch . other_config={}
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x1
ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x1554
ovs-vsctl set Interface dpdk0 options:n_rxq=1
ovs-vsctl set Interface dpdk1 options:n_rxq=1
ovs-vsctl set Interface dpdk2 options:n_rxq=1

3. Boot VM, see attachment.

4. In VM, start testpmd.

5. In another host, generate packets to VM.
./build/MoonGen examples/l2-load-latency.lua 0 1 5000


6. Do migration from src to des
# /bin/virsh migrate --verbose --persistent --live rhel7.5_nonrt qemu+ssh://192.168.1.2/system

7. Migration fails with 2 issue mentioned above.


Actual results:
(1) After migrating from src to des, the guest network can not recover at all. Related log look like below:

====cat /var/log/libvirt/qemu/rhel7.5_nonrt.log====
...
2017-12-21 10:17:34.912+0000: initiating migration
2017-12-21T10:17:35.268775Z qemu-kvm: Failed to read msg header. Read 0 instead of 12. Original request 6.
2017-12-21T10:17:35.268820Z qemu-kvm: vhost_set_log_base failed: Success (0)
2017-12-21T10:17:35.268827Z qemu-kvm: Failed to set msg fds.
2017-12-21T10:17:35.268833Z qemu-kvm: vhost_set_features failed: Success (0)
2017-12-21T10:17:35.268838Z qemu-kvm: Failed to set msg fds.
2017-12-21T10:17:35.268843Z qemu-kvm: vhost_set_vring_addr failed: Success (0)
2017-12-21T10:17:35.268849Z qemu-kvm: Failed to set msg fds.
2017-12-21T10:17:35.268855Z qemu-kvm: vhost_set_vring_addr failed: Success (0)
2017-12-21T10:17:35.312092Z qemu-kvm: Failed to set msg fds.
2017-12-21T10:17:35.312117Z qemu-kvm: vhost VQ 0 ring restore failed: -1: Success (0)
2017-12-21T10:17:35.312133Z qemu-kvm: Failed to set msg fds.
2017-12-21T10:17:35.312138Z qemu-kvm: vhost VQ 1 ring restore failed: -1: Success (0)
2017-12-21T10:17:35.312554Z qemu-kvm: Failed to read from slave.
2017-12-21T10:17:35.312571Z qemu-kvm: Failed to read from slave.
2017-12-21T10:17:35.312618Z qemu-kvm: Failed to set msg fds.
2017-12-21T10:17:35.312626Z qemu-kvm: vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11)
2017-12-21T10:17:35.312633Z qemu-kvm: Failed to set msg fds.
2017-12-21T10:17:35.312639Z qemu-kvm: vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11)
2017-12-21T10:17:35.312984Z qemu-kvm: Failed to set msg fds.
2017-12-21T10:17:35.312994Z qemu-kvm: vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11)
2017-12-21T10:17:35.313001Z qemu-kvm: Failed to set msg fds.
2017-12-21T10:17:35.313006Z qemu-kvm: vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11)
2017-12-21 10:17:44.416+0000: shutting down, reason=migrated
2017-12-21T10:17:44.417363Z qemu-kvm: terminating on signal 15 from pid 1338 (/usr/sbin/libvirtd)

====check host # dmesg====
...
[110599.158973] pmd90[31429]: segfault at 2 ip 000056488562d8e1 sp 00007f90cdff8780 error 4
[110599.159011] pmd91[31428]: segfault at 2 ip 000056488562f76a sp 00007f90ce7fb580 error 4
[110599.159014]  in ovs-vswitchd[564885465000+478000]

[110599.180449]  in ovs-vswitchd[564885465000+478000]


(2) Sometimes the first migration from src to des fail with qemu core dump.

2017-12-20 05:54:30.196+0000: initiating migration
2017-12-20T05:54:30.552693Z qemu-kvm: Failed to read msg header. Read -1 instead of 12. Original request 6.
2017-12-20T05:54:30.552746Z qemu-kvm: vhost_set_log_base failed: Input/output error (5)
2017-12-20T05:54:30.552780Z qemu-kvm: Failed to set msg fds.
2017-12-20T05:54:30.552787Z qemu-kvm: vhost_set_vring_addr failed: Invalid argument (22)
2017-12-20T05:54:30.552793Z qemu-kvm: Failed to set msg fds.
2017-12-20T05:54:30.552799Z qemu-kvm: vhost_set_vring_addr failed: Invalid argument (22)
2017-12-20T05:54:30.552804Z qemu-kvm: Failed to set msg fds.
2017-12-20T05:54:30.552810Z qemu-kvm: vhost_set_features failed: Invalid argument (22)
2017-12-20 05:54:30.769+0000: shutting down, reason=crashed


Expected results:
Migration should work well.


Additional info:
1. Without multiple pmds, doesn't hit this issue: Do 10 migration runs without step2, all migration runs work well.

2. This is a regression bug.
openvswitch-2.8.0-4.el7fdb.x86_64.rpm    work well 

Reference:
[1]
# ovs-vsctl show
b9357f5b-bb2c-429e-8e6d-b171d7242b7c
    Bridge "ovsbr1"
        Port "ovsbr1"
            Interface "ovsbr1"
                type: internal
        Port "dpdk2"
            Interface "dpdk2"
                type: dpdk
                options: {dpdk-devargs="0000:06:00.0", n_rxq="1", n_txq="1"}
        Port "vhost-user2"
            Interface "vhost-user2"
                type: dpdkvhostuser
    Bridge "ovsbr0"
        Port "vhost-user1"
            Interface "vhost-user1"
                type: dpdkvhostuser
        Port "dpdk1"
            Interface "dpdk1"
                type: dpdk
                options: {dpdk-devargs="0000:04:00.1", n_rxq="1", n_txq="1"}
        Port "vhost-user0"
            Interface "vhost-user0"
                type: dpdkvhostuser
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal
        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
                options: {dpdk-devargs="0000:04:00.0", n_rxq="1", n_txq="1"}

Comment 3 Aaron Conole 2018-01-02 22:27:20 UTC

Can you make sure the abrt coredump hook is installed and collect a core dump of ovs-vswitchd that crashes?

Comment 4 Pei Zhang 2018-01-03 09:22:31 UTC

(In reply to Aaron Conole from comment #3)
> Can you make sure the abrt coredump hook is installed and collect a core
> dump of ovs-vswitchd that crashes?

Hi Aaron,

Please check [1].

[1]http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1528229/


Best Regards,
Pei

Comment 5 Aaron Conole 2018-01-08 19:05:09 UTC

According to the info in that crash you used .2?  Can you confirm the rpm versions used to generate that crashdump?  I can't seem to get a backtrace out of the crashdump.  I get unknown symbol information.

Also, can you use .1 to generate the crashdump?  I don't have access to a rhel7.5 system at the moment.

Comment 6 Pei Zhang 2018-01-10 11:04:57 UTC

(In reply to Aaron Conole from comment #5)
> According to the info in that crash you used .2?  Can you confirm the rpm
> versions used to generate that crashdump?  I can't seem to get a backtrace
> out of the crashdump.  I get unknown symbol information.

Yes, I was testing with .2.

> Also, can you use .1 to generate the crashdump?  I don't have access to a
> rhel7.5 system at the moment.

This is .1 crashdump.

# abrt-cli list
id 6ddeb16173cc4a6241c021b4958c218da37af81a
reason:         ovs-vswitchd killed by SIGSEGV
time:           Wed 10 Jan 2018 06:01:19 AM EST
cmdline:        ovs-vswitchd unix:/var/run/openvswitch/db.sock --pidfile --detach --log-file=/var/log/openvswitch/ovs-vswitchd.log
package:        openvswitch-2.9.0-0.1.20171212git6625e43.el7fdb
uid:            0 (root)
Directory:      /var/spool/abrt/ccpp-2018-01-10-06:01:19-38753
Run 'abrt-cli report /var/spool/abrt/ccpp-2018-01-10-06:01:19-38753' for creating a case in Red Hat Customer Portal

core dump file please refer to:
http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1528229/Jan10/


Best Regards,
Pei

Comment 7 Aaron Conole 2018-01-29 16:28:45 UTC

Created attachment 1387907 [details]
rpm for ovs with backported patch for testing

Attached a possible fixed RPM.  Please test this out.

Comment 8 Pei Zhang 2018-01-31 07:05:08 UTC

(In reply to Aaron Conole from comment #7)
> Created attachment 1387907 [details]
> rpm for ovs with backported patch for testing
> 
> Attached a possible fixed RPM.  Please test this out.

Hi Aaron, 

This issue has gone with this build. All 15 migration work as expected, no any error.(This is a high MoonGen packets loss issue, however I think it's Bug 1512463). 

Versions:
kernel-3.10.0-841.el7.x86_64
qemu-kvm-rhev-2.10.0-18.el7.x86_64
openvswitch-2.9.0-0.4.20171212git6625e43.bz1528229.el7fdb.x86_64


Thanks,
Pei

Comment 9 Aaron Conole 2018-01-31 13:42:23 UTC

Thanks, Pei.

This is resolved via backport of 7320ecf6898f559cd129f2a8bcbce71cbb25075e

Comment 11 Pei Zhang 2018-02-02 09:19:57 UTC

This bug has been fixed well.

==Verification==

Versions:
3.10.0-841.el7.x86_64
qemu-kvm-rhev-2.10.0-18.el7.x86_64
libvirt-3.9.0-9.el7.x86_64
tuned-2.9.0-1.el7.noarch
openvswitch-2.9.0-0.4.20171212git6625e43.el7fdb.x86_64
dpdk-17.11-7.el7.x86_64

Steps:
Same as Description. All 20 migration runs work well.


So this bug has been fixed well. Move status to 'VERIFIED'.

Comment 15 Pei Zhang 2018-02-26 06:21:29 UTC

Update:

Versions:
kernel-3.10.0-855.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64
libvirt-3.9.0-13.el7.x86_64
tuned-2.9.0-1.el7.noarch
dpdk-17.11-7.el7.x86_64
openvswitch-2.9.0-3.el7fdp.x86_64

Steps:
200 migration runs work well. 

Beaker job:
https://beaker.engineering.redhat.com/recipes/4852564#tasks

Note: There is still packets loss issue, however it's Bug 1512463 - Guest network can not recover immediately after ping-pong live migration over ovs-dpdk

Comment 17 Pei Zhang 2018-03-06 08:27:30 UTC

==Verification==

Versions:
kernel-3.10.0-855.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64
libvirt-3.9.0-13.el7.x86_64
dpdk-17.11-7.el7.x86_64
openvswitch-2.9.0-1.el7fdb.x86_64
microcode-20180108.tgz

Steps:

All 20 migration runs work well. 


VM acts as client:
===========Stream Rate: 1Mpps===========
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
 0       1Mpps      132     17095        17   10550410.0
 1       1Mpps      133     17193        17   10664811.0
 2       1Mpps      122     16439        16    7852092.0
 3       1Mpps      138     15410        17    7702828.0
 4       1Mpps      130     17172        16   14683583.0
 5       1Mpps      133     17279        17   14498931.0
 6       1Mpps      138     17068        18    3734361.0
 7       1Mpps      125     18668        16   10902951.0
 8       1Mpps      125     16105        17   11887164.0
 9       1Mpps      129     17175        16   14594933.0

VM acts as Server:
===========Stream Rate: 1Mpps===========
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
 0       1Mpps      146     13878        15    5388814.0
 1       1Mpps      141     14826       115   12939958.0
 2       1Mpps      128     14052        13   10341139.0
 3       1Mpps      147     13794        14   11001215.0
 4       1Mpps      134     14228        13   11302087.0
 5       1Mpps      138     15280        13   13966103.0
 6       1Mpps      134     15004       112    9595515.0
 7       1Mpps      137     13509        14    9695822.0
 8       1Mpps      128     14985       116    7379643.0
 9       1Mpps      130     13758        22    6602482.0


So this bug has been fixed very well in openvswitch-2.9.0-1.el7fdb.x86_64. Move status of this bug to 'VERIFIED'.

Comment 20 errata-xmlrpc 2018-03-19 10:22:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0550

Note You need to log in before you can comment on or make changes to this bug.