1727052 – Src qemu crashed if do parallel migration with parallel connection=2 after a failed migration with parallel connection=-1

Bug 1727052 - Src qemu crashed if do parallel migration with parallel connection=2 after a failed migration with parallel connection=-1

Summary: Src qemu crashed if do parallel migration with parallel connection=2 after a ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	8.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Juan Quintela
QA Contact:	Li Xiaohui
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-04 10:21 UTC by Fangge Jin
Modified:	2021-11-03 13:16 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-15 07:37:23 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
libvirtd and qemu log (223.08 KB, application/x-bzip) 2019-07-04 10:21 UTC, Fangge Jin	no flags	Details
qemu backtrace (9.19 KB, text/plain) 2019-07-04 10:24 UTC, Fangge Jin	no flags	Details
View All

Description Fangge Jin 2019-07-04 10:21:39 UTC

Created attachment 1587335 [details]
libvirtd and qemu  log

Description of problem:
Src qemu crashed if do parallel migration with parallel connection=2 after a failed migration with parallel connection=-1

Version-Release number of selected component (if applicable):
libvirt-5.4.0-2.module+el8.1.0+3523+b348b848.x86_64
qemu-kvm-4.0.0-4.module+el8.1.0+3523+b348b848.x86_64
kernel-4.18.0-107.el8.x86_64


How reproducible:
100%

Steps to Reproduce:
1.Start a vm

2.Do parallel migration with connection number=-1(although I don't understand what does connection number=-1 actually mean), it will fail
# virsh migrate nfs qemu+ssh://intel-5130-16-1.englab.nay.redhat.com/system --live --p2p  --parallel --parallel-connections -1
error: operation failed: migration out job: Unable to write to socket: Connection reset by peer

3.After step2, do parallel migration with connection number=1, src qemu crashed:
# virsh migrate nfs qemu+ssh://intel-5130-16-1.englab.nay.redhat.com/system --live --p2p  --parallel --parallel-connections 1
error: operation failed: domain is not running


Actual results:
In step3, src qemu crashed

Expected results:
In step3, migration should succeed

Additional info:

Comment 1 Fangge Jin 2019-07-04 10:24:55 UTC

Created attachment 1587348 [details]
qemu backtrace

Comment 3 Juan Quintela 2019-07-29 11:07:51 UTC

Hi

parallel_connections needs to be >= 1.

Improving the error message.
Migration after the failure should work through, Looking into that.

Comment 4 Li Xiaohui 2019-08-05 04:55:20 UTC

Hi all,
met problem when set multifd-channels -1, check multifd-channels value, found it's 255, please fix together, thanks
(qemu) migrate_set_parameter multifd-channels -1
(qemu) info migrate_parameters 
announce-initial: 50 ms
announce-max: 550 ms
announce-rounds: 5
announce-step: 100 ms
compress-level: 1
compress-threads: 8
compress-wait-thread: on
decompress-threads: 2
cpu-throttle-initial: 20
cpu-throttle-increment: 10
max-cpu-throttle: 99
tls-creds: ''
tls-hostname: ''
max-bandwidth: 33554432 bytes/second
downtime-limit: 300 milliseconds
x-checkpoint-delay: 20000
block-incremental: off
multifd-channels: 255
xbzrle-cache-size: 67108864
max-postcopy-bandwidth: 0
 tls-authz: '(null)'

Comment 5 Juan Quintela 2019-11-19 14:38:25 UTC

This is not urgent, we will improve the error handling upstream.  -1 is not a valid value, and it should be detected as that.

Comment 6 Ademar Reis 2020-02-05 23:00:11 UTC

QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 7 Thomas Huth 2020-09-26 06:51:27 UTC

(In reply to Juan Quintela from comment #5)
> This is not urgent, we will improve the error handling upstream.  -1 is not
> a valid value, and it should be detected as that.

Has anybody already improved the error handling upstream?

Comment 9 Juan Quintela 2021-01-04 20:01:59 UTC

Hi

Not yet, will try to take a look as soon as I have time (not really soon).

Comment 11 RHEL Program Management 2021-03-15 07:37:23 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 12 Li Xiaohui 2021-04-15 14:45:58 UTC

Test this bz again via libvirt, found migration succeed and vm works well according to Comment 0 on rhelav 8.4.0(kernel-4.18.0-304.el8.x86_64&qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64&libvirt-client-7.0.0-10.module+el8.4.0+10417+37f6984d.x86_64), so close this bz as CurrentRelease.

And mark qe_test_coverage- as it's a negative test in this bz.


BTW, still found we could set multifd-channels to -1 both on qemu and libvirt side, and it's 255 in fact after setting to -1.
Juan, do you plan to give nice warning to avoid such wrong setting? Or still keep the current status?

Comment 13 Juan Quintela 2021-11-03 13:16:35 UTC

Hi xiaohui

will give one error when channels are set to -1.

Just closing the need info.

Note You need to log in before you can comment on or make changes to this bug.