Bug 1727052

Summary: Src qemu crashed if do parallel migration with parallel connection=2 after a failed migration with parallel connection=-1
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Fangge Jin <fjin>
Component: qemu-kvmAssignee: Juan Quintela <quintela>
qemu-kvm sub component: Live Migration QA Contact: Li Xiaohui <xiaohli>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: unspecified    
Priority: medium CC: aadam, chayang, jinzhao, juzhang, knoel, quintela, thuth, virt-maint
Version: 8.1Keywords: Triaged
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-15 07:37:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
libvirtd and qemu log
none
qemu backtrace none

Description Fangge Jin 2019-07-04 10:21:39 UTC
Created attachment 1587335 [details]
libvirtd and qemu  log

Description of problem:
Src qemu crashed if do parallel migration with parallel connection=2 after a failed migration with parallel connection=-1

Version-Release number of selected component (if applicable):
libvirt-5.4.0-2.module+el8.1.0+3523+b348b848.x86_64
qemu-kvm-4.0.0-4.module+el8.1.0+3523+b348b848.x86_64
kernel-4.18.0-107.el8.x86_64


How reproducible:
100%

Steps to Reproduce:
1.Start a vm

2.Do parallel migration with connection number=-1(although I don't understand what does connection number=-1 actually mean), it will fail
# virsh migrate nfs qemu+ssh://intel-5130-16-1.englab.nay.redhat.com/system --live --p2p  --parallel --parallel-connections -1
error: operation failed: migration out job: Unable to write to socket: Connection reset by peer

3.After step2, do parallel migration with connection number=1, src qemu crashed:
# virsh migrate nfs qemu+ssh://intel-5130-16-1.englab.nay.redhat.com/system --live --p2p  --parallel --parallel-connections 1
error: operation failed: domain is not running


Actual results:
In step3, src qemu crashed

Expected results:
In step3, migration should succeed

Additional info:

Comment 1 Fangge Jin 2019-07-04 10:24:55 UTC
Created attachment 1587348 [details]
qemu backtrace

Comment 3 Juan Quintela 2019-07-29 11:07:51 UTC
Hi

parallel_connections needs to be >= 1.

Improving the error message.
Migration after the failure should work through, Looking into that.

Comment 4 Li Xiaohui 2019-08-05 04:55:20 UTC
Hi all,
met problem when set multifd-channels -1, check multifd-channels value, found it's 255, please fix together, thanks
(qemu) migrate_set_parameter multifd-channels -1
(qemu) info migrate_parameters 
announce-initial: 50 ms
announce-max: 550 ms
announce-rounds: 5
announce-step: 100 ms
compress-level: 1
compress-threads: 8
compress-wait-thread: on
decompress-threads: 2
cpu-throttle-initial: 20
cpu-throttle-increment: 10
max-cpu-throttle: 99
tls-creds: ''
tls-hostname: ''
max-bandwidth: 33554432 bytes/second
downtime-limit: 300 milliseconds
x-checkpoint-delay: 20000
block-incremental: off
multifd-channels: 255
xbzrle-cache-size: 67108864
max-postcopy-bandwidth: 0
 tls-authz: '(null)'

Comment 5 Juan Quintela 2019-11-19 14:38:25 UTC
This is not urgent, we will improve the error handling upstream.  -1 is not a valid value, and it should be detected as that.

Comment 6 Ademar Reis 2020-02-05 23:00:11 UTC
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 7 Thomas Huth 2020-09-26 06:51:27 UTC
(In reply to Juan Quintela from comment #5)
> This is not urgent, we will improve the error handling upstream.  -1 is not
> a valid value, and it should be detected as that.

Has anybody already improved the error handling upstream?

Comment 9 Juan Quintela 2021-01-04 20:01:59 UTC
Hi

Not yet, will try to take a look as soon as I have time (not really soon).

Comment 11 RHEL Program Management 2021-03-15 07:37:23 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 12 Li Xiaohui 2021-04-15 14:45:58 UTC
Test this bz again via libvirt, found migration succeed and vm works well according to Comment 0 on rhelav 8.4.0(kernel-4.18.0-304.el8.x86_64&qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64&libvirt-client-7.0.0-10.module+el8.4.0+10417+37f6984d.x86_64), so close this bz as CurrentRelease.

And mark qe_test_coverage- as it's a negative test in this bz.


BTW, still found we could set multifd-channels to -1 both on qemu and libvirt side, and it's 255 in fact after setting to -1.
Juan, do you plan to give nice warning to avoid such wrong setting? Or still keep the current status?

Comment 13 Juan Quintela 2021-11-03 13:16:35 UTC
Hi xiaohui

will give one error when channels are set to -1.

Just closing the need info.