Bug 1726898
Summary: | Parallel migration fails with error "Unable to write to socket: Connection reset by peer" now and then | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Fangge Jin <fjin> |
Component: | qemu-kvm | Assignee: | Juan Quintela <quintela> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Li Xiaohui <xiaohli> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 8.1 | CC: | aadam, chayang, ddepaula, hhuang, jinzhao, juzhang, knoel, mtessun, quintela, qzhang, virt-maint, xianwang, xiaohli, yafu |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-4.1.0-7.module+el8.1.0+4177+896cb282 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-12-20 08:28:20 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 2
Juan Quintela
2019-07-29 11:10:24 UTC
Dest host: Processors 4 Cores 4 Sockets 2 Src host: Processors 16 Cores 8 Sockets 1 Hi all, I can reproduce this issue on host(kernel-4.18.0-125.el8.x86_64&qemu-4.1.0-rc3). How reproducible: 1/2 Steps to Reproduce: 1.on src&dst host, set multifd on and set multifd-channels 5; 2.do migration from src to dst host on src qemu, get error: (qemu) migrate -d tcp:192.168.11.22:4444 (qemu) qemu-kvm: multifd_send_pages: channel 2 has already quit! qemu-kvm: multifd_send_pages: channel 2 has already quit! qemu-kvm: multifd_send_sync_main: multifd_send_pages fail qemu-kvm: Unable to write to socket: Connection reset by peer (qemu) info migrate globals: store-global-state: on only-migratable: off send-configuration: on send-section-footer: on decompress-error-check: on clear-bitmap-shift: 18 capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off multifd: on dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off Migration status: failed (Unable to write to socket: Connection reset by peer) total time: 0 milliseconds 3.when do migration again, src&dst qemu core dump(need wait for some minutes). seems this problem is same with following bz. Juan, could you confirm this? Bug 1727052 - Src qemu crashed if do parallel migration with parallel connection=2 after a failed migration with parallel connection=-1 (qemu) migrate -d tcp:192.168.11.22:4444 (qemu) info migrate globals: store-global-state: on only-migratable: off send-configuration: on send-section-footer: on decompress-error-check: on clear-bitmap-shift: 18 capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off multifd: on dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off Migration status: setup total time: 0 milliseconds (qemu) ./start.sh: line 22: 25102 Segmentation fault (core dumped) /usr/libexec/qemu-kvm -enable-kvm -nodefaults -machine q35 -m 8G -smp 8 -cpu Haswell-noTSX-IBRS -name debug-threads=on -device pcie-root-port,id=pcie.0-root-port-2,slot=2,chassis=2,addr=0x2,bus=pcie.0 -device pcie-root-port,id=pcie.0-root-port-3,slot=3,chassis=3,addr=0x3,bus=pcie.0 -device pcie-root-port,id=pcie.0-root-port-4,slot=4,chassis=4,addr=0x4,bus=pcie.0 -device pcie-root-port,id=pcie.0-root-port-5,slot=5,chassis=5,addr=0x5,bus=pcie.0 -device virtio-scsi-pci,id=scsi0,bus=pcie.0-root-port-2 -drive file=/mnt/nfs/rhel810-scsi-0729-2.qcow2,format=qcow2,if=none,id=drive-scsi0-0-0-0,media=disk,cache=none,werror=stop,rerror=stop -device scsi-hd,bus=scsi0.0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,queues=4 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=18:66:da:5e:c2:3c,bus=pcie.0-root-port-3,vectors=10,mq=on -qmp tcp:0:3333,server,nowait -vnc :1 -device VGA -monitor stdio [root@dell-per430-10 qemu-sh]# [root@dell-per430-10 qemu-sh]# [root@dell-per430-10 qemu-sh]# [root@dell-per430-10 qemu-sh]# Tryping to fix this one now that it is easier for you to do it. Hi Posnted a tentative fix upstream. Posted upstream a fix. Waiting for the pull. Posted upstream in: https://lists.nongnu.org/archive/html/qemu-devel/2019-08/msg03930.html Pull requset sent upstream. All ack in. Brew id: 23338180 QA_ACK, please. Hi Juan, I can reproduce this bz on kernel-4.18.0-144.el8.x86_64&qemu-kvm-4.1.0-10.module+el8.1.0+4234+33aa4f57.x86_64: do ping-pong multifd migration between src&dst host with set multifd on, multifd-channels ~ 5 and set speed ~ 1G, almostly 10 times, will hit error: 1.on dst host: (qemu) migrate -d tcp:192.168.11.21:5555 (qemu) qemu-kvm: multifd_send_pages: channel 2 has already quit! qemu-kvm: multifd_send_pages: channel 2 has already quit! qemu-kvm: multifd_send_sync_main: multifd_send_pages fail qemu-kvm: Unable to write to socket: Connection reset by peer check migration status, capability and parameters: (qemu) info status VM status: running (qemu) info migrate globals: store-global-state: on only-migratable: off send-configuration: on send-section-footer: on decompress-error-check: on clear-bitmap-shift: 18 capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off multifd: on dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off Migration status: failed (Unable to write to socket: Connection reset by peer) total time: 0 milliseconds (qemu) info migrate_capabilities multifd: on ... (qemu) info migrate_parameters ... multifd-channels: 5 2.on src host, check migration status, capability and parameters: (qemu) info status VM status: paused (inmigrate) (qemu) info migr migrate migrate_cache_size migrate_capabilities migrate_parameters (qemu) info migrate globals: store-global-state: on only-migratable: off send-configuration: on send-section-footer: on decompress-error-check: on clear-bitmap-shift: 18 socket address: [ tcp:0.0.0.0:4444 ] (qemu) info migrate_capabilities multifd: on ... (qemu) info migrate_parameters ... multifd-channels: 5 Hi To be sure that we are on the same page: - you say that the error is on dst host. dst host never send pages. I am guessing that you mean that it is the dst host of the previous migration, not the "dst" of the current migration. - it only happens when you do ping pong migration? - How long it pass while you finish one migration and you start the following one - how idle/loaded are the guest? I am on vacation until Monday, I will try to reproduce it locally. Later, Juan. (In reply to Juan Quintela from comment #15) > Hi > > To be sure that we are on the same page: > > - you say that the error is on dst host. dst host never send pages. I am > guessing that you mean that it is the dst host of the previous migration, > not the "dst" of the current migration. > - it only happens when you do ping pong migration? No, when I repeat to do multifd migration from src to dst host, just 8 times, will reproduce this issue > - How long it pass while you finish one migration and you start the > following one so here it's nothing about ping-pong migration. > - how idle/loaded are the guest? No stress in guest, only keep guest running. > > I am on vacation until Monday, I will try to reproduce it locally. > > Later, Juan. As soon as I put gdb onto it, bug disspears. Trying two things right now: - using only traces (no gdb), but then no backtraces. - lukas found a way to reproduce it with autotest (25 ping pong migrations make it work for him). Trying autotest and see if I can get anything from there. Talked with xiaholi on irc, will post an update when I finish today. Hi Juan, Hai Since this bz isn't reproduce 100% and not easy to debug from you, what's more, the errata deadline is coming on Oct.1, QE advise move this ON_QA bz to re-fix, change ITR(Internal Target Release) flag from 8.1.0 to 8.2.0, then we'll have more time to fix on rhel8.2. Please do this at once, we have only one day left to drop this bz from errata. I think that I find the cause because I was unable to reproduce. Could you retest adding to the command line: --global migration.multifd-channels=10 Or whatever is the maximum number of channels that you are going to use. Options are: - upgrade the default number of channels and hope that it is big enough (10 or so would work) - use always --incomping defer, change all the values, and then start the proper migration. Current code calls listen() before we start the command prompt. When we arrive to the command prompt, it is too late for this. I am also asking libvirt people what do they preffer. (In reply to Juan Quintela from comment #24) > I think that I find the cause because I was unable to reproduce. > > Could you retest adding to the command line: > > --global migration.multifd-channels=10 > > Or whatever is the maximum number of channels that you are going to use. > Options are: > - upgrade the default number of channels and hope that it is big enough (10 > or so would work) > - use always --incomping defer, change all the values, and then start the > proper migration. > > Current code calls listen() before we start the command prompt. When we > arrive to the command prompt, it is too late for this. > I am also asking libvirt people what do they preffer. ok, I will try later (In reply to Juan Quintela from comment #24) > I think that I find the cause because I was unable to reproduce. > > Could you retest adding to the command line: > > --global migration.multifd-channels=10 > when I start src&dst guest with "-global migration.multifd-channels=5", will reproduce this bz after the second try. Test steps like followings: 1.start src guest with "-global migration.multifd-channels=5"; 2.start dst geust with "-global migration.multifd-channels=5" & "-incoming tcp:0:4444"; 3.enable multifd capability both on src&dst qemu. 4.set speed&downtime on src qemu 5.migration guest from src to dst host with command in src qemu: (qemu) migrate -d tcp:10.73.73.87:4444 > Or whatever is the maximum number of channels that you are going to use. > Options are: > - upgrade the default number of channels and hope that it is big enough (10 > or so would work) > - use always --incomping defer, change all the values, and then start the > proper migration. when I test with "-incoming defer" on dst guest, tried 15 times, didn't hit this issue. Test steps like followings: 1.start src guest; 2.start dst geust with "-incoming defer"; 3.enable multifd capability both on src&dst qemu. 4.set multifd-channels to 5 both on src&dst qemu. 5.set speed&downtime on src qemu 6.set incoming port on dst qemu: (qemu) migrate_incoming tcp:10.73.73.87:4444 7.migration guest from src to dst host with command in src qemu: (qemu) migrate -d tcp:10.73.73.87:4444 > > Current code calls listen() before we start the command prompt. When we > arrive to the command prompt, it is too late for this. > I am also asking libvirt people what do they preffer. hi -icoming defer is the "supported way", the other is just for testing. Still wondering why it is not workincg for you. Thanks, Juan. Hi I think we can close this issue: 1st- the supported way of doing migration is with --incoming defer 2nd- I found the problem with the second way (the unofficial one), I should have told you that you also needed: -global migration.multifd-channels=10 -global migration.x-multifd=on Both need to be used to be able to use it. Xiaohli, if you can't reproduce, please close it. Thanks, Juan. Hi Juan, I have tested following situations about multifd test, Juan, please fix situation 2&4&5: operation steps: 1) enable multifd capabilities on src&dst host 2) set multifd-channels same on src&dst host 3) set migrate_incoming in dst hmp 4) migrate guest from src to dst 1.test with above steps: 1) -> 2) -> 3)-> 4), Pass 2.test with above steps: 1) -> 2) -> 3)-> 2) -> 4): qemu on src host hang with error: (qemu) qemu-kvm: multifd_send_pages: channel 5 has already quit! qemu-kvm: multifd_send_pages: channel 5 has already quit! qemu-kvm: multifd_send_sync_main: multifd_send_pages fail qemu-kvm: failed to save SaveStateEntry with id(name): 2(ram) I know it's a negative test, and migration should fail, but qemu hang isn't our expection, discussed with you, you will update upstream to print one error and give up. Thanks 3.test with above steps: 1) -> 2) -> 3) -> 2) -> 3) -> 4), Pass. 4.test with above steps: 1) -> 2)but not same multifd-channels on src&dst host -> 3) -> 4): migration status is active, but in fact migration doesn't start(only "total time" is increasing in following data, other data won't change). This is a negative test, but we expect that should give error prompt and give up, too. (qemu) info migrate ... Migration status: active total time: 81408 milliseconds expected downtime: 300 milliseconds setup: 18 milliseconds transferred ram: 1030 kbytes throughput: 0.00 mbps remaining ram: 4209376 kbytes total ram: 4211528 kbytes duplicate: 154 pages skipped: 0 pages normal: 383 pages normal bytes: 1532 kbytes dirty sync count: 1 page size: 4 kbytes multifd bytes: 1029 kbytes pages-per-second: 0 5.test with above steps: 1)but one side multifd enable, another disable -> 3) -> 4): get same result with test 4, so please handle these issues together. Of course, I will update our test plan about multifd soon according to Juan's test guidance from Comment 28. Thanks. hi Xiaohui Anserinhg to comment 29: You can't do 2 after 3. Once that migration has started, there is no way to change the number of channels. I will change upstream to detect that we are inside a migration and give one error to the user. Thanks, Juan. 1) enable multifd capabilities on src&dst host 2) set multifd-channels same on src&dst host 3) set migrate_incoming in dst hmp 4) migrate guest from src to dst (In reply to Juan Quintela from comment #31) > hi Xiaohui > > Anserinhg to comment 29: > > You can't do 2 after 3. Once that migration has started, there is no way to > change the number of channels. Juan, I mean test 2)->3)->2), not 2)->3)->4)->2). From qemu side, I think could change multifd-channels after set migrate_incoming but before do migration. I don't know why couldn't do so? does libvirt not support? > I will change upstream to detect that we are inside a migration and give one > error to the user. > > Thanks, Juan. Hi I agree that this bugzillas can be cancelled. That don't work. You can't change the channels once that the migration has started. Libvirt don't do that, so we are safe there. There will be an upstream patches that give one error on next release. Later, Juan. Hi Ariel and Juan, Confirmed with libvirt mate yafu, libvirt doesn't support -incoming defer now. So shall libvirt file a new bz for supporting this function? Thanks, Forget Comment 37 because I found libvirt use -incoming defer in Comment 0, and their operations are right |