Bug 2137740
Summary: | Multifd migration fails under a weak network/socket ordering race | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Li Xiaohui <xiaohli> | |
Component: | qemu-kvm | Assignee: | Peter Xu <peterx> | |
qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | berrange, chayang, coli, fjin, iholder, jinzhao, juzhang, leobras, mdean, peterx, quintela, sgott, virt-maint | |
Version: | 8.8 | Keywords: | Triaged | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | qemu-kvm-6.2.0-31.module+el8.8.0+18188+901de023 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2169732 (view as bug list) | Environment: | ||
Last Closed: | 2023-05-16 08:16:35 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2169732 |
Description
Li Xiaohui
2022-10-26 03:34:04 UTC
Tried to reproduce this bug on qemu-kvm-6.2.0-11.module+el8.6.0+16538+01ea313d.6.x86_64: 1.When the guest is running on the source host, enable multifd capability on source and destination; 2.Before migration, creat network packet loss on the source host: # tc qdisc add dev switch root netem loss 40% 3.Then start migrating the guest from source to destination host; After step 3, would get errors on source and destination hmp like below: (1)src hmp: (qemu) info 2022-10-26T11:34:41.372470Z qemu-kvm: multifd_send_pages: channel 0 has already quit! (2)dst hmp: (qemu) 2022-10-26T11:34:41.362459Z qemu-kvm: failed to receive packet via multifd channel 1: multifd: received packet magic 5145564d expected 11223344 And source qmp and hmp hangs when executing below two qmp commands, also can't login the guest through the console: {"execute":"qmp_capabilities", "arguments":{"enable":["oob"]}} {"return": {}} {"execute":"query-migrate"} {"execute": "query-status"} Regarding the above test result, do I reproduce the bug mentioned by Itamar? Ok, by previous experience debugging, it looks like the dst qemu is reading some packet expecting it to be the header, but due to packet loss, it is not, and it breaks the migration (as expected?). If there is some packet loss, not all data is getting into the destination. I mean, of course some packet can be lost in TCP transmission, but it should be re-sent by the TCP stack. Please help me understand: - What is getting tested here, exactly? - Is migration code supposed to re-send data if there is any packet lost? - What is the command used to start migration? (In reply to Leonardo Bras from comment #2) > Ok, by previous experience debugging, it looks like the dst qemu is reading > some packet expecting it to be the header, but due to packet loss, it is > not, and it breaks the migration (as expected?). If there is some packet > loss, not all data is getting into the destination. > > I mean, of course some packet can be lost in TCP transmission, but it should > be re-sent by the TCP stack. > > Please help me understand: > - What is getting tested here, exactly? > - Is migration code supposed to re-send data if there is any packet lost? This is an ordering race rather than an actual loss of data. There are multiple socket connections happening; the 'main' socket and then multiple sockets for multifd. The existing code makes the incorrect assumption that the 'main' socket will connect first (and send its 5145564d header) followed by the 'multifd' sockets (with their 11223344 header). The artificial packet loss delays the opening of the main socket; so a multifd socket connects first, then the main connection comes along. The error above is the multifd code expecting to receive a multifd header but actually receiving the main socket header. Peter has been looking at making that more robust. > - What is the command used to start migration? QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. Verify this bug on kernel-4.18.0-472.el8.x86_64 && qemu-kvm-6.2.0-31.module+el8.8.0+18188+901de023.x86_64 Background: 1.When the guest is running on the source host, enable multifd capability on source and destination; 2.Before migration, creat network packet loss on the source host: # tc qdisc add dev switch root netem loss 40% 3.Then start migrating the guest from source to destination host; after step 3, migration is active. the progress of migration is very slow and migration can't converge. But anyway, guest still works well, no erros from qemu, and qemu won't hang. Then test below scenarios: 1) cancel migration when multifd migration can't converge during network packet loss: Result: Cancel migration successfully, VM works well on src host. Below is some info after canceling migration. src hmp: (qemu) migrate_cancel (qemu) 2023-02-23T06:37:36.993227Z qemu-kvm: multifd_send_pages: channel 1 has already quit! 2023-02-23T06:37:36.993316Z qemu-kvm: multifd_send_sync_main: multifd_send_pages fail 2023-02-23T06:37:36.993326Z qemu-kvm: failed to save SaveStateEntry with id(name): 1(ram): -1 2023-02-23T06:37:36.995036Z qemu-kvm: Unable to write to socket: Broken pipe dst hmp: (qemu) 2023-02-23T06:37:36.757048Z qemu-kvm: check_section_footer: Read section footer failed: -5 2023-02-23T06:37:36.758076Z qemu-kvm: load of migration failed: Invalid argument 2) recovery network packet loss, then continue multifd migration. # tc qdisc delete dev switch root netem loss 40% Result: Migration succeeds, VM works well after migration. Hi Peter, I tested this bug in Comment 8, do you think the above results are expected as our fix? (In reply to Li Xiaohui from comment #9) > Hi Peter, I tested this bug in Comment 8, do you think the above results are > expected as our fix? Yes, I think so. Thanks! Thank you. So mark this bug verified per Comment 8. Will add one case in polarion later. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:2757 |