Bug 2018404
| Summary: | migration/qmp: no error prompt when destination address is invalid under postcopy-paused | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Li Xiaohui <xiaohli> |
| Component: | qemu-kvm | Assignee: | Peter Xu <peterx> |
| qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> |
| Status: | CLOSED MIGRATED | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | chayang, leobras, lersek, mdean, peterx, virt-maint |
| Version: | 9.0 | Keywords: | MigratedToJIRA, Reopened, Triaged |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-09-22 17:32:11 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Correct the actual result after step 8:
{"timestamp": {"seconds": 1635491378, "microseconds": 422929}, "event": "MIGRATION", "data": {"status": "postcopy-paused"}}
{"execute":"query-migrate"}
{"return": {"status": "postcopy-paused", "socket-address": [{"port": "1234", "ipv6": true, "host": "::", "type": "inet"}]}}
{"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:192.168.130.222:1235"}}
{"timestamp": {"seconds": 1635491420, "microseconds": 871656}, "event": "MIGRATION", "data": {"status": "setup"}}
{"error": {"class": "GenericError", "desc": "Failed to bind socket: Cannot assign requested address"}}
Xiaohui, As I mentioned, this issue has nothing to do with postcopy or postcopy recovery. We have this problem for years. It happens as long as e.g. you try to migrate to an unknown address. Please try it out. I suggest the subject changed to "migration/qmp: no error prompt when destination address is invalid". Peter (In reply to Peter Xu from comment #2) > Xiaohui, > > As I mentioned, this issue has nothing to do with postcopy or postcopy > recovery. We have this problem for years. It happens as long as e.g. you > try to migrate to an unknown address. Got it. > > Please try it out. I suggest the subject changed to "migration/qmp: no > error prompt when destination address is invalid". > Yeh, I tried and it turned out just as you said. Updated the summary. > Peter Dropping the "under postcopy-paused" qualification from the title, per Peter's comment 2 / comment 3. What is the simplest way to reproduce the situation described by Peter in comment 2? Note: that comment is from October 2021, so it's old. Here's what I can reproduce (I just grabbed my locally built, upstream QEMU, version 7.2.1): ** Terminal#1, standing in "/tmp": $ qemu-system-x86_64 -monitor stdio -qmp unix:./qmp-sock,server ** Terminal#2, standing in the "python" subdir of the QEMU source tree: $ PYTHONPATH=$PWD python qemu/qmp/qmp_shell.py -v -p /tmp/qmp-sock (QEMU) migrate uri=tcp:192.168.130.123:1234 { "arguments": { "uri": "tcp:192.168.130.123:1234" }, "execute": "migrate" } { "return": {} } (QEMU) query-migrate { "arguments": {}, "execute": "query-migrate" } { "return": { "status": "setup" } } [... a long time passes, "query-migrate" keeps reporting "setup" status; but then ...] (QEMU) query-migrate { "arguments": {}, "execute": "query-migrate" } { "return": { "error-desc": "Failed to connect to '192.168.130.123:1234': Connection timed out", "status": "failed" } } So that indicates *one of two* things: - The issue has been fixed between the reported 6.1 and the 7.2.1 version that I'm trying, for *both* postcopy and precopy. (Peter says in comment#2 that precopy and postcopy behave identically in this regard, so precopy working fine implies postcopy works fine now, too.) - Or else, Peter was wrong in comment 2, precopy has always worked fine in this regard (i.e., that's what my current ad-hoc test shows), but postcopy is different, and may still be broken (I've not tested it). Peter, any comments? Looking at the QEMU source / git history, this problem *precisely* has been fixed by Daniel, in commit d59ce6f34434 ("migration: add reporting of errors for outgoing migration", 2016-05-26).
However, that commit is part of v2.7.0, so it must have been present in the reported version (which is 6.1 based).
Is this perhaps a regression from the following commit (note: it appeared in 6.1):
- ca30f24d12c9 ("migration: Don't do migrate cleanup if during postcopy resume", 2021-07-13)
I think I've got it. I think Peter's comment#2 is wrong, and that the problem still affects us, but only with postcopy. The problem should be relevant for upstream QEMU, too (@ 2a6ae6915454). Here's what happens (note this is only by code reading, not by testing / debugging, so in a way this is just speculation): - On the source (outgoing) side, we are in MIGRATION_STATUS_POSTCOPY_PAUSED (postcopy migration broke and got paused). - Over QMP, a migrate command comes in (on the source side) with resume=true. - Then: qmp_migrate() [migration/migration.c] migrate_prepare() [migration/migration.c] - passes fine as s->state is POSTCOPY_PAUSED - *importantly*, the migrate_init() call is *not* reached - migrate_init() would be the function to atomically flip the status from NONE to SETUP socket_start_outgoing_migration() [migration/socket.c] socket_start_outgoing_migration_internal() [migration/socket.c] - starts the network connection in a separate thread / QIOTask - the task callback ("QIOTaskFunc") is socket_outgoing_migration() socket_outgoing_migration() [migration/socket.c] qio_task_propagate_error() [io/task.c] - we jump to the "out" label, with "err" populated, as the connection has failed migration_channel_connect() [migration/channel.c] migrate_fd_connect() [migration/migration.c] migrate_fd_error() [migration/migration.c] migrate_set_state(&s->state, MIGRATION_STATUS_SETUP, MIGRATION_STATUS_FAILED ); <-------- BUG migrate_set_error() So here's the problem: because we never *enter* the SETUP state when we resume from POSTCOPY_PAUSED, our attempt to set FAILED state in response to the background socket connection error has no effect. Namely, we call the helper function migrate_fd_error(), for *BOTH* (a) changing the state to FAILED *and* (b) recording the error details, but internally to migrate_fd_error(), step (a) is *conditional* on the current state being SETUP -- and that's not the case when we're resuming from POSTCOPY_PAUSED! migrate_set_state() does nothing if the current state is not "old_state". Later on, the user or the mgmt app issues the "query-migrate" QMP command, and this happens: qmp_query_migrate() [migration/migration.c] fill_source_migration_info() [migration/migration.c] There, fill_source_migration_info reaches the POSTCOPY_PAUSED label, and therefore calls populate_time_info(), populate_ram_info(), populate_disk_info(), populate_vfio_info() -- compare comment#0! --, and does not reach the FAILED state where, we would copy-out the error object that we -- otherwise successfully -- stashed in migrate_fd_error() / migrate_set_error(). In retrospect, this is confirmed by the following snippet from comment#0, under "Actual results" (i.e., the bad results): "status": "postcopy-paused" In other words, we're stuck in POSTCOPY_PAUSED, because the migrate_fd_error() function can take us to FAILED state from *only* the SETUP state, not from POSTCOPY_PAUSED. The bug was introduced in commit d3e35b8f6248 ("migration: rebuild channel on source", 2018-05-15) -- first, that was when migrate_prepare() was introduced, for skipping the SETUP status; second, the migrate_fd_connect() function was only partially extended to recognize the POSTCOPY_PAUSED state. Namely, migrate_fd_connect() would continue calling migrate_fd_error() unmodified, even though migrate_fd_error() would not move to FAILED state from the POSTCOPY_PAUSED state. Proposed solution: extend migrate_fd_error() with a "bool resume" parameter. If the new param is false, don't change anything; otherwise, the "old_state" argument for migrate_set_state() must be POSTCOPY_PAUSED (not SETUP). This affects the following code locations: - clearly the migrate_fd_error() function declaration and definition - the "migrate_fd_error" trace event too - the call site we have investigated here, in migrate_fd_connect() - interestingly, we have another call site: in qmp_migrate(), when "local_err" is populated. That is, in case we have an immediate, synchronous error, for example because we cannot parse the remote URL in socket_start_outgoing_migration(). - interestingly, we have *YET ANOTHER* call site: also in qmp_migrate() -- the last branch (unrecognized migration protocol) -- has an *open-coded* migrate_set_state()call , which expects old_state=SETUP. That cannot be true *if* we cleared migrate_prepare() higher up in the function due to resume being "true"! IOW, for the last two entries, I'm predicting that the same error as reported in this BZ is triggerable synchronously too, from the postcopy-paused state, on the source side. Just try to resume in such a way that fails the QMP command synchronously! WARNING: I think the issues here have a larger scope / impact than just incorrect error reporting. There's a bunch of functionality in "migration/migration.c" that depends on MIGRATION_STATUS_SETUP. For example: - the qemu_savevm_wait_unplug() call in migration_thread(); not sure if that is exclusive with resuming - the same, but in bg_migration_thread(); not sure if that is exclusive with resuming - the SETUP state is exposed publicly via migration_in_setup(); and virtio-net, host-vdpa and spice-core call that function. I'm unsure if those call sites should be aware of recovery, too. Hello, Laszlo, Firstly, a huge thanks for helping in any of migration bugs! (especially postcopy ones :) Let me comment inline. (In reply to Laszlo Ersek from comment #11) > I think I've got it. I think Peter's comment#2 is wrong, and that the > problem still affects us, but only with postcopy. The problem should be > relevant for upstream QEMU, too (@ 2a6ae6915454). I had a guess that maybe upstream qemu changed the behavior because iirc when I said that I should have tried something, and according to Xiaohui's comment it seems we did get the same result at that time, I'll need to double check, though. > > Here's what happens (note this is only by code reading, not by testing / > debugging, so in a way this is just speculation): > > - On the source (outgoing) side, we are in > MIGRATION_STATUS_POSTCOPY_PAUSED (postcopy migration broke and got > paused). > > - Over QMP, a migrate command comes in (on the source side) with > resume=true. > > - Then: > > qmp_migrate() [migration/migration.c] > migrate_prepare() [migration/migration.c] > - passes fine as s->state is POSTCOPY_PAUSED > - *importantly*, the migrate_init() call is > *not* reached Yes, PAUSED need to maintain most of the fields. > - migrate_init() would be the function to > atomically flip the status from NONE to > SETUP > socket_start_outgoing_migration() [migration/socket.c] > socket_start_outgoing_migration_internal() [migration/socket.c] > - starts the network connection in a > separate thread / QIOTask > - the task callback ("QIOTaskFunc") is > socket_outgoing_migration() > > socket_outgoing_migration() [migration/socket.c] > qio_task_propagate_error() [io/task.c] > - we jump to the "out" label, with "err" > populated, as the connection has failed > migration_channel_connect() [migration/channel.c] > migrate_fd_connect() [migration/migration.c] > migrate_fd_error() [migration/migration.c] > migrate_set_state(&s->state, > MIGRATION_STATUS_SETUP, > MIGRATION_STATUS_FAILED > ); <-------- BUG This may or may not be a "bug", at least keeping postcopy PAUSED state when a recovery is failed is by design. Here the trick is if postcopy is paused we should hopefully never fail the migration, because failing the migration means split brain (half data on src, half on dst), then the guest will 100% crash. If we keep it PAUSED, as long as it can be recovered again at some point, guest can still receive all the data and keep running (even if during PAUSED it may fault on some page and the vcpu may halt). > migrate_set_error() > > So here's the problem: because we never *enter* the SETUP state when we > resume from POSTCOPY_PAUSED, our attempt to set FAILED state in response > to the background socket connection error has no effect. Namely, we call > the helper function migrate_fd_error(), for *BOTH* (a) changing the > state to FAILED *and* (b) recording the error details, but internally to > migrate_fd_error(), step (a) is *conditional* on the current state being > SETUP -- and that's not the case when we're resuming from > POSTCOPY_PAUSED! migrate_set_state() does nothing if the current state > is not "old_state". > > Later on, the user or the mgmt app issues the "query-migrate" QMP > command, and this happens: > > qmp_query_migrate() [migration/migration.c] > fill_source_migration_info() [migration/migration.c] > > There, fill_source_migration_info reaches the POSTCOPY_PAUSED label, and > therefore calls populate_time_info(), populate_ram_info(), > populate_disk_info(), populate_vfio_info() -- compare comment#0! --, and > does not reach the FAILED state where, we would copy-out the error > object that we -- otherwise successfully -- stashed in > migrate_fd_error() / migrate_set_error(). Exactly. I posted a patch just a few days ago just for this (not for this bug, but for a small rework to cleanup return path thread error handling. Please see: https://lore.kernel.org/all/20230628215002.73546-6-peterx@redhat.com/ I've got four patches; now I need to test them. Ouch, just seeing comment#12... The patches I ended up with meanwhile take a different approach from <https://lore.kernel.org/all/20230628215002.73546-6-peterx@redhat.com/>. My patches make the argument that we should transition from POSTCOPY_PAUSED to FAILED. Peter's patch seems to stick with POSTCOPY_PAUSED, but print the stashed socket error nonetheless. This is aligned with the statement "keeping postcopy PAUSED state when a recovery is failed is by design" from comment#12. What's confusing to me there is that we intentionally exploit migrate_fd_error() not changing the state (because we're issuing those transitions from a state different from SETUP), but the code makes that behavior *look* like it's accidental, rather than by-design! So I think that the migrate_fd_error() calls, and the migrate_set_state() call in qmp_migrate(), would deserve some comments ("these are no-ops if we're resuming"). Either way, some of my cleanup patches might be useful on top of Peter's: - docs/devel/migration.rst: networking error is not fatal for postcopy - migration: factor out "resume_requested" in qmp_migrate() - migration: unexport migrate_fd_error() Finally a request... whenever submitting an upstream patch set that's related to an RHBZ, even remotely so, please always mark that fact on the RHBZ, in a new comment. It feels quite a bit like my analysis here was wasted -- it could have been avoided if this BZ had been marked "in progress" (with the link to the patch series) on June 28th... Also (again) I think bug 2111332 is related / dependent, so re-checking that one after Peter's patches might be useful. (In reply to Laszlo Ersek from comment #14) > Ouch, just seeing comment#12... I tried to comment as fast as possible, it's just that everything happens faster than I thought (all in a few hours :). > > The patches I ended up with meanwhile take a different approach from > <https://lore.kernel.org/all/20230628215002.73546-6-peterx@redhat.com/>. > > My patches make the argument that we should transition from POSTCOPY_PAUSED > to FAILED. > > Peter's patch seems to stick with POSTCOPY_PAUSED, but print the stashed > socket error nonetheless. This is aligned with the statement "keeping > postcopy PAUSED state when a recovery is failed is by design" from > comment#12. > > What's confusing to me there is that we intentionally exploit > migrate_fd_error() not changing the state (because we're issuing those > transitions from a state different from SETUP), but the code makes that > behavior *look* like it's accidental, rather than by-design! So I think that > the migrate_fd_error() calls, and the migrate_set_state() call in > qmp_migrate(), would deserve some comments ("these are no-ops if we're > resuming"). I agree. I should have given wrong impression that the code was by design - it was not! It's about "keeping PAUSED state" should be by-design only. > > Either way, some of my cleanup patches might be useful on top of Peter's: > > - docs/devel/migration.rst: networking error is not fatal for postcopy > - migration: factor out "resume_requested" in qmp_migrate() > - migration: unexport migrate_fd_error() > > Finally a request... whenever submitting an upstream patch set that's > related to an RHBZ, even remotely so, please always mark that fact on the > RHBZ, in a new comment. That patch was _not_ for this bug. I'll need to check whether that applies. Note that IIUC this bug was trying to track incoming side of error reporting, not src side. > It feels quite a bit like my analysis here was > wasted -- it could have been avoided if this BZ had been marked "in > progress" (with the link to the patch series) on June 28th... > > Also (again) I think bug 2111332 is related / dependent, so re-checking that > one after Peter's patches might be useful. Assigning this to me is definitely fine; but again just to mention that patch is not trying to fix this one at all. Thanks. (In reply to Peter Xu from comment #15) > That patch was _not_ for this bug. I'll need to check whether that applies. > Note that IIUC this bug was trying to track incoming side of error > reporting, not src side. Hold on - maybe I was wrong, sorry. After I re-read it it seems to be for the src. Anyway, I'll double check with everything and when I repost I'll attach BZ: link if that applies. As Laszlo rightfully pointed out, I think this patch should fix it: https://lore.kernel.org/all/20230705163502.331007-2-peterx@redhat.com/ I'll take care of it when it lands upstream. Laszlo, please feel free to still post whatever patch you got that you think makes sense even without fixing this bug. Thanks. [qemu-devel] [PATCH 0/2] migration: trivialities Message-Id: <20230706102937.82490-1-lersek> After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. Reopen this bug as Peter has sent out a patch to fix it according to Comment 17 Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |
Description of problem: When postcopy is paused, fix the network issue, then try to resume postcopy with invalid destination address but get no error prompt in source qmp like below migration failure of normal case without postcopy: **************************** {"execute":"query-migrate"} {"return": {"status": "failed", "error-desc": "Failed to connect to '192.168.130.123:1234': No route to host"}} Version-Release number of selected component (if applicable): hosts info: kernel-5.14.0-1.7.1.el9.x86_64 & qemu-kvm-6.1.0-6.el9.x86_64 How reproducible: 100% Steps to Reproduce: 1.Boot a vm on src host; 2.Boot a vm on dst host with same qemu cmds as src host but appending "-incoming defer"; 3.Load some stress in vm with stressapptest tool; 4.Enable postcopy&events capabilities on src&dst host, and set postcopy speed to 5MB on src host; 5.Set migration incoming on dst host, and start migration from src to dst host; 6.During migration is active, switch to postcopy mode; 7.During postcopy is active, down migration network on dst host; 8.After postcopy is paused, fix network issue to ensure migration recovery. 9.After network recovery, recover migration on dst host: {"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:$dst_host_ip:1235"}} 10.Then resume migration on src host with invalid dst ip: {"execute":"migrate", "arguments":{"uri":"tcp:$wrong_dst_host_ip:1235", "resume":true}} Actual results: After step 10, qmp won't give errors via 'query-migrate': {"execute":"migrate", "arguments":{"uri":"tcp:$wrong_dst_ip:1235", "resume":true}} {"return": {}} {"execute":"query-migrate"} {"return": {"expected-downtime": 300, "status": "postcopy-paused", "setup-time": 7, "total-time": 273813, "ram": {"total": 4312604672, "postcopy-requests": 2804, "dirty-sync-count": 3, "multifd-bytes": 0, "pages-per-second": 1280, "page-size": 4096, "remaining": 1680412672, "mbps": 42.026400000000002, "transferred": 5366104695, "duplicate": 253336, "dirty-pages-rate": 33839, "skipped": 0, "normal-bytes": 5353353216, "normal": 1306971}}} Expected results: After step 10, qmp cmd: 'query-migrate' gets similar error like migration failure under migration without postcopy enabled: {"execute": "migrate","arguments":{"uri": "tcp:192.168.130.123:1234"}} {"timestamp": {"seconds": 1635503696, "microseconds": 53372}, "event": "MIGRATION", "data": {"status": "setup"}} {"return": {}} {"timestamp": {"seconds": 1635503699, "microseconds": 140495}, "event": "MIGRATION", "data": {"status": "failed"}} {"execute":"query-migrate"} {"return": {"status": "failed", "error-desc": "Failed to connect to '192.168.130.123:1234': No route to host"}} Additional info: