Bug 2018404

Summary:	migration/qmp: no error prompt when destination address is invalid under postcopy-paused
Product:	Red Hat Enterprise Linux 9	Reporter:	Li Xiaohui <xiaohli>
Component:	qemu-kvm	Assignee:	Peter Xu <peterx>
qemu-kvm sub component:	Live Migration	QA Contact:	Li Xiaohui <xiaohli>
Status:	CLOSED MIGRATED	Docs Contact:
Severity:	low
Priority:	low	CC:	chayang, leobras, lersek, mdean, peterx, virt-maint
Version:	9.0	Keywords:	MigratedToJIRA, Reopened, Triaged
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	---
Hardware:	Unspecified
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-09-22 17:32:11 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Li Xiaohui 2021-10-29 07:36:48 UTC

Description of problem:
When postcopy is paused, fix the network issue, then try to resume postcopy with invalid destination address but get no error prompt in source qmp like below migration failure of normal case without postcopy:
****************************
{"execute":"query-migrate"}
{"return": {"status": "failed", "error-desc": "Failed to connect to '192.168.130.123:1234': No route to host"}}


Version-Release number of selected component (if applicable):
hosts info:
kernel-5.14.0-1.7.1.el9.x86_64 & qemu-kvm-6.1.0-6.el9.x86_64


How reproducible:
100%


Steps to Reproduce:
1.Boot a vm on src host;
2.Boot a vm on dst host with same qemu cmds as src host but appending "-incoming defer";
3.Load some stress in vm with stressapptest tool;
4.Enable postcopy&events capabilities on src&dst host, and set postcopy speed to 5MB on src host;
5.Set migration incoming on dst host, and start migration from src to dst host;
6.During migration is active, switch to postcopy mode;
7.During postcopy is active, down migration network on dst host;
8.After postcopy is paused, fix network issue to ensure migration recovery.
9.After network recovery, recover migration on dst host:
{"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:$dst_host_ip:1235"}}
10.Then resume migration on src host with invalid dst ip:
{"execute":"migrate", "arguments":{"uri":"tcp:$wrong_dst_host_ip:1235", "resume":true}}


Actual results:
After step 10, qmp won't give errors via 'query-migrate':
{"execute":"migrate", "arguments":{"uri":"tcp:$wrong_dst_ip:1235", "resume":true}}
{"return": {}}
{"execute":"query-migrate"}
{"return": {"expected-downtime": 300, "status": "postcopy-paused", "setup-time": 7, "total-time": 273813, "ram": {"total": 4312604672, "postcopy-requests": 2804, "dirty-sync-count": 3, "multifd-bytes": 0, "pages-per-second": 1280, "page-size": 4096, "remaining": 1680412672, "mbps": 42.026400000000002, "transferred": 5366104695, "duplicate": 253336, "dirty-pages-rate": 33839, "skipped": 0, "normal-bytes": 5353353216, "normal": 1306971}}}


Expected results:
After step 10, qmp cmd: 'query-migrate' gets similar error like migration failure under migration without postcopy enabled:
{"execute": "migrate","arguments":{"uri": "tcp:192.168.130.123:1234"}}
{"timestamp": {"seconds": 1635503696, "microseconds": 53372}, "event": "MIGRATION", "data": {"status": "setup"}}
{"return": {}}
{"timestamp": {"seconds": 1635503699, "microseconds": 140495}, "event": "MIGRATION", "data": {"status": "failed"}}
{"execute":"query-migrate"}
{"return": {"status": "failed", "error-desc": "Failed to connect to '192.168.130.123:1234': No route to host"}}


Additional info:

Comment 1 Li Xiaohui 2021-10-29 07:40:39 UTC

Correct the actual result after step 8:
{"timestamp": {"seconds": 1635491378, "microseconds": 422929}, "event": "MIGRATION", "data": {"status": "postcopy-paused"}}
{"execute":"query-migrate"}
{"return": {"status": "postcopy-paused", "socket-address": [{"port": "1234", "ipv6": true, "host": "::", "type": "inet"}]}}
{"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:192.168.130.222:1235"}}
{"timestamp": {"seconds": 1635491420, "microseconds": 871656}, "event": "MIGRATION", "data": {"status": "setup"}}
{"error": {"class": "GenericError", "desc": "Failed to bind socket: Cannot assign requested address"}}

Comment 2 Peter Xu 2021-10-29 08:14:05 UTC

Xiaohui,

As I mentioned, this issue has nothing to do with postcopy or postcopy recovery.  We have this problem for years.  It happens as long as e.g. you try to migrate to an unknown address.

Please try it out.  I suggest the subject changed to "migration/qmp: no error prompt when destination address is invalid".

Peter

Comment 3 Li Xiaohui 2021-10-29 10:13:04 UTC

(In reply to Peter Xu from comment #2)
> Xiaohui,
> 
> As I mentioned, this issue has nothing to do with postcopy or postcopy
> recovery.  We have this problem for years.  It happens as long as e.g. you
> try to migrate to an unknown address.

Got it.

> 
> Please try it out.  I suggest the subject changed to "migration/qmp: no
> error prompt when destination address is invalid".
> 

Yeh, I tried and it turned out just as you said. Updated the summary.

> Peter

Comment 4 Li Xiaohui 2021-10-29 12:11:38 UTC

Update Comment 0, let's track bug according to Comment 0

Comment 7 Laszlo Ersek 2023-07-05 08:48:55 UTC

Dropping the "under postcopy-paused" qualification from the title, per Peter's comment 2 / comment 3.

Comment 8 Laszlo Ersek 2023-07-05 10:16:40 UTC

What is the simplest way to reproduce the situation described by Peter
in comment 2?

Note: that comment is from October 2021, so it's old.

Here's what I can reproduce (I just grabbed my locally built, upstream
QEMU, version 7.2.1):

** Terminal#1, standing in "/tmp":

  $ qemu-system-x86_64 -monitor stdio -qmp unix:./qmp-sock,server

** Terminal#2, standing in the "python" subdir of the QEMU source tree:

  $ PYTHONPATH=$PWD python qemu/qmp/qmp_shell.py -v -p /tmp/qmp-sock

  (QEMU) migrate uri=tcp:192.168.130.123:1234
  {
      "arguments": {
          "uri": "tcp:192.168.130.123:1234"
      },
      "execute": "migrate"
  }
  {
      "return": {}
  }

  (QEMU) query-migrate
  {
      "arguments": {},
      "execute": "query-migrate"
  }
  {
      "return": {
          "status": "setup"
      }
  }

  [... a long time passes, "query-migrate" keeps reporting "setup"
   status; but then ...]

  (QEMU) query-migrate
  {
      "arguments": {},
      "execute": "query-migrate"
  }
  {
      "return": {
          "error-desc":
            "Failed to connect to '192.168.130.123:1234': Connection timed out",
          "status": "failed"
      }
  }

So that indicates *one of two* things:

- The issue has been fixed between the reported 6.1 and the 7.2.1
  version that I'm trying, for *both* postcopy and precopy. (Peter says
  in comment#2 that precopy and postcopy behave identically in this
  regard, so precopy working fine implies postcopy works fine now, too.)

- Or else, Peter was wrong in comment 2, precopy has always worked fine
  in this regard (i.e., that's what my current ad-hoc test shows), but
  postcopy is different, and may still be broken (I've not tested it).

Peter, any comments?

Comment 9 Laszlo Ersek 2023-07-05 10:22:07 UTC

Looking at the QEMU source / git history, this problem *precisely* has been fixed by Daniel, in commit d59ce6f34434 ("migration: add reporting of errors for outgoing migration", 2016-05-26).

However, that commit is part of v2.7.0, so it must have been present in the reported version (which is 6.1 based).

Comment 10 Laszlo Ersek 2023-07-05 10:56:02 UTC

Is this perhaps a regression from the following commit (note: it appeared in 6.1):
- ca30f24d12c9 ("migration: Don't do migrate cleanup if during postcopy resume", 2021-07-13)

Comment 11 Laszlo Ersek 2023-07-05 12:03:55 UTC

I think I've got it. I think Peter's comment#2 is wrong, and that the
problem still affects us, but only with postcopy. The problem should be
relevant for upstream QEMU, too (@ 2a6ae6915454).

Here's what happens (note this is only by code reading, not by testing /
debugging, so in a way this is just speculation):

- On the source (outgoing) side, we are in
  MIGRATION_STATUS_POSTCOPY_PAUSED (postcopy migration broke and got
  paused).

- Over QMP, a migrate command comes in (on the source side) with
  resume=true.

- Then:

qmp_migrate()                                     [migration/migration.c]
  migrate_prepare()                               [migration/migration.c]
    - passes fine as s->state is POSTCOPY_PAUSED
    - *importantly*, the migrate_init() call is
      *not* reached
    - migrate_init() would be the function to
      atomically flip the status from NONE to
      SETUP
  socket_start_outgoing_migration()               [migration/socket.c]
    socket_start_outgoing_migration_internal()    [migration/socket.c]
      - starts the network connection in a
        separate thread / QIOTask
      - the task callback ("QIOTaskFunc") is
        socket_outgoing_migration()

socket_outgoing_migration()                       [migration/socket.c]
  qio_task_propagate_error()                      [io/task.c]
  - we jump to the "out" label, with "err"
    populated, as the connection has failed
  migration_channel_connect()                     [migration/channel.c]
    migrate_fd_connect()                          [migration/migration.c]
      migrate_fd_error()                          [migration/migration.c]
        migrate_set_state(&s->state,
                          MIGRATION_STATUS_SETUP,
                          MIGRATION_STATUS_FAILED
                         );  <-------- BUG
        migrate_set_error()

So here's the problem: because we never *enter* the SETUP state when we
resume from POSTCOPY_PAUSED, our attempt to set FAILED state in response
to the background socket connection error has no effect. Namely, we call
the helper function migrate_fd_error(), for *BOTH* (a) changing the
state to FAILED *and* (b) recording the error details, but internally to
migrate_fd_error(), step (a) is *conditional* on the current state being
SETUP -- and that's not the case when we're resuming from
POSTCOPY_PAUSED! migrate_set_state() does nothing if the current state
is not "old_state".

Later on, the user or the mgmt app issues the "query-migrate" QMP
command, and this happens:

  qmp_query_migrate()                             [migration/migration.c]
    fill_source_migration_info()                  [migration/migration.c]

There, fill_source_migration_info reaches the POSTCOPY_PAUSED label, and
therefore calls populate_time_info(), populate_ram_info(),
populate_disk_info(), populate_vfio_info() -- compare comment#0! --, and
does not reach the FAILED state where, we would copy-out the error
object that we -- otherwise successfully -- stashed in
migrate_fd_error() / migrate_set_error().

In retrospect, this is confirmed by the following snippet from
comment#0, under "Actual results" (i.e., the bad results):

  "status": "postcopy-paused"

In other words, we're stuck in POSTCOPY_PAUSED, because the
migrate_fd_error() function can take us to FAILED state from *only* the
SETUP state, not from POSTCOPY_PAUSED.

The bug was introduced in commit d3e35b8f6248 ("migration: rebuild
channel on source", 2018-05-15) -- first, that was when
migrate_prepare() was introduced, for skipping the SETUP status; second,
the migrate_fd_connect() function was only partially extended to
recognize the POSTCOPY_PAUSED state. Namely, migrate_fd_connect() would
continue calling migrate_fd_error() unmodified, even though
migrate_fd_error() would not move to FAILED state from the
POSTCOPY_PAUSED state.

Proposed solution: extend migrate_fd_error() with a "bool resume"
parameter. If the new param is false, don't change anything; otherwise,
the "old_state" argument for migrate_set_state() must be POSTCOPY_PAUSED
(not SETUP). This affects the following code locations:

- clearly the migrate_fd_error() function declaration and definition

- the "migrate_fd_error" trace event too

- the call site we have investigated here, in migrate_fd_connect()

- interestingly, we have another call site: in qmp_migrate(), when
  "local_err" is populated. That is, in case we have an immediate,
  synchronous error, for example because we cannot parse the remote URL
  in socket_start_outgoing_migration().

- interestingly, we have *YET ANOTHER* call site: also in qmp_migrate()
  -- the last branch (unrecognized migration protocol) -- has an
  *open-coded* migrate_set_state()call , which expects old_state=SETUP.
  That cannot be true *if* we cleared migrate_prepare() higher up in the
  function due to resume being "true"!

IOW, for the last two entries, I'm predicting that the same error as
reported in this BZ is triggerable synchronously too, from the
postcopy-paused state, on the source side. Just try to resume in such a
way that fails the QMP command synchronously!

WARNING: I think the issues here have a larger scope / impact than just
incorrect error reporting. There's a bunch of functionality in
"migration/migration.c" that depends on MIGRATION_STATUS_SETUP. For example:

- the qemu_savevm_wait_unplug() call in migration_thread(); not sure if
  that is exclusive with resuming

- the same, but in bg_migration_thread(); not sure if that is exclusive
  with resuming

- the SETUP state is exposed publicly via migration_in_setup(); and
  virtio-net, host-vdpa and spice-core call that function. I'm unsure if
  those call sites should be aware of recovery, too.

Comment 12 Peter Xu 2023-07-05 13:34:34 UTC

Hello, Laszlo,

Firstly, a huge thanks for helping in any of migration bugs! (especially postcopy ones :)

Let me comment inline.

(In reply to Laszlo Ersek from comment #11)
> I think I've got it. I think Peter's comment#2 is wrong, and that the
> problem still affects us, but only with postcopy. The problem should be
> relevant for upstream QEMU, too (@ 2a6ae6915454).

I had a guess that maybe upstream qemu changed the behavior because iirc when I said that I should have tried something, and according to Xiaohui's comment it seems we did get the same result at that time, I'll need to double check, though.

> 
> Here's what happens (note this is only by code reading, not by testing /
> debugging, so in a way this is just speculation):
> 
> - On the source (outgoing) side, we are in
>   MIGRATION_STATUS_POSTCOPY_PAUSED (postcopy migration broke and got
>   paused).
> 
> - Over QMP, a migrate command comes in (on the source side) with
>   resume=true.
> 
> - Then:
> 
> qmp_migrate()                                     [migration/migration.c]
>   migrate_prepare()                               [migration/migration.c]
>     - passes fine as s->state is POSTCOPY_PAUSED
>     - *importantly*, the migrate_init() call is
>       *not* reached

Yes, PAUSED need to maintain most of the fields.

>     - migrate_init() would be the function to
>       atomically flip the status from NONE to
>       SETUP
>   socket_start_outgoing_migration()               [migration/socket.c]
>     socket_start_outgoing_migration_internal()    [migration/socket.c]
>       - starts the network connection in a
>         separate thread / QIOTask
>       - the task callback ("QIOTaskFunc") is
>         socket_outgoing_migration()
> 
> socket_outgoing_migration()                       [migration/socket.c]
>   qio_task_propagate_error()                      [io/task.c]
>   - we jump to the "out" label, with "err"
>     populated, as the connection has failed
>   migration_channel_connect()                     [migration/channel.c]
>     migrate_fd_connect()                          [migration/migration.c]
>       migrate_fd_error()                          [migration/migration.c]
>         migrate_set_state(&s->state,
>                           MIGRATION_STATUS_SETUP,
>                           MIGRATION_STATUS_FAILED
>                          );  <-------- BUG

This may or may not be a "bug", at least keeping postcopy PAUSED state when a recovery is failed is by design.  Here the trick is if postcopy is paused we should hopefully never fail the migration, because failing the migration means split brain (half data on src, half on dst), then the guest will 100% crash.

If we keep it PAUSED, as long as it can be recovered again at some point, guest can still receive all the data and keep running (even if during PAUSED it may fault on some page and the vcpu may halt).

>         migrate_set_error()
> 
> So here's the problem: because we never *enter* the SETUP state when we
> resume from POSTCOPY_PAUSED, our attempt to set FAILED state in response
> to the background socket connection error has no effect. Namely, we call
> the helper function migrate_fd_error(), for *BOTH* (a) changing the
> state to FAILED *and* (b) recording the error details, but internally to
> migrate_fd_error(), step (a) is *conditional* on the current state being
> SETUP -- and that's not the case when we're resuming from
> POSTCOPY_PAUSED! migrate_set_state() does nothing if the current state
> is not "old_state".
> 
> Later on, the user or the mgmt app issues the "query-migrate" QMP
> command, and this happens:
> 
>   qmp_query_migrate()                             [migration/migration.c]
>     fill_source_migration_info()                  [migration/migration.c]
> 
> There, fill_source_migration_info reaches the POSTCOPY_PAUSED label, and
> therefore calls populate_time_info(), populate_ram_info(),
> populate_disk_info(), populate_vfio_info() -- compare comment#0! --, and
> does not reach the FAILED state where, we would copy-out the error
> object that we -- otherwise successfully -- stashed in
> migrate_fd_error() / migrate_set_error().

Exactly.  I posted a patch just a few days ago just for this (not for this bug, but for a small rework to cleanup return path thread error handling.  Please see:

https://lore.kernel.org/all/20230628215002.73546-6-peterx@redhat.com/

Comment 13 Laszlo Ersek 2023-07-05 13:42:48 UTC

I've got four patches; now I need to test them.

Comment 14 Laszlo Ersek 2023-07-05 13:58:15 UTC

Ouch, just seeing comment#12...

The patches I ended up with meanwhile take a different approach from <https://lore.kernel.org/all/20230628215002.73546-6-peterx@redhat.com/>.

My patches make the argument that we should transition from POSTCOPY_PAUSED to FAILED.

Peter's patch seems to stick with POSTCOPY_PAUSED, but print the stashed socket error nonetheless. This is aligned with the statement "keeping postcopy PAUSED state when a recovery is failed is by design" from comment#12.

What's confusing to me there is that we intentionally exploit migrate_fd_error() not changing the state (because we're issuing those transitions from a state different from SETUP), but the code makes that behavior *look* like it's accidental, rather than by-design! So I think that the migrate_fd_error() calls, and the migrate_set_state() call in qmp_migrate(), would deserve some comments ("these are no-ops if we're resuming").

Either way, some of my cleanup patches might be useful on top of Peter's:

- docs/devel/migration.rst: networking error is not fatal for postcopy
- migration: factor out "resume_requested" in qmp_migrate()
- migration: unexport migrate_fd_error()

Finally a request... whenever submitting an upstream patch set that's related to an RHBZ, even remotely so, please always mark that fact on the RHBZ, in a new comment. It feels quite a bit like my analysis here was wasted -- it could have been avoided if this BZ had been marked "in progress" (with the link to the patch series) on June 28th...

Also (again) I think bug 2111332 is related / dependent, so re-checking that one after Peter's patches might be useful.

Comment 15 Peter Xu 2023-07-05 14:28:30 UTC

(In reply to Laszlo Ersek from comment #14)
> Ouch, just seeing comment#12...

I tried to comment as fast as possible, it's just that everything happens faster than I thought (all in a few hours :).

> 
> The patches I ended up with meanwhile take a different approach from
> <https://lore.kernel.org/all/20230628215002.73546-6-peterx@redhat.com/>.
> 
> My patches make the argument that we should transition from POSTCOPY_PAUSED
> to FAILED.
> 
> Peter's patch seems to stick with POSTCOPY_PAUSED, but print the stashed
> socket error nonetheless. This is aligned with the statement "keeping
> postcopy PAUSED state when a recovery is failed is by design" from
> comment#12.
> 
> What's confusing to me there is that we intentionally exploit
> migrate_fd_error() not changing the state (because we're issuing those
> transitions from a state different from SETUP), but the code makes that
> behavior *look* like it's accidental, rather than by-design! So I think that
> the migrate_fd_error() calls, and the migrate_set_state() call in
> qmp_migrate(), would deserve some comments ("these are no-ops if we're
> resuming").

I agree.  I should have given wrong impression that the code was by design - it was not!  It's about "keeping PAUSED state" should be by-design only.

> 
> Either way, some of my cleanup patches might be useful on top of Peter's:
> 
> - docs/devel/migration.rst: networking error is not fatal for postcopy
> - migration: factor out "resume_requested" in qmp_migrate()
> - migration: unexport migrate_fd_error()
> 
> Finally a request... whenever submitting an upstream patch set that's
> related to an RHBZ, even remotely so, please always mark that fact on the
> RHBZ, in a new comment.

That patch was _not_ for this bug.  I'll need to check whether that applies.  Note that IIUC this bug was trying to track incoming side of error reporting, not src side.
 
> It feels quite a bit like my analysis here was
> wasted -- it could have been avoided if this BZ had been marked "in
> progress" (with the link to the patch series) on June 28th...
> 
> Also (again) I think bug 2111332 is related / dependent, so re-checking that
> one after Peter's patches might be useful.

Assigning this to me is definitely fine; but again just to mention that patch is not trying to fix this one at all.

Thanks.

Comment 16 Peter Xu 2023-07-05 14:33:11 UTC

(In reply to Peter Xu from comment #15)
> That patch was _not_ for this bug.  I'll need to check whether that applies.
> Note that IIUC this bug was trying to track incoming side of error
> reporting, not src side.

Hold on - maybe I was wrong, sorry.  After I re-read it it seems to be for the src.

Anyway, I'll double check with everything and when I repost I'll attach BZ: link if that applies.

Comment 17 Peter Xu 2023-07-05 17:02:58 UTC

As Laszlo rightfully pointed out, I think this patch should fix it:

https://lore.kernel.org/all/20230705163502.331007-2-peterx@redhat.com/

I'll take care of it when it lands upstream.  Laszlo, please feel free to still post whatever patch you got that you think makes sense even without fixing this bug.  Thanks.

Comment 18 Laszlo Ersek 2023-07-06 10:30:07 UTC

[qemu-devel] [PATCH 0/2] migration: trivialities
Message-Id: <20230706102937.82490-1-lersek>

Comment 19 RHEL Program Management 2023-08-14 07:28:33 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 20 Li Xiaohui 2023-08-17 05:30:21 UTC

Reopen this bug as Peter has sent out a patch to fix it according to Comment 17

Comment 22 RHEL Program Management 2023-09-22 17:31:13 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 23 RHEL Program Management 2023-09-22 17:32:11 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.