Bug 2114866
| Summary: | Migration parameters are not reset successfully when kill dest virtproxyd during p2p migration | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Fangge Jin <fjin> |
| Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
| libvirt sub component: | General | QA Contact: | Fangge Jin <fjin> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | jdenemar, lmen, virt-maint, xuzhang, yafu |
| Version: | 9.1 | Keywords: | AutomationTriaged, Triaged |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | libvirt-8.9.0-1.el9 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-05-09 07:27:05 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | 8.9.0 |
| Embargoed: | |||
The root cause is an asynchronous nature of migrate_cancel. Contrary to what libvirt expects, migrate_cancel switches the migration state to "cancelling" and returns before it is actually cancelled. Usually it's canceled very quickly so nobody notices this, but in this specific case (not sure how it relates to virtproxyd, it might be just luck) the migration gets canceled a bit later. A bit long enough for us to try to reset migration capabilities and parameters back to their original values before the migration is canceled. But since capabilities may only be set when there's no migration running, we get "There's a migration process in progress" error from QEMU, which causes qemuMigrationParamsReset to return early, before even trying to reset migration parameters. We will need to make a synchronous wrapper around migrate_cancel as the rest of libvirt relies on such behavior. Patch sent upstream for review: https://listman.redhat.com/archives/libvir-list/2022-September/234101.html Fixed upstream by
commit 0ff8c175f743524d83a3939b2201f3bc1245d6e6
Refs: v8.7.0-55-g0ff8c175f7
Author: Jiri Denemark <jdenemar>
AuthorDate: Tue Aug 30 11:18:06 2022 +0200
Commit: Jiri Denemark <jdenemar>
CommitDate: Tue Sep 6 18:28:10 2022 +0200
qemu: Rename qemuMigrationSrcCancel
Let's call this qemuMigrationSrcCancelUnattended as the function is
supposed to be used when no other thread is watching the migration.
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: Peter Krempa <pkrempa>
commit 4e55fe21b54dbc79413626f50908fad5e72a0494
Refs: v8.7.0-56-g4e55fe21b5
Author: Jiri Denemark <jdenemar>
AuthorDate: Tue Aug 30 12:29:02 2022 +0200
Commit: Jiri Denemark <jdenemar>
CommitDate: Tue Sep 6 18:28:10 2022 +0200
qemu: Create wrapper for qemuMonitorMigrateCancel
We will need a little bit more code around qemuMonitorMigrateCancel to
make sure it works as expected. The new qemuMigrationSrcCancel helper
will avoid repeating the code in several places.
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: Peter Krempa <pkrempa>
commit 2d7b22b561f9e2dd1b86fb2c5b0d904823a189d2
Refs: v8.7.0-57-g2d7b22b561
Author: Jiri Denemark <jdenemar>
AuthorDate: Tue Sep 6 18:21:31 2022 +0200
Commit: Jiri Denemark <jdenemar>
CommitDate: Tue Sep 6 18:28:10 2022 +0200
qemu: Make qemuMigrationSrcCancel optionally synchronous
We have always considered "migrate_cancel" QMP command to return after
successfully cancelling the migration. But this is no longer true (to be
honest I'm not sure it ever was) as it just changes the migration state
to "cancelling". In most cases the migration is canceled pretty quickly
and we don't really notice anything, but sometimes it takes so long we
even get to clearing migration capabilities before the migration is
actually canceled, which fails as capabilities can only be changed when
no migration is running. So to avoid this issue, we can wait for the
migration to be really canceled after sending migrate_cancel. The only
place where we don't need synchronous behavior is when we're cancelling
migration on user's request while it is actively watched by another
thread.
https://bugzilla.redhat.com/show_bug.cgi?id=2114866
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: Peter Krempa <pkrempa>
Pre-verify failed on libvirt-8.8.0-1.el9.x86_64
Steps:
1. Start guest
2. Do migration with --parallel-connections, kill target virtproxyd before migration completes:
# virsh migrate vm1 qemu+tcp://{target}/system --live --p2p --postcopy --parallel --compressed --comp-methods xbzrle --bandwidth 4 --parallel-connections 4
error: internal error: client socket is closed
3. Query migrate parameters:
# virsh qemu-monitor-command vm1 '{"execute":"query-migrate-parameters"}'
{"return":{"cpu-throttle-tailslow":false,"xbzrle-cache-size":67108864,"cpu-throttle-initial":20,"announce-max":550,"decompress-threads":2,"compress-threads":8,"compress-level":1,"multifd-channels":4,"multifd-zstd-level":1,"announce-initial":50,"block-incremental":false,"compress-wait-thread":true,"downtime-limit":300,"tls-authz":"","multifd-compression":"none","announce-rounds":5,"announce-step":100,"tls-creds":"","multifd-zlib-level":1,"max-cpu-throttle":99,"max-postcopy-bandwidth":0,"tls-hostname":"","throttle-trigger-threshold":50,"max-bandwidth":4194304,"x-checkpoint-delay":20000,"cpu-throttle-increment":10},"id":"libvirt-430"}
Virtqemud log:
2022-10-11 08:05:51.703+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"timestamp": {"seconds": 1665475551, "microseconds": 703727}, "event": "MIGRATION", "data": {"status": "setup"}}]
2022-10-11 08:05:51.703+0000: 51203: info : qemuMonitorJSONIOProcessLine:203 : QEMU_MONITOR_RECV_EVENT: mon=0x7f6758050010 event={"timestamp": {"seconds": 1665475551, "microseconds": 703727}, "event": "MIGRATION", "data": {"status": "setup"}}
2022-10-11 08:05:51.703+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"return": {}, "id": "libvirt-418"}]
2022-10-11 08:05:51.703+0000: 51203: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_REPLY: mon=0x7f6758050010 reply={"return": {}, "id": "libvirt-418"}
2022-10-11 08:05:52.178+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"timestamp": {"seconds": 1665475552, "microseconds": 178048}, "event": "MIGRATION_PASS", "data": {"pass": 1}}]
2022-10-11 08:05:52.178+0000: 51203: info : qemuMonitorJSONIOProcessLine:203 : QEMU_MONITOR_RECV_EVENT: mon=0x7f6758050010 event={"timestamp": {"seconds": 1665475552, "microseconds": 178048}, "event": "MIGRATION_PASS", "data": {"pass": 1}}
2022-10-11 08:05:52.330+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"timestamp": {"seconds": 1665475552, "microseconds": 330897}, "event": "MIGRATION", "data": {"status": "active"}}]
2022-10-11 08:05:52.330+0000: 51203: info : qemuMonitorJSONIOProcessLine:203 : QEMU_MONITOR_RECV_EVENT: mon=0x7f6758050010 event={"timestamp": {"seconds": 1665475552, "microseconds": 330897}, "event": "MIGRATION", "data": {"status": "active"}}
2022-10-11 08:06:00.861+0000: 51019: info : qemuMonitorSend:859 : QEMU_MONITOR_SEND_MSG: mon=0x7f6758050010 msg={"execute":"migrate_cancel","id":"libvirt-419"}
2022-10-11 08:06:00.862+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"timestamp": {"seconds": 1665475560, "microseconds": 862206}, "event": "MIGRATION", "data": {"status": "cancelling"}}]
2022-10-11 08:06:00.862+0000: 51203: info : qemuMonitorJSONIOProcessLine:203 : QEMU_MONITOR_RECV_EVENT: mon=0x7f6758050010 event={"timestamp": {"seconds": 1665475560, "microseconds": 862206}, "event": "MIGRATION", "data": {"status": "cancelling"}}
2022-10-11 08:06:00.862+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"return": {}, "id": "libvirt-419"}]
2022-10-11 08:06:00.862+0000: 51203: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_REPLY: mon=0x7f6758050010 reply={"return": {}, "id": "libvirt-419"}
2022-10-11 08:06:00.862+0000: 51019: info : qemuMonitorSend:859 : QEMU_MONITOR_SEND_MSG: mon=0x7f6758050010 msg={"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":false},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":false},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":false},{"capability":"late-block-activate","state":false},{"capability":"multifd","state":false},{"capability":"dirty-bitmaps","state":false},{"capability":"return-path","state":false},{"capability":"zero-copy-send","state":false}]},"id":"libvirt-420"}
2022-10-11 08:06:00.865+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"id": "libvirt-420", "error": {"class": "GenericError", "desc": "There's a migration process in progress"}}]
2022-10-11 08:06:00.865+0000: 51203: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_REPLY: mon=0x7f6758050010 reply={"id": "libvirt-420", "error": {"class": "GenericError", "desc": "There's a migration process in progress"}}
2022-10-11 08:06:00.936+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"timestamp": {"seconds": 1665475560, "microseconds": 936144}, "event": "MIGRATION", "data": {"status": "cancelled"}}]
2022-10-11 08:06:00.936+0000: 51203: info : qemuMonitorJSONIOProcessLine:203 : QEMU_MONITOR_RECV_EVENT: mon=0x7f6758050010 event={"timestamp": {"seconds": 1665475560, "microseconds": 936144}, "event": "MIGRATION", "data": {"status": "cancelled"}}
2022-10-11 08:06:29.563+0000: 51017: info : qemuMonitorSend:859 : QEMU_MONITOR_SEND_MSG: mon=0x7f6758050010 msg={"execute":"query-migrate-parameters","id":"libvirt-421"}
2022-10-11 08:06:29.563+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"return": {"cpu-throttle-tailslow": false, "xbzrle-cache-size": 67108864, "cpu-throttle-initial": 20, "announce-max": 550, "decompress-threads": 2, "compress-threads": 8, "compress-level": 1, "multifd-channels": 4, "multifd-zstd-level": 1, "announce-initial": 50, "block-incremental": false, "compress-wait-thread": true, "downtime-limit": 300, "tls-authz": "", "multifd-compression": "none", "announce-rounds": 5, "announce-step": 100, "tls-creds": "", "multifd-zlib-level": 1, "max-cpu-throttle": 99, "max-postcopy-bandwidth": 0, "tls-hostname": "", "throttle-trigger-threshold": 50, "max-bandwidth": 4194304, "x-checkpoint-delay": 20000, "cpu-throttle-increment": 10}, "id": "libvirt-421"}]
2022-10-11 08:06:29.563+0000: 51203: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_REPLY: mon=0x7f6758050010 reply={"return": {"cpu-throttle-tailslow": false, "xbzrle-cache-size": 67108864, "cpu-throttle-initial": 20, "announce-max": 550, "decompress-threads": 2, "compress-threads": 8, "compress-level": 1, "multifd-channels": 4, "multifd-zstd-level": 1, "announce-initial": 50, "block-incremental": false, "compress-wait-thread": true, "downtime-limit": 300, "tls-authz": "", "multifd-compression": "none", "announce-rounds": 5, "announce-step": 100, "tls-creds": "", "multifd-zlib-level": 1, "max-cpu-throttle": 99, "max-postcopy-bandwidth": 0, "tls-hostname": "", "throttle-trigger-threshold": 50, "max-bandwidth": 4194304, "x-checkpoint-delay": 20000, "cpu-throttle-increment": 10}, "id": "libvirt-421"}
Tried additional scenario on libvirt-8.8.0-1.el9.x86_64
Steps:
1. Start guest
2. Do migration with --parallel-connections, kill source virtqemud before migration completes:
# virsh migrate vm1 qemu+tcp://{target}/system --live --p2p --postcopy --parallel --compressed --comp-methods xbzrle --bandwidth 4 --parallel-connections 4
error: Disconnected from qemu:///system due to end of file
error: End of file while reading data: Input/output error
Result:
1. There is virtqemud coredump:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 qemuMigrationUpdateJobType (jobData=jobData@entry=0x0) at ../src/qemu/qemu_migration.c:1732
1732 switch ((qemuMonitorMigrationStatus) priv->stats.mig.status) {
[Current thread is 1 (Thread 0x7f1ebbfff640 (LWP 55147))]
(gdb) bt
#0 qemuMigrationUpdateJobType (jobData=jobData@entry=0x0) at ../src/qemu/qemu_migration.c:1732
#1 0x00007f1f0829338f in qemuMigrationSrcIsCanceled (vm=0x7f1ebc094840, vm=0x7f1ebc094840) at ../src/qemu/qemu_migration.c:4601
#2 qemuMigrationSrcCancel (vm=0x7f1ebc094840, asyncJob=<optimized out>, wait=<optimized out>) at ../src/qemu/qemu_migration.c:4652
#3 0x00007f1f0829d780 in qemuMigrationSrcCancelUnattended (vm=0x7f1ebc094840) at ../src/qemu/qemu_migration.c:6985
#4 0x00007f1f082d8aa0 in qemuProcessRecoverMigrationOut (stopFlags=<synthetic pointer>, reason=1, state=VIR_DOMAIN_RUNNING,
migStatus=VIR_DOMAIN_JOB_STATUS_FAILED, job=0x7f1ebbffe940, vm=0x7f1ebc094840, driver=0x7f1ebc0236f0) at ../src/qemu/qemu_process.c:3535
#5 qemuProcessRecoverMigration (stopFlags=<synthetic pointer>, job=0x7f1ebbffe940, vm=0x7f1ebc094840, driver=0x7f1ebc0236f0)
at ../src/qemu/qemu_process.c:3620
#6 qemuProcessRecoverJob (stopFlags=<synthetic pointer>, job=0x7f1ebbffe940, vm=0x7f1ebc094840, driver=0x7f1ebc0236f0)
at ../src/qemu/qemu_process.c:3687
#7 qemuProcessReconnect (opaque=<optimized out>) at ../src/qemu/qemu_process.c:8842
#8 0x00007f1f193c6fc9 in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:256
#9 0x00007f1f18d66802 in start_thread () from /lib64/libc.so.6
#10 0x00007f1f18d06450 in clone3 () from /lib64/libc.so.6
The attached logs start more than an hour after the log from comment #4 and everything worked just fine there. Could you please attach debug logs covering the incorrect behavior? Fixes for both issues (comment #4 and comment #6) sent upstream for review: https://bugzilla.redhat.com/show_bug.cgi?id=2114866 This is now fixed upstream by
commit 4dd86f334bc3c9679865176f588597cf99375330
Refs: v8.8.0-175-g4dd86f334b
Author: Jiri Denemark <jdenemar>
AuthorDate: Tue Oct 11 14:00:13 2022 +0200
Commit: Jiri Denemark <jdenemar>
CommitDate: Mon Oct 24 15:28:47 2022 +0200
qemu_migration: Properly wait for migration to be canceled
In my commit v8.7.0-57-g2d7b22b561 I attempted to make
qemuMigrationSrcCancel synchronous, but failed. When we are canceling
migration after some kind of error which is detected in
in qemuMigrationSrcWaitForCompletion, jobData->status will be set to
VIR_DOMAIN_JOB_STATUS_FAILED regardless on QEMU state. So instead of
relying on the translated jobData->status in qemuMigrationSrcIsCanceled
we need to check the migration status we get from QEMU MIGRATION event.
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: Ján Tomko <jtomko>
commit 1a570f97125dbf4897e48c833b8fcec0ddbc13f1
Refs: v8.8.0-176-g1a570f9712
Author: Jiri Denemark <jdenemar>
AuthorDate: Wed Oct 12 17:45:38 2022 +0200
Commit: Jiri Denemark <jdenemar>
CommitDate: Mon Oct 24 15:28:47 2022 +0200
qemu: Do not crash when canceling migration on reconnect
When libvirtd is restarted during an active outgoing migration (or
snapshot, save, or dump which are internally implemented as migration)
it wants to cancel the migration. But by a mistake in commit
v8.7.0-57-g2d7b22b561 the qemuMigrationSrcCancel function is called with
wait == true, which leads to an instant crash by dereferencing NULL
pointer stored in priv->job.current.
When canceling migration to file (snapshot, save, dump), we don't need
to wait until it is really canceled as no migration capabilities or
parameters need to be restored.
On the other hand we need to wait when canceling outgoing migration and
since we don't have virDomainJobData at this point, we have to
temporarily restore the migration job to make sure we can process
MIGRATION events from QEMU.
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: Ján Tomko <jtomko>
Test with libvirt-8.9.0-2.el9.x86_64 and qemu-kvm-7.1.0-4.el9.x86_64 Covered scenarios: non-p2p, p2p kill src virtqemud, kill target virtqemud, kill target virtproxyd Result: PASS Verified with libvirt-8.9.0-2.el9.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (libvirt bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2171 |
Description of problem: Do p2p migration with some migration parameters(e.g. --parallel-connections), kill dest virtproxyd during migration. Then check whether migration parameters are reset. Version-Release number of selected component (if applicable): libvirt-8.5.0-4.el9.x86_64 qemu-kvm-7.0.0-9.el9.x86_64 How reproducible: 100% Steps to Reproduce: 1. Start vm 2. Do p2p migration with some migration parameters # virsh migrate vm1 qemu+tcp://******/system --live --postcopy --parallel --compressed --comp-methods xbzrle --bandwidth 4 --parallel-connections 4 [--p2p] 3. Kill dest virtproxyd before migration completes. 4. Check migration parameters, the parameters are not restored: # virsh qemu-monitor-command vm1 '{"execute":"query-migrate-parameters"}' {"return":{"cpu-throttle-tailslow":false,"xbzrle-cache-size":67108864,"cpu-throttle-initial":20,"announce-max":550,"decompress-threads":2,"compress-threads":8,"compress-level":1,"multifd-channels":4,"multifd-zstd-level":1,"announce-initial":50,"block-incremental":false,"compress-wait-thread":true,"downtime-limit":300,"tls-authz":"","multifd-compression":"none","announce-rounds":5,"announce-step":100,"tls-creds":"","multifd-zlib-level":1,"max-cpu-throttle":99,"max-postcopy-bandwidth":0,"tls-hostname":"","throttle-trigger-threshold":50,"max-bandwidth":4194304,"x-checkpoint-delay":20000,"cpu-throttle-increment":10},"id":"libvirt-14"} 5. Check virtqemud log 2022-08-03 08:46:44.820+0000: 161370: info : qemuMonitorIOWrite:382 : QEMU_MONITOR_IO_WRITE: mon=0x7ff75c087460 buf={"execute":"migrate_cancel","id":"libvirt-439"} 2022-08-03 08:46:44.820+0000: 161370: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_EVENT: mon=0x7ff75c087460 event={"timestamp": {"seconds": 1659516404, "microseconds": 820512}, "event": "MIGRATION", "data": {"status": "cancelling"}} 2022-08-03 08:46:44.820+0000: 161370: info : qemuMonitorJSONIOProcessLine:214 : QEMU_MONITOR_RECV_REPLY: mon=0x7ff75c087460 reply={"return": {}, "id": "libvirt-439"} 2022-08-03 08:46:44.821+0000: 161370: info : qemuMonitorIOWrite:382 : QEMU_MONITOR_IO_WRITE: mon=0x7ff75c087460 buf={"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":false},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":false},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":false},{"capability":"late-block-activate","state":false},{"capability":"multifd","state":false},{"capability":"dirty-bitmaps","state":false},{"capability":"return-path","state":false},{"capability":"zero-copy-send","state":false}]},"id":"libvirt-440"}^M 2022-08-03 08:46:44.823+0000: 161370: debug : qemuMonitorJSONIOProcessLine:195 : Line [{"id": "libvirt-440", "error": {"class": "GenericError", "desc": "There's a migration process in progress"}}] Actual results: As steps Expected results: Migration parameters can be reset Additional info: