Bug 2114866
Summary: | Migration parameters are not reset successfully when kill dest virtproxyd during p2p migration | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Fangge Jin <fjin> |
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
libvirt sub component: | General | QA Contact: | Fangge Jin <fjin> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | jdenemar, lmen, virt-maint, xuzhang, yafu |
Version: | 9.1 | Keywords: | AutomationTriaged, Triaged |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-8.9.0-1.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-05-09 07:27:05 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | 8.9.0 |
Embargoed: |
Description
Fangge Jin
2022-08-03 12:14:43 UTC
The root cause is an asynchronous nature of migrate_cancel. Contrary to what libvirt expects, migrate_cancel switches the migration state to "cancelling" and returns before it is actually cancelled. Usually it's canceled very quickly so nobody notices this, but in this specific case (not sure how it relates to virtproxyd, it might be just luck) the migration gets canceled a bit later. A bit long enough for us to try to reset migration capabilities and parameters back to their original values before the migration is canceled. But since capabilities may only be set when there's no migration running, we get "There's a migration process in progress" error from QEMU, which causes qemuMigrationParamsReset to return early, before even trying to reset migration parameters. We will need to make a synchronous wrapper around migrate_cancel as the rest of libvirt relies on such behavior. Patch sent upstream for review: https://listman.redhat.com/archives/libvir-list/2022-September/234101.html Fixed upstream by commit 0ff8c175f743524d83a3939b2201f3bc1245d6e6 Refs: v8.7.0-55-g0ff8c175f7 Author: Jiri Denemark <jdenemar> AuthorDate: Tue Aug 30 11:18:06 2022 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue Sep 6 18:28:10 2022 +0200 qemu: Rename qemuMigrationSrcCancel Let's call this qemuMigrationSrcCancelUnattended as the function is supposed to be used when no other thread is watching the migration. Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: Peter Krempa <pkrempa> commit 4e55fe21b54dbc79413626f50908fad5e72a0494 Refs: v8.7.0-56-g4e55fe21b5 Author: Jiri Denemark <jdenemar> AuthorDate: Tue Aug 30 12:29:02 2022 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue Sep 6 18:28:10 2022 +0200 qemu: Create wrapper for qemuMonitorMigrateCancel We will need a little bit more code around qemuMonitorMigrateCancel to make sure it works as expected. The new qemuMigrationSrcCancel helper will avoid repeating the code in several places. Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: Peter Krempa <pkrempa> commit 2d7b22b561f9e2dd1b86fb2c5b0d904823a189d2 Refs: v8.7.0-57-g2d7b22b561 Author: Jiri Denemark <jdenemar> AuthorDate: Tue Sep 6 18:21:31 2022 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue Sep 6 18:28:10 2022 +0200 qemu: Make qemuMigrationSrcCancel optionally synchronous We have always considered "migrate_cancel" QMP command to return after successfully cancelling the migration. But this is no longer true (to be honest I'm not sure it ever was) as it just changes the migration state to "cancelling". In most cases the migration is canceled pretty quickly and we don't really notice anything, but sometimes it takes so long we even get to clearing migration capabilities before the migration is actually canceled, which fails as capabilities can only be changed when no migration is running. So to avoid this issue, we can wait for the migration to be really canceled after sending migrate_cancel. The only place where we don't need synchronous behavior is when we're cancelling migration on user's request while it is actively watched by another thread. https://bugzilla.redhat.com/show_bug.cgi?id=2114866 Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: Peter Krempa <pkrempa> Pre-verify failed on libvirt-8.8.0-1.el9.x86_64 Steps: 1. Start guest 2. Do migration with --parallel-connections, kill target virtproxyd before migration completes: # virsh migrate vm1 qemu+tcp://{target}/system --live --p2p --postcopy --parallel --compressed --comp-methods xbzrle --bandwidth 4 --parallel-connections 4 error: internal error: client socket is closed 3. Query migrate parameters: # virsh qemu-monitor-command vm1 '{"execute":"query-migrate-parameters"}' {"return":{"cpu-throttle-tailslow":false,"xbzrle-cache-size":67108864,"cpu-throttle-initial":20,"announce-max":550,"decompress-threads":2,"compress-threads":8,"compress-level":1,"multifd-channels":4,"multifd-zstd-level":1,"announce-initial":50,"block-incremental":false,"compress-wait-thread":true,"downtime-limit":300,"tls-authz":"","multifd-compression":"none","announce-rounds":5,"announce-step":100,"tls-creds":"","multifd-zlib-level":1,"max-cpu-throttle":99,"max-postcopy-bandwidth":0,"tls-hostname":"","throttle-trigger-threshold":50,"max-bandwidth":4194304,"x-checkpoint-delay":20000,"cpu-throttle-increment":10},"id":"libvirt-430"} Virtqemud log: 2022-10-11 08:05:51.703+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"timestamp": {"seconds": 1665475551, "microseconds": 703727}, "event": "MIGRATION", "data": {"status": "setup"}}] 2022-10-11 08:05:51.703+0000: 51203: info : qemuMonitorJSONIOProcessLine:203 : QEMU_MONITOR_RECV_EVENT: mon=0x7f6758050010 event={"timestamp": {"seconds": 1665475551, "microseconds": 703727}, "event": "MIGRATION", "data": {"status": "setup"}} 2022-10-11 08:05:51.703+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"return": {}, "id": "libvirt-418"}] 2022-10-11 08:05:51.703+0000: 51203: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_REPLY: mon=0x7f6758050010 reply={"return": {}, "id": "libvirt-418"} 2022-10-11 08:05:52.178+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"timestamp": {"seconds": 1665475552, "microseconds": 178048}, "event": "MIGRATION_PASS", "data": {"pass": 1}}] 2022-10-11 08:05:52.178+0000: 51203: info : qemuMonitorJSONIOProcessLine:203 : QEMU_MONITOR_RECV_EVENT: mon=0x7f6758050010 event={"timestamp": {"seconds": 1665475552, "microseconds": 178048}, "event": "MIGRATION_PASS", "data": {"pass": 1}} 2022-10-11 08:05:52.330+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"timestamp": {"seconds": 1665475552, "microseconds": 330897}, "event": "MIGRATION", "data": {"status": "active"}}] 2022-10-11 08:05:52.330+0000: 51203: info : qemuMonitorJSONIOProcessLine:203 : QEMU_MONITOR_RECV_EVENT: mon=0x7f6758050010 event={"timestamp": {"seconds": 1665475552, "microseconds": 330897}, "event": "MIGRATION", "data": {"status": "active"}} 2022-10-11 08:06:00.861+0000: 51019: info : qemuMonitorSend:859 : QEMU_MONITOR_SEND_MSG: mon=0x7f6758050010 msg={"execute":"migrate_cancel","id":"libvirt-419"} 2022-10-11 08:06:00.862+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"timestamp": {"seconds": 1665475560, "microseconds": 862206}, "event": "MIGRATION", "data": {"status": "cancelling"}}] 2022-10-11 08:06:00.862+0000: 51203: info : qemuMonitorJSONIOProcessLine:203 : QEMU_MONITOR_RECV_EVENT: mon=0x7f6758050010 event={"timestamp": {"seconds": 1665475560, "microseconds": 862206}, "event": "MIGRATION", "data": {"status": "cancelling"}} 2022-10-11 08:06:00.862+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"return": {}, "id": "libvirt-419"}] 2022-10-11 08:06:00.862+0000: 51203: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_REPLY: mon=0x7f6758050010 reply={"return": {}, "id": "libvirt-419"} 2022-10-11 08:06:00.862+0000: 51019: info : qemuMonitorSend:859 : QEMU_MONITOR_SEND_MSG: mon=0x7f6758050010 msg={"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":false},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":false},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":false},{"capability":"late-block-activate","state":false},{"capability":"multifd","state":false},{"capability":"dirty-bitmaps","state":false},{"capability":"return-path","state":false},{"capability":"zero-copy-send","state":false}]},"id":"libvirt-420"} 2022-10-11 08:06:00.865+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"id": "libvirt-420", "error": {"class": "GenericError", "desc": "There's a migration process in progress"}}] 2022-10-11 08:06:00.865+0000: 51203: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_REPLY: mon=0x7f6758050010 reply={"id": "libvirt-420", "error": {"class": "GenericError", "desc": "There's a migration process in progress"}} 2022-10-11 08:06:00.936+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"timestamp": {"seconds": 1665475560, "microseconds": 936144}, "event": "MIGRATION", "data": {"status": "cancelled"}}] 2022-10-11 08:06:00.936+0000: 51203: info : qemuMonitorJSONIOProcessLine:203 : QEMU_MONITOR_RECV_EVENT: mon=0x7f6758050010 event={"timestamp": {"seconds": 1665475560, "microseconds": 936144}, "event": "MIGRATION", "data": {"status": "cancelled"}} 2022-10-11 08:06:29.563+0000: 51017: info : qemuMonitorSend:859 : QEMU_MONITOR_SEND_MSG: mon=0x7f6758050010 msg={"execute":"query-migrate-parameters","id":"libvirt-421"} 2022-10-11 08:06:29.563+0000: 51203: debug : qemuMonitorJSONIOProcessLine:189 : Line [{"return": {"cpu-throttle-tailslow": false, "xbzrle-cache-size": 67108864, "cpu-throttle-initial": 20, "announce-max": 550, "decompress-threads": 2, "compress-threads": 8, "compress-level": 1, "multifd-channels": 4, "multifd-zstd-level": 1, "announce-initial": 50, "block-incremental": false, "compress-wait-thread": true, "downtime-limit": 300, "tls-authz": "", "multifd-compression": "none", "announce-rounds": 5, "announce-step": 100, "tls-creds": "", "multifd-zlib-level": 1, "max-cpu-throttle": 99, "max-postcopy-bandwidth": 0, "tls-hostname": "", "throttle-trigger-threshold": 50, "max-bandwidth": 4194304, "x-checkpoint-delay": 20000, "cpu-throttle-increment": 10}, "id": "libvirt-421"}] 2022-10-11 08:06:29.563+0000: 51203: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_REPLY: mon=0x7f6758050010 reply={"return": {"cpu-throttle-tailslow": false, "xbzrle-cache-size": 67108864, "cpu-throttle-initial": 20, "announce-max": 550, "decompress-threads": 2, "compress-threads": 8, "compress-level": 1, "multifd-channels": 4, "multifd-zstd-level": 1, "announce-initial": 50, "block-incremental": false, "compress-wait-thread": true, "downtime-limit": 300, "tls-authz": "", "multifd-compression": "none", "announce-rounds": 5, "announce-step": 100, "tls-creds": "", "multifd-zlib-level": 1, "max-cpu-throttle": 99, "max-postcopy-bandwidth": 0, "tls-hostname": "", "throttle-trigger-threshold": 50, "max-bandwidth": 4194304, "x-checkpoint-delay": 20000, "cpu-throttle-increment": 10}, "id": "libvirt-421"} Tried additional scenario on libvirt-8.8.0-1.el9.x86_64 Steps: 1. Start guest 2. Do migration with --parallel-connections, kill source virtqemud before migration completes: # virsh migrate vm1 qemu+tcp://{target}/system --live --p2p --postcopy --parallel --compressed --comp-methods xbzrle --bandwidth 4 --parallel-connections 4 error: Disconnected from qemu:///system due to end of file error: End of file while reading data: Input/output error Result: 1. There is virtqemud coredump: Program terminated with signal SIGSEGV, Segmentation fault. #0 qemuMigrationUpdateJobType (jobData=jobData@entry=0x0) at ../src/qemu/qemu_migration.c:1732 1732 switch ((qemuMonitorMigrationStatus) priv->stats.mig.status) { [Current thread is 1 (Thread 0x7f1ebbfff640 (LWP 55147))] (gdb) bt #0 qemuMigrationUpdateJobType (jobData=jobData@entry=0x0) at ../src/qemu/qemu_migration.c:1732 #1 0x00007f1f0829338f in qemuMigrationSrcIsCanceled (vm=0x7f1ebc094840, vm=0x7f1ebc094840) at ../src/qemu/qemu_migration.c:4601 #2 qemuMigrationSrcCancel (vm=0x7f1ebc094840, asyncJob=<optimized out>, wait=<optimized out>) at ../src/qemu/qemu_migration.c:4652 #3 0x00007f1f0829d780 in qemuMigrationSrcCancelUnattended (vm=0x7f1ebc094840) at ../src/qemu/qemu_migration.c:6985 #4 0x00007f1f082d8aa0 in qemuProcessRecoverMigrationOut (stopFlags=<synthetic pointer>, reason=1, state=VIR_DOMAIN_RUNNING, migStatus=VIR_DOMAIN_JOB_STATUS_FAILED, job=0x7f1ebbffe940, vm=0x7f1ebc094840, driver=0x7f1ebc0236f0) at ../src/qemu/qemu_process.c:3535 #5 qemuProcessRecoverMigration (stopFlags=<synthetic pointer>, job=0x7f1ebbffe940, vm=0x7f1ebc094840, driver=0x7f1ebc0236f0) at ../src/qemu/qemu_process.c:3620 #6 qemuProcessRecoverJob (stopFlags=<synthetic pointer>, job=0x7f1ebbffe940, vm=0x7f1ebc094840, driver=0x7f1ebc0236f0) at ../src/qemu/qemu_process.c:3687 #7 qemuProcessReconnect (opaque=<optimized out>) at ../src/qemu/qemu_process.c:8842 #8 0x00007f1f193c6fc9 in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:256 #9 0x00007f1f18d66802 in start_thread () from /lib64/libc.so.6 #10 0x00007f1f18d06450 in clone3 () from /lib64/libc.so.6 The attached logs start more than an hour after the log from comment #4 and everything worked just fine there. Could you please attach debug logs covering the incorrect behavior? Fixes for both issues (comment #4 and comment #6) sent upstream for review: https://bugzilla.redhat.com/show_bug.cgi?id=2114866 This is now fixed upstream by commit 4dd86f334bc3c9679865176f588597cf99375330 Refs: v8.8.0-175-g4dd86f334b Author: Jiri Denemark <jdenemar> AuthorDate: Tue Oct 11 14:00:13 2022 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Mon Oct 24 15:28:47 2022 +0200 qemu_migration: Properly wait for migration to be canceled In my commit v8.7.0-57-g2d7b22b561 I attempted to make qemuMigrationSrcCancel synchronous, but failed. When we are canceling migration after some kind of error which is detected in in qemuMigrationSrcWaitForCompletion, jobData->status will be set to VIR_DOMAIN_JOB_STATUS_FAILED regardless on QEMU state. So instead of relying on the translated jobData->status in qemuMigrationSrcIsCanceled we need to check the migration status we get from QEMU MIGRATION event. Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: Ján Tomko <jtomko> commit 1a570f97125dbf4897e48c833b8fcec0ddbc13f1 Refs: v8.8.0-176-g1a570f9712 Author: Jiri Denemark <jdenemar> AuthorDate: Wed Oct 12 17:45:38 2022 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Mon Oct 24 15:28:47 2022 +0200 qemu: Do not crash when canceling migration on reconnect When libvirtd is restarted during an active outgoing migration (or snapshot, save, or dump which are internally implemented as migration) it wants to cancel the migration. But by a mistake in commit v8.7.0-57-g2d7b22b561 the qemuMigrationSrcCancel function is called with wait == true, which leads to an instant crash by dereferencing NULL pointer stored in priv->job.current. When canceling migration to file (snapshot, save, dump), we don't need to wait until it is really canceled as no migration capabilities or parameters need to be restored. On the other hand we need to wait when canceling outgoing migration and since we don't have virDomainJobData at this point, we have to temporarily restore the migration job to make sure we can process MIGRATION events from QEMU. Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: Ján Tomko <jtomko> Test with libvirt-8.9.0-2.el9.x86_64 and qemu-kvm-7.1.0-4.el9.x86_64 Covered scenarios: non-p2p, p2p kill src virtqemud, kill target virtqemud, kill target virtproxyd Result: PASS Verified with libvirt-8.9.0-2.el9.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (libvirt bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2171 |