Bug 2044818
Summary: | Qemu Core Dumped when migrate -> migrate_cancel -> migrate again during guest is paused | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Tingting Mao <timao> |
Component: | qemu-kvm | Assignee: | Peter Xu <peterx> |
qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | chayang, coli, dgilbert, jinzhao, juzhang, kkiwi, lcapitulino, leobras, mdean, pbonzini, peterx, quintela, virt-maint, xiaohli |
Version: | 9.0 | Keywords: | Regression, Triaged |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-6.2.0-10.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-05-17 12:25:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Tingting Mao
2022-01-25 09:32:31 UTC
The complete core dumped info: http://fileshare.englab.nay.redhat.com/pub/section2/kvm/timao/bugs/2044818/core.qemu-kvm.0.20be3b5b76ee4adf8b3b196cea053715.4674.1643101255000000.zst We start migration + internal snapshot test following below commit: ******************************************************************* commit 4c170330aae4a4ed75c3a8638b7d4c5d9f365244 Author: Peter Xu <peterx> Date: Wed Sep 22 12:20:07 2021 -0400 migration: Make migration blocker work for snapshots too save_snapshot() checks migration blocker, which looks sane. At the meantime we should also teach the blocker add helper to fail if during a snapshot, just like for migrations. Reviewed-by: Marc-André Lureau <marcandre.lureau> Signed-off-by: Peter Xu <peterx> Reviewed-by: Juan Quintela <quintela> Signed-off-by: Juan Quintela <quintela> I'll lower the severity to medium considering this is an internal snapshot scenario, but this is also likely a regression (please add the keyword if confirmed) - do you have information about the last version this worked? Also tagging @peterx if he has any insights, but we might need a more formal bisect before we can attribute it to his patch. It shouldn't be related to 4c170330aae4a4ed75c3a8638b7d4c5d9f365244. Btw, that commit fixes race between dump-guest-mem with snapshot/migration, it's not related to a race between migration and snapshot because I just don't understand how they can be triggered at all.. I cannot reproduce the crash on upstream qemu, and that matches with my understanding because QMP/HMP share the same main thread on cmd execution, afaict. I don't understand how it even happened on rhel9. Tingting, when the coredump triggered, could you share the stacks of all the threads? E.g., "thread apply all bt" in GDB when the core attached would work. Meanwhile, can this be triggered with QMP-only? Because AFAICT customers never use HMP, and it's not supported either for RHEL (it's only for debugging purpose). If it must be reproduced with HMP+QMP then IMHO we can simply close this as WONTFIX. It won't hurt if you can still provide the full thread backtrace so we can consider fix that upstream only, but may not even worth an explicit backport. The reporter shared me the host, and I found that it reproduces only with "-S". I can even constantly reproduce the crash with below sequence upstream: (qemu) migrate -d exec:cat>out (qemu) migrate_cancel (qemu) migrate -d exec:cat>out (qemu) qemu-system-x86_64: ../softmmu/memory.c:2782: memory_global_dirty_log_start: Assertion `!(global_dirty_tracking & flags)' failed. ./bug.sh: line 42: 299010 Aborted sudo $bin -M q35,accel=kvm -smp 40 -m ${mem} -msg timestamp=on -S -name peter-vm,debug-threads=on -global migration.x-max-bandwidth=0 -qmp unix:/tmp/peter.qmp,server,nowait -nographic -nodefaults -monitor stdio -netdev user,id=net0,hostfwd=tcp::${port}-:22 -device virtio-net-pci,netdev=net0 $param $image Hence not a race at all. Basically we called memory_global_dirty_log_start() twice without calling memory_global_dirty_log_do_stop() due to commit 1931076077 ("migration: optimize the downtime", 2017-08-01). I'm now wondering how that commit helped migration downtime because it's delaying qemu_savevm_state_cleanup(), which, iiuc, does not blocks dest QEMU from running at all... Then after commit 63b41db4bc ("memory: make global_dirty_tracking a bitmask", 2021-11-01) the assertion is added to check it up, hence the crash. Before that we'll just call log_global_start hook twice. Can we simply revert the trick in commit 1931076077? Thanks Peter to debug this issue. I could also reproduce with comment 0 and comment 7 on the latest qemu-kvm-6.2.0-5.el9.x86_64. Add regression keyword into this bug since I could reproduce bug on qemu-kvm-6.2.0-1.el9.x86_64, but migration works well on qemu-kvm-6.1.0-8.el9.x86_64. do migration -> cancel migration during active -> restart migration (Note guest is stopped when execute these steps), can reproduce bug no matter it's exec or tcp protocol. Posted a fix here: https://lore.kernel.org/qemu-devel/20220207032622.19931-1-peterx@redhat.com Xiaohui, could you try below to see whether it fixes the issue? https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=42836003 Since the patch has a dependency of commit 7b0538ed3a ("memory: Fix incorrect calls of log_global_start/stop") too, the brew also has that. (In reply to Peter Xu from comment #10) > Posted a fix here: > > https://lore.kernel.org/qemu-devel/20220207032622.19931-1-peterx@redhat.com > > Xiaohui, could you try below to see whether it fixes the issue? > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=42836003 The build fixed bug. Thanks. > > Since the patch has a dependency of commit 7b0538ed3a ("memory: Fix > incorrect calls of log_global_start/stop") too, the brew also has that. Hi Peter, We need exception+ or blockder+ for this bug if we want it to be fixed on rhel9 now. QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. Test pass on qemu-kvm-6.2.0-10.el9.x86_64, migration succeed when: do migration -> cancel migration during active -> restart migration (Note guest is stopped when execute these steps) So mark this bug verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: qemu-kvm), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2307 |