RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2044818 - Qemu Core Dumped when migrate -> migrate_cancel -> migrate again during guest is paused
Summary: Qemu Core Dumped when migrate -> migrate_cancel -> migrate again during guest...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: 9.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Peter Xu
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-25 09:32 UTC by Tingting Mao
Modified: 2022-05-17 12:31 UTC (History)
14 users (show)

Fixed In Version: qemu-kvm-6.2.0-10.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-17 12:25:28 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/centos-stream/src qemu-kvm merge_requests 77 0 None None None 2022-02-22 04:54:52 UTC
Red Hat Issue Tracker RHELPLAN-109501 0 None None None 2022-01-25 09:42:52 UTC
Red Hat Product Errata RHBA-2022:2307 0 None None None 2022-05-17 12:26:07 UTC

Description Tingting Mao 2022-01-25 09:32:31 UTC
Description of problem:
While reviewing the commit - 4c170330aae4a4ed75c3a8638b7d4c5d9f365244 and trying to test this scenerio(migration + internal snapshot), hit qemu core dumped.


Version-Release number of selected component (if applicable):
qemu-kvm-6.2.0-4.el9
kernel-5.14.0-39.el9.x86_64


How reproducible:
2/2


Steps to Reproduce:
1.Bootup src and dst guests in the same host
Src:
/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35,memory-backend=mem-machine_mem \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 30720 \
    -object memory-backend-ram,size=30720M,id=mem-machine_mem  \
    -smp 20,maxcpus=20,cores=10,threads=1,dies=1,sockets=2  \
    -cpu 'Broadwell',+kvm_pv_unhalt \
    -chardev socket,id=qmp_id_catch_monitor,server=on,path=/tmp/avocado_chdzoghs/monitor-catch_monitor-20211222-043056-XUr3fyof,wait=off  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idzArAFk \
    -chardev socket,id=chardev_serial0,server=on,path=/tmp/avocado_chdzoghs/serial-serial0-20211222-043056-XUr3fyof,wait=off \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20211222-043056-XUr3fyof,path=/tmp/avocado_chdzoghs/seabios-20211222-043056-XUr3fyof,server=on,wait=off \
    -device isa-debugcon,chardev=seabioslog_id_20211222-043056-XUr3fyof,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0 \
    -blockdev node-name=nvme_image1,driver=file,auto-read-only=on,discard=ignore,filename=RHEL-9.0-x86_64-latest.qcow2,cache.direct=off,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=off,cache.no-flush=off,file=nvme_image1 \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=off \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:4a:05:2e:bd:42,id=idleb5Da,netdev=idwZT70w,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idwZT70w,vhost=on \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -enable-kvm \
    -monitor stdio \
    -S \
    -chardev socket,server=on,path=/var/tmp/monitor-qmpmonitor1-20210721-024113-AsZ7KYro,id=qmp_id_qmpmonitor1,wait=off  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \

Dst:
/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35,memory-backend=mem-machine_mem \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 30720 \
    -object memory-backend-ram,size=30720M,id=mem-machine_mem  \
    -smp 20,maxcpus=20,cores=10,threads=1,dies=1,sockets=2  \
    -cpu 'Broadwell',+kvm_pv_unhalt \
    -chardev socket,id=qmp_id_qmpmonitor1,server=on,path=/tmp/avocado_chdzoghs/monitor-qmpmonitor1-20211222-043056-XUr3fyof,wait=off  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,id=qmp_id_catch_monitor,server=on,path=/tmp/avocado_chdzoghs/monitor-catch_monitor-20211222-043056-XUr3fyof,wait=off  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idzArAFk \
    -chardev socket,id=chardev_serial0,server=on,path=/tmp/avocado_chdzoghs/serial-serial0-20211222-043056-XUr3fyof,wait=off \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20211222-043056-XUr3fyof,path=/tmp/avocado_chdzoghs/seabios-20211222-043056-XUr3fyof,server=on,wait=off \
    -device isa-debugcon,chardev=seabioslog_id_20211222-043056-XUr3fyof,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0 \
    -blockdev node-name=nvme_image1,driver=file,auto-read-only=on,discard=ignore,filename=RHEL-9.0-x86_64-latest.qcow2,cache.direct=off,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=off,cache.no-flush=off,file=nvme_image1 \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=off \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:4a:05:2e:bd:42,id=idleb5Da,netdev=idwZT70w,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idwZT70w,vhost=on \
    -vnc :1  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -enable-kvm \
    -monitor stdio \
    -S -incoming defer\

2. Set migration incoming on dst
(qemu) migrate_incoming tcp:[::]:5800

3. Create an internal snapshot on src via QMP
# nc -U monitor-qmpmonitor1-20210721-024113-AsZ7KYro
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 2, "major": 6}, "package": "qemu-kvm-6.2.0-4.el9"}, "capabilities": ["oob"]}}
{"execute": "qmp_capabilities"}
{"return": {}}
{"execute":"human-monitor-command","arguments":{"command-line":"savevm sn2"}}
{"return": ""}
 
4. At the same time of step3, do migration on src via HMP
(qemu) migrate -d tcp:localhost:5800


Actual results:
Qemu core dumped after step4
(qemu) qemu-kvm: ../softmmu/memory.c:2782: void memory_global_dirty_log_start(unsigned int): Assertion `!(global_dirty_tracking & flags)' failed.
qemu.sh: line 39:  4674 Aborted                 (core dumped) /usr/libexec/qemu-kvm -S -name 'avocado-vt-vm1' -sandbox on -machine q35,memory-backend=mem-machine_mem -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 -nodefaults -device VGA,bus=pcie.0,addr=0x2 -m 30720 -object memory-backend-ram,size=30720M,id=mem-machine_mem -smp 20,maxcpus=20,cores=10,threads=1,dies=1,sockets=2 -cpu 'Broadwell',+kvm_pv_unhalt -chardev socket,id=qmp_id_catch_monitor,server=on,path=/tmp/avocado_chdzoghs/monitor-catch_monitor-20211222-043056-XUr3fyof,wait=off -mon chardev=qmp_id_catch_monitor,mode=control -device pvpanic,ioport=0x505,id=idzArAFk -chardev socket,id=chardev_serial0,server=on,path=/tmp/avocado_chdzoghs/serial-serial0-20211222-043056-XUr3fyof,wait=off -device isa-serial,id=serial0,chardev=chardev_serial0 -chardev socket,id=seabioslog_id_20211222-043056-XUr3fyof,path=/tmp/avocado_chdzoghs/seabios-20211222-043056-XUr3fyof,server=on,wait=off -device isa-debugcon,chardev=seabioslog_id_20211222-043056-XUr3fyof,iobase=0x402 -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0 -blockdev node-name=nvme_image1,driver=file,auto-read-only=on,discard=ignore,filename=$1,cache.direct=off,cache.no-flush=off -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=off,cache.no-flush=off,file=nvme_image1 -device scsi-hd,id=image1,drive=drive_image1,write-cache=off -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 -device virtio-net-pci,mac=9a:4a:05:2e:bd:42,id=idleb5Da,netdev=idwZT70w,bus=pcie-root-port-3,addr=0x0 -netdev tap,id=idwZT70w,vhost=on -vnc :0 -rtc base=utc,clock=host,driftfix=slew -enable-kvm -monitor stdio -S -chardev socket,server=on,path=/var/tmp/monitor-qmpmonitor1-20210721-024113-AsZ7KYro,id=qmp_id_qmpmonitor1,wait=off -mon chardev=qmp_id_qmpmonitor1,mode=control


Expected results:
No core dumped, but the process fails with some hint info.


Additional info:
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007f6915fe8863 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007f6915f9b676 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f6915f857d3 in __GI_abort () at abort.c:79
#4  0x00007f6915f856fb in __assert_fail_base (fmt=<optimized out>, assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>) at assert.c:92
#5  0x00007f6915f94396 in __GI___assert_fail (assertion=0x560870953138 "!(global_dirty_tracking & flags)", file=0x560870952666 "../softmmu/memory.c", line=2782, 
    function=0x560870953107 "void memory_global_dirty_log_start(unsigned int)") at assert.c:101
#6  0x00005608705717f0 in memory_global_dirty_log_start (flags=1) at ../softmmu/memory.c:2782
#7  0x00005608705973ec in ram_save_setup (f=0x560871c163d0, opaque=0x560870f7d838 <ram_state>) at ../migration/ram.c:2865
#8  0x00005608703466b7 in qemu_savevm_state_setup (f=0x560871c163d0) at ../migration/savevm.c:1217
#9  0x000056087033b7b5 in migration_thread (opaque=0x560871be4c00) at ../migration/migration.c:3817
#10 0x00005608708cca7a in qemu_thread_start (args=0x560872099560) at ../util/qemu-thread-posix.c:556
#11 0x00007f6915fe6aaf in start_thread (arg=<optimized out>) at pthread_create.c:435
#12 0x00007f691606b740 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Comment 2 Li Xiaohui 2022-01-25 10:31:29 UTC
We start migration + internal snapshot test following below commit:
*******************************************************************
commit 4c170330aae4a4ed75c3a8638b7d4c5d9f365244
Author: Peter Xu <peterx>
Date:   Wed Sep 22 12:20:07 2021 -0400

    migration: Make migration blocker work for snapshots too
    
    save_snapshot() checks migration blocker, which looks sane.  At the meantime we
    should also teach the blocker add helper to fail if during a snapshot, just
    like for migrations.
    
    Reviewed-by: Marc-André Lureau <marcandre.lureau>
    Signed-off-by: Peter Xu <peterx>
    Reviewed-by: Juan Quintela <quintela>
    Signed-off-by: Juan Quintela <quintela>

Comment 3 Klaus Heinrich Kiwi 2022-01-27 17:29:23 UTC
I'll lower the severity to medium considering this is an internal snapshot scenario, but this is also likely a regression (please add the keyword if confirmed) - do you have information about the last version this worked?

Also tagging @peterx if he has any insights, but we might need a more formal bisect before we can attribute it to his patch.

Comment 5 Peter Xu 2022-01-28 07:14:22 UTC
It shouldn't be related to 4c170330aae4a4ed75c3a8638b7d4c5d9f365244.  Btw, that commit fixes race between dump-guest-mem with snapshot/migration, it's not related to a race between migration and snapshot because I just don't understand how they can be triggered at all..

I cannot reproduce the crash on upstream qemu, and that matches with my understanding because QMP/HMP share the same main thread on cmd execution, afaict.  I don't understand how it even happened on rhel9.

Tingting, when the coredump triggered, could you share the stacks of all the threads?  E.g., "thread apply all bt" in GDB when the core attached would work.

Meanwhile, can this be triggered with QMP-only?  Because AFAICT customers never use HMP, and it's not supported either for RHEL (it's only for debugging purpose).  If it must be reproduced with HMP+QMP then IMHO we can simply close this as WONTFIX.  It won't hurt if you can still provide the full thread backtrace so we can consider fix that upstream only, but may not even worth an explicit backport.

Comment 7 Peter Xu 2022-01-28 11:41:45 UTC
The reporter shared me the host, and I found that it reproduces only with "-S".

I can even constantly reproduce the crash with below sequence upstream:

(qemu) migrate -d exec:cat>out
(qemu) migrate_cancel 
(qemu) migrate -d exec:cat>out
(qemu) qemu-system-x86_64: ../softmmu/memory.c:2782: memory_global_dirty_log_start: Assertion `!(global_dirty_tracking & flags)' failed.
./bug.sh: line 42: 299010 Aborted                 sudo $bin -M q35,accel=kvm -smp 40 -m ${mem} -msg timestamp=on -S -name peter-vm,debug-threads=on -global migration.x-max-bandwidth=0 -qmp unix:/tmp/peter.qmp,server,nowait -nographic -nodefaults -monitor stdio -netdev user,id=net0,hostfwd=tcp::${port}-:22 -device virtio-net-pci,netdev=net0 $param $image

Hence not a race at all.

Basically we called memory_global_dirty_log_start() twice without calling memory_global_dirty_log_do_stop() due to commit 1931076077 ("migration: optimize the downtime", 2017-08-01).  I'm now wondering how that commit helped migration downtime because it's delaying qemu_savevm_state_cleanup(), which, iiuc, does not blocks dest QEMU from running at all...

Then after commit 63b41db4bc ("memory: make global_dirty_tracking a bitmask", 2021-11-01) the assertion is added to check it up, hence the crash.  Before that we'll just call log_global_start hook twice.

Can we simply revert the trick in commit 1931076077?

Comment 8 Li Xiaohui 2022-01-29 04:18:31 UTC
Thanks Peter to debug this issue.

I could also reproduce with comment 0 and comment 7 on the latest qemu-kvm-6.2.0-5.el9.x86_64.

Comment 9 Li Xiaohui 2022-01-29 07:41:32 UTC
Add regression keyword into this bug since I could reproduce bug on qemu-kvm-6.2.0-1.el9.x86_64, but migration works well on qemu-kvm-6.1.0-8.el9.x86_64.


do migration -> cancel migration during active -> restart migration (Note guest is stopped when execute these steps), can reproduce bug no matter it's exec or tcp protocol.

Comment 10 Peter Xu 2022-02-07 03:48:58 UTC
Posted a fix here:

https://lore.kernel.org/qemu-devel/20220207032622.19931-1-peterx@redhat.com

Xiaohui, could you try below to see whether it fixes the issue?

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=42836003

Since the patch has a dependency of commit 7b0538ed3a ("memory: Fix incorrect calls of log_global_start/stop") too, the brew also has that.

Comment 11 Li Xiaohui 2022-02-07 08:38:24 UTC
(In reply to Peter Xu from comment #10)
> Posted a fix here:
> 
> https://lore.kernel.org/qemu-devel/20220207032622.19931-1-peterx@redhat.com
> 
> Xiaohui, could you try below to see whether it fixes the issue?
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=42836003

The build fixed bug. Thanks.

> 
> Since the patch has a dependency of commit 7b0538ed3a ("memory: Fix
> incorrect calls of log_global_start/stop") too, the brew also has that.

Comment 12 Li Xiaohui 2022-02-24 07:47:09 UTC
Hi Peter,
We need exception+ or blockder+ for this bug if we want it to be fixed on rhel9 now.

Comment 14 Yanan Fu 2022-02-25 03:54:37 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 17 Li Xiaohui 2022-02-25 09:53:47 UTC
Test pass on qemu-kvm-6.2.0-10.el9.x86_64, migration succeed when:
do migration -> cancel migration during active -> restart migration (Note guest is stopped when execute these steps)

So mark this bug verified.

Comment 19 errata-xmlrpc 2022-05-17 12:25:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: qemu-kvm), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2307


Note You need to log in before you can comment on or make changes to this bug.