Bug 1251487
Summary: | qemu core dump when do drive mirror | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Xueqiang Wei <xuwei> |
Component: | qemu-kvm-rhev | Assignee: | Kevin Wolf <kwolf> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.2 | CC: | famz, huding, jcody, juzhang, knoel, kwolf, lersek, mrezanin, pbonzini, pezhang, qiguo, qzhang, shuyu, stefanha, virt-maint, xfu, xuhan, xuwei |
Target Milestone: | rc | Keywords: | Regression, TestBlocker |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-rhev-2.3.0-22.el7 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-12-04 16:53:22 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1246125, 1258737 |
Description
Xueqiang Wei
2015-08-07 13:01:47 UTC
>
>
> Additional info:
>
> I downgrade to qemu-kvm-rhev-2.3.0-12.el7.x86_64 to try, no core dump and
> works well.
Adding regression keywords according this comment.
Best Regards,
Junyi
seams same bug as 1249452 Have hit the same issue on ppc64le The stack trace looks same: Program terminated with signal 6, Aborted. #0 0x00003fff9b91e578 in raise () from /lib64/power8/libc.so.6 #0 0x00003fff9b91e578 in raise () from /lib64/power8/libc.so.6 No symbol table info available. #1 0x00003fff9b9206fc in abort () from /lib64/power8/libc.so.6 No symbol table info available. #2 0x000000004aecfd38 in qemu_coroutine_enter (co=0x10003b9c900, opaque=0x0) at qemu-coroutine.c:111 self = <optimized out> ret = <optimized out> #3 0x000000004aed002c in qemu_co_queue_run_restart (co=0x10004a9e700) at qemu-coroutine-lock.c:59 next = <optimized out> #4 0x000000004aecfc00 in qemu_coroutine_enter (co=0x10004a9e700, opaque=<optimized out>) at qemu-coroutine.c:118 self = <optimized out> ret = COROUTINE_YIELD ... (In reply to Qian Guo from comment #3) > seams same bug as 1249452 Reassigned to Kevin, who's taking care of Bug 1249452, for initial investigation. qemu-kvm-rhev-2.3.0-13.el7 includes the patches for two BZs that might be related to this. Both were backported by Fam and reviewed by Jeff, Stefan, Paolo and Laszlo. (All CCed) Bug 1242316: "block/mirror: Add "unmap" to drive-mirror" Bug 1233826: "block/mirror: Sleep periodically during bitmap scanning" Before I spend time digging into this, do any of you remember a part of these patches that seems likely to cause a bug like this (coroutine reentered recursively)? (In reply to Kevin Wolf from comment #7) > Bug 1233826: "block/mirror: Sleep periodically during bitmap scanning" I was a reviewer on this one. (downstream commit 51b5e87448f2fc4cefcb4adb2ee12c31c37332ff). I think I did not dig into block_job_sleep_ns() added by that patch, but now that call looks like a possible culprit to me. I don't know much about QEMU's coroutines (although they have been explained to me earlier; sorry about that). The direct block_job_sleep_ns() -> qemu_coroutine_yield() call seems okay. The block_job_sleep_ns() -> co_aio_sleep_ns() -> qemu_coroutine_yield() call chain, *with* the "reenter me later" timer, seems risky. I do realize it's a generic coroutine facility. Still... Sorry I have no better idea. In any case this should be a textually simple patch, and should be easy to revert (without bisection). Thanks, Laszlo. Actually, looking closer at that patch, I think it's unlikely to be the culprit because we don't fail in the preparing loop, but only later. I have a suspicion now, and the code that I don't like is the following lines in mirror_iteration_done(): if (s->common.busy) { qemu_coroutine_enter(s->common.co, NULL); } The reason why I don't like it is that - unlike most places that reenter - it doesn't have a specific yield that it pairs with, but is more uncontrolled. What we really mean here is "reenter the coroutine if it's in one of the four explicit yields in mirror.c". This used to be equivalent with s->common.busy (though not really nice to rely on it) because neither mirror_run() nor mirror_iteration() call any function that could yield. However since upstream commit dcfb3beb (downstream ea7c8b77, "mirror: Do zero write on target if sectors not allocated") this doesn't hold true any more: bdrv_get_block_status_above() can yield. So what happens is that bdrv_get_block_status_above() wants to take a lock that is already held, so it enters itself into the queue of waiting coroutines and yields. Instead of being woken up by the unlock function, however, it gets woken up by mirror_iteration_done(), which is obviously wrong. In most cases the code actually happens to cope fairly well with such cases, but in this specific case, the unlock must already have scheduled the coroutine for wakeup when mirror_iteration_done() reentered it. And then the coroutine happened to process the scheduled restarts and tried to reenter itself recursively. tl;dr: The reenter in mirror_iteration_done() must be paired with specific yields instead of abusing s->common.busy. Can you please test whether this scratch build fixes the problem? http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9678693 *** Bug 1249452 has been marked as a duplicate of this bug. *** (In reply to Kevin Wolf from comment #10) > Can you please test whether this scratch build fixes the problem? > > http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9678693 I tried 5 times using autotest script and manually, not hit this problem on this scratch build. for details as below: # uname -r 3.10.0-302.el7.x86_64 # rpm -qa|grep qemu qemu-kvm-rhev-debuginfo-2.3.0-14.el7.bz1251487.x86_64 qemu-kvm-rhev-2.3.0-14.el7.bz1251487.x86_64 qemu-kvm-common-rhev-2.3.0-14.el7.bz1251487.x86_64 qemu-img-rhev-2.3.0-14.el7.bz1251487.x86_64 qemu-kvm-tools-rhev-2.3.0-14.el7.bz1251487.x86_64 ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch qemu-guest-agent-2.3.0-2.el7.x86_64 autotest result: (Result file: /home/autotest-devel/client/results/default/status) virt.qemu.smp_4.4096m.repeat1.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 226 completed successfully virt.qemu.smp_4.4096m.repeat2.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 311 completed successfully virt.qemu.smp_4.4096m.repeat3.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 313 completed successfully virt.qemu.smp_4.4096m.repeat4.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 296 completed successfully virt.qemu.smp_4.4096m.repeat5.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 316 completed successfully ---- GOOD 1525 Fix posted upstream: https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg01524.html Now I'm waiting for the patch to arrive in master before I can backport it. Block verification of bug: Bug 1233826 - issueing drive-mirror command causes monitor unresponsive Bug 1238585 - drive-mirror has spurious failures with low 'granularity' values Kevin, Since this is a blocker for QE, can you provide a scratch build and also post to rhvirt-patches? I think I see it's accepted upstream now. Thanks! The new scratch build is http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9775419 Fix included in qemu-kvm-rhev-2.3.0-22.el7 *** Bug 1258737 has been marked as a duplicate of this bug. *** Summary:The verification results have proved that this bug has been fixed. Reproduced: Host versions: Kernel:3.10.0-314.el7.x86_64 qemu-kvm-rhev:qemu-kvm-rhev-2.3.0-14.el7.x86_64 Steps: 1.boot guest /usr/libexec/qemu-kvm -name rhel7.2 -machine pc-i440fx-rhel7.2.0,accel=kvm \ -cpu SandyBridge -m 2G,slots=256,maxmem=40G -numa node \ -smp 4,sockets=2,cores=2,threads=1 \ -uuid 82b1a01e-5f6c-4f5f-8d27-3855a74e6b6b \ -netdev tap,id=hostnet0 \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=12:54:00:5c:88:6d \ -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16 \ -spice port=5900,addr=0.0.0.0,disable-ticketing,image-compression=off,seamless-migration=on \ -monitor stdio \ -serial unix:/tmp/monitor,server,nowait \ -qmp tcp:0:1235,server,nowait \ -drive file=/home/rhel7.2_scsi.qcow2,if=none,id=drive-scsi-disk0,format=qcow2,cache=none,werror=stop,rerror=stop \ -device virtio-scsi-pci,id=scsi0 \ -device scsi-hd,drive=drive-scsi-disk0,bus=scsi0.0,scsi-id=0,lun=0,id=scsi-disk0 \ 2. do block mirror {"execute": "drive-mirror", "arguments": {"device": "drive-scsi-disk0", "mode": "absolute-paths", "format": "qcow2", "target": "/home/mirror0.qco2", "sync": "full"}, "id": "se9JM3ou"} {"return": {}, "id": "se9JM3ou"} 3. qemu core dumped occurs (qemu) Formatting '/home/mirror0.qco2', fmt=qcow2 size=21474836480 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 Co-routine re-entered recursively Aborted (core dumped) Verified: Host versions: Kernel:3.10.0-314.el7.x86_64 qemu-kvm-rhev:qemu-kvm-rhev-2.3.0-22.el7.x86_64 Steps: 1.boot guest /usr/libexec/qemu-kvm -name rhel7.2 -machine pc-i440fx-rhel7.2.0,accel=kvm \ -cpu SandyBridge -m 2G,slots=256,maxmem=40G -numa node \ -smp 4,sockets=2,cores=2,threads=1 \ -uuid 82b1a01e-5f6c-4f5f-8d27-3855a74e6b6b \ -netdev tap,id=hostnet0 \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=12:54:00:5c:88:6d \ -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16 \ -spice port=5900,addr=0.0.0.0,disable-ticketing,image-compression=off,seamless-migration=on \ -monitor stdio \ -serial unix:/tmp/monitor,server,nowait \ -qmp tcp:0:1235,server,nowait \ -drive file=/home/rhel7.2_scsi.qcow2,if=none,id=drive-scsi-disk0,format=qcow2,cache=none,werror=stop,rerror=stop \ -device virtio-scsi-pci,id=scsi0 \ -device scsi-hd,drive=drive-scsi-disk0,bus=scsi0.0,scsi-id=0,lun=0,id=scsi-disk0 \ 2. do block mirror {"execute": "drive-mirror", "arguments": {"device": "drive-scsi-disk0", "mode": "absolute-paths", "format": "qcow2", "target": "/home/mirror1.qco2", "sync": "full"}, "id": "se9JM3ou"} {"return": {}, "id": "se9JM3ou"} {"timestamp": {"seconds": 1441600632, "microseconds": 143896}, "event": "BLOCK_JOB_READY", "data": {"device": "drive-scsi-disk0", "len": 1116602368, "offset": 1116602368, "speed": 0, "type": "mirror"}} 3.query mirror job {"execute": "query-block-jobs", "id": "qAyA5fMW"} {"return": [{"io-status": "ok", "device": "drive-scsi-disk0", "busy": false, "len": 1116864512, "offset": 1116864512, "paused": false, "speed": 0, "ready": true, "type": "mirror"}], "id": "qAyA5fMW"} 4. Qemu and guest both work well (qemu) info block drive-scsi-disk0: /home/rhel7.2_scsi.qcow2 (qcow2) Cache mode: writeback, direct ide1-cd0: [not inserted] Removable device: not locked, tray closed floppy0: [not inserted] Removable device: not locked, tray closed sd0: [not inserted] Removable device: not locked, tray closed (qemu) info status VM status: running So this bug has been fixed. On PowerPC platform,do dd file in the guest and do block mirror at the same time,the file created successful and the block mirror finished successful,guest work well. Host: qemu-kvm-rhev-2.3.0-22.el7.ppc64le kernel-3.10.0-313.el7.ppc64le SLOF-20150313-3.gitc89b0df.el7.noarch 1.Boot the guest with cmd: # gdb /usr/libexec/qemu-kvm (gdb) r -name live-snapshot-test -machine pseries,accel=kvm,usb=off -m 8G -smp 4 -nodefaults -monitor stdio -rtc base=utc -drive file=RHEL-7.2-20150820.0-Server-ppc64-2.qcow2,format=qcow2,if=none,id=drive-virtio0,werror=stop,cache=none -device virtio-blk-pci,id=virtio0,drive=drive-virtio0 -device spapr-vscsi,id=scsi0,reg=0x1000 -drive file=RHEL-7.2-20150820.0-Server-ppc64-dvd1.iso,if=none,id=drive-scsi0,readonly=on,format=raw -device scsi-cd,bus=scsi0.0,channel=0,scsi-id=0,drive=drive-scsi0,id=scsi0-0 -netdev tap,id=hostnet0,script=/etc/qemu-ifup -device spapr-vlan,netdev=hostnet0,id=net0,mac=00:52:5f:5d:5c:5d -device usb-kbd,id=input0 -device usb-mouse,id=input1 -msg timestamp=on -usb -device usb-tablet,id=tablet1 -vga std -vnc 0.0.0.0:19 -uuid bf91fbbf-33a8-4bef-b69a-e591bb233795 -qmp tcp:0:4666,server,nowait 2.In the guest: dd if=/dev/zero of=file bs=1M count=10240 3.During the step2 {"execute":"drive-mirror","arguments":{"device":"drive-virtio0","target":"/root/test_home/shuyu/8-26-livesnapshot/sn1","format":"qcow2","mode":"absolute-paths","sync":"full","speed":1000000000, "on-source-error": "stop", "on-target-error": "stop"}} Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2546.html |