Bug 1251487
| Summary: | qemu core dump when do drive mirror | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Xueqiang Wei <xuwei> |
| Component: | qemu-kvm-rhev | Assignee: | Kevin Wolf <kwolf> |
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 7.2 | CC: | famz, huding, jcody, juzhang, knoel, kwolf, lersek, mrezanin, pbonzini, pezhang, qiguo, qzhang, shuyu, stefanha, virt-maint, xfu, xuhan, xuwei |
| Target Milestone: | rc | Keywords: | Regression, TestBlocker |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | qemu-kvm-rhev-2.3.0-22.el7 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-12-04 16:53:22 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1246125, 1258737 | ||
>
>
> Additional info:
>
> I downgrade to qemu-kvm-rhev-2.3.0-12.el7.x86_64 to try, no core dump and
> works well.
Adding regression keywords according this comment.
Best Regards,
Junyi
seams same bug as 1249452 Have hit the same issue on ppc64le
The stack trace looks same:
Program terminated with signal 6, Aborted.
#0 0x00003fff9b91e578 in raise () from /lib64/power8/libc.so.6
#0 0x00003fff9b91e578 in raise () from /lib64/power8/libc.so.6
No symbol table info available.
#1 0x00003fff9b9206fc in abort () from /lib64/power8/libc.so.6
No symbol table info available.
#2 0x000000004aecfd38 in qemu_coroutine_enter (co=0x10003b9c900, opaque=0x0) at qemu-coroutine.c:111
self = <optimized out>
ret = <optimized out>
#3 0x000000004aed002c in qemu_co_queue_run_restart (co=0x10004a9e700) at qemu-coroutine-lock.c:59
next = <optimized out>
#4 0x000000004aecfc00 in qemu_coroutine_enter (co=0x10004a9e700, opaque=<optimized out>) at qemu-coroutine.c:118
self = <optimized out>
ret = COROUTINE_YIELD
...
(In reply to Qian Guo from comment #3) > seams same bug as 1249452 Reassigned to Kevin, who's taking care of Bug 1249452, for initial investigation. qemu-kvm-rhev-2.3.0-13.el7 includes the patches for two BZs that might be related to this. Both were backported by Fam and reviewed by Jeff, Stefan, Paolo and Laszlo. (All CCed) Bug 1242316: "block/mirror: Add "unmap" to drive-mirror" Bug 1233826: "block/mirror: Sleep periodically during bitmap scanning" Before I spend time digging into this, do any of you remember a part of these patches that seems likely to cause a bug like this (coroutine reentered recursively)? (In reply to Kevin Wolf from comment #7) > Bug 1233826: "block/mirror: Sleep periodically during bitmap scanning" I was a reviewer on this one. (downstream commit 51b5e87448f2fc4cefcb4adb2ee12c31c37332ff). I think I did not dig into block_job_sleep_ns() added by that patch, but now that call looks like a possible culprit to me. I don't know much about QEMU's coroutines (although they have been explained to me earlier; sorry about that). The direct block_job_sleep_ns() -> qemu_coroutine_yield() call seems okay. The block_job_sleep_ns() -> co_aio_sleep_ns() -> qemu_coroutine_yield() call chain, *with* the "reenter me later" timer, seems risky. I do realize it's a generic coroutine facility. Still... Sorry I have no better idea. In any case this should be a textually simple patch, and should be easy to revert (without bisection). Thanks, Laszlo. Actually, looking closer at that patch, I think it's unlikely to
be the culprit because we don't fail in the preparing loop, but only later.
I have a suspicion now, and the code that I don't like is the following lines
in mirror_iteration_done():
if (s->common.busy) {
qemu_coroutine_enter(s->common.co, NULL);
}
The reason why I don't like it is that - unlike most places that reenter - it
doesn't have a specific yield that it pairs with, but is more uncontrolled.
What we really mean here is "reenter the coroutine if it's in one of the four
explicit yields in mirror.c".
This used to be equivalent with s->common.busy (though not really nice to rely
on it) because neither mirror_run() nor mirror_iteration() call any function
that could yield. However since upstream commit dcfb3beb (downstream ea7c8b77,
"mirror: Do zero write on target if sectors not allocated") this doesn't hold
true any more: bdrv_get_block_status_above() can yield.
So what happens is that bdrv_get_block_status_above() wants to take a lock that
is already held, so it enters itself into the queue of waiting coroutines and
yields. Instead of being woken up by the unlock function, however, it gets
woken up by mirror_iteration_done(), which is obviously wrong.
In most cases the code actually happens to cope fairly well with such cases,
but in this specific case, the unlock must already have scheduled the coroutine
for wakeup when mirror_iteration_done() reentered it. And then the coroutine
happened to process the scheduled restarts and tried to reenter itself
recursively.
tl;dr: The reenter in mirror_iteration_done() must be paired with specific
yields instead of abusing s->common.busy.
Can you please test whether this scratch build fixes the problem? http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9678693 *** Bug 1249452 has been marked as a duplicate of this bug. *** (In reply to Kevin Wolf from comment #10) > Can you please test whether this scratch build fixes the problem? > > http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9678693 I tried 5 times using autotest script and manually, not hit this problem on this scratch build. for details as below: # uname -r 3.10.0-302.el7.x86_64 # rpm -qa|grep qemu qemu-kvm-rhev-debuginfo-2.3.0-14.el7.bz1251487.x86_64 qemu-kvm-rhev-2.3.0-14.el7.bz1251487.x86_64 qemu-kvm-common-rhev-2.3.0-14.el7.bz1251487.x86_64 qemu-img-rhev-2.3.0-14.el7.bz1251487.x86_64 qemu-kvm-tools-rhev-2.3.0-14.el7.bz1251487.x86_64 ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch qemu-guest-agent-2.3.0-2.el7.x86_64 autotest result: (Result file: /home/autotest-devel/client/results/default/status) virt.qemu.smp_4.4096m.repeat1.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 226 completed successfully virt.qemu.smp_4.4096m.repeat2.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 311 completed successfully virt.qemu.smp_4.4096m.repeat3.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 313 completed successfully virt.qemu.smp_4.4096m.repeat4.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 296 completed successfully virt.qemu.smp_4.4096m.repeat5.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 316 completed successfully ---- GOOD 1525 Fix posted upstream: https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg01524.html Now I'm waiting for the patch to arrive in master before I can backport it. Block verification of bug: Bug 1233826 - issueing drive-mirror command causes monitor unresponsive Bug 1238585 - drive-mirror has spurious failures with low 'granularity' values Kevin, Since this is a blocker for QE, can you provide a scratch build and also post to rhvirt-patches? I think I see it's accepted upstream now. Thanks! The new scratch build is http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9775419 Fix included in qemu-kvm-rhev-2.3.0-22.el7 *** Bug 1258737 has been marked as a duplicate of this bug. *** Summary:The verification results have proved that this bug has been fixed.
Reproduced:
Host versions:
Kernel:3.10.0-314.el7.x86_64
qemu-kvm-rhev:qemu-kvm-rhev-2.3.0-14.el7.x86_64
Steps:
1.boot guest
/usr/libexec/qemu-kvm -name rhel7.2 -machine pc-i440fx-rhel7.2.0,accel=kvm \
-cpu SandyBridge -m 2G,slots=256,maxmem=40G -numa node \
-smp 4,sockets=2,cores=2,threads=1 \
-uuid 82b1a01e-5f6c-4f5f-8d27-3855a74e6b6b \
-netdev tap,id=hostnet0 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=12:54:00:5c:88:6d \
-device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16 \
-spice port=5900,addr=0.0.0.0,disable-ticketing,image-compression=off,seamless-migration=on \
-monitor stdio \
-serial unix:/tmp/monitor,server,nowait \
-qmp tcp:0:1235,server,nowait \
-drive file=/home/rhel7.2_scsi.qcow2,if=none,id=drive-scsi-disk0,format=qcow2,cache=none,werror=stop,rerror=stop \
-device virtio-scsi-pci,id=scsi0 \
-device scsi-hd,drive=drive-scsi-disk0,bus=scsi0.0,scsi-id=0,lun=0,id=scsi-disk0 \
2. do block mirror
{"execute": "drive-mirror", "arguments": {"device": "drive-scsi-disk0", "mode": "absolute-paths", "format": "qcow2", "target": "/home/mirror0.qco2", "sync": "full"}, "id": "se9JM3ou"}
{"return": {}, "id": "se9JM3ou"}
3. qemu core dumped occurs
(qemu)
Formatting '/home/mirror0.qco2', fmt=qcow2 size=21474836480 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
Co-routine re-entered recursively
Aborted (core dumped)
Verified:
Host versions:
Kernel:3.10.0-314.el7.x86_64
qemu-kvm-rhev:qemu-kvm-rhev-2.3.0-22.el7.x86_64
Steps:
1.boot guest
/usr/libexec/qemu-kvm -name rhel7.2 -machine pc-i440fx-rhel7.2.0,accel=kvm \
-cpu SandyBridge -m 2G,slots=256,maxmem=40G -numa node \
-smp 4,sockets=2,cores=2,threads=1 \
-uuid 82b1a01e-5f6c-4f5f-8d27-3855a74e6b6b \
-netdev tap,id=hostnet0 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=12:54:00:5c:88:6d \
-device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16 \
-spice port=5900,addr=0.0.0.0,disable-ticketing,image-compression=off,seamless-migration=on \
-monitor stdio \
-serial unix:/tmp/monitor,server,nowait \
-qmp tcp:0:1235,server,nowait \
-drive file=/home/rhel7.2_scsi.qcow2,if=none,id=drive-scsi-disk0,format=qcow2,cache=none,werror=stop,rerror=stop \
-device virtio-scsi-pci,id=scsi0 \
-device scsi-hd,drive=drive-scsi-disk0,bus=scsi0.0,scsi-id=0,lun=0,id=scsi-disk0 \
2. do block mirror
{"execute": "drive-mirror", "arguments": {"device": "drive-scsi-disk0", "mode": "absolute-paths", "format": "qcow2", "target": "/home/mirror1.qco2", "sync": "full"}, "id": "se9JM3ou"}
{"return": {}, "id": "se9JM3ou"}
{"timestamp": {"seconds": 1441600632, "microseconds": 143896}, "event": "BLOCK_JOB_READY", "data": {"device": "drive-scsi-disk0", "len": 1116602368, "offset": 1116602368, "speed": 0, "type": "mirror"}}
3.query mirror job
{"execute": "query-block-jobs", "id": "qAyA5fMW"}
{"return": [{"io-status": "ok", "device": "drive-scsi-disk0", "busy": false, "len": 1116864512, "offset": 1116864512, "paused": false, "speed": 0, "ready": true, "type": "mirror"}], "id": "qAyA5fMW"}
4. Qemu and guest both work well
(qemu) info block
drive-scsi-disk0: /home/rhel7.2_scsi.qcow2 (qcow2)
Cache mode: writeback, direct
ide1-cd0: [not inserted]
Removable device: not locked, tray closed
floppy0: [not inserted]
Removable device: not locked, tray closed
sd0: [not inserted]
Removable device: not locked, tray closed
(qemu) info status
VM status: running
So this bug has been fixed.
On PowerPC platform,do dd file in the guest and do block mirror at the same time,the file created successful and the block mirror finished successful,guest work well.
Host:
qemu-kvm-rhev-2.3.0-22.el7.ppc64le
kernel-3.10.0-313.el7.ppc64le
SLOF-20150313-3.gitc89b0df.el7.noarch
1.Boot the guest with cmd:
# gdb /usr/libexec/qemu-kvm
(gdb) r -name live-snapshot-test -machine pseries,accel=kvm,usb=off -m 8G -smp 4 -nodefaults -monitor stdio -rtc base=utc -drive file=RHEL-7.2-20150820.0-Server-ppc64-2.qcow2,format=qcow2,if=none,id=drive-virtio0,werror=stop,cache=none -device virtio-blk-pci,id=virtio0,drive=drive-virtio0 -device spapr-vscsi,id=scsi0,reg=0x1000 -drive file=RHEL-7.2-20150820.0-Server-ppc64-dvd1.iso,if=none,id=drive-scsi0,readonly=on,format=raw -device scsi-cd,bus=scsi0.0,channel=0,scsi-id=0,drive=drive-scsi0,id=scsi0-0 -netdev tap,id=hostnet0,script=/etc/qemu-ifup -device spapr-vlan,netdev=hostnet0,id=net0,mac=00:52:5f:5d:5c:5d -device usb-kbd,id=input0 -device usb-mouse,id=input1 -msg timestamp=on -usb -device usb-tablet,id=tablet1 -vga std -vnc 0.0.0.0:19 -uuid bf91fbbf-33a8-4bef-b69a-e591bb233795 -qmp tcp:0:4666,server,nowait
2.In the guest:
dd if=/dev/zero of=file bs=1M count=10240
3.During the step2
{"execute":"drive-mirror","arguments":{"device":"drive-virtio0","target":"/root/test_home/shuyu/8-26-livesnapshot/sn1","format":"qcow2","mode":"absolute-paths","sync":"full","speed":1000000000, "on-source-error": "stop", "on-target-error": "stop"}}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2546.html |
Description of problem: boot a guest win2012r2, and execute the below commands 1. {"execute": "drive-mirror", "arguments": {"device": "drive_image1", "mode": "absolute-paths", "format": "qcow2", "target": "/home/autotest-devel/client/tests/virt/shared/data/images/target1.qcow2", "sync": "full"}, "id": "se9JM3ou"} 2. {"execute": "query-block-jobs", "id": "qAyA5fMW"} qemu will core dump. I think it shoud be a regression bug. Version-Release number of selected component (if applicable): kernel-3.10.0-302.el7.x86_64 qemu-kvm-rhev-2.3.0-14.el7 How reproducible: I tried 10 times using autotest script and manually, happens every time. Steps to Reproduce: 1. boot a guest: /usr/libexec/qemu-kvm \ -S \ -name 'virt-tests-vm1' \ -sandbox off \ -machine pc \ -nodefaults \ -vga qxl \ -device intel-hda,bus=pci.0,addr=03 \ -device hda-duplex \ -chardev socket,id=qmp_id_qmp1,path=/tmp/monitor-qmp1-20150805-121007-yZR6ga5r,server,nowait \ -mon chardev=qmp_id_qmp1,mode=control \ -chardev socket,id=qmp_id_catch_monitor,path=/tmp/monitor-catch_monitor-20150805-121007-yZR6ga5r,server,nowait \ -mon chardev=qmp_id_catch_monitor,mode=control \ -chardev socket,id=serial_id_serial0,path=/tmp/serial-serial0-20150805-121007-yZR6ga5r,server,nowait \ -device isa-serial,chardev=serial_id_serial0 \ -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=04 \ -chardev socket,id=devvs,path=/tmp/virtio_port-vs-20150805-121007-yZR6ga5r,server,nowait \ -device virtserialport,chardev=devvs,name=vs,id=vs,bus=virtio_serial_pci0.0 \ -chardev socket,id=seabioslog_id_20150805-121007-yZR6ga5r,path=/tmp/seabios-20150805-121007-yZR6ga5r,server,nowait \ -device isa-debugcon,chardev=seabioslog_id_20150805-121007-yZR6ga5r,iobase=0x402 \ -device nec-usb-xhci,id=usb1,bus=pci.0,addr=05 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=06 \ -drive id=drive_image1,if=none,cache=none,snapshot=off,aio=native,format=qcow2,file=/home/autotest-devel/client/tests/virt/shared/data/images/win2012-64r2-virtio-scsi.qcow2 \ -device scsi-hd,id=image1,drive=drive_image1 \ -device rtl8139,mac=9a:6c:6d:6e:6f:70,id=id4irgrw,netdev=idfonZdP,bus=pci.0,addr=07 \ -netdev tap,id=idfonZdP,fd=23 \ -m 32768 \ -smp 16,maxcpus=16,cores=8,threads=1,sockets=2 \ -cpu 'Opteron_G4',+kvm_pv_unhalt,hv_spinlocks=0x1fff,hv_vapic,hv_time \ -drive id=drive_cd1,if=none,snapshot=off,aio=native,media=cdrom,file=/home/autotest-devel/client/tests/virt/shared/data/isos/windows/winutils.iso \ -device ide-cd,id=cd1,drive=drive_cd1,bus=ide.0,unit=0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -spice port=3000,password=123456,addr=0,tls-port=3200,x509-dir=/tmp/spice_x509d,tls-channel=main,tls-channel=inputs,image-compression=auto_glz,zlib-glz-wan-compression=auto,streaming-video=all,agent-mouse=on,playback-compression=on,ipv4 \ -rtc base=localtime,clock=host,driftfix=slew \ -boot order=cdn,once=c,menu=off,strict=off \ -qmp tcp:0:6667,server,nowait \ -enable-kvm 2. run the below commands: {"execute":"qmp_capabilities"} {"execute": "drive-mirror", "arguments": {"device": "drive_image1", "mode": "absolute-paths", "format": "qcow2", "target": "/home/autotest-devel/client/tests/virt/shared/data/images/target1.qcow2", "sync": "full"}, "id": "se9JM3ou"} 3. run the below commands: {"execute": "query-block-jobs", "id": "qAyA5fMW"} Actual results: after step 2: qemu output: Formatting '/home/autotest-devel/client/tests/virt/shared/data/images/target1.qcow2', fmt=qcow2 size=32212254720 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 after step 3: qemu output: Co-routine re-entered recursively qemu core dump: # gdb core (gdb) bt full #0 0x00007f4998d145d7 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007f4998d15cc8 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00007f49a093b0bb in qemu_coroutine_enter (co=0x7f49a44b0480, opaque=0x0) at qemu-coroutine.c:111 self = <optimized out> ret = <optimized out> #3 0x00007f49a093b23a in qemu_co_queue_run_restart ( co=co@entry=0x7f49a3b14b40) at qemu-coroutine-lock.c:59 next = <optimized out> #4 0x00007f49a093afff in qemu_coroutine_enter (co=0x7f49a3b14b40, opaque=<optimized out>) at qemu-coroutine.c:118 self = <optimized out> ret = COROUTINE_YIELD #5 0x00007f49a092bb8c in bdrv_co_aio_rw_vector (bs=bs@entry=0x7f49a1f1c000, sector_num=sector_num@entry=25389824, qiov=qiov@entry=0x7f49a44a6928, nb_sectors=nb_sectors@entry=10880, flags=flags@entry=(unknown: 0), cb=cb@entry=0x7f49a096a240 <mirror_read_complete>, opaque=opaque@entry=0x7f49a44a6920, is_write=is_write@entry=false) at block.c:5025 co = <optimized out> acb = 0x7f49a37538e0 #6 0x00007f49a092e94c in bdrv_aio_readv (bs=bs@entry=0x7f49a1f1c000, sector_num=sector_num@entry=25389824, qiov=qiov@entry=0x7f49a44a6928, nb_sectors=nb_sectors@entry=10880, cb=cb@entry=0x7f49a096a240 <mirror_read_complete>, opaque=opaque@entry=0x7f49a44a6920) at block.c:4647 No locals. #7 0x00007f49a096af12 in mirror_iteration (s=0x7f49a3e4bc80) at block/mirror.c:306 source = 0x7f49a1f1c000 nb_chunks = <optimized out> end = <optimized out> hbitmap_next_sector = <optimized out> pnum = 2816 nb_sectors = 10880 sectors_per_chunk = <optimized out> sector_num = 25389824 next_chunk = <optimized out> next_sector = 25400704 op = 0x7f49a44a6920 ret = <optimized out> delay_ns = <optimized out> #8 mirror_run (opaque=0x7f49a3e4bc80) at block/mirror.c:518 delay_ns = 0 cnt = <optimized out> should_complete = <optimized out> s = 0x7f49a3e4bc80 data = <optimized out> bs = <optimized out> sector_num = <optimized out> end = <optimized out> sectors_per_chunk = <optimized out> length = <optimized out> last_pause_ns = <optimized out> bdi = {cluster_size = 0, vm_state_offset = 0, is_dirty = false, unallocated_blocks_are_zero = false, can_write_zeroes_with_unmap = false, needs_compressed_writes = false} backing_filename = "\000" ret = <optimized out> n = 128 __PRETTY_FUNCTION__ = "mirror_run" #9 0x00007f49a093b97a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at coroutine-ucontext.c:80 self = 0x7f49a44b0480 co = 0x7f49a44b0480 #10 0x00007f4998d260f0 in ?? () from /lib64/libc.so.6 No symbol table info available. Expected results: no core dump, works well Additional info: I downgrade to qemu-kvm-rhev-2.3.0-12.el7.x86_64 to try, no core dump and works well.