Hide Forgot
Description of problem: boot a guest win2012r2, and execute the below commands 1. {"execute": "drive-mirror", "arguments": {"device": "drive_image1", "mode": "absolute-paths", "format": "qcow2", "target": "/home/autotest-devel/client/tests/virt/shared/data/images/target1.qcow2", "sync": "full"}, "id": "se9JM3ou"} 2. {"execute": "query-block-jobs", "id": "qAyA5fMW"} qemu will core dump. I think it shoud be a regression bug. Version-Release number of selected component (if applicable): kernel-3.10.0-302.el7.x86_64 qemu-kvm-rhev-2.3.0-14.el7 How reproducible: I tried 10 times using autotest script and manually, happens every time. Steps to Reproduce: 1. boot a guest: /usr/libexec/qemu-kvm \ -S \ -name 'virt-tests-vm1' \ -sandbox off \ -machine pc \ -nodefaults \ -vga qxl \ -device intel-hda,bus=pci.0,addr=03 \ -device hda-duplex \ -chardev socket,id=qmp_id_qmp1,path=/tmp/monitor-qmp1-20150805-121007-yZR6ga5r,server,nowait \ -mon chardev=qmp_id_qmp1,mode=control \ -chardev socket,id=qmp_id_catch_monitor,path=/tmp/monitor-catch_monitor-20150805-121007-yZR6ga5r,server,nowait \ -mon chardev=qmp_id_catch_monitor,mode=control \ -chardev socket,id=serial_id_serial0,path=/tmp/serial-serial0-20150805-121007-yZR6ga5r,server,nowait \ -device isa-serial,chardev=serial_id_serial0 \ -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=04 \ -chardev socket,id=devvs,path=/tmp/virtio_port-vs-20150805-121007-yZR6ga5r,server,nowait \ -device virtserialport,chardev=devvs,name=vs,id=vs,bus=virtio_serial_pci0.0 \ -chardev socket,id=seabioslog_id_20150805-121007-yZR6ga5r,path=/tmp/seabios-20150805-121007-yZR6ga5r,server,nowait \ -device isa-debugcon,chardev=seabioslog_id_20150805-121007-yZR6ga5r,iobase=0x402 \ -device nec-usb-xhci,id=usb1,bus=pci.0,addr=05 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=06 \ -drive id=drive_image1,if=none,cache=none,snapshot=off,aio=native,format=qcow2,file=/home/autotest-devel/client/tests/virt/shared/data/images/win2012-64r2-virtio-scsi.qcow2 \ -device scsi-hd,id=image1,drive=drive_image1 \ -device rtl8139,mac=9a:6c:6d:6e:6f:70,id=id4irgrw,netdev=idfonZdP,bus=pci.0,addr=07 \ -netdev tap,id=idfonZdP,fd=23 \ -m 32768 \ -smp 16,maxcpus=16,cores=8,threads=1,sockets=2 \ -cpu 'Opteron_G4',+kvm_pv_unhalt,hv_spinlocks=0x1fff,hv_vapic,hv_time \ -drive id=drive_cd1,if=none,snapshot=off,aio=native,media=cdrom,file=/home/autotest-devel/client/tests/virt/shared/data/isos/windows/winutils.iso \ -device ide-cd,id=cd1,drive=drive_cd1,bus=ide.0,unit=0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -spice port=3000,password=123456,addr=0,tls-port=3200,x509-dir=/tmp/spice_x509d,tls-channel=main,tls-channel=inputs,image-compression=auto_glz,zlib-glz-wan-compression=auto,streaming-video=all,agent-mouse=on,playback-compression=on,ipv4 \ -rtc base=localtime,clock=host,driftfix=slew \ -boot order=cdn,once=c,menu=off,strict=off \ -qmp tcp:0:6667,server,nowait \ -enable-kvm 2. run the below commands: {"execute":"qmp_capabilities"} {"execute": "drive-mirror", "arguments": {"device": "drive_image1", "mode": "absolute-paths", "format": "qcow2", "target": "/home/autotest-devel/client/tests/virt/shared/data/images/target1.qcow2", "sync": "full"}, "id": "se9JM3ou"} 3. run the below commands: {"execute": "query-block-jobs", "id": "qAyA5fMW"} Actual results: after step 2: qemu output: Formatting '/home/autotest-devel/client/tests/virt/shared/data/images/target1.qcow2', fmt=qcow2 size=32212254720 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 after step 3: qemu output: Co-routine re-entered recursively qemu core dump: # gdb core (gdb) bt full #0 0x00007f4998d145d7 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007f4998d15cc8 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00007f49a093b0bb in qemu_coroutine_enter (co=0x7f49a44b0480, opaque=0x0) at qemu-coroutine.c:111 self = <optimized out> ret = <optimized out> #3 0x00007f49a093b23a in qemu_co_queue_run_restart ( co=co@entry=0x7f49a3b14b40) at qemu-coroutine-lock.c:59 next = <optimized out> #4 0x00007f49a093afff in qemu_coroutine_enter (co=0x7f49a3b14b40, opaque=<optimized out>) at qemu-coroutine.c:118 self = <optimized out> ret = COROUTINE_YIELD #5 0x00007f49a092bb8c in bdrv_co_aio_rw_vector (bs=bs@entry=0x7f49a1f1c000, sector_num=sector_num@entry=25389824, qiov=qiov@entry=0x7f49a44a6928, nb_sectors=nb_sectors@entry=10880, flags=flags@entry=(unknown: 0), cb=cb@entry=0x7f49a096a240 <mirror_read_complete>, opaque=opaque@entry=0x7f49a44a6920, is_write=is_write@entry=false) at block.c:5025 co = <optimized out> acb = 0x7f49a37538e0 #6 0x00007f49a092e94c in bdrv_aio_readv (bs=bs@entry=0x7f49a1f1c000, sector_num=sector_num@entry=25389824, qiov=qiov@entry=0x7f49a44a6928, nb_sectors=nb_sectors@entry=10880, cb=cb@entry=0x7f49a096a240 <mirror_read_complete>, opaque=opaque@entry=0x7f49a44a6920) at block.c:4647 No locals. #7 0x00007f49a096af12 in mirror_iteration (s=0x7f49a3e4bc80) at block/mirror.c:306 source = 0x7f49a1f1c000 nb_chunks = <optimized out> end = <optimized out> hbitmap_next_sector = <optimized out> pnum = 2816 nb_sectors = 10880 sectors_per_chunk = <optimized out> sector_num = 25389824 next_chunk = <optimized out> next_sector = 25400704 op = 0x7f49a44a6920 ret = <optimized out> delay_ns = <optimized out> #8 mirror_run (opaque=0x7f49a3e4bc80) at block/mirror.c:518 delay_ns = 0 cnt = <optimized out> should_complete = <optimized out> s = 0x7f49a3e4bc80 data = <optimized out> bs = <optimized out> sector_num = <optimized out> end = <optimized out> sectors_per_chunk = <optimized out> length = <optimized out> last_pause_ns = <optimized out> bdi = {cluster_size = 0, vm_state_offset = 0, is_dirty = false, unallocated_blocks_are_zero = false, can_write_zeroes_with_unmap = false, needs_compressed_writes = false} backing_filename = "\000" ret = <optimized out> n = 128 __PRETTY_FUNCTION__ = "mirror_run" #9 0x00007f49a093b97a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at coroutine-ucontext.c:80 self = 0x7f49a44b0480 co = 0x7f49a44b0480 #10 0x00007f4998d260f0 in ?? () from /lib64/libc.so.6 No symbol table info available. Expected results: no core dump, works well Additional info: I downgrade to qemu-kvm-rhev-2.3.0-12.el7.x86_64 to try, no core dump and works well.
> > > Additional info: > > I downgrade to qemu-kvm-rhev-2.3.0-12.el7.x86_64 to try, no core dump and > works well. Adding regression keywords according this comment. Best Regards, Junyi
seams same bug as 1249452
Have hit the same issue on ppc64le The stack trace looks same: Program terminated with signal 6, Aborted. #0 0x00003fff9b91e578 in raise () from /lib64/power8/libc.so.6 #0 0x00003fff9b91e578 in raise () from /lib64/power8/libc.so.6 No symbol table info available. #1 0x00003fff9b9206fc in abort () from /lib64/power8/libc.so.6 No symbol table info available. #2 0x000000004aecfd38 in qemu_coroutine_enter (co=0x10003b9c900, opaque=0x0) at qemu-coroutine.c:111 self = <optimized out> ret = <optimized out> #3 0x000000004aed002c in qemu_co_queue_run_restart (co=0x10004a9e700) at qemu-coroutine-lock.c:59 next = <optimized out> #4 0x000000004aecfc00 in qemu_coroutine_enter (co=0x10004a9e700, opaque=<optimized out>) at qemu-coroutine.c:118 self = <optimized out> ret = COROUTINE_YIELD ...
(In reply to Qian Guo from comment #3) > seams same bug as 1249452 Reassigned to Kevin, who's taking care of Bug 1249452, for initial investigation.
qemu-kvm-rhev-2.3.0-13.el7 includes the patches for two BZs that might be related to this. Both were backported by Fam and reviewed by Jeff, Stefan, Paolo and Laszlo. (All CCed) Bug 1242316: "block/mirror: Add "unmap" to drive-mirror" Bug 1233826: "block/mirror: Sleep periodically during bitmap scanning" Before I spend time digging into this, do any of you remember a part of these patches that seems likely to cause a bug like this (coroutine reentered recursively)?
(In reply to Kevin Wolf from comment #7) > Bug 1233826: "block/mirror: Sleep periodically during bitmap scanning" I was a reviewer on this one. (downstream commit 51b5e87448f2fc4cefcb4adb2ee12c31c37332ff). I think I did not dig into block_job_sleep_ns() added by that patch, but now that call looks like a possible culprit to me. I don't know much about QEMU's coroutines (although they have been explained to me earlier; sorry about that). The direct block_job_sleep_ns() -> qemu_coroutine_yield() call seems okay. The block_job_sleep_ns() -> co_aio_sleep_ns() -> qemu_coroutine_yield() call chain, *with* the "reenter me later" timer, seems risky. I do realize it's a generic coroutine facility. Still... Sorry I have no better idea. In any case this should be a textually simple patch, and should be easy to revert (without bisection).
Thanks, Laszlo. Actually, looking closer at that patch, I think it's unlikely to be the culprit because we don't fail in the preparing loop, but only later. I have a suspicion now, and the code that I don't like is the following lines in mirror_iteration_done(): if (s->common.busy) { qemu_coroutine_enter(s->common.co, NULL); } The reason why I don't like it is that - unlike most places that reenter - it doesn't have a specific yield that it pairs with, but is more uncontrolled. What we really mean here is "reenter the coroutine if it's in one of the four explicit yields in mirror.c". This used to be equivalent with s->common.busy (though not really nice to rely on it) because neither mirror_run() nor mirror_iteration() call any function that could yield. However since upstream commit dcfb3beb (downstream ea7c8b77, "mirror: Do zero write on target if sectors not allocated") this doesn't hold true any more: bdrv_get_block_status_above() can yield. So what happens is that bdrv_get_block_status_above() wants to take a lock that is already held, so it enters itself into the queue of waiting coroutines and yields. Instead of being woken up by the unlock function, however, it gets woken up by mirror_iteration_done(), which is obviously wrong. In most cases the code actually happens to cope fairly well with such cases, but in this specific case, the unlock must already have scheduled the coroutine for wakeup when mirror_iteration_done() reentered it. And then the coroutine happened to process the scheduled restarts and tried to reenter itself recursively. tl;dr: The reenter in mirror_iteration_done() must be paired with specific yields instead of abusing s->common.busy.
Can you please test whether this scratch build fixes the problem? http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9678693
*** Bug 1249452 has been marked as a duplicate of this bug. ***
(In reply to Kevin Wolf from comment #10) > Can you please test whether this scratch build fixes the problem? > > http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9678693 I tried 5 times using autotest script and manually, not hit this problem on this scratch build. for details as below: # uname -r 3.10.0-302.el7.x86_64 # rpm -qa|grep qemu qemu-kvm-rhev-debuginfo-2.3.0-14.el7.bz1251487.x86_64 qemu-kvm-rhev-2.3.0-14.el7.bz1251487.x86_64 qemu-kvm-common-rhev-2.3.0-14.el7.bz1251487.x86_64 qemu-img-rhev-2.3.0-14.el7.bz1251487.x86_64 qemu-kvm-tools-rhev-2.3.0-14.el7.bz1251487.x86_64 ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch qemu-guest-agent-2.3.0-2.el7.x86_64 autotest result: (Result file: /home/autotest-devel/client/results/default/status) virt.qemu.smp_4.4096m.repeat1.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 226 completed successfully virt.qemu.smp_4.4096m.repeat2.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 311 completed successfully virt.qemu.smp_4.4096m.repeat3.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 313 completed successfully virt.qemu.smp_4.4096m.repeat4.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 296 completed successfully virt.qemu.smp_4.4096m.repeat5.run_test.Host_RHEL.7.2.qcow2.virtio_scsi.up.rtl8139.Win2012.x86_64.r2.io-github-autotest-qemu.drive_mirror.simple_test.mirroring.block_job_complete.target_in_localfs GOOD 316 completed successfully ---- GOOD 1525
Fix posted upstream: https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg01524.html Now I'm waiting for the patch to arrive in master before I can backport it.
Block verification of bug: Bug 1233826 - issueing drive-mirror command causes monitor unresponsive Bug 1238585 - drive-mirror has spurious failures with low 'granularity' values
Kevin, Since this is a blocker for QE, can you provide a scratch build and also post to rhvirt-patches? I think I see it's accepted upstream now. Thanks!
The new scratch build is http://brewweb.devel.redhat.com/brew/taskinfo?taskID=9775419
Fix included in qemu-kvm-rhev-2.3.0-22.el7
*** Bug 1258737 has been marked as a duplicate of this bug. ***
Summary:The verification results have proved that this bug has been fixed. Reproduced: Host versions: Kernel:3.10.0-314.el7.x86_64 qemu-kvm-rhev:qemu-kvm-rhev-2.3.0-14.el7.x86_64 Steps: 1.boot guest /usr/libexec/qemu-kvm -name rhel7.2 -machine pc-i440fx-rhel7.2.0,accel=kvm \ -cpu SandyBridge -m 2G,slots=256,maxmem=40G -numa node \ -smp 4,sockets=2,cores=2,threads=1 \ -uuid 82b1a01e-5f6c-4f5f-8d27-3855a74e6b6b \ -netdev tap,id=hostnet0 \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=12:54:00:5c:88:6d \ -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16 \ -spice port=5900,addr=0.0.0.0,disable-ticketing,image-compression=off,seamless-migration=on \ -monitor stdio \ -serial unix:/tmp/monitor,server,nowait \ -qmp tcp:0:1235,server,nowait \ -drive file=/home/rhel7.2_scsi.qcow2,if=none,id=drive-scsi-disk0,format=qcow2,cache=none,werror=stop,rerror=stop \ -device virtio-scsi-pci,id=scsi0 \ -device scsi-hd,drive=drive-scsi-disk0,bus=scsi0.0,scsi-id=0,lun=0,id=scsi-disk0 \ 2. do block mirror {"execute": "drive-mirror", "arguments": {"device": "drive-scsi-disk0", "mode": "absolute-paths", "format": "qcow2", "target": "/home/mirror0.qco2", "sync": "full"}, "id": "se9JM3ou"} {"return": {}, "id": "se9JM3ou"} 3. qemu core dumped occurs (qemu) Formatting '/home/mirror0.qco2', fmt=qcow2 size=21474836480 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 Co-routine re-entered recursively Aborted (core dumped) Verified: Host versions: Kernel:3.10.0-314.el7.x86_64 qemu-kvm-rhev:qemu-kvm-rhev-2.3.0-22.el7.x86_64 Steps: 1.boot guest /usr/libexec/qemu-kvm -name rhel7.2 -machine pc-i440fx-rhel7.2.0,accel=kvm \ -cpu SandyBridge -m 2G,slots=256,maxmem=40G -numa node \ -smp 4,sockets=2,cores=2,threads=1 \ -uuid 82b1a01e-5f6c-4f5f-8d27-3855a74e6b6b \ -netdev tap,id=hostnet0 \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=12:54:00:5c:88:6d \ -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16 \ -spice port=5900,addr=0.0.0.0,disable-ticketing,image-compression=off,seamless-migration=on \ -monitor stdio \ -serial unix:/tmp/monitor,server,nowait \ -qmp tcp:0:1235,server,nowait \ -drive file=/home/rhel7.2_scsi.qcow2,if=none,id=drive-scsi-disk0,format=qcow2,cache=none,werror=stop,rerror=stop \ -device virtio-scsi-pci,id=scsi0 \ -device scsi-hd,drive=drive-scsi-disk0,bus=scsi0.0,scsi-id=0,lun=0,id=scsi-disk0 \ 2. do block mirror {"execute": "drive-mirror", "arguments": {"device": "drive-scsi-disk0", "mode": "absolute-paths", "format": "qcow2", "target": "/home/mirror1.qco2", "sync": "full"}, "id": "se9JM3ou"} {"return": {}, "id": "se9JM3ou"} {"timestamp": {"seconds": 1441600632, "microseconds": 143896}, "event": "BLOCK_JOB_READY", "data": {"device": "drive-scsi-disk0", "len": 1116602368, "offset": 1116602368, "speed": 0, "type": "mirror"}} 3.query mirror job {"execute": "query-block-jobs", "id": "qAyA5fMW"} {"return": [{"io-status": "ok", "device": "drive-scsi-disk0", "busy": false, "len": 1116864512, "offset": 1116864512, "paused": false, "speed": 0, "ready": true, "type": "mirror"}], "id": "qAyA5fMW"} 4. Qemu and guest both work well (qemu) info block drive-scsi-disk0: /home/rhel7.2_scsi.qcow2 (qcow2) Cache mode: writeback, direct ide1-cd0: [not inserted] Removable device: not locked, tray closed floppy0: [not inserted] Removable device: not locked, tray closed sd0: [not inserted] Removable device: not locked, tray closed (qemu) info status VM status: running So this bug has been fixed.
On PowerPC platform,do dd file in the guest and do block mirror at the same time,the file created successful and the block mirror finished successful,guest work well. Host: qemu-kvm-rhev-2.3.0-22.el7.ppc64le kernel-3.10.0-313.el7.ppc64le SLOF-20150313-3.gitc89b0df.el7.noarch 1.Boot the guest with cmd: # gdb /usr/libexec/qemu-kvm (gdb) r -name live-snapshot-test -machine pseries,accel=kvm,usb=off -m 8G -smp 4 -nodefaults -monitor stdio -rtc base=utc -drive file=RHEL-7.2-20150820.0-Server-ppc64-2.qcow2,format=qcow2,if=none,id=drive-virtio0,werror=stop,cache=none -device virtio-blk-pci,id=virtio0,drive=drive-virtio0 -device spapr-vscsi,id=scsi0,reg=0x1000 -drive file=RHEL-7.2-20150820.0-Server-ppc64-dvd1.iso,if=none,id=drive-scsi0,readonly=on,format=raw -device scsi-cd,bus=scsi0.0,channel=0,scsi-id=0,drive=drive-scsi0,id=scsi0-0 -netdev tap,id=hostnet0,script=/etc/qemu-ifup -device spapr-vlan,netdev=hostnet0,id=net0,mac=00:52:5f:5d:5c:5d -device usb-kbd,id=input0 -device usb-mouse,id=input1 -msg timestamp=on -usb -device usb-tablet,id=tablet1 -vga std -vnc 0.0.0.0:19 -uuid bf91fbbf-33a8-4bef-b69a-e591bb233795 -qmp tcp:0:4666,server,nowait 2.In the guest: dd if=/dev/zero of=file bs=1M count=10240 3.During the step2 {"execute":"drive-mirror","arguments":{"device":"drive-virtio0","target":"/root/test_home/shuyu/8-26-livesnapshot/sn1","format":"qcow2","mode":"absolute-paths","sync":"full","speed":1000000000, "on-source-error": "stop", "on-target-error": "stop"}}
According to comment19 and comment20, set this issue as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2546.html