Bug 1453169
Summary: | qemu aborts if quit during live commit process | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Qianqian Zhu <qizhu> |
Component: | qemu-kvm-rhev | Assignee: | Kevin Wolf <kwolf> |
Status: | CLOSED ERRATA | QA Contact: | Qianqian Zhu <qizhu> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 7.4 | CC: | chayang, dgibson, juzhang, knoel, kwolf, lmiksik, michen, qizhu, qzhang, thuth, virt-maint |
Target Milestone: | rc | Keywords: | Regression |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-rhev-2.9.0-10.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-02 04:41:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Qianqian Zhu
2017-05-22 09:57:27 UTC
spapr-vscsi is specific to power. So what the message tells is that op blockers prevented some operation from using a block node that is going to be removed with the commit block job because it doesn't contain consistent data any more after the commit job has copied the first sectors from sn3 into sn1. The curious thing is that apparently the 'quit' command made something want to access that block node. And apparently that thing didn't expect this to fail, or we wouldn't get an abort(). It's unclear to me yet what this could be. (In reply to Qianqian Zhu from comment #0) > Program received signal SIGABRT, Aborted. > 0x00003fffb6f2eff0 in raise () from /lib64/libc.so.6 Can you please get a full backtrace so we can find out which operation it was that failed? Hi Kevin, Not sure if it helps, below is what I get after (gdb) thread apply all bt full: (qemu) main-loop: WARNING: I/O thread spun for 1000 iterations (qemu) quit [New Thread 0x3fffb47eea80 (LWP 186434)] Unexpected error in bdrv_check_update_perm() at block.c:1648: qemu-kvm: Conflicts with use by commit job 'drive_image1' as 'intermediate node', which does not allow 'consistent read' on #block365 Program received signal SIGABRT, Aborted. 0x00003fffb6f2eff0 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); qemu-kvm-rhev-2.6.0-27.el7.ppc64le has no this issue, so this is a regression. Kevin, So in this case it looks like this is triggered by the the spapr-vscsi block device (which is POWER specific). It's not clear to me why the details of the block device would have any bearing on the guts of the block layers live commit code. Any ideas? No, it doesn't make any sense to me either. The things that should actually make a difference is scsi-hd, and that's the same on both platforms. This is also why I asked for a full backtrace, but we don't seem to be getting one from the reporter. Can you try and reproduce it yourself? Qianqian Zhu, to make sure that we've got all important information in this ticket here, could you please describe how you created the initial versions of the sn1, sn3 and sn4 files? I've tried to reproduce the problem, but so far I've failed. I either get a BLOCK_JOB_COMPLETED event immediately, since there is nothing to commit between ns3 and ns1 (all changes are in ns4), or if I start QEMU slightly differently (so that the block job runs longer), I see a BLOCK_JOB_CANCELLED right after the SHUTDOWN event. In any case, there is no abort. So you really have to provide more information, how you exactly set up your sn* files... Also, could it be that this is related to NFS? Could you please try whether you can reproduce the issue with local files only, too? I have no power host on hand currently, I have tried to reserve one from beaker, and I will test local files once I get it. All snapshots are got from live snapshot: { "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_image1","snapshot-file": "/mnt/nfs/sn1", "format": "qcow2", "mode": "absolute-paths" } } { "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_image1","snapshot-file": "/mnt/nfs/sn2", "format": "qcow2", "mode": "absolute-paths" } } { "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_image1","snapshot-file": "/mnt/nfs/sn3", "format": "qcow2", "mode": "absolute-paths" } } After snapshot sn3, I dd a 1G file inside guest, then live snapshot sn4: { "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_image1","snapshot-file": "/mnt/nfs/sn4", "format": "qcow2", "mode": "absolute-paths" } } Test with local file, 100% reproducible with below steps: 1. Launch guest with base file: /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -machine pseries -vga std -device spapr-vscsi,id=spapr_vscsi0 -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/staf-kvm-devel/vt_test_images/rhel74-ppc64le-virtio.qcow2 -device scsi-hd,id=image1,bus=spapr_vscsi0.0,drive=drive_image1 -device virtio-net-pci,mac=9a:a9:aa:ab:ac:ad,id=idIVGRLp,vectors=4,netdev=idOmGKIu,bus=pci.0,addr=0x4 -netdev tap,id=idOmGKIu,vhost=on -m 4096 -smp 2,cores=2,threads=1,sockets=1 -vnc :3 -monitor stdio -qmp tcp::5555,server,nowait 2. Live snapshot: { "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_image1","snapshot-file": "/home/sn1", "format": "qcow2", "mode": "absolute-paths" } } {"return": {}} { "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_image1","snapshot-file": "/home/sn2", "format": "qcow2", "mode": "absolute-paths" } } {"return": {}} { "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_image1","snapshot-file": "/home/sn3", "format": "qcow2", "mode": "absolute-paths" } } {"return": {}} 3. dd file inside guest: dd if=/dev/urandom of=/home/sn3 bs=1M count=1000 status=progress 4. Live snapshot: { "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_image1","snapshot-file": "/home/sn4", "format": "qcow2", "mode": "absolute-paths" } } 5. Live commit: { "execute": "block-commit", "arguments": { "device": "drive_image1", "base":"/home/sn1", "top": "/home/sn3","backing-file": ":/home/sn1" } } 6. quit qemu: (qemu) quit Qianqian, From the initial message I see that this problem doesn't occur on x86. However, it's not clear if it's been tested with the virtio-scsi or virtio-blk devices on POWER. Can you confirm whether or not this occurs on POWER with those devices. It would be useful to know if the specific block device is related to the problem or not. David, Both virtio-scsi and virtio-blk on power hit this issue, so I modify the title. Qianqian, Thanks for the information. So this looks like a Power-specific host side problem with the block code, rather than something related to the specific block device. That makes a bit more sense, although it's still unclear why Power would make a difference here. OK, thanks to the detailed description from comment 13, I've finally been able to reproduce the problem (with qemu-kvm-rhev-2.9.0-8.el7). Here's the backtrace: #0 0x00003fffb710eff0 in raise () from /lib64/libc.so.6 #1 0x00003fffb711136c in abort () from /lib64/libc.so.6 #2 0x0000000045771e94 in error_handle_fatal (errp=<optimized out>, err=0x466046e0) at util/error.c:38 #3 0x0000000045771fb8 in error_setv (errp=0x45debbb8 <error_abort>, src=0x45803f68 "block.c", line=<optimized out>, func=0x45803a40 <__func__.32017> "bdrv_check_update_perm", err_class=<optimized out>, fmt=0x458040e8 "Conflicts with use by %s as '%s', which does not allow '%s' on %s", ap=<optimized out>, suffix=0x0) at util/error.c:71 #4 0x00000000457720b8 in error_setg_internal (errp=<optimized out>, src=<optimized out>, line=<optimized out>, func=<optimized out>, fmt=<optimized out>) at util/error.c:95 #5 0x000000004566b05c in bdrv_check_update_perm (bs=0x46913800, new_used_perm=1, new_shared_perm=21, ignore_children=0x0, errp=0x45debbb8 <error_abort>) at block.c:1657 #6 0x000000004566d2dc in bdrv_root_attach_child (child_bs=0x46913800, child_name=0x45804890 "backing", child_role=0x45947820 <child_backing>, perm=1, shared_perm=21, opaque=0x4691a000, errp=0x45debbb8 <error_abort>) at block.c:1861 #7 0x000000004566d48c in bdrv_attach_child (parent_bs=0x4691a000, child_bs=0x46913800, child_name=0x45804890 "backing", child_role=0x45947820 <child_backing>, errp=0x45debbb8 <error_abort>) at block.c:1899 #8 0x0000000045671e20 in bdrv_set_backing_hd (bs=0x4691a000, backing_hd=0x46913800, errp=0x45debbb8 <error_abort>) at block.c:1996 #9 0x00000000456c5910 in commit_complete (job=0x467a3720, opaque=0x47821f38) at block/commit.c:125 #10 0x00000000456750dc in block_job_defer_to_main_loop_bh (opaque=0x470d5be0) at blockjob.c:794 #11 0x00000000457658d8 in aio_bh_call (bh=0x46607260) at util/async.c:90 #12 aio_bh_poll (ctx=0x46701900) at util/async.c:118 #13 0x000000004576a534 in aio_poll (ctx=0x46701900, blocking=<optimized out>) at util/aio-posix.c:682 #14 0x00000000456c6e44 in bdrv_drain_recurse (bs=0x4691a000) at block/io.c:164 #15 0x00000000456c7968 in bdrv_drained_begin (bs=0x4691a000) at block/io.c:248 #16 0x00000000456c7cc0 in bdrv_drain (bs=0x4691a000) at block/io.c:282 #17 0x00000000456b76d0 in blk_drain (blk=<optimized out>) at block/block-backend.c:1383 #18 0x00000000456752cc in block_job_drain (job=0x467a3720) at blockjob.c:126 #19 0x00000000456757b8 in block_job_finish_sync (job=0x467a3720, finish= 0x45677150 <block_job_cancel_err>, errp=0x0) at blockjob.c:584 #20 0x00000000456764b8 in block_job_cancel_sync (job=0x467a3720) at blockjob.c:604 #21 block_job_cancel_sync_all () at blockjob.c:615 #22 0x000000004566d7d8 in bdrv_close_all () at block.c:3002 #23 0x000000004535a180 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4726 Using the instructions from comment 13 and virtio-scsi instead of spapr-vscsi, I've been able to reproduce this issue on x86, too, so this is not specific to ppc64: $ rpm -qa | grep qemu-kvm-rhev qemu-kvm-rhev-2.9.0-5.el7.x86_64 $ sudo /usr/libexec/qemu-kvm -vga std -device virtio-scsi,id=vscsi0 -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/thuth/tmp/images/rhel74-server.qcow2 -device scsi-hd,id=image1,bus=vscsi0.0,drive=drive_image1 -m 2G -smp 2,cores=2,threads=1,sockets=1 -vnc :10 -monitor stdio -qmp tcp::5555,server,nowait QEMU 2.9.0 monitor - type 'help' for more information (qemu) Formatting '/home/sn1', fmt=qcow2 size=19327352832 backing_file=/home/thuth/tmp/images/rhel74-server.qcow2 backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 Formatting '/home/sn2', fmt=qcow2 size=19327352832 backing_file=/home/sn1 backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 Formatting '/home/sn3', fmt=qcow2 size=19327352832 backing_file=/home/sn2 backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 Formatting '/home/sn4', fmt=qcow2 size=19327352832 backing_file=/home/sn3 backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 (qemu) quit main-loop: WARNING: I/O thread spun for 1000 iterations Unexpected error in bdrv_check_update_perm() at block.c:1648: qemu-kvm: Conflicts with use by commit job 'drive_image1' as 'intermediate node', which does not allow 'consistent read' on #block2246 ... so I'm assigning this to an expert from the block layer instead. Ok, thanks. With this setup I can reproduce the problem, too. Adding block_job_remove_all_bdrv() before block_job_completed() should fix the problem. I'll work on a patch and qemu-iotests case. Fix included in qemu-kvm-rhev-2.9.0-10.el7 Reproduced on qemu-kvm-rhev-2.9.0-5.el7.x86_64: Steps: 1. Launch guest: /usr/libexec/qemu-kvm -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=03 -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel74-64-virtio-scsi.qcow2 -device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,scsi-id=0,lun=0 -device virtio-net-pci,mac=9a:4d:4e:4f:50:51,id=idpH2Vot,vectors=4,netdev=idP4EUQG,bus=pci.0,addr=04 -netdev tap,id=idP4EUQG,vhost=on -m 1024 -cpu 'SandyBridge' -vnc :3 -monitor stdio -qmp tcp:0:5555,server,nowait 2. Live snapshot sn1->sn4, and dd files inside guest on sn3. 3. . Live commit: { "execute": "block-commit", "arguments": { "device": "drive_image1", "base": "/home/sn1", "top": "/home/sn3","backing-file": "/home/sn1" } } 4. Quit qemu: (qemu) quit Result: Qemu abort. Verified with: Both qemu-kvm-rhev-2.9.0-10.el7.x86_64 and qemu-kvm-rhev-2.9.0-10.el7.ppc64le Steps same as above. Result: Qemu quit normally, block job cancelled. (In reply to Qianqian Zhu from comment #22) > Reproduced on qemu-kvm-rhev-2.9.0-5.el7.x86_64: > Steps: > 1. Launch guest: > /usr/libexec/qemu-kvm -device > virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=03 -drive > id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2, > file=/home/kvm_autotest_root/images/rhel74-64-virtio-scsi.qcow2 -device > scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,scsi-id=0,lun=0 > -device > virtio-net-pci,mac=9a:4d:4e:4f:50:51,id=idpH2Vot,vectors=4,netdev=idP4EUQG, > bus=pci.0,addr=04 -netdev tap,id=idP4EUQG,vhost=on -m 1024 > -cpu 'SandyBridge' -vnc :3 -monitor stdio -qmp tcp:0:5555,server,nowait > 2. Live snapshot sn1->sn4, and dd files inside guest on sn3. > 3. . Live commit: > { "execute": "block-commit", "arguments": { "device": "drive_image1", > "base": "/home/sn1", "top": "/home/sn3","backing-file": "/home/sn1" } } > 4. Quit qemu: > (qemu) quit > > Result: > Qemu abort. > > Verified with: > Both qemu-kvm-rhev-2.9.0-10.el7.x86_64 and qemu-kvm-rhev-2.9.0-10.el7.ppc64le > > Steps same as above. > Result: > Qemu quit normally, block job cancelled. Setting VERIFIED accordingly. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 |