Bug 1482478
Summary: | Fail to quit source qemu when do live migration after mirroring guest to NBD server | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Longxiang Lyu <lolyu> | |
Component: | qemu-kvm-rhev | Assignee: | Eric Blake <eblake> | |
Status: | CLOSED ERRATA | QA Contact: | Longxiang Lyu <lolyu> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 7.4 | CC: | aliang, areis, chayang, coldford, coli, dgilbert, dvd, eblake, famz, jsuchane, juzhang, knoel, lolyu, marjones, michen, mrezanin, mtessun, qzhang, stefanha, toneata, virt-maint | |
Target Milestone: | rc | Keywords: | Regression | |
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | qemu-kvm-rhev-2.10.0-2.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1495472 1495474 (view as bug list) | Environment: | ||
Last Closed: | 2018-04-11 00:33:01 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1495472, 1495474 |
Description
Longxiang Lyu
2017-08-17 11:30:54 UTC
On the Additional info part: Fail to reproduce for 7.3.z, src qemu quits successfully. The version info should be: kernel-3.10.0-514.el7.x86_64 qemu-kvm-rhev-2.6.0-28.el7_3.14 for qemu-kvm-rhev-2.9.0-16.el7_4.3.x86_64, src qemu can't quit. dest qemu output during migration: (qemu) red_dispatcher_loadvm_commands: qemu-kvm: Conflicts with use by a block device as 'root', which uses 'write' on #block126 dest qemu output after invocation of nbd-server-stop: (qemu) nbd/server.c:nbd_receive_request():L706: read failed Hi Fam, this seems similar to issues that have been reported (and fixed?) upstream recently. What do you think? Yes, it's worth to test the upstream patch (in 2.10-rc3). They fix two issues about op blockers, though I'm not sure about the src hang. Is there a backtrace when src hangs? 1. step 2 should be: #qemu-img create -f qcow2 mirror1.qcow2 20G 2. gdb bt output of the hang qemu process: #0 0x00007fccc3149aff in ppoll () from /lib64/libc.so.6 #1 0x000055be482a0c9b in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:322 #3 0x000055be482a2955 in aio_poll (ctx=ctx@entry=0x55be4a407700, blocking=<optimized out>) at util/aio-posix.c:622 #4 0x000055be48231184 in bdrv_flush (bs=bs@entry=0x55be4a4bc800) at block/io.c:2418 #5 0x000055be481ed52b in bdrv_close (bs=0x55be4a4bc800) at block.c:2949 #6 bdrv_delete (bs=0x55be4a4bc800) at block.c:3139 #7 bdrv_unref (bs=0x55be4a4bc800) at block.c:4116 #8 0x000055be48225314 in blk_remove_bs (blk=blk@entry=0x55be4a42a000) at block/block-backend.c:552 #9 0x000055be4822535b in blk_remove_all_bs () at block/block-backend.c:306 #10 0x000055be481ea648 in bdrv_close_all () at block.c:3009 #11 0x000055be47f9306b in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4737 bug exists in qemu-2.10.0-rc3 on src fail to quit qemu, hangs. gdb bt output for the src qemu-process of qemu-2.10.0-rc3: # gdb -batch -ex bt -p 3005 [New LWP 3045] [New LWP 3023] [New LWP 3022] [New LWP 3021] [New LWP 3019] [New LWP 3006] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007fded767aaff in ppoll () from /lib64/libc.so.6 warning: File "/home/test/nbd01/qemu-2.10/qemu-2.10.0-rc3/.gdbinit" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py". To enable execution of this file add add-auto-load-safe-path /home/test/nbd01/qemu-2.10/qemu-2.10.0-rc3/.gdbinit line to your configuration file "/root/.gdbinit". To completely disable this security protection add set auto-load safe-path / line to your configuration file "/root/.gdbinit". For more information about this security protection see the "Auto-loading safe path" section in the GDB manual. E.g., run from the shell: info "(gdb)Auto-loading safe path" #0 0x00007fded767aaff in ppoll () at /lib64/libc.so.6 #1 0x000055c65a2e5adb in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 #2 0x000055c65a2e5adb in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:322 #3 0x000055c65a2e779a in aio_poll (ctx=ctx@entry=0x55c65baeafb0, blocking=<optimized out>) at util/aio-posix.c:622 #4 0x000055c65a278314 in bdrv_flush (bs=bs@entry=0x55c65d7b5c30) at block/io.c:2288 #5 0x000055c65a22cacb in bdrv_unref (bs=0x55c65d7b5c30) at block.c:3061 #6 0x000055c65a22cacb in bdrv_unref (bs=0x55c65d7b5c30) at block.c:3251 #7 0x000055c65a22cacb in bdrv_unref (bs=0x55c65d7b5c30) at block.c:4237 #8 0x000055c65a26ba3a in blk_remove_bs (blk=blk@entry=0x55c65bb14fc0) at block/block-backend.c:647 #9 0x000055c65a26ba8b in blk_remove_all_bs () at block/block-backend.c:401 #10 0x000055c65a22a138 in bdrv_close_all () at block.c:3121 #11 0x000055c659f5b8ea in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4796 I have sent a fix upstream. Please test it: https://github.com/stefanha/qemu/tree/nbd-fix-eof-hang Thanks! If I'm correct, this worked in 2.8, regressed in 2.9 and 2.10 (unless we do an unlikely -rc4), but should be one of the early commits into 2.11 as well as an easy backport for downstream bug still exists for the fix upstream: https://github.com/stefanha/qemu/tree/nbd-fix-eof-hang # ps aux | grep qemu root 9801 3.6 6.7 6607480 817448 pts/2 Sl+ 18:44 0:35 /home/test/iscsi/qemu-fix/qemu/x86_64-softmmu/qemu-system-x86_64 -name guest=test-virt -machine pc-i440fx-2.9,accel=kvm,usb=off,vmport=off,dump-guest-core=off -cpu SandyBridge -m 2G -smp 4,sockets=4,cores=1,threads=1 -boot strict=on -drive file=/home/test/nbd01/test.qcow2,if=none,cache=none,snapshot=off,format=qcow2,id=img0 -device virtio-blk-pci,bus=pci.0,addr=0x7,drive=img0,id=virtio-disk0,bootindex=1 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:12:b3:20:61,bus=pci.0,addr=0x3 -device VGA -usbdevice tablet -vnc :2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -monitor stdio -qmp tcp:0:5555,server,nowait root 10690 0.0 0.0 112664 972 pts/1 R+ 19:00 0:00 grep --color=auto qemu # gdb -batch -ex bt -p 9801 [New LWP 9819] [New LWP 9817] [New LWP 9816] [New LWP 9815] [New LWP 9814] [New LWP 9802] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007f66e9dc6aff in ppoll () from /lib64/libc.so.6 #0 0x00007f66e9dc6aff in ppoll () at /lib64/libc.so.6 #1 0x0000559ae2ad98ab in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 #2 0x0000559ae2ad98ab in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:322 #3 0x0000559ae2adb56a in aio_poll (ctx=ctx@entry=0x559ae3fddbb0, blocking=<optimized out>) at util/aio-posix.c:622 #4 0x0000559ae2a734a4 in bdrv_flush (bs=bs@entry=0x559ae3fe17d0) at block/io.c:2418 #5 0x0000559ae2a2de0b in bdrv_unref (bs=0x559ae3fe17d0) at block.c:2949 #6 0x0000559ae2a2de0b in bdrv_unref (bs=0x559ae3fe17d0) at block.c:3139 #7 0x0000559ae2a2de0b in bdrv_unref (bs=0x559ae3fe17d0) at block.c:4121 #8 0x0000559ae2a679d4 in blk_remove_bs (blk=blk@entry=0x559ae4007710) at block/block-backend.c:552 #9 0x0000559ae2a67a1b in blk_remove_all_bs () at block/block-backend.c:306 #10 0x0000559ae2a2af78 in bdrv_close_all () at block.c:3009 #11 0x0000559ae276ec95 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4721 Does this work now with current qemu (1eed3399) - Eric's big batch of nbd fixes went in yesterday. I was testing with --copy-storage-all yesterday and it seemed happy. bug exists for qemu-kvm-rhev-2.9.0-16.el7_4.5 bug is fixed in v2.10.0-rc4. I see the exported nbd disk is not locked after migration and dest vm is in running status, should it be locked? This works as 2.8 does, which doesn't have this bug. I'm not sure about your c#18 locking question, lets ask Eric. (In reply to Dr. David Alan Gilbert from comment #20) > I'm not sure about your c#18 locking question, lets ask Eric. How are you observing whether the nbd disk is locked? But that sounds like a side question, unrelated to the hang that was fixed upstream for 2.10. Moving to POST, since RHEL 7.5 will pick up the 2.10 fix by rebase (In reply to Eric Blake from comment #21) > (In reply to Dr. David Alan Gilbert from comment #20) > > I'm not sure about your c#18 locking question, lets ask Eric. > > How are you observing whether the nbd disk is locked? But that sounds like a > side question, unrelated to the hang that was fixed upstream for 2.10. In the fix: v2.10.0-rc4, dest VM could be continued without executing step 7: nbd-server-stop. It seems qemu has no write protection for this image so that the dest could directly do I/O to it. Hi Eric, Do you have a list of all the commits needed for this? (In reply to Dr. David Alan Gilbert from comment #26) > Hi Eric, > Do you have a list of all the commits needed for this? Probably easier if I just prepare the backport. I'll reassign the z-stream bug to myself... Moving back to assigned; we have some NBD patches that are being included in the 7.4.z build but were not part of 2.10. *** Bug 1493890 has been marked as a duplicate of this bug. *** *** Bug 1493901 has been marked as a duplicate of this bug. *** Fix included in qemu-kvm-rhev-2.10.0-2.el7 Bug is verified pass for qemu-kvm-rhev-2.10.0-2.el7. Steps are the same as bug description. Why is rhel-7.4.z+ still set? Does Bug 1495474 already have the fix for 7.4.z? Thanks. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:1104 |