RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1482478 - Fail to quit source qemu when do live migration after mirroring guest to NBD server
Summary: Fail to quit source qemu when do live migration after mirroring guest to NBD ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.4
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Eric Blake
QA Contact: Longxiang Lyu
URL:
Whiteboard:
: 1493890 1493901 (view as bug list)
Depends On:
Blocks: 1495472 1495474
TreeView+ depends on / blocked
 
Reported: 2017-08-17 11:30 UTC by Longxiang Lyu
Modified: 2022-03-13 14:23 UTC (History)
21 users (show)

Fixed In Version: qemu-kvm-rhev-2.10.0-2.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1495472 1495474 (view as bug list)
Environment:
Last Closed: 2018-04-11 00:33:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:1104 0 normal SHIPPED_LIVE Important: qemu-kvm-rhev security, bug fix, and enhancement update 2018-04-10 22:54:38 UTC

Description Longxiang Lyu 2017-08-17 11:30:54 UTC
Description of problem:
Fail to quit source qemu when do live migration after mirroring guest to NBD server.

Version-Release number of selected component (if applicable):
kernel-3.10.0-702.el7.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.4.x86_64
guest: rhel7.4
backend: nbd

How reproducible:
100%

Steps to reproduce:
1. Boot guest on src host
#!/bin/bash
/usr/libexec/qemu-kvm \
-name guest=test-virt \
-machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off \
-cpu SandyBridge \
-m 2G \
-smp 4,sockets=4,cores=1,threads=1 \
-boot strict=on \
-drive file=/home/test/nbd01/test.qcow2,if=none,cache=none,snapshot=off,aio=native,format=qcow2,id=img0 \
-device virtio-blk-pci,bus=pci.0,addr=0x7,drive=img0,id=virtio-disk0,bootindex=1 \
-netdev tap,id=hostnet0,vhost=on \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:12:b3:20:61,bus=pci.0,addr=0x3 \
-device qxl-vga \
-usbdevice tablet \
-vnc :2 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 \
-monitor stdio \
-qmp tcp:0:5555,server,nowait \

2. Create empty images on dst disk
#qemu-img create -f qcow2 rhel74-virtio.qcow2 20G

3. Start the dest qemu and setup as NBD server
#!/bin/bash
/usr/libexec/qemu-kvm \
-name guest=test-virt \
-machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off \
-cpu SandyBridge \
-m 2G \
-smp 4,sockets=4,cores=1,threads=1 \
-boot strict=on \
-drive file=/home/test/nbd01/mirror1.qcow2,if=none,cache=none,snapshot=off,aio=native,format=qcow2,id=img0 \
-device virtio-blk-pci,bus=pci.0,addr=0x7,drive=img0,id=virtio-disk0,bootindex=1 \
-netdev tap,id=hostnet0,vhost=on \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:12:b3:20:61,bus=pci.0,addr=0x3 \
-device qxl-vga \
-usbdevice tablet \
-vnc :1 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 \
-monitor stdio \
-qmp tcp:0:4444,server,nowait \
-incoming tcp:0:6666

{ "execute": "qmp_capabilities" }
{ "execute": "nbd-server-start", "arguments": { "addr": { "type": "inet", "data": { "host": "127.0.0.1", "port": "9999" } } } }
{ "execute": "nbd-server-add", "arguments": { "device": "img0", "writable": true } }


4. Invoke drive-mirror on the src
{ "execute": "qmp_capabilities" }
{ "execute": "drive-mirror", "arguments": { "device": "img0", "target": "nbd://127.0.0.1:9999/img0", "sync": "full", "format": "raw", "mode": "existing" } }

5. Reopen src to the new mirrored image when block mirror is ready.
{"execute": "block-job-complete", "arguments": { "device": "img0"} }

6. Invoke the migration
(qemu) migrate -d tcp:0:6666

7. Invoke nbd-server-stop command when migration completes
{ "execute" : "nbd-server-stop", "arguments" : {} }

8. Quit qemu of src
(qemu) quit





Actual results:
dest qemu monitor:
(qemu) red_dispatcher_loadvm_commands: 
qemu-kvm: Conflicts with use by a block device as 'root', which uses 'write' on #block121

On src fail to quit qemu, hang.


Expected results:
src qemu quits successfully.

Additional info:
Fail to reproduce for 7.3-Z, src qemu quits successfully.
Kernel: 3.10.0-702.el7.x86_64
Qemu: 2.9.0-16.el7_4.4

Comment 2 Longxiang Lyu 2017-08-18 01:50:45 UTC
On the Additional info part:
Fail to reproduce for 7.3.z, src qemu quits successfully.
The version info should be:
kernel-3.10.0-514.el7.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.14

Comment 4 Longxiang Lyu 2017-08-18 02:59:58 UTC
for qemu-kvm-rhev-2.9.0-16.el7_4.3.x86_64, src qemu can't quit.
dest qemu output during migration:
(qemu) red_dispatcher_loadvm_commands: 
qemu-kvm: Conflicts with use by a block device as 'root', which uses 'write' on #block126

dest qemu output after invocation of nbd-server-stop:
(qemu) nbd/server.c:nbd_receive_request():L706: read failed

Comment 5 Stefan Hajnoczi 2017-08-18 14:59:13 UTC
Hi Fam, this seems similar to issues that have been reported (and fixed?) upstream recently.  What do you think?

Comment 6 Fam Zheng 2017-08-20 02:51:38 UTC
Yes, it's worth to test the upstream patch (in 2.10-rc3). They fix two issues about op blockers, though I'm not sure about the src hang. Is there a backtrace when src hangs?

Comment 7 Longxiang Lyu 2017-08-20 14:27:49 UTC
1.
step 2 should be:
#qemu-img create -f qcow2 mirror1.qcow2 20G

2.
gdb bt output of the hang qemu process:
#0  0x00007fccc3149aff in ppoll () from /lib64/libc.so.6
#1  0x000055be482a0c9b in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1)
    at util/qemu-timer.c:322
#3  0x000055be482a2955 in aio_poll (ctx=ctx@entry=0x55be4a407700, blocking=<optimized out>)
    at util/aio-posix.c:622
#4  0x000055be48231184 in bdrv_flush (bs=bs@entry=0x55be4a4bc800) at block/io.c:2418
#5  0x000055be481ed52b in bdrv_close (bs=0x55be4a4bc800) at block.c:2949
#6  bdrv_delete (bs=0x55be4a4bc800) at block.c:3139
#7  bdrv_unref (bs=0x55be4a4bc800) at block.c:4116
#8  0x000055be48225314 in blk_remove_bs (blk=blk@entry=0x55be4a42a000) at block/block-backend.c:552
#9  0x000055be4822535b in blk_remove_all_bs () at block/block-backend.c:306
#10 0x000055be481ea648 in bdrv_close_all () at block.c:3009
#11 0x000055be47f9306b in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
    at vl.c:4737

Comment 9 Longxiang Lyu 2017-08-21 06:20:37 UTC
bug exists in qemu-2.10.0-rc3
on src fail to quit qemu, hangs.

Comment 10 Longxiang Lyu 2017-08-21 06:32:19 UTC
gdb bt output for the src qemu-process of qemu-2.10.0-rc3:

# gdb -batch -ex bt -p 3005
[New LWP 3045]
[New LWP 3023]
[New LWP 3022]
[New LWP 3021]
[New LWP 3019]
[New LWP 3006]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fded767aaff in ppoll () from /lib64/libc.so.6
warning: File "/home/test/nbd01/qemu-2.10/qemu-2.10.0-rc3/.gdbinit" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
To enable execution of this file add
	add-auto-load-safe-path /home/test/nbd01/qemu-2.10/qemu-2.10.0-rc3/.gdbinit
line to your configuration file "/root/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/root/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
#0  0x00007fded767aaff in ppoll () at /lib64/libc.so.6
#1  0x000055c65a2e5adb in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x000055c65a2e5adb in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:322
#3  0x000055c65a2e779a in aio_poll (ctx=ctx@entry=0x55c65baeafb0, blocking=<optimized out>) at util/aio-posix.c:622
#4  0x000055c65a278314 in bdrv_flush (bs=bs@entry=0x55c65d7b5c30) at block/io.c:2288
#5  0x000055c65a22cacb in bdrv_unref (bs=0x55c65d7b5c30) at block.c:3061
#6  0x000055c65a22cacb in bdrv_unref (bs=0x55c65d7b5c30) at block.c:3251
#7  0x000055c65a22cacb in bdrv_unref (bs=0x55c65d7b5c30) at block.c:4237
#8  0x000055c65a26ba3a in blk_remove_bs (blk=blk@entry=0x55c65bb14fc0) at block/block-backend.c:647
#9  0x000055c65a26ba8b in blk_remove_all_bs () at block/block-backend.c:401
#10 0x000055c65a22a138 in bdrv_close_all () at block.c:3121
#11 0x000055c659f5b8ea in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4796

Comment 11 Stefan Hajnoczi 2017-08-21 16:17:42 UTC
I have sent a fix upstream.  Please test it:
https://github.com/stefanha/qemu/tree/nbd-fix-eof-hang

Thanks!

Comment 12 Eric Blake 2017-08-21 18:54:11 UTC
If I'm correct, this worked in 2.8, regressed in 2.9 and 2.10 (unless we do an unlikely -rc4), but should be one of the early commits into 2.11 as well as an easy backport for downstream

Comment 13 Longxiang Lyu 2017-08-22 11:11:16 UTC
bug still exists for the fix upstream:
https://github.com/stefanha/qemu/tree/nbd-fix-eof-hang

# ps aux | grep qemu
root      9801  3.6  6.7 6607480 817448 pts/2  Sl+  18:44   0:35 /home/test/iscsi/qemu-fix/qemu/x86_64-softmmu/qemu-system-x86_64 -name guest=test-virt -machine pc-i440fx-2.9,accel=kvm,usb=off,vmport=off,dump-guest-core=off -cpu SandyBridge -m 2G -smp 4,sockets=4,cores=1,threads=1 -boot strict=on -drive file=/home/test/nbd01/test.qcow2,if=none,cache=none,snapshot=off,format=qcow2,id=img0 -device virtio-blk-pci,bus=pci.0,addr=0x7,drive=img0,id=virtio-disk0,bootindex=1 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:12:b3:20:61,bus=pci.0,addr=0x3 -device VGA -usbdevice tablet -vnc :2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -monitor stdio -qmp tcp:0:5555,server,nowait
root     10690  0.0  0.0 112664   972 pts/1    R+   19:00   0:00 grep --color=auto qemu

# gdb -batch -ex bt -p 9801
[New LWP 9819]
[New LWP 9817]
[New LWP 9816]
[New LWP 9815]
[New LWP 9814]
[New LWP 9802]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f66e9dc6aff in ppoll () from /lib64/libc.so.6
#0  0x00007f66e9dc6aff in ppoll () at /lib64/libc.so.6
#1  0x0000559ae2ad98ab in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x0000559ae2ad98ab in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:322
#3  0x0000559ae2adb56a in aio_poll (ctx=ctx@entry=0x559ae3fddbb0, blocking=<optimized out>) at util/aio-posix.c:622
#4  0x0000559ae2a734a4 in bdrv_flush (bs=bs@entry=0x559ae3fe17d0) at block/io.c:2418
#5  0x0000559ae2a2de0b in bdrv_unref (bs=0x559ae3fe17d0) at block.c:2949
#6  0x0000559ae2a2de0b in bdrv_unref (bs=0x559ae3fe17d0) at block.c:3139
#7  0x0000559ae2a2de0b in bdrv_unref (bs=0x559ae3fe17d0) at block.c:4121
#8  0x0000559ae2a679d4 in blk_remove_bs (blk=blk@entry=0x559ae4007710) at block/block-backend.c:552
#9  0x0000559ae2a67a1b in blk_remove_all_bs () at block/block-backend.c:306
#10 0x0000559ae2a2af78 in bdrv_close_all () at block.c:3009
#11 0x0000559ae276ec95 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4721

Comment 14 Dr. David Alan Gilbert 2017-08-24 13:19:38 UTC
Does this work now with current qemu (1eed3399) - Eric's big batch of nbd fixes went in yesterday.
I was testing with --copy-storage-all yesterday and it seemed happy.

Comment 15 Longxiang Lyu 2017-08-26 06:25:57 UTC
bug exists for qemu-kvm-rhev-2.9.0-16.el7_4.5

Comment 17 Longxiang Lyu 2017-08-29 09:19:43 UTC
bug is fixed in v2.10.0-rc4.

Comment 18 Longxiang Lyu 2017-08-29 09:20:37 UTC
I see the exported nbd disk is not locked after migration and dest vm is in running status, should it be locked?

Comment 19 Longxiang Lyu 2017-08-29 09:31:23 UTC
This works as 2.8 does, which doesn't have this bug.

Comment 20 Dr. David Alan Gilbert 2017-09-05 17:06:50 UTC
I'm not sure about your c#18 locking question, lets ask Eric.

Comment 21 Eric Blake 2017-09-20 18:28:45 UTC
(In reply to Dr. David Alan Gilbert from comment #20)
> I'm not sure about your c#18 locking question, lets ask Eric.

How are you observing whether the nbd disk is locked? But that sounds like a side question, unrelated to the hang that was fixed upstream for 2.10.

Comment 22 Eric Blake 2017-09-20 18:30:19 UTC
Moving to POST, since RHEL 7.5 will pick up the 2.10 fix by rebase

Comment 25 Longxiang Lyu 2017-09-26 01:54:47 UTC
(In reply to Eric Blake from comment #21)
> (In reply to Dr. David Alan Gilbert from comment #20)
> > I'm not sure about your c#18 locking question, lets ask Eric.
> 
> How are you observing whether the nbd disk is locked? But that sounds like a
> side question, unrelated to the hang that was fixed upstream for 2.10.

In the fix: v2.10.0-rc4, dest VM could be continued without executing step 7: nbd-server-stop. It seems qemu has no write protection for this image so that the dest could directly do I/O to it.

Comment 26 Dr. David Alan Gilbert 2017-09-26 11:35:59 UTC
Hi Eric,
  Do you have a list of all the commits needed for this?

Comment 27 Eric Blake 2017-09-26 12:58:40 UTC
(In reply to Dr. David Alan Gilbert from comment #26)
> Hi Eric,
>   Do you have a list of all the commits needed for this?

Probably easier if I just prepare the backport. I'll reassign the z-stream bug to myself...

Comment 29 Eric Blake 2017-10-06 18:00:22 UTC
Moving back to assigned; we have some NBD patches that are being included in the 7.4.z build but were not part of 2.10.

Comment 30 Eric Blake 2017-10-06 18:03:36 UTC
*** Bug 1493890 has been marked as a duplicate of this bug. ***

Comment 31 Eric Blake 2017-10-06 18:03:58 UTC
*** Bug 1493901 has been marked as a duplicate of this bug. ***

Comment 33 Miroslav Rezanina 2017-10-13 10:25:29 UTC
Fix included in qemu-kvm-rhev-2.10.0-2.el7

Comment 34 Longxiang Lyu 2017-10-15 11:16:46 UTC
Bug is verified pass for qemu-kvm-rhev-2.10.0-2.el7.

Steps are the same as bug description.

Comment 35 Karen Noel 2017-11-13 22:32:00 UTC
Why is rhel-7.4.z+ still set? Does Bug 1495474 already have the fix for 7.4.z? Thanks.

Comment 51 errata-xmlrpc 2018-04-11 00:33:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1104


Note You need to log in before you can comment on or make changes to this bug.