Bug 1503437

Summary: src qemu hangs and migration fails when doing storage vm migration with data-plane.
Product: Red Hat Enterprise Linux 7 Reporter: Longxiang Lyu <lolyu>
Component: qemu-kvm-rhevAssignee: Kevin Wolf <kwolf>
Status: CLOSED DUPLICATE QA Contact: aihua liang <aliang>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.5CC: aliang, chayang, coli, juzhang, knoel, lolyu, michen, ngu, qizhu, qzhang, stefanha, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-12 15:46:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1602264, 1634219    
Bug Blocks: 1649160    
Attachments:
Description Flags
gdb debug info when src guest in mounted glusterfs-07182018 none

Description Longxiang Lyu 2017-10-18 06:32:22 UTC
Description of problem:
src qemu hangs and migration fails when do storage vm migration with data-plane.

Version-Release number of selected component (if applicable):
kernel-3.10.0-702.el7.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.4.x86_64
guest: rhel7.4
backend: nbd

How reproducible:
100%

Steps to Reproduce:
1. boot up dest guest an empty image with data-plane.
# qemu-img create -f qcow2 mirror.qcow2 20G

#!/bin/bash
/usr/libexec/qemu-kvm \
-name guest=test-virt1 \
-machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off \
-cpu SandyBridge \
-m 4G \
-smp 4,sockets=4,cores=1,threads=1 \
-boot strict=on \
-object iothread,id=iothread0 \
-device virtio-scsi-pci,bus=pci.0,addr=0x5,iothread=iothread0,id=scsi0 \
-drive file=mirror.qcow2,format=qcow2,snapshot=off,cache=none,if=none,aio=native,id=img0 \
-device scsi-hd,bus=scsi0.0,drive=img0,scsi-id=0,lun=0,id=scsi-disk0,bootindex=0 \
-netdev tap,id=hostnet0,vhost=on \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=51:54:12:b3:20:61,bus=pci.0,addr=0x3 \
-device qxl-vga \
-vnc :2 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 \
-monitor stdio \
-qmp tcp:0:5555,server,nowait \
-usbdevice tablet \
-incoming tcp:0:6666 \


2. export dest img0 as NBD server.
# telnet 127.0.0.1 5555
{ "execute": "qmp_capabilities" }
{ "execute": "nbd-server-start", "arguments": { "addr": { "type": "inet","data": { "host":"10.66.11.1", "port": "9000" } } } }
{"execute":"nbd-server-add","arguments":{"device":"img0", "writable": true}}   

3. boot up src guest with data-plane.
#!/bin/bash
/usr/libexec/qemu-kvm \
-name guest=test-virt0 \
-machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off \
-cpu SandyBridge \
-m 4G \
-smp 4,sockets=4,cores=1,threads=1 \
-boot strict=on \
-object iothread,id=iothread0 \
-device virtio-scsi-pci,bus=pci.0,addr=0x5,iothread=iothread0,id=scsi0 \
-drive file=rhel74-64-virtio.qcow2,format=qcow2,snapshot=off,cache=none,if=none,aio=native,id=img0 \
-device scsi-hd,bus=scsi0.0,drive=img0,scsi-id=0,lun=0,id=scsi-disk0,bootindex=0 \
-netdev tap,id=hostnet0,vhost=on \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=51:54:12:b3:20:61,bus=pci.0,addr=0x3 \
-device qxl-vga \
-vnc :1 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 \
-monitor stdio \
-qmp tcp:0:4444,server,nowait \
-usbdevice tablet \

4. invoke drive mirror on src.
# telnet 127.0.0.1 4444
{ "execute": "qmp_capabilities" }
{ "execute": "drive-mirror", "arguments": { "device": "img0", "target": "nbd://10.66.11.1:9000/img0", "sync": "full", "format": "raw", "mode": "existing" } }


5. after block-mirror reaches ready state, do migration to dest.
{"execute": "migrate","arguments":{"uri": "tcp:10.66.11.1:6666"}}

Actual results:
src:
qmp outputs:
{"timestamp": {"seconds": 1508307934, "microseconds": 665460}, "event": "STOP"}
src qemu hangs.

dest:
(qemu) red_dispatcher_loadvm_commands: 
(qemu) info status
VM status: paused (inmigrate)


Expected results:
migration should return success, and guest continues to run on dest.

Additional info:
gdb bt info of src qemu:
# gdb -batch -ex bt -p 19906
[New LWP 20419]
[New LWP 19927]
[New LWP 19926]
[New LWP 19924]
[New LWP 19923]
[New LWP 19922]
[New LWP 19921]
[New LWP 19908]
[New LWP 19907]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f435dd7442d in __lll_lock_wait () from /lib64/libpthread.so.0
#0  0x00007f435dd7442d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f435dd6fdcb in _L_lock_812 () from /lib64/libpthread.so.0
#2  0x00007f435dd6fc98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00005612f6d5796f in qemu_mutex_lock (mutex=mutex@entry=0x5612f7318400 <qemu_global_mutex>) at util/qemu-thread-posix.c:65
#4  0x00005612f6a7501c in qemu_mutex_lock_iothread () at /usr/src/debug/qemu-2.10.0/cpus.c:1581
#5  0x00005612f6d54f4f in os_host_main_loop_wait (timeout=53154444) at util/main-loop.c:258
#6  main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:515
#7  0x00005612f6a3b1ca in main_loop () at vl.c:1917
#8  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4804

Comment 2 Longxiang Lyu 2017-10-24 01:55:57 UTC
could reproduce in qemu-kvm-rhev-2.9.0-16.el7_4.10.x86_64.

Comment 3 Longxiang Lyu 2017-10-24 05:04:00 UTC
could reproduce in 7.4.z: 
kernel-3.10.0-693.5.1.el7.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.9.x86_64

Comment 4 Longxiang Lyu 2017-11-29 06:30:37 UTC
If reopen src to dest mirrored image after block mirror reaches ready state, there is no such problem.

from src launch:
{"execute": "block-job-complete", "arguments": { "device": "img0"} }

Comment 6 Longxiang Lyu 2018-07-18 05:17:07 UTC
tested in:
qemu-kvm-rhev-2.12.0-7.el7.x86_64
kernel-3.10.0-918.el7.x86_64
this problem seems solved but the source qemu process fails to quit and hangs te same as described in Bug 1482478. Maybe I will open a new bug to track this.

Comment 7 Gu Nini 2018-07-18 06:40:28 UTC
(In reply to Longxiang Lyu from comment #6)
> tested in:
> qemu-kvm-rhev-2.12.0-7.el7.x86_64
> kernel-3.10.0-918.el7.x86_64
> this problem seems solved but the source qemu process fails to quit and
> hangs te same as described in Bug 1482478. Maybe I will open a new bug to
> track this.

New bz1602264 reported.

Comment 8 Gu Nini 2018-07-18 09:11:10 UTC
Created attachment 1459676 [details]
gdb debug info when src guest in mounted glusterfs-07182018

(In reply to Longxiang Lyu from comment #6)
> tested in:
> qemu-kvm-rhev-2.12.0-7.el7.x86_64
> kernel-3.10.0-918.el7.x86_64
> this problem seems solved but the source qemu process fails to quit and
> hangs te same as described in Bug 1482478. Maybe I will open a new bug to
> track this.

I did the same test on the same qemu/kernel versions in following scenarios:

1) The src guest is on a mounted glusterfs dir WHILE the dst one is on localfs, it's found the src guest core dumped during the drive-mirror process, please refer to the attached gdb log for details.

# ./vm22.sh rhel76-64-virtio-scsi.qcow2
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) 
(qemu) qemu-kvm: util/aio-posix.c:527: run_poll_handlers: Assertion `ctx->poll_disable_cnt == 0' failed.
./vm22.sh: line 28: 23378 Aborted                 (core dumped) /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' ......

2) The dst guest is on a mounted glusterfs dir WHILE the src one is on localfs, it's found the src guest hang there after the 'migrate' finishes and when I issue 'migrate-continue' cmd after the 'block-job-cancel' one; however, the dst guest is still in 'inmigrate' status.

Please note I have used following case to do the test:
https://polarion.engineering.redhat.com/polarion/#/project/RedHatEnterpriseLinux7/workitem?id=RHEL7-62535

Comment 11 Kevin Wolf 2018-12-12 15:46:43 UTC
Given the backtrace in comment 10, this looks like the remaining problem is what bug 1634219 is about.

*** This bug has been marked as a duplicate of bug 1634219 ***