Bug 1506151

Summary: [data-plane] Quitting qemu in destination side encounters "core dumped" when doing live migration
Product: Red Hat Enterprise Linux 7 Reporter: yilzhang
Component: qemu-kvm-rhevAssignee: jason wang <jasowang>
Status: CLOSED ERRATA QA Contact: xianwang <xianwang>
Severity: high Docs Contact:
Priority: high    
Version: 7.5CC: ailan, aliang, chayang, coli, dgilbert, juzhang, knoel, lmiksik, michen, pbonzini, qzhang, stefanha, virt-maint, xianwang, yilzhang
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.10.0-10.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-11 00:44:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
GDB-bt__for-d)__inComment9 none

Description yilzhang 2017-10-25 09:17:13 UTC
Description of problem:
Boot up one guest with system disk and one data disk both binding to the same iothread, then do live migration;
The migration cannot finish after five minutes, then quit qemu in destination side,in which condition qemu-kvm will abort abnormally.


Version-Release number of selected component (if applicable):
Host kernel:   3.10.0-747.el7.ppc64le
qemu-kvm-rhev: qemu-kvm-rhev-2.10.0-3.el7
SLOF:          SLOF-20170724-2.git89f519f.el7.noarch
Guest kernel:  3.10.0-747.el7.ppc64le

How reproducible: 6/6


Steps to Reproduce:
1. Boot one guest on source host with data-plane, whose system disk and data disk are binding to the same iothread
/usr/libexec/qemu-kvm \
 -smp 8,sockets=2,cores=4,threads=1 -m 8192 \
 -serial unix:/tmp/dp-serial.log,server,nowait \
 -nodefaults \
 -rtc base=localtime,clock=host \
 -boot menu=on \
 -monitor stdio \
-monitor unix:/tmp/monitor1,server,nowait \
 -qmp tcp:0:777,server,nowait \
 -device pci-bridge,id=bridge1,chassis_nr=1,bus=pci.0 \
\
-object iothread,id=iothread0 \
 -device virtio-scsi-pci,bus=bridge1,addr=0x1f,id=scsi0,iothread=iothread0 \
-drive file=rhel.qcow2,media=disk,if=none,cache=none,id=drive_sysdisk,aio=native,format=qcow2,werror=stop,rerror=stop \
-device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 \
\
-drive file=/home/yilzhang/dataplane/DISK-image-for-migration.raw,if=none,cache=none,id=drive_ddisk_2,aio=native,format=raw,werror=stop,rerror=stop \
-device scsi-hd,drive=drive_ddisk_2,bus=scsi0.0,id=ddisk_2 \
\
 -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on \
 -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c3:e7:8a,bus=bridge1,addr=0x1e \
2. Boot the guest on destination host with incoming option
-incoming tcp:0:1234

3. After guest is up, migrate it to the destination
(qemu) migrate tcp:10.16.69.89:1234
4. After five minutes, migration still hung there in source side(not finished yet)
5. Quit qemu-kvm process in destination side
(qemu) info status
 VM status: paused (inmigrate)
(qemu) q



Actual results:
Migration cannot complete, and quitting qemu in destination side encounters "core dumped":
[Destination]# sh des-Cannot_migrate_9325.sh
QEMU 2.10.0 monitor - type 'help' for more information
(qemu) VNC server running on ::1:5900
(qemu)
(qemu) info status
VM status: paused (inmigrate)
(qemu) q
qemu-kvm: /builddir/build/BUILD/qemu-2.10.0/hw/virtio/virtio.c:212: vring_get_region_caches: Assertion `caches != ((void *)0)' failed.
des-Cannot_migrate_9325.sh: line 24: 58406 Aborted                 (core dumped) /usr/libexec/qemu-kvm -smp 8,sockets=2,cores=4,threads=1 -m 8192 -serial unix:/tmp/dp-serial.log,server,nowait -nodefaults -rtc base=localtime,clock=host -boot menu=on -monitor stdio -monitor unix:/tmp/monitor1,server,nowait -qmp tcp:0:777,server,nowait -device pci-bridge,id=bridge1,chassis_nr=1,bus=pci.0 -object iothread,id=iothread0 -device virtio-scsi-pci,bus=bridge1,addr=0x1f,id=scsi0,iothread=iothread0 -drive file=rhel.qcow2,media=disk,if=none,cache=none,id=drive_sysdisk,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 -drive file=/home/yilzhang/dataplane/DISK-image-for-migration.raw,if=none,cache=none,id=drive_ddisk_2,aio=native,format=raw,werror=stop,rerror=stop -device scsi-hd,drive=drive_ddisk_2,bus=scsi0.0,id=ddisk_2 -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c3:e7:8a,bus=bridge1,addr=0x1e -incoming tcp:0:1234


Expected results:
Migration can succeed in five minutes, and qemu-kvm should not abort with core dumped


Additional info:
[New LWP 16115]
[New LWP 16114]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/qemu-kvm -smp 8,sockets=2,cores=4,threads=1 -m 8192 -serial unix:/'.
Program terminated with signal 6, Aborted.
#0  0x00003fff7f36fa70 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.ppc64le cyrus-sasl-gssapi-2.1.26-22.el7.ppc64le cyrus-sasl-lib-2.1.26-22.el7.ppc64le cyrus-sasl-md5-2.1.26-22.el7.ppc64le elfutils-libelf-0.170-1.el7.ppc64le elfutils-libs-0.170-1.el7.ppc64le glib2-2.54.1-1.el7.ppc64le glibc-2.17-214.el7.ppc64le gmp-6.0.0-15.el7.ppc64le gnutls-3.3.26-9.el7.ppc64le gperftools-libs-2.6.1-1.el7.ppc64le keyutils-libs-1.5.8-3.el7.ppc64le krb5-libs-1.15.1-15.el7.ppc64le libaio-0.3.109-13.el7.ppc64le libattr-2.4.46-13.el7.ppc64le libcap-2.22-9.el7.ppc64le libcom_err-1.42.9-10.el7.ppc64le libcurl-7.29.0-45.el7.ppc64le libdb-5.3.21-20.el7.ppc64le libfdt-1.4.3-1.el7.ppc64le libffi-3.0.13-18.el7.ppc64le libgcc-4.8.5-22.el7.ppc64le libgcrypt-1.5.3-14.el7.ppc64le libgpg-error-1.12-3.el7.ppc64le libibverbs-15-1.el7.ppc64le libidn-1.28-4.el7.ppc64le libiscsi-1.9.0-7.el7.ppc64le libnl3-3.2.28-4.el7.ppc64le libpng-1.5.13-7.el7_2.ppc64le librdmacm-15-1.el7.ppc64le libseccomp-2.3.1-3.el7.ppc64le libselinux-2.5-12.el7.ppc64le libssh2-1.4.3-10.el7_2.1.ppc64le libstdc++-4.8.5-22.el7.ppc64le libtasn1-4.10-1.el7.ppc64le libusbx-1.0.21-1.el7.ppc64le lzo-2.06-8.el7.ppc64le nettle-2.7.1-8.el7.ppc64le nspr-4.17.0-1.el7.ppc64le nss-3.33.0-2.el7.ppc64le nss-softokn-freebl-3.33.0-1.el7.ppc64le nss-util-3.33.0-1.el7.ppc64le numactl-libs-2.0.9-7.el7.ppc64le openldap-2.4.44-5.el7.ppc64le openssl-libs-1.0.2k-8.el7.ppc64le p11-kit-0.23.5-3.el7.ppc64le pcre-8.32-17.el7.ppc64le pixman-0.34.0-1.el7.ppc64le snappy-1.1.0-3.el7.ppc64le systemd-libs-219-45.el7.ppc64le xz-libs-5.2.2-1.el7.ppc64le zlib-1.2.7-17.el7.ppc64le
(gdb) bt
#0  0x00003fff7f36fa70 in raise () from /lib64/libc.so.6
#1  0x00003fff7f371dec in abort () from /lib64/libc.so.6
#2  0x00003fff7f365554 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00003fff7f365644 in __assert_fail () from /lib64/libc.so.6
#4  0x000000009c2e29c0 in vring_get_region_caches (vq=<optimized out>) at /usr/src/debug/qemu-2.10.0/hw/virtio/virtio.c:212
#5  0x000000009c2e3454 in vring_avail_idx (vq=0x1001b4e0080) at /usr/src/debug/qemu-2.10.0/hw/virtio/virtio.c:226
#6  virtio_queue_set_notification (vq=0x1001b4e0080, enable=<optimized out>) at /usr/src/debug/qemu-2.10.0/hw/virtio/virtio.c:325
#7  0x000000009c2ccb64 in virtio_net_set_status (vdev=0x1001b434420, status=<optimized out>) at /usr/src/debug/qemu-2.10.0/hw/net/virtio-net.c:295
#8  0x000000009c4d4518 in qemu_del_net_client (nc=0x1001a690000) at net/net.c:391
#9  0x000000009c4d5fc4 in net_cleanup () at net/net.c:1468
#10 0x000000009c21810c in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4813
(gdb) bt full
#0  0x00003fff7f36fa70 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00003fff7f371dec in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00003fff7f365554 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3  0x00003fff7f365644 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4  0x000000009c2e29c0 in vring_get_region_caches (vq=<optimized out>) at /usr/src/debug/qemu-2.10.0/hw/virtio/virtio.c:212
        caches = <optimized out>
#5  0x000000009c2e3454 in vring_avail_idx (vq=0x1001b4e0080) at /usr/src/debug/qemu-2.10.0/hw/virtio/virtio.c:226
        caches = <optimized out>
        pa = <optimized out>
#6  virtio_queue_set_notification (vq=0x1001b4e0080, enable=<optimized out>) at /usr/src/debug/qemu-2.10.0/hw/virtio/virtio.c:325
No locals.
#7  0x000000009c2ccb64 in virtio_net_set_status (vdev=0x1001b434420, status=<optimized out>) at /usr/src/debug/qemu-2.10.0/hw/net/virtio-net.c:295
        ncs = 0x1001b500020
        queue_started = false
        n = 0x1001b434420
        __func__ = "virtio_net_set_status"
        q = 0x1001b4528c0
        i = 0
        queue_status = 15 '\017'
#8  0x000000009c4d4518 in qemu_del_net_client (nc=0x1001a690000) at net/net.c:391
        nic = <optimized out>
        ncs = {0x1001a690000, 0x3fffc98e7540, 0x20, 0x4, 0x3fffc98e7608, 0x3fffc98e7610, 0x0, 0x3fffc98e7540, 0x9c67f468, 0x9c67f46d, 0x3fffc98e7540, 0x3fffc98e7590, 0x4, 
          0x3fff7f388560 <vfprintf@@GLIBC_2.17+336>, 0x3fff7f517400, 0x3fffc98e75e0, 0x0, 0x3fffc98e7c90, 0x0, 0x1, 0x3fff7f50ce58 <_IO_str_jumps>, 0x3fffc98e7600, 0x5d84293e2ad6e200, 0x3fffc98e75e0, 
          0x3fff7f517400, 0x3fffc98e7600, 0x3fff7f517400, 0x3fffc98e7610, 0x3fffc98e7cc0, 0x1001b454bc0, 0x36, 0x3fffc98e7630, 0x9c67e358, 0x3fff7f3bed8c <__GI__IO_default_xsputn+364>, 0x3fffc98e7600, 
          0x3fffc98e7c70, 0x0, 0x3fffc98e7630, 0x9c6d07f1, 0x9c6d07f3, 0x3fffc98e7630, 0x3fffc98e7ca0, 0x0, 0x3fff7f388560 <vfprintf@@GLIBC_2.17+336>, 0x3fff7f517400, 0xe6e8, 0x0, 0x3fffc98e7d80, 0x0, 0x1, 
          0x3fff7f50ce58 <_IO_str_jumps>, 0x3fffc98e76f0, 0x265c8, 0x3fff7f3ac2cc <__GI__IO_padn+364>, 0x3fff7f517400, 0x3fffc98e76f0, 0x3fffc98e8280, 0x3fffc98e7700, 0x6b, 0x18, 0x0, 0x3fffc98e76f0, 
          0x9c67e358, 0x9c67e36f, 0x3fffc98e76f0, 0x3fffc98e7d60, 0x3fffc98e7700, 0x3fff7f388560 <vfprintf@@GLIBC_2.17+336>, 0x3fff7f517400, 0x3fff7f388560 <vfprintf@@GLIBC_2.17+336>, 0x3fff7f517400, 
          0x3fff803183e8 <_dl_map_object_from_fd+2856>, 0x3fffc98e82e0, 0x3fff8035fb48, 0x3fffc98e8280, 0x3fffc98e7780, 0x3fff7eb404d0, 0x3fffc98e7790, 0x3fff7f517400, 0x3fffc98e77a0, 0x3fffc98e7e60, 
          0x3fffc98e8068, 0x28, 0x3fffc98e77d0, 0x17, 0x3fff7f3bed8c <__GI__IO_default_xsputn+364>, 0x0, 0x3fff7f3bed8c <__GI__IO_default_xsputn+364>, 0x0, 0x3fffc98e77d0, 0x9c6c9b98, 0x9c6c9b9d, 
          0x3fffc98e77d0, 0x3fffc98e7e40, 0x3fffc98e77e0, 0x3fff7f388560 <vfprintf@@GLIBC_2.17+336>, 0x3fff7f517400, 0x3fff7f388560 <vfprintf@@GLIBC_2.17+336>, 0x3fff7f517400, 0x3fffc98e7cb0, 0x3fffc98e7cc0, 
          0x3fffc98e7cd0, 0x3fff80322ff0 <openaux>, 0x3fffc98e7c80, 0x3fff8034fc10 <__libc_enable_secure_internal>, 0x3fffc98e7880, 0x3fff7f3412d8, 0x0, 0x3fff7f50f210, 0x3066313030303034, 0x0, 0x3fffc98e7880, 
          0x9c687969, 0x3fffc98e8110, 0x3fffc98e7880, 0x3fffc98e7ef0, 0x0, 0x3fff7f388560 <vfprintf@@GLIBC_2.17+336>, 0x3fff7f517400, 0x3fff80326090 <_dl_catch_error+144>, 
          0x9c817b00 <virtio_net_properties+2168>, 0x2, 0x0, 0x3fffc98e7be0, 0xffffffff90000001, 0x3fffc98e7920, 0x3fff80357e00, 0x9c67f46d, 0xffffffffffffffff, 0x10, 0x0, 0x3fffc98e7920, 0x9c66dfb8, 
          0x9c66dfba, 0x3fffc98e7920, 0x3fffc98e7f90, 0x3fffc98e7cc8, 0x3fff7f388560 <vfprintf@@GLIBC_2.17+336>, 0x3fff7f517400, 0x0, 0xfffffffffffffff8, 0x3fffc98e7a10, 0x0, 0x3fffc98e79b0, 0x9c67f498, 
          0x9c67f49c, 0x3fffc98e7980, 0x3fffc98e7ff0, 0x0, 0x3fffc98e79b0, 0x9c687969, 0x3fffc98e8240, 0x3fffc98e79b0, 0x3fffc98e8020, 0x1001a552400, 0x3fff7f388560 <vfprintf@@GLIBC_2.17+336>, 0x3fff7f517400, 
          0x3fffc98e7a90, 0x1001aed0000, 0x1e, 0x9c817b00 <virtio_net_properties+2168>, 0x9c6c6180, 0x0, 0x3031000000000000, 0x1001b005120, 0x3fffc98e7a30, 0x0, 0x9c642bd8 <error_setg_internal+56>, 
          0x1001b005120, 0x3fffc98e7ab0, 0x1001a552400, 0x3fffc98e83b8, 0x11110000, 0x3fffc98e7b10, 0x1001aed0000, 0x1e, 0x9c817b00 <virtio_net_properties+2168>, 0x9c6c6180, 0x0, 0x3fffc98e7e58, 0x1001b005120, 
          0x9c67e36f, 0x0, 0x10, 0x78, 0x3fffc98e7e88, 0x5, 0x9c6d07f3, 0x11110000, 0x3fffc98e7ac0, 0x0, 0x0, 0x0, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffca, 0x3fffc98e7e90, 0xffffffca, 0x0, 
          0x3fffc98e7b70...}
        queues = <optimized out>
        i = <optimized out>
        nf = <optimized out>
        next = <optimized out>
        __PRETTY_FUNCTION__ = "qemu_del_net_client"
#9  0x000000009c4d5fc4 in net_cleanup () at net/net.c:1468
        nc = <optimized out>
#10 0x000000009c21810c in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4813
---Type <return> to continue, or q <return> to quit---
        i = <optimized out>
        snapshot = <optimized out>
        linux_boot = 0
        initrd_filename = <optimized out>
        kernel_filename = 0x0
        kernel_cmdline = <optimized out>
        boot_order = <optimized out>
        boot_once = <optimized out>
        cyls = 0
        heads = 0
        secs = 0
        translation = <optimized out>
        opts = <optimized out>
        machine_opts = <optimized out>
        hda_opts = <optimized out>
        icount_opts = <optimized out>
        accel_opts = <optimized out>
        olist = <optimized out>
        optind = 38
        optarg = 0x3fffc98ef3a9 "tcp:0:1234"
        loadvm = <optimized out>
        machine_class = 0x0
        cpu_model = <optimized out>
        vga_model = 0x0
        qtest_chrdev = <optimized out>
        qtest_log = <optimized out>
        pid_file = <optimized out>
        incoming = 0x3fffc98ef3a9 "tcp:0:1234"
        defconfig = <optimized out>
        userconfig = <optimized out>
        nographic = <optimized out>
        display_type = <optimized out>
        display_remote = <optimized out>
        log_mask = <optimized out>
        log_file = <optimized out>
        trace_file = <optimized out>
        maxram_size = <optimized out>
        ram_slots = <optimized out>
        vmstate_dump_file = 0x0
        main_loop_err = 0x0
        err = 0x0
        list_data_dirs = <optimized out>
        bdo_queue = {sqh_first = 0x0, sqh_last = 0x3fffc98e96c8}
        __func__ = "main"
        __FUNCTION__ = "main"
(gdb)

Comment 2 yilzhang 2017-10-25 09:32:29 UTC
1. If not use data-plane, then live migration can succeed and quitting qemu in destination side won't abort too; that is, everything will work well if I don't use data-plane in my command line

2. If only have system disk with data-plane enabled (that is, there is no data disk), everything will work well too.

Comment 3 Karen Noel 2017-10-25 13:22:10 UTC
Does this reproduce on x86?

Comment 5 yilzhang 2017-10-27 08:37:54 UTC
Will try it soon, please stay tuned.

Comment 6 yilzhang 2017-10-30 06:32:59 UTC
X86 also has this bug:
Log on destination side:
(qemu) info status
VM status: paused (inmigrate)
(qemu) q
qemu-kvm: /builddir/build/BUILD/qemu-2.10.0/hw/virtio/virtio.c:212: vring_get_region_caches: Assertion `caches != ((void *)0)' failed.
des_bug1506151.sh: line 22: 12150 Aborted                 (core dumped) /usr/libexec/qemu-kvm -smp 8,sockets=2,cores=4,threads=1 -m 8192 -serial unix:/tmp/dp-serial.log,server,nowait -nodefaults -rtc base=localtime,clock=host -boot menu=on -monitor stdio -monitor unix:/tmp/monitor1,server,nowait -qmp tcp:0:777,server,nowait -device pci-bridge,id=bridge1,chassis_nr=1,bus=pci.0 -object iothread,id=iothread0 -device virtio-scsi-pci,bus=bridge1,addr=0x1f,id=scsi0,iothread=iothread0 -drive file=rhel7.5.qcow2,media=disk,if=none,cache=none,id=drive_sysdisk,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 -drive file=/root/test/DISK-image-for-migration.raw,if=none,cache=none,id=drive_ddisk_2,aio=native,format=raw,werror=stop,rerror=stop -device scsi-hd,drive=drive_ddisk_2,bus=scsi0.0,id=ddisk_2 -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c3:e7:8a,bus=bridge1,addr=0x1e -incoming tcp:0:1234



Host kernel:   3.10.0-747.el7.x86_64
qemu-kvm-rhev: qemu-kvm-rhev-2.10.0-3.el7
Guest kernel:  3.10.0-747.el7.x86_64

Comment 7 xianwang 2017-11-01 09:24:45 UTC
a)According to my test result, it seems that only when there are two or more scsi disks that with data plane this bug will be reproduced(migration hang and can not completed), whether these two scsi disks connected to a scsi controller or two separated controller; 

b)On the other hand, migration can be completed when there is only one disk with data plane whether it is a scsi disk or blk disk;

c)what's more, migration also can be completed when there is a scsi disk and one or two blk disks with data plane. 


I just tried this scenario for local migration
version:
kernel-3.10.0-760.el7.ppc64le
qemu-kvm-rhev-2.10.0-3.el7.ppc64le
SLOF-20170724-2.git89f519f.el7.noarch

qemu cli:
# /usr/libexec/qemu-kvm -object iothread,id=iothread0 -device virtio-scsi-pci,bus=pci.0,addr=0x1f,id=scsi0,iothread=iothread0 -drive file=/home/rhel75.qcow2,media=disk,if=none,cache=none,id=drive_sysdisk,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 -drive file=/home/r1.qcow2,if=none,cache=none,id=drive_ddisk_2,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_ddisk_2,bus=scsi0.0,id=ddisk_2 -monitor stdio -vga std -vnc :1
QEMU 2.9.0 monitor - type 'help' for more information
(qemu) migrate_set_downtime 10
(qemu) migrate_set_speed 1G
(qemu) info status 
VM status: running
(qemu) migrate -d tcp:127.0.0.1:5801
(qemu) Killed

qemu process hang and migration can not finished

Comment 9 Dr. David Alan Gilbert 2017-11-03 09:19:26 UTC
It feels like there's perhaps two separate bugs here;
   a) Why the migration hangs
   b) The destination failing when you quit

What state is the source in when it hangs?
   c) Does the source monitor respond?
   c1) If so what does    info migrate   and   info status    say?
   d) If the source monitor does not respond then please use gdb to get a 
      thread apply all bt full
   e) In comment 7 xianwang shows a 'Killed' - where from? Did you kill the qemu or did that happen by itself?

Comment 10 yilzhang 2017-11-06 06:15:11 UTC
Created attachment 1348417 [details]
GDB-bt__for-d)__inComment9

Comment 11 yilzhang 2017-11-06 06:18:07 UTC
Hi David,

c) The source monitor doesn't respond
d) Please check the gdb backtrace in attament, named: GDB-bt__for-d)__inComment9
e) In comment 7 xianwang shows a 'Killed' - She killed the qemu

Comment 12 Stefan Hajnoczi 2017-11-06 14:41:26 UTC
I agree with David Gilbert, there are two separate bugs.

1. The migration thread hangs in the source QEMU in qemu_savevm_state_complete_precopy() -> bdrv_inactivate_all() -> qcow2_inactivate() -> qcow2_cache_flush() -> bdrv_flush().

This happens because bdrv_inactivate_all() acquires each BlockDriverState's AioContext.  When the guest is launched with 2 disks in the same IOThread, the IOThread's AioContext is acquired twice.

bdrv_flush() hangs in BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE) because the IOThread's AioContext is only released once but the migration thread acquired it twice.  Therefore no progress is made and the source QEMU hangs.

2. The virtio-net device has loaded device state on the destination but the guest hasn't resumed yet.  When the 'quit' command is processed virtio_net_device_unrealize() -> virtio_net_set_status() attempts to access the vring but the memory region cache is not unitialized.  I haven't been able to reproduce this locally with qemu.git/master and I don't see how this can happen in the source code.

Comment 14 Paolo Bonzini 2017-11-16 13:48:49 UTC
I'm working on the first part.

Comment 17 yilzhang 2017-11-17 07:01:39 UTC
Hi Jason,
The backtrace is the same as comment #0.

Comment 20 yilzhang 2017-11-24 09:24:25 UTC
Using the patch in Comment 18, I tried 5 times on Power8, the result is:
Migration still can not complete(Migration hang), but quitting qemu-kvm process in destination side won't crash.

Src Host:
kernel:  3.10.0-797.el7.ppc64le
qemu-kvm-rhev-2.10.0-6.el7.root201711221748

Des Host:
kernel:  3.10.0-768.el7.ppc64le
qemu-kvm-rhev-2.10.0-6.el7.root201711221748

Comment 22 Miroslav Rezanina 2017-11-30 16:54:33 UTC
Fix included in qemu-kvm-rhev-2.10.0-10.el7

Comment 24 xianwang 2017-12-05 04:36:53 UTC
I think this bug is not fixed on qemu-kvm-rhev-2.10.0-10.el7, I have re-test this scenario on both x86 and ppc with qemu-kvm-rhev-2.10.0-10.el7, but the result is same as #comment7, test information is as  following:
version:
x86:
3.10.0-792.el7.x86_64
qemu-kvm-rhev-2.10.0-10.el7.x86_64
seabios-bin-1.11.0-1.el7.noarch

qemu cli:
# /usr/libexec/qemu-kvm -object iothread,id=iothread0 -device virtio-scsi-pci,bus=pci.0,addr=0x1f,id=scsi0,iothread=iothread0 -drive file=/home/xianwang/rhel75.qcow2,media=disk,if=none,cache=none,id=drive_sysdisk,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 -drive file=/home/xianwang/r1.qcow2,if=none,cache=none,id=drive_ddisk_2,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_ddisk_2,bus=scsi0.0,id=ddisk_2 -monitor stdio -vga std -vnc :1 -m 4096

src:
QEMU 2.9.0 monitor - type 'help' for more information
(qemu) migrate -d tcp:10.66.10.208:5801
(qemu) info migrate
Migration status: active
........
qemu process hang and migration can not finish

dst:
(qemu) info status 
VM status: paused (inmigrate)

desktop on vnc is displayed on destination host, but vm is hang and status is paused(inmigrate).

ppc:
3.10.0-768.el7.ppc64le
qemu-kvm-rhev-2.10.0-10.el7.ppc64le
SLOF-20170724-2.git89f519f.el7.noarch

steps and results are same with that of x86.

So, this bug is not fixed on qemu-kvm-rhev-2.10.0-10.el7

Comment 25 jason wang 2017-12-05 06:59:43 UTC
Have you read the comments carefully? There were two bugs in fact, and the fix is for crash not hang issue for sure, you need open another bug for tracking the hang.

Thanks

Comment 26 xianwang 2017-12-05 08:27:50 UTC
(In reply to jason wang from comment #25)
> Have you read the comments carefully? There were two bugs in fact, and the
> fix is for crash not hang issue for sure, you need open another bug for
> tracking the hang.
> 
> Thanks

Sorry, I miss comment9, now, in destination, there is no core dump after quit qemu, i.e, this bug is fixed, and I will file another bug to track "hang" issue

ppc:
3.10.0-768.el7.ppc64le
qemu-kvm-rhev-2.10.0-10.el7.ppc64le
SLOF-20170724-2.git89f519f.el7.noarch

# /usr/libexec/qemu-kvm -nodefaults -object iothread,id=iothread0 -device virtio-scsi-pci,bus=pci.0,addr=0x1f,id=scsi0,iothread=iothread0 -drive file=/home/rhel75.qcow2,media=disk,if=none,cache=none,id=drive_sysdisk,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 -drive file=/home/r1.qcow2,if=none,cache=none,id=drive_ddisk_2,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_ddisk_2,bus=scsi0.0,id=ddisk_2 -monitor stdio -vga std -vnc :1 -m 4096

src:
QEMU 2.9.0 monitor - type 'help' for more information
(qemu) migrate -d tcp:10.66.10.208:5801
(qemu) info migrate
Migration status: active
........
qemu process hang and migration can not finish

dst:
(qemu) info status 
VM status: paused (inmigrate)
(qemu) q

there is no core dump.

Comment 27 Dr. David Alan Gilbert 2017-12-05 09:48:13 UTC
+xianwang Please add the bz number of the new bz for the hang here.

Comment 28 Dr. David Alan Gilbert 2017-12-06 14:40:47 UTC
It looks like the bz for the hang was created as:
https://bugzilla.redhat.com/show_bug.cgi?id=1520824

Paolo:  In c14 you say you were working on the double locking causing the hang; did you end up with a fix for that?

Comment 29 Paolo Bonzini 2017-12-15 13:59:35 UTC
David,

I passed that patch to Stefan who has posted it upstream. Either I or you can take care of the backport.

Comment 30 Dr. David Alan Gilbert 2017-12-15 14:52:29 UTC
(In reply to Paolo Bonzini from comment #29)
> David,
> 
> I passed that patch to Stefan who has posted it upstream. Either I or you
> can take care of the backport.

Yep I'm tracking it on bz 1520824 - I don't think it's got merged yet.

Comment 31 Dr. David Alan Gilbert 2017-12-20 12:46:25 UTC
I've posted the backport for 1520824.

Comment 33 errata-xmlrpc 2018-04-11 00:44:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1104