Bug 1506151
Summary: | [data-plane] Quitting qemu in destination side encounters "core dumped" when doing live migration | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | yilzhang | ||||
Component: | qemu-kvm-rhev | Assignee: | jason wang <jasowang> | ||||
Status: | CLOSED ERRATA | QA Contact: | xianwang <xianwang> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.5 | CC: | ailan, aliang, chayang, coli, dgilbert, juzhang, knoel, lmiksik, michen, pbonzini, qzhang, stefanha, virt-maint, xianwang, yilzhang | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | qemu-kvm-rhev-2.10.0-10.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-04-11 00:44:15 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
yilzhang
2017-10-25 09:17:13 UTC
1. If not use data-plane, then live migration can succeed and quitting qemu in destination side won't abort too; that is, everything will work well if I don't use data-plane in my command line 2. If only have system disk with data-plane enabled (that is, there is no data disk), everything will work well too. Does this reproduce on x86? Will try it soon, please stay tuned. X86 also has this bug: Log on destination side: (qemu) info status VM status: paused (inmigrate) (qemu) q qemu-kvm: /builddir/build/BUILD/qemu-2.10.0/hw/virtio/virtio.c:212: vring_get_region_caches: Assertion `caches != ((void *)0)' failed. des_bug1506151.sh: line 22: 12150 Aborted (core dumped) /usr/libexec/qemu-kvm -smp 8,sockets=2,cores=4,threads=1 -m 8192 -serial unix:/tmp/dp-serial.log,server,nowait -nodefaults -rtc base=localtime,clock=host -boot menu=on -monitor stdio -monitor unix:/tmp/monitor1,server,nowait -qmp tcp:0:777,server,nowait -device pci-bridge,id=bridge1,chassis_nr=1,bus=pci.0 -object iothread,id=iothread0 -device virtio-scsi-pci,bus=bridge1,addr=0x1f,id=scsi0,iothread=iothread0 -drive file=rhel7.5.qcow2,media=disk,if=none,cache=none,id=drive_sysdisk,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 -drive file=/root/test/DISK-image-for-migration.raw,if=none,cache=none,id=drive_ddisk_2,aio=native,format=raw,werror=stop,rerror=stop -device scsi-hd,drive=drive_ddisk_2,bus=scsi0.0,id=ddisk_2 -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c3:e7:8a,bus=bridge1,addr=0x1e -incoming tcp:0:1234 Host kernel: 3.10.0-747.el7.x86_64 qemu-kvm-rhev: qemu-kvm-rhev-2.10.0-3.el7 Guest kernel: 3.10.0-747.el7.x86_64 a)According to my test result, it seems that only when there are two or more scsi disks that with data plane this bug will be reproduced(migration hang and can not completed), whether these two scsi disks connected to a scsi controller or two separated controller; b)On the other hand, migration can be completed when there is only one disk with data plane whether it is a scsi disk or blk disk; c)what's more, migration also can be completed when there is a scsi disk and one or two blk disks with data plane. I just tried this scenario for local migration version: kernel-3.10.0-760.el7.ppc64le qemu-kvm-rhev-2.10.0-3.el7.ppc64le SLOF-20170724-2.git89f519f.el7.noarch qemu cli: # /usr/libexec/qemu-kvm -object iothread,id=iothread0 -device virtio-scsi-pci,bus=pci.0,addr=0x1f,id=scsi0,iothread=iothread0 -drive file=/home/rhel75.qcow2,media=disk,if=none,cache=none,id=drive_sysdisk,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 -drive file=/home/r1.qcow2,if=none,cache=none,id=drive_ddisk_2,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_ddisk_2,bus=scsi0.0,id=ddisk_2 -monitor stdio -vga std -vnc :1 QEMU 2.9.0 monitor - type 'help' for more information (qemu) migrate_set_downtime 10 (qemu) migrate_set_speed 1G (qemu) info status VM status: running (qemu) migrate -d tcp:127.0.0.1:5801 (qemu) Killed qemu process hang and migration can not finished It feels like there's perhaps two separate bugs here; a) Why the migration hangs b) The destination failing when you quit What state is the source in when it hangs? c) Does the source monitor respond? c1) If so what does info migrate and info status say? d) If the source monitor does not respond then please use gdb to get a thread apply all bt full e) In comment 7 xianwang shows a 'Killed' - where from? Did you kill the qemu or did that happen by itself? Created attachment 1348417 [details]
GDB-bt__for-d)__inComment9
Hi David, c) The source monitor doesn't respond d) Please check the gdb backtrace in attament, named: GDB-bt__for-d)__inComment9 e) In comment 7 xianwang shows a 'Killed' - She killed the qemu I agree with David Gilbert, there are two separate bugs. 1. The migration thread hangs in the source QEMU in qemu_savevm_state_complete_precopy() -> bdrv_inactivate_all() -> qcow2_inactivate() -> qcow2_cache_flush() -> bdrv_flush(). This happens because bdrv_inactivate_all() acquires each BlockDriverState's AioContext. When the guest is launched with 2 disks in the same IOThread, the IOThread's AioContext is acquired twice. bdrv_flush() hangs in BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE) because the IOThread's AioContext is only released once but the migration thread acquired it twice. Therefore no progress is made and the source QEMU hangs. 2. The virtio-net device has loaded device state on the destination but the guest hasn't resumed yet. When the 'quit' command is processed virtio_net_device_unrealize() -> virtio_net_set_status() attempts to access the vring but the memory region cache is not unitialized. I haven't been able to reproduce this locally with qemu.git/master and I don't see how this can happen in the source code. I'm working on the first part. Hi Jason, The backtrace is the same as comment #0. Using the patch in Comment 18, I tried 5 times on Power8, the result is: Migration still can not complete(Migration hang), but quitting qemu-kvm process in destination side won't crash. Src Host: kernel: 3.10.0-797.el7.ppc64le qemu-kvm-rhev-2.10.0-6.el7.root201711221748 Des Host: kernel: 3.10.0-768.el7.ppc64le qemu-kvm-rhev-2.10.0-6.el7.root201711221748 Fix included in qemu-kvm-rhev-2.10.0-10.el7 I think this bug is not fixed on qemu-kvm-rhev-2.10.0-10.el7, I have re-test this scenario on both x86 and ppc with qemu-kvm-rhev-2.10.0-10.el7, but the result is same as #comment7, test information is as following: version: x86: 3.10.0-792.el7.x86_64 qemu-kvm-rhev-2.10.0-10.el7.x86_64 seabios-bin-1.11.0-1.el7.noarch qemu cli: # /usr/libexec/qemu-kvm -object iothread,id=iothread0 -device virtio-scsi-pci,bus=pci.0,addr=0x1f,id=scsi0,iothread=iothread0 -drive file=/home/xianwang/rhel75.qcow2,media=disk,if=none,cache=none,id=drive_sysdisk,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 -drive file=/home/xianwang/r1.qcow2,if=none,cache=none,id=drive_ddisk_2,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_ddisk_2,bus=scsi0.0,id=ddisk_2 -monitor stdio -vga std -vnc :1 -m 4096 src: QEMU 2.9.0 monitor - type 'help' for more information (qemu) migrate -d tcp:10.66.10.208:5801 (qemu) info migrate Migration status: active ........ qemu process hang and migration can not finish dst: (qemu) info status VM status: paused (inmigrate) desktop on vnc is displayed on destination host, but vm is hang and status is paused(inmigrate). ppc: 3.10.0-768.el7.ppc64le qemu-kvm-rhev-2.10.0-10.el7.ppc64le SLOF-20170724-2.git89f519f.el7.noarch steps and results are same with that of x86. So, this bug is not fixed on qemu-kvm-rhev-2.10.0-10.el7 Have you read the comments carefully? There were two bugs in fact, and the fix is for crash not hang issue for sure, you need open another bug for tracking the hang. Thanks (In reply to jason wang from comment #25) > Have you read the comments carefully? There were two bugs in fact, and the > fix is for crash not hang issue for sure, you need open another bug for > tracking the hang. > > Thanks Sorry, I miss comment9, now, in destination, there is no core dump after quit qemu, i.e, this bug is fixed, and I will file another bug to track "hang" issue ppc: 3.10.0-768.el7.ppc64le qemu-kvm-rhev-2.10.0-10.el7.ppc64le SLOF-20170724-2.git89f519f.el7.noarch # /usr/libexec/qemu-kvm -nodefaults -object iothread,id=iothread0 -device virtio-scsi-pci,bus=pci.0,addr=0x1f,id=scsi0,iothread=iothread0 -drive file=/home/rhel75.qcow2,media=disk,if=none,cache=none,id=drive_sysdisk,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 -drive file=/home/r1.qcow2,if=none,cache=none,id=drive_ddisk_2,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_ddisk_2,bus=scsi0.0,id=ddisk_2 -monitor stdio -vga std -vnc :1 -m 4096 src: QEMU 2.9.0 monitor - type 'help' for more information (qemu) migrate -d tcp:10.66.10.208:5801 (qemu) info migrate Migration status: active ........ qemu process hang and migration can not finish dst: (qemu) info status VM status: paused (inmigrate) (qemu) q there is no core dump. +xianwang Please add the bz number of the new bz for the hang here. It looks like the bz for the hang was created as: https://bugzilla.redhat.com/show_bug.cgi?id=1520824 Paolo: In c14 you say you were working on the double locking causing the hang; did you end up with a fix for that? David, I passed that patch to Stefan who has posted it upstream. Either I or you can take care of the backport. (In reply to Paolo Bonzini from comment #29) > David, > > I passed that patch to Stefan who has posted it upstream. Either I or you > can take care of the backport. Yep I'm tracking it on bz 1520824 - I don't think it's got merged yet. I've posted the backport for 1520824. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:1104 |