Bug 1652572
Summary: | QEMU core dumped if stop nfs service during migration | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Yumei Huang <yuhuang> | ||||
Component: | qemu-kvm | Assignee: | Hanna Czenczek <hreitz> | ||||
Status: | CLOSED ERRATA | QA Contact: | Li Xiaohui <xiaohli> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 8.0 | CC: | chayang, coli, ddepaula, dgilbert, hreitz, jen, jinzhao, juzhang, knoel, kwolf, mtessun, qzhang, rbalakri, virt-maint, xianwang | ||||
Target Milestone: | rc | ||||||
Target Release: | 8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | qemu-kvm-3.1.0-21.module+el8.0.1+3009+b48fff88 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-08-07 10:41:09 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Yumei Huang
2018-11-22 11:59:04 UTC
The block code shouldn't assert here; in the end this path is coming from the error_abort in blk_root_inactivate. Discussions with kwolf suggest really the block code shouldn't be getting this upset on unlock failures. Hi, Dave, I am curious why powerpc don't has this issue, I have tried several times and don't hit this issue. Host: 4.18.0-40.el8.ppc64le qemu-kvm-3.0.0-2.module+el8+2208+e41b12e0.ppc64le SLOF-20171214-4.gitfa98132.module+el8+2179+85112f94.noarch Guest: 4.18.0-40.el8.ppc64le Steps:(3 hosts: src host + dst host + nfs sever; mount type: soft,vers=4) 1. Boot guest on src host 2. Boot guest with -incoming on dst host 3. Run dd in guest while true;do dd if=/dev/zero of=file1 bs=1M count=10;done 4. Migrate guest from src to dst host 5. During migration, on src host, cut off connection with nfs server # iptables -I INPUT -s $nfs_sever_ip -j DROP result: both src and dst qemu hang, guest hang, but guest is migrated to dst via vnc display. after a while, resume nfs on src host(iptables -F), src qemu is ok and migration completed, vm works well on dst host and "reboot" also works well . src: (qemu) info migrate Migration status: completed (qemu) info status VM status: paused (postmigrate) dst: (qemu) info status VM status: running qemu cli: /usr/libexec/qemu-kvm \ -name "mouse-vm" \ -sandbox off \ -machine pseries \ -nodefaults \ -vga std \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server,nowait \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x3 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \ -device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \ -device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=pci.0,addr=0x5 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,rerror=stop,werror=stop,format=qcow2,file=/home/mount_point/rhel8.0-20181005.1-p8.qcow2 \ -netdev tap,id=tap0,vhost=on \ -m 4096 \ -smp 2,maxcpus=4,cores=2,threads=2,sockets=1 \ -vnc :10 \ -rtc base=utc,clock=host \ -boot menu=off,strict=off,order=cdn,once=c \ -enable-kvm \ -device usb-kbd,id=kbd1,bus=usb1.0,port=2 \ -device usb-mouse,id=mouse1,bus=usb1.0,port=3 \ -device usb-tablet,id=tablet1,bus=usb1.0,port=4 \ -qmp tcp:0:3333,server,nowait \ -serial tcp:0:4444,server,nowait \ -monitor stdio \ Tried with qemu-kvm-2.12.0-42.module+el8+2173+537e5cb5, reproduced the issue after about 20 minutes. mount option: xxx:/home/yuhuang on /home/yuhuang type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=xxx,local_lock=none,addr=xxx) (In reply to xianwang from comment #2) > Hi, Dave, > I am curious why powerpc don't has this issue, I have tried several times > and don't hit this issue. > > Host: > 4.18.0-40.el8.ppc64le > qemu-kvm-3.0.0-2.module+el8+2208+e41b12e0.ppc64le > SLOF-20171214-4.gitfa98132.module+el8+2179+85112f94.noarch > > Guest: > 4.18.0-40.el8.ppc64le > > Steps:(3 hosts: src host + dst host + nfs sever; mount type: soft,vers=4) > 1. Boot guest on src host > 2. Boot guest with -incoming on dst host > 3. Run dd in guest > while true;do dd if=/dev/zero of=file1 bs=1M count=10;done > 4. Migrate guest from src to dst host > 5. During migration, on src host, cut off connection with nfs server > # iptables -I INPUT -s $nfs_sever_ip -j DROP > > result: > both src and dst qemu hang, guest hang, but guest is migrated to dst via vnc > display. after a while, resume nfs on src host(iptables -F), src qemu is ok > and migration completed, vm works well on dst host and "reboot" also works > well . > src: > (qemu) info migrate > Migration status: completed > (qemu) info status > VM status: paused (postmigrate) > > dst: > (qemu) info status > VM status: running > > qemu cli: > /usr/libexec/qemu-kvm \ > -name "mouse-vm" \ > -sandbox off \ > -machine pseries \ > -nodefaults \ > -vga std \ > -chardev > socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server,nowait > \ > -chardev > socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server, > nowait \ > -mon chardev=qmp_id_qmpmonitor1,mode=control \ > -mon chardev=qmp_id_catch_monitor,mode=control \ > -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x3 \ > -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \ > -device > scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi- > id=0,lun=0,bootindex=0 \ > -device > virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=pci.0, > addr=0x5 \ > -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ > -drive > id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,rerror=stop, > werror=stop,format=qcow2,file=/home/mount_point/rhel8.0-20181005.1-p8.qcow2 \ > -netdev tap,id=tap0,vhost=on \ > -m 4096 \ > -smp 2,maxcpus=4,cores=2,threads=2,sockets=1 \ > -vnc :10 \ > -rtc base=utc,clock=host \ > -boot menu=off,strict=off,order=cdn,once=c \ > -enable-kvm \ > -device usb-kbd,id=kbd1,bus=usb1.0,port=2 \ > -device usb-mouse,id=mouse1,bus=usb1.0,port=3 \ > -device usb-tablet,id=tablet1,bus=usb1.0,port=4 \ > -qmp tcp:0:3333,server,nowait \ > -serial tcp:0:4444,server,nowait \ > -monitor stdio \ I have tried again and reproduced it, this time: step 5: on nfs server, stop nfs-service: systemctl stop nfs-server other steps is same with above. result: after stop nfs-server, wait some minutes src: (qemu) Unexpected error in raw_apply_lock_bytes() at block/file-posix.c:696: qemu-kvm: Failed to lock byte 100 debug.sh: line 30: 68990 Aborted (core dumped) /usr/libexec/qemu-kvm...... dst: qemu quit with following message (qemu) qemu-kvm: Failed to load virtio_pci/modern_queue_state:used qemu-kvm: Failed to load virtio_pci/modern_state:vqs qemu-kvm: Failed to load virtio/extra_state:extra_state qemu-kvm: Failed to load virtio-net:virtio qemu-kvm: error while loading state for instance 0x0 of device 'pci@800000020000000:05.0/virtio-net' qemu-kvm: load of migration failed: Input/output error I seem to remember that Max was looking into handling unlocking errors more gracefully. I haven’t so far. I can only agree with Dave (and you) that the code shouldn’t be that upset about it. But I suppose I can look into it, if that was part of the request. Upstream, Kevin has pointed me to the fact that 2996ffad3acabe890fbb4f84a069cdc325a68108 should have fixed this issue. This has been backported to RHV 7 as part of BZ 1551486, the RHEL 8.1 BZ is BZ 1694148. RHEL 8.0.1 contains that patch already. Max The reproducer (2) in BZ 1694148 comment 2 still breaks, so 696aaaed579ac5bf5fa336216909b46d3d8f07a8 needs to be backported. Max Setting ITR=8.0.1 as this patch is on queue. We need pm_ack+, can you please grant it Martin? Fix included in qemu-kvm-3.1.0-21.module+el8.0.1+3009+b48fff88 Hi all use kernel-4.18.0-40.el8.x86_64 & qemu-img-3.0.0-2.module+el8+2208+e41b12e0.x86_64, reproduce this bz waiting about 10 mins after stop nfs service on nfs-server. verify bz in kernel-4.18.0-80.el8.x86_64 & qemu-img-3.1.0-21.module+el8.0.1+3009+b48fff88.x86_64, but guest restart and core dump on dst host after nfs service restart and migration finish 1.setup nfs server and soft mount rhel8.0.1 on src host and dst host: (src host)# mount -o soft $nfs_server_ip:/home/nfs /mnt (dst host)# mount -o soft $nfs_server_ip:/home/nfs /mnt 2.boot guest on src host: /usr/libexec/qemu-kvm \ -machine q35 \ -m 8G \ -smp 8 \ -cpu 'EPYC' \ -device pcie-root-port,id=pcie.0-root-port-2,slot=2,chassis=2,addr=0x2,bus=pcie.0 \ -device pcie-root-port,id=pcie.0-root-port-3,slot=3,chassis=3,addr=0x3,bus=pcie.0 \ -device pcie-root-port,id=pcie.0-root-port-4,slot=4,chassis=4,addr=0x4,bus=pcie.0 \ -object secret,id=sec0,data=redhat \ -blockdev node-name=back_image,driver=file,cache.direct=on,cache.no-flush=off,filename=/mnt/rhel-image/rhel8-0-1.luks \ -blockdev node-name=drive-virtio-disk0,driver=luks,cache.direct=off,cache.no-flush=on,file=back_image,key-secret=sec0 \ -device virtio-blk-pci,drive=drive-virtio-disk0,id=disk0,bus=pcie.0-root-port-2 \ -device virtio-net-pci,mac=d0:67:26:cc:07:1c,id=idLnLWR0,vectors=4,netdev=idINi0TE,bus=pcie.0-root-port-3,addr=0x0 \ -netdev tap,id=idINi0TE,vhost=on \ -vnc :0 \ -device VGA \ -monitor stdio \ -qmp tcp:0:1234,server,nowait 3.boot guest with "-incoming ..." on dst host: commands like step2 -incoming tcp:0:4567,server,nowait \ 4.run "top" in guest: (guest)# top 5.migrate guest from src host to dst host: {"execute": "migrate","arguments":{"uri": "tcp:$dst_host_ip:4567"}} {"return": {}} {"execute":"query-migrate"} {"return": {"expected-downtime": 300, "status": "active", "setup-time": 35, "total-time": 833, "ram": {"total": 8607571968, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 0, "page-size": 4096, "remaining": 7773724672, "mbps": 268.5672, "transferred": 30214708, "duplicate": 196645, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 28389376, "normal": 6931}}} 6.during migration, stop nfs service on nfs server: (nfs server)# service nfs stop 7.after6, wait > 20 mins, guest only hang on src host, and there's no qemu core dump or error prompt in qemu(src and dst) like comment 0: (src host qemu) only hang (src host qmp) continuously print message like followings: {"timestamp": {"seconds": 1555406362, "microseconds": 363739}, "event": "BLOCK_IO_ERROR", "data": {"device": "", "nospace": false, "node-name": "drive-virtio-disk0", "reason": "Input/output error", "operation": "write", "action": "report"}} (dst host qemu) info status VM status: paused (inmigrate) 8.restart nfs service again on nfs server, migration goes on, and migrate successfully, check guest on dst host: (nfs server)service nfs start check guest status after several minutes, find guest migration finish on dst host: (dst host qemu) info status VM status: running but check guest status, find guest restart and core dump, like picture 1. In step 8, I think our expected result is that guest should continue run normally on dst host and not reboot, so, this bz maybe need fix again. Best wishes, Li Xiaohui Created attachment 1555456 [details]
picture 1
I think that’s a different bug, then. I’ll investigate, but the fact that qemu itself no longer crashes is sufficient for this BZ, I think. Max (In reply to Max Reitz from comment #19) > I think that’s a different bug, then. I’ll investigate, but the fact that > qemu itself no longer crashes is sufficient for this BZ, I think. > > Max Hi Max, Yes, it's a different bz. Because without migration, just boot a guest with system disk mounted in nfs server, when guest running, stop nfs in nfs server, guest gdm will core dump, too. So about Comment 17, can I mark the bz verified? Best Regards Li Xiaohui Hi, Yes, this BZ should be verified then. Thanks! Max (In reply to Max Reitz from comment #21) > Hi, > > Yes, this BZ should be verified then. That's good. > > > Thanks! > > Max About guest gdm core dump after stop nfs service, I don't know whether need file a new bz. Max, If you have some result after investigatation, please tell me, thanks in advance. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2395 |