Description of problem: For forward and backward migration test on powerpc from ALT-7.6 to RHELAV-8.1.0, after migration completed on source end, guest hung on destination end, execute "(qemu) system_reset", vm failed to reboot and stoped at SLOF phase, vm status is "VM status: paused (io-error)". Version-Release number of selected component (if applicable): Source host: 4.14.0-115.11.1.el7a.ppc64le qemu-kvm-rhev-2.12.0-18.el7_6.7.ppc64le SLOF-20171214-2.gitfa98132.el7.noarch # cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] *******Because "THP", we should disable it on alt-7.6 Destination host: 4.18.0-129.el8.ppc64le qemu-kvm-4.0.0-6.module+el8.1.0+3736+a2aefea3.ppc64le SLOF-20190703-1.gitba1ab360.module+el8.1.0+3730+7d905127.noarch # cat /sys/kernel/mm/transparent_hugepage/enabled always [madvise] never *******"THP" issue is fixed on this build, so, we should use its default value Guest: 4.14.0-115.11.1.el7a.ppc64le How reproducible: 100% Steps to Reproduce: 1.Boot a guest with following qemu cli: /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox off \ -nodefaults \ -machine pseries-rhel7.6.0 \ -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 \ -object iothread,id=iothread0 \ -chardev socket,id=console0,path=/tmp/console0,server,nowait \ -device spapr-vty,chardev=console0,reg=0x30000000 \ -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x5 \ -device pci-bridge,chassis_nr=1,id=bridge1,bus=pci.0,addr=0x6 \ -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x7 \ -drive file=/home/xianwang/ALT-Server-7.6-ppc64le-virtio-scsi.qcow2,format=qcow2,if=none,cache=none,id=drive_scsi1,werror=stop,rerror=stop \ -device scsi-hd,drive=drive_scsi1,id=scsi-disk1,bus=scsi1.0,channel=0,scsi-id=0x6,lun=0x3,bootindex=0 \ -device virtio-net-pci,mac=9a:7b:7c:7d:7e:72,id=id9HRc5V,vectors=4,netdev=idjlQN53,bus=pci.0,addr=0xa \ -netdev tap,id=idjlQN53,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \ -m 2048,slots=4,maxmem=32G \ -smp 4 \ -vga std \ -vnc :11 \ -cpu host \ -device usb-kbd \ -device usb-mouse \ -qmp tcp:0:8881,server,nowait \ -msg timestamp=on \ -rtc base=localtime,clock=vm,driftfix=slew \ -monitor stdio \ -boot order=cdn,once=n,menu=on,strict=off \ -enable-kvm \ 2.Boot a guest on destination host with same qemu cli with above but appending "-incoming tcp:0:5801" 3.On Source host, do migration (qemu) migrate -d tcp:10.19.128.149:5801 4.Migration finished on source end (qemu) info migrate globals: store-global-state: on only-migratable: off send-configuration: on send-section-footer: on decompress-error-check: on capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off late-block-activate: off Migration status: completed total time: 79976 milliseconds downtime: 264 milliseconds setup: 24 milliseconds transferred ram: 2790415 kbytes throughput: 286.48 mbps remaining ram: 0 kbytes total ram: 2113856 kbytes duplicate: 677827 pages skipped: 0 pages normal: 694756 pages normal bytes: 2779024 kbytes dirty sync count: 9 page size: 4 kbytes (qemu) info status VM status: paused (postmigrate) Actual results: VM hung on destination end, it failed to reboot it, after "system_reset", vm status is "paused(io-error)" Destination end: (qemu) info status VM status: running ******in fact, vm hung on VNC (qemu) system_reset (qemu) info status VM status: paused (io-error) After migration completed, destination console output: Red Hat Enterprise Linux Server 7.6 (Maipo) Kernel 4.14.0-115.11.1.el7a.ppc64le on an ppc64le dhcp16-213-225 login: [ 367.511873] INFO: task dbus-daemon:5655 blocked for more than 120 seconds. [ 367.511918] Not tainted 4.14.0-115.11.1.el7a.ppc64le #1 [ 367.511953] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 367.511995] dbus-daemon D 0 5655 1 0x00040080 [ 367.512024] Call Trace: [ 367.512040] [c00000006a0bb160] [0000000000000009] 0x9 (unreliable) [ 367.512078] [c00000006a0bb330] [c00000000001e480] __switch_to+0x350/0x670 [ 367.512114] [c00000006a0bb390] [c000000000c418c4] __schedule+0x354/0xaf0 [ 367.512150] [c00000006a0bb460] [c000000000c420a8] schedule+0x48/0xc0 [ 367.512186] [c00000006a0bb490] [c000000000c48394] schedule_timeout+0x194/0x580 [ 367.512229] [c00000006a0bb580] [c000000000c437f8] wait_for_completion+0x168/0x270 [ 367.512292] [c00000006a0bb600] [c008000001a65c08] xfs_buf_submit_wait+0xb8/0x3c0 [xfs] [ 367.512354] [c00000006a0bb640] [c008000001a66114] xfs_buf_read_map+0x194/0x2a0 [xfs] [ 367.512415] [c00000006a0bb6a0] [c008000001ab6878] xfs_trans_read_buf_map+0x238/0x450 [xfs] [ 367.512471] [c00000006a0bb710] [c008000001a24bbc] xfs_da_read_buf+0x39c/0x4a0 [xfs] [ 367.512527] [c00000006a0bb830] [c008000001a30334] xfs_dir2_leaf_lookup_int+0x94/0x390 [xfs] [ 367.512584] [c00000006a0bb8d0] [c008000001a30678] xfs_dir2_leaf_lookup+0x48/0x1b0 [xfs] [ 367.512639] [c00000006a0bb930] [c008000001a27de0] xfs_dir_lookup+0x270/0x2c0 [xfs] [ 367.512697] [c00000006a0bb990] [c008000001a83ffc] xfs_lookup+0x6c/0x190 [xfs] [ 367.512757] [c00000006a0bb9f0] [c008000001a7ef58] xfs_vn_lookup+0x78/0xd0 [xfs] [ 367.512806] [c00000006a0bba40] [c000000000456648] lookup_slow+0xd8/0x240 [ 367.512843] [c00000006a0bbac0] [c00000000045bc38] walk_component+0x468/0x690 [ 367.512888] [c00000006a0bbb60] [c00000000045d768] path_lookupat+0x1f8/0x710 [ 367.512925] [c00000006a0bbbe0] [c00000000045dd20] filename_lookup+0xa0/0x270 [ 367.512972] [c00000006a0bbd10] [c00000000044ae5c] vfs_statx.constprop.2+0x5c/0x220 [ 367.513020] [c00000006a0bbd70] [c00000000044b33c] SyS_newstat+0x2c/0x60 [ 367.513062] [c00000006a0bbe30] [c00000000000b288] system_call+0x5c/0x70 Red Hat Enterprise Linux Server 7.6 (Maipo) Kernel 4.14.0-115.11.1.el7a.ppc64le on an ppc64le dhcp16-213-225 login: SLOF ********************************************************************** QEMU Starting Build Date = Jul 23 2019 04:40:55 FW Version = mockbuild@ release 20190703 Press "s" to enter Open Firmware. Press F12 for boot menu. Populating /vdevice methods Populating /vdevice/vty@30000000 Populating /vdevice/nvram@71000000 Populating /pci@800000020000000 00 0000 (D) : 1234 1111 qemu vga 00 2800 (D) : 1033 0194 serial bus [ usb-xhci ] 00 3000 (B) : 1b36 0001 pci* 01 3800 (D) : 1af4 1004 virtio [ scsi ] Populating /pci@800000020000000/pci@6/scsi@7 SCSI: Looking for devices 106000300000000 DISK : "QEMU QEMU HARDDISK 2.5+" 00 5000 (D) : 1af4 1000 virtio [ net ] Installing QEMU fb Scanning USB XHCI: Initializing USB Keyboard USB mouse No console specified using screen & keyboard Welcome to Open Firmware Copyright (c) 2004, 2017 IBM Corporation All rights reserved. This program and the accompanying materials are made available under the terms of the BSD License available at http://www.opensource.org/licenses/bsd-license.php Trying to load: from: /pci@800000020000000/pci@6/scsi@7/disk@106000300000000 ... Expected results: migration completed and vm works well on destination. Additional info:
I. This issue is only hit on fast train(destination) not on slow train, i.e, if destination host with "qemu-kvm-2.12.0-83.module+el8.1.0+3852+0ba8aef0.ppc64le", I can't hit this issue with the same steps and other build information is same with bug report. II. I also could hit this issue with the following simple qemu cli: /usr/libexec/qemu-kvm \ -nodefaults \ -machine pseries-rhel7.6.0 \ -monitor stdio \ -device virtio-net-pci,mac=9a:7b:7c:7d:7e:72,id=id9HRc5V,vectors=4,netdev=idjlQN53,bus=pci.0,addr=0xa \ -netdev tap,id=idjlQN53,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \ -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x7 \ -drive file=/home/xianwang/ALT-Server-7.6-ppc64le-virtio-scsi.qcow2,format=qcow2,if=none,cache=none,id=drive_scsi1,werror=stop,rerror=stop \ -device scsi-hd,drive=drive_scsi1,id=scsi-disk1,bus=scsi1.0,channel=0,scsi-id=0x6,lun=0x3,bootindex=0 \ after migration completed on source, guest status is: (qemu) info status VM status: paused (io-error) But the following qemu cli works well: /usr/libexec/qemu-kvm \ -nodefaults \ -machine pseries-rhel7.6.0 \ -monitor stdio -incoming tcp:0:5801 \ -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x7 \ -drive file=/home/xianwang/ALT-Server-7.6-ppc64le-virtio-scsi.qcow2,format=qcow2,if=none,cache=none,id=drive_scsi1,werror=stop,rerror=stop \ -device scsi-hd,drive=drive_scsi1,id=scsi-disk1,bus=scsi1.0,channel=0,scsi-id=0x6,lun=0x3,bootindex=0 III. I have tried remove "-netdev tap,.." and "-device virtio-net-pci,.." from the qemu cli of bug report, then this issue does not exist, so, I wonder maybe this issue is related something wrong with virtio-net device.
ALT-7.6 is not supported on x86_64, so, I think we could see this issue is powerpc only.
I have tried several times testing this scenario on qemu3.1, but can't reproduce it, so it is a regression, the detail build information is as following: src host: 4.14.0-115.11.1.el7a.ppc64le qemu-kvm-rhev-2.12.0-18.el7_6.7.ppc64le SLOF-20171214-2.gitfa98132.el7.noarch dst host: 4.18.0-80.10.1.el8_0.ppc64le qemu-kvm-3.1.0-30.module+el8.0.1+3755+6782b0ed.ppc64le SLOF-20180702-4.git9b7ab2f.module+el8.0.1+3755+6782b0ed.noarch Guest: 4.14.0-115.11.1.el7a.ppc64le
Xianwang, could you retest with qemu-kvm-4.1.0? Thanks
(In reply to Laurent Vivier from comment #7) > Xianwang, > > could you retest with qemu-kvm-4.1.0? > > Thanks Yes, I think it is fixed with qemu4.1, I have tried several times on qemu4.1 and have not hit it. source: 4.14.0-115.12.1.el7a.ppc64le qemu-kvm-rhev-2.12.0-18.el7_6.7.ppc64le # cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] destination: 4.18.0-134.el8.ppc64le qemu-kvm-4.1.0-4.module+el8.1.0+4020+16089f93.ppc64le # cat /sys/kernel/mm/transparent_hugepage/enabled always [madvise] never
According to comment 8, I close this BZ as fixed in CURRENTRELEASE (qemu-4.1)
Based on this, I'm setting status to VERIFIED. Hi Danilo, Can you help add this bug to Advanced-Virt-RHEL-8.1.0 errata? Thanks, Qunfang
Should be there.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3723