1741059 – Migrate a vm from ALT-7.6 to RHELAV-8.1.0, after migration completed, vm hang and failed to reboot vm

Bug 1741059 - Migrate a vm from ALT-7.6 to RHELAV-8.1.0, after migration completed, vm hang and failed to reboot vm

Summary: Migrate a vm from ALT-7.6 to RHELAV-8.1.0, after migration completed, vm hang...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	8.1
Hardware:	ppc64le
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	8.1
Assignee:	Laurent Vivier
QA Contact:	Gu Nini
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-14 07:48 UTC by xianwang
Modified:	2019-11-06 07:18 UTC (History)
CC List:	13 users (show)
Fixed In Version:	qemu-kvm-4.1.0-4.module+el8.1.0+4020+16089f93.ppc64le
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-11-06 07:18:29 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:3723	0	None	None	None	2019-11-06 07:18:50 UTC

Description xianwang 2019-08-14 07:48:13 UTC

Description of problem:
For forward and backward migration test on powerpc from ALT-7.6 to RHELAV-8.1.0, after migration completed on source end, guest hung on destination end, execute "(qemu) system_reset", vm failed to reboot and stoped at SLOF phase, vm status is "VM status: paused (io-error)".

Version-Release number of selected component (if applicable):
Source host:
4.14.0-115.11.1.el7a.ppc64le
qemu-kvm-rhev-2.12.0-18.el7_6.7.ppc64le
SLOF-20171214-2.gitfa98132.el7.noarch
# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
*******Because "THP", we should disable it on alt-7.6

Destination host:
4.18.0-129.el8.ppc64le
qemu-kvm-4.0.0-6.module+el8.1.0+3736+a2aefea3.ppc64le
SLOF-20190703-1.gitba1ab360.module+el8.1.0+3730+7d905127.noarch
# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
*******"THP" issue is fixed on this build, so, we should use its default value

Guest:
4.14.0-115.11.1.el7a.ppc64le

How reproducible:
100%

Steps to Reproduce:
1.Boot a guest with following qemu cli:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -nodefaults  \
    -machine pseries-rhel7.6.0 \
    -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 \
    -object iothread,id=iothread0 \
    -chardev socket,id=console0,path=/tmp/console0,server,nowait \
    -device spapr-vty,chardev=console0,reg=0x30000000 \
    -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x5 \
    -device pci-bridge,chassis_nr=1,id=bridge1,bus=pci.0,addr=0x6 \
    -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x7 \
    -drive file=/home/xianwang/ALT-Server-7.6-ppc64le-virtio-scsi.qcow2,format=qcow2,if=none,cache=none,id=drive_scsi1,werror=stop,rerror=stop \
    -device scsi-hd,drive=drive_scsi1,id=scsi-disk1,bus=scsi1.0,channel=0,scsi-id=0x6,lun=0x3,bootindex=0 \
    -device virtio-net-pci,mac=9a:7b:7c:7d:7e:72,id=id9HRc5V,vectors=4,netdev=idjlQN53,bus=pci.0,addr=0xa \
    -netdev tap,id=idjlQN53,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -m 2048,slots=4,maxmem=32G \
    -smp 4 \
    -vga std \
    -vnc :11 \
    -cpu host \
    -device usb-kbd \
    -device usb-mouse \
    -qmp tcp:0:8881,server,nowait \
    -msg timestamp=on \
    -rtc base=localtime,clock=vm,driftfix=slew  \
    -monitor stdio \
    -boot order=cdn,once=n,menu=on,strict=off \
    -enable-kvm \

2.Boot a guest on destination host with same qemu cli with above but appending "-incoming tcp:0:5801"

3.On Source host, do migration
(qemu) migrate -d tcp:10.19.128.149:5801

4.Migration finished on source end
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off late-block-activate: off 
Migration status: completed
total time: 79976 milliseconds
downtime: 264 milliseconds
setup: 24 milliseconds
transferred ram: 2790415 kbytes
throughput: 286.48 mbps
remaining ram: 0 kbytes
total ram: 2113856 kbytes
duplicate: 677827 pages
skipped: 0 pages
normal: 694756 pages
normal bytes: 2779024 kbytes
dirty sync count: 9
page size: 4 kbytes
(qemu) info status 
VM status: paused (postmigrate)


Actual results:
VM hung on destination end, it failed to reboot it, after "system_reset", vm status is "paused(io-error)"
Destination end:
(qemu) info status 
VM status: running     ******in fact, vm hung on VNC
(qemu) system_reset 
(qemu) info status 
VM status: paused (io-error)



After migration completed, destination console output:
Red Hat Enterprise Linux Server 7.6 (Maipo)
Kernel 4.14.0-115.11.1.el7a.ppc64le on an ppc64le

dhcp16-213-225 login: [  367.511873] INFO: task dbus-daemon:5655 blocked for more than 120 seconds.
[  367.511918]       Not tainted 4.14.0-115.11.1.el7a.ppc64le #1
[  367.511953] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  367.511995] dbus-daemon     D    0  5655      1 0x00040080
[  367.512024] Call Trace:
[  367.512040] [c00000006a0bb160] [0000000000000009] 0x9 (unreliable)
[  367.512078] [c00000006a0bb330] [c00000000001e480] __switch_to+0x350/0x670
[  367.512114] [c00000006a0bb390] [c000000000c418c4] __schedule+0x354/0xaf0
[  367.512150] [c00000006a0bb460] [c000000000c420a8] schedule+0x48/0xc0
[  367.512186] [c00000006a0bb490] [c000000000c48394] schedule_timeout+0x194/0x580
[  367.512229] [c00000006a0bb580] [c000000000c437f8] wait_for_completion+0x168/0x270
[  367.512292] [c00000006a0bb600] [c008000001a65c08] xfs_buf_submit_wait+0xb8/0x3c0 [xfs]
[  367.512354] [c00000006a0bb640] [c008000001a66114] xfs_buf_read_map+0x194/0x2a0 [xfs]
[  367.512415] [c00000006a0bb6a0] [c008000001ab6878] xfs_trans_read_buf_map+0x238/0x450 [xfs]
[  367.512471] [c00000006a0bb710] [c008000001a24bbc] xfs_da_read_buf+0x39c/0x4a0 [xfs]
[  367.512527] [c00000006a0bb830] [c008000001a30334] xfs_dir2_leaf_lookup_int+0x94/0x390 [xfs]
[  367.512584] [c00000006a0bb8d0] [c008000001a30678] xfs_dir2_leaf_lookup+0x48/0x1b0 [xfs]
[  367.512639] [c00000006a0bb930] [c008000001a27de0] xfs_dir_lookup+0x270/0x2c0 [xfs]
[  367.512697] [c00000006a0bb990] [c008000001a83ffc] xfs_lookup+0x6c/0x190 [xfs]
[  367.512757] [c00000006a0bb9f0] [c008000001a7ef58] xfs_vn_lookup+0x78/0xd0 [xfs]
[  367.512806] [c00000006a0bba40] [c000000000456648] lookup_slow+0xd8/0x240
[  367.512843] [c00000006a0bbac0] [c00000000045bc38] walk_component+0x468/0x690
[  367.512888] [c00000006a0bbb60] [c00000000045d768] path_lookupat+0x1f8/0x710
[  367.512925] [c00000006a0bbbe0] [c00000000045dd20] filename_lookup+0xa0/0x270
[  367.512972] [c00000006a0bbd10] [c00000000044ae5c] vfs_statx.constprop.2+0x5c/0x220
[  367.513020] [c00000006a0bbd70] [c00000000044b33c] SyS_newstat+0x2c/0x60
[  367.513062] [c00000006a0bbe30] [c00000000000b288] system_call+0x5c/0x70


Red Hat Enterprise Linux Server 7.6 (Maipo)
Kernel 4.14.0-115.11.1.el7a.ppc64le on an ppc64le

dhcp16-213-225 login: 

SLOF **********************************************************************
QEMU Starting
 Build Date = Jul 23 2019 04:40:55
 FW Version = mockbuild@ release 20190703
 Press "s" to enter Open Firmware.

Press F12 for boot menu.

Populating /vdevice methods
Populating /vdevice/vty@30000000
Populating /vdevice/nvram@71000000
Populating /pci@800000020000000
                     00 0000 (D) : 1234 1111    qemu vga
                     00 2800 (D) : 1033 0194    serial bus [ usb-xhci ]
                     00 3000 (B) : 1b36 0001    pci*
                     01 3800 (D) : 1af4 1004    virtio [ scsi ]
Populating /pci@800000020000000/pci@6/scsi@7
       SCSI: Looking for devices
          106000300000000 DISK     : "QEMU     QEMU HARDDISK    2.5+"
                     00 5000 (D) : 1af4 1000    virtio [ net ]
Installing QEMU fb



Scanning USB 
  XHCI: Initializing
    USB Keyboard 
    USB mouse 
No console specified using screen & keyboard
     




  Welcome to Open Firmware

  Copyright (c) 2004, 2017 IBM Corporation All rights reserved.
  This program and the accompanying materials are made available
  under the terms of the BSD License available at
  http://www.opensource.org/licenses/bsd-license.php


Trying to load:  from: /pci@800000020000000/pci@6/scsi@7/disk@106000300000000 ... 


Expected results:
migration completed and vm works well on destination.

Additional info:

Comment 1 xianwang 2019-08-14 08:22:35 UTC

I. 
This issue is only hit on fast train(destination) not on slow train, i.e, if destination host with "qemu-kvm-2.12.0-83.module+el8.1.0+3852+0ba8aef0.ppc64le", I can't hit this issue with the same steps and other build information is same with bug report.

II.
I also could hit this issue with the following simple qemu cli:
/usr/libexec/qemu-kvm \
-nodefaults \
-machine pseries-rhel7.6.0 \
-monitor stdio \
-device virtio-net-pci,mac=9a:7b:7c:7d:7e:72,id=id9HRc5V,vectors=4,netdev=idjlQN53,bus=pci.0,addr=0xa \
-netdev tap,id=idjlQN53,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
-device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x7 \
-drive file=/home/xianwang/ALT-Server-7.6-ppc64le-virtio-scsi.qcow2,format=qcow2,if=none,cache=none,id=drive_scsi1,werror=stop,rerror=stop \
-device scsi-hd,drive=drive_scsi1,id=scsi-disk1,bus=scsi1.0,channel=0,scsi-id=0x6,lun=0x3,bootindex=0 \

after migration completed on source, guest status is:
(qemu) info status 
VM status: paused (io-error)

But the following qemu cli works well:
/usr/libexec/qemu-kvm \
-nodefaults \
-machine pseries-rhel7.6.0 \
-monitor stdio -incoming tcp:0:5801 \
-device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x7 \
-drive file=/home/xianwang/ALT-Server-7.6-ppc64le-virtio-scsi.qcow2,format=qcow2,if=none,cache=none,id=drive_scsi1,werror=stop,rerror=stop \
-device scsi-hd,drive=drive_scsi1,id=scsi-disk1,bus=scsi1.0,channel=0,scsi-id=0x6,lun=0x3,bootindex=0

III.
I have tried remove "-netdev tap,.." and "-device virtio-net-pci,.." from the qemu cli of bug report, then this issue does not exist, so, I wonder maybe this issue is related something wrong with virtio-net device.

Comment 3 xianwang 2019-08-15 03:27:05 UTC

ALT-7.6 is not supported on x86_64, so, I think we could see this issue is powerpc only.

Comment 5 xianwang 2019-08-15 08:15:55 UTC

I have tried several times testing this scenario on qemu3.1, but can't reproduce it, so it is a regression, the detail build information is as following:

src host:
4.14.0-115.11.1.el7a.ppc64le
qemu-kvm-rhev-2.12.0-18.el7_6.7.ppc64le
SLOF-20171214-2.gitfa98132.el7.noarch

dst host:
4.18.0-80.10.1.el8_0.ppc64le
qemu-kvm-3.1.0-30.module+el8.0.1+3755+6782b0ed.ppc64le
SLOF-20180702-4.git9b7ab2f.module+el8.0.1+3755+6782b0ed.noarch

Guest:
4.14.0-115.11.1.el7a.ppc64le

Comment 7 Laurent Vivier 2019-08-20 12:50:12 UTC

Xianwang,

could you retest with qemu-kvm-4.1.0?

Thanks

Comment 8 xianwang 2019-08-21 03:26:17 UTC

(In reply to Laurent Vivier from comment #7)
> Xianwang,
> 
> could you retest with qemu-kvm-4.1.0?
> 
> Thanks

Yes, I think it is fixed with qemu4.1, I have tried several times on qemu4.1 and have not hit it.
source:
4.14.0-115.12.1.el7a.ppc64le
qemu-kvm-rhev-2.12.0-18.el7_6.7.ppc64le
# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

destination:
4.18.0-134.el8.ppc64le
qemu-kvm-4.1.0-4.module+el8.1.0+4020+16089f93.ppc64le
# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

Comment 9 Laurent Vivier 2019-08-21 12:40:36 UTC

According to comment 8, I close this BZ as fixed in CURRENTRELEASE (qemu-4.1)

Comment 14 Qunfang Zhang 2019-09-05 06:26:37 UTC

Based on this, I'm setting status to VERIFIED.

Hi Danilo,

Can you help add this bug to Advanced-Virt-RHEL-8.1.0 errata? 

Thanks,
Qunfang

Comment 15 Danilo de Paula 2019-09-16 19:26:14 UTC

Should be there.

Comment 17 errata-xmlrpc 2019-11-06 07:18:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3723

Note You need to log in before you can comment on or make changes to this bug.