Bug 1743639 - Backward migration from qemu4.1 to qemu-kvm-rhev-2.12.0 failed with message "qemu-kvm: error while loading state for instance 0x0 of device 'spapr'"
Summary: Backward migration from qemu4.1 to qemu-kvm-rhev-2.12.0 failed with message "...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.1
Hardware: ppc64le
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Laurent Vivier
QA Contact: Gu Nini
URL:
Whiteboard:
Depends On: 1744170
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-20 11:04 UTC by xianwang
Modified: 2019-08-28 09:48 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-28 09:48:56 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description xianwang 2019-08-20 11:04:55 UTC
Description of problem:
When do backward migration from rhel8.1.0(qemu 4.1) to alt7.6(qemu-kvm-rhev-2.12.0), migration failed with "(qemu) qemu-kvm: error while loading state for instance 0x0 of device 'spapr'", it happens both on p9<->p9 and p8<->p9.

Version-Release number of selected component (if applicable):
Host A p9(alt7.6):
4.14.0-115.8.2.el7a.ppc64le
qemu-kvm-rhev-2.12.0-18.el7_6.7.ppc64le
SLOF-20171214-2.gitfa98132.el7.noarch

Host B p8(rhel8.1.0 fast train):
4.18.0-130.el8.ppc64le
qemu-kvm-4.1.0-4.module+el8.1.0+4020+16089f93.ppc64le
SLOF-20190114-2.gita5b428e.module+el8.1.0+3554+1a3a94a6.noarch

How reproducible:
100%

Steps to Reproduce:
1.Boot a guest on host A:
/usr/libexec/qemu-kvm -M pseries-rhel7.6.0 -nodefaults -monitor stdio
(qemu) info qtree
bus: main-system-bus
  type System
  dev: spapr-pci-host-bridge, id ""
    index = 0 (0x0)
    mem_win_size = 2147483648 (0x80000000)
    mem64_win_size = 1099511627776 (0x10000000000)
    io_win_size = 65536 (0x10000)
    dynamic-reconfiguration = true
    dma_win_addr = 0 (0x0)
    dma_win_size = 1073741824 (0x40000000)
    dma64_win_addr = 576460752303423488 (0x800000000000000)
    ddw = true
    pgsz = 69632 (0x11000)
    numa_node = 4294967295 (0xffffffff)
    pre-2.8-migration = false
    pcie-extended-configuration-space = true
    bus: pci.0
      type PCI
  dev: spapr-vio-bridge, id ""
    bus: spapr-vio
      type spapr-vio-bus
      dev: spapr-nvram, id "nvram@71000000"
        reg = 1895825408 (0x71000000)
        drive = ""
        irq = 4098 (0x1002)


2.Boot incoming guest on host B:
/usr/libexec/qemu-kvm -M pseries-rhel7.6.0 -nodefaults -monitor stdio -incoming tcp:0:5801
(qemu) info qtree
bus: main-system-bus
  type System
  dev: spapr-pci-host-bridge, id ""
    index = 0 (0x0)
    mem_win_size = 2147483648 (0x80000000)
    mem64_win_size = 1099511627776 (0x10000000000)
    io_win_size = 65536 (0x10000)
    dynamic-reconfiguration = true
    dma_win_addr = 0 (0x0)
    dma_win_size = 1073741824 (0x40000000)
    dma64_win_addr = 576460752303423488 (0x800000000000000)
    ddw = true
    pgsz = 69632 (0x11000)
    numa_node = 4294967295 (0xffffffff)
    pre-2.8-migration = false
    pcie-extended-configuration-space = true
    gpa = 70368744177664 (0x400000000000)
    atsd = 140737488355328 (0x800000000000)
    bus: pci.0
      type PCI
  dev: spapr-vio-bridge, id ""
    bus: spapr-vio
      type spapr-vio-bus
      dev: spapr-nvram, id "nvram@71000000"
        reg = 1895825408 (0x71000000)
        drive = ""


3.Do forward migration from host A to host B
(qemu) migrate -d tcp:10.16.200.238:5801
migration completed and vm running on hostB.

4.Boot incoming guest on host A:
/usr/libexec/qemu-kvm -M pseries-rhel7.6.0 -nodefaults -monitor stdio -incoming tcp:0:5801

5.Do backward migration from host B to host A
(qemu) migrate -d tcp:10.19.128.145:5801


Actual results:
Migration completed on host B, but qemu crash on host A
on host A:
(qemu) qemu-kvm: error while loading state for instance 0x0 of device 'spapr'
qemu-kvm: load of migration failed: No such file or directory

Expected results:
Forward and backward migration both work well.

Additional info:
This issue happens both for following build P9<->P9:
Host A p9(alt7.6):
4.14.0-115.8.2.el7a.ppc64le
qemu-kvm-rhev-2.12.0-18.el7_6.7.ppc64le

Host B p8(rhel8.1.0 fast train):
4.18.0-134.el8.ppc64le
qemu-kvm-4.1.0-1.module+el8.1.0+3966+4a23dca1.ppc64le

Comment 1 xianwang 2019-08-20 11:07:21 UTC
This is ppc only bug

Comment 2 xianwang 2019-08-20 11:15:08 UTC
I.
This is a regression bug on qemu4.1, if destination build is qemu4.0, it works well, i.e, it works well on below build:
P9<->P9:
Host A p9(alt7.6):
4.14.0-115.8.2.el7a.ppc64le
qemu-kvm-rhev-2.12.0-18.el7_6.7.ppc64le

Host B p9(rhel8.1.0 fast train):
4.18.0-134.el8.ppc64le
qemu-kvm-4.0.0-6.module+el8.1.0+3736+a2aefea3.ppc64le

In fact, this issue is fixed on qemu4.0 as following bz:
https://bugzilla.redhat.com/show_bug.cgi?id=1709726

Comment 3 xianwang 2019-08-20 11:27:33 UTC
This issue is hit when do migration from qemu4.1-->qemu4.0 and qemu4.1-->qemu3.1 as following:
I:
Host A p8 (qemu4.1):
4.18.0-130.el8.ppc64le
qemu-kvm-4.1.0-4.module+el8.1.0+4020+16089f93.ppc64le

Host B p9 (qemu4.0):
4.18.0-134.el8.ppc64le
qemu-kvm-4.0.0-6.module+el8.1.0+3736+a2aefea3.ppc64le

qemu cli:
/usr/libexec/qemu-kvm -nodefaults -monitor stdio -machine pseries-rhel8.1.0,max-cpu-compat=power8

result is same with bug report


II:
Host A p8 (qemu4.1):
4.18.0-130.el8.ppc64le
qemu-kvm-4.1.0-4.module+el8.1.0+4020+16089f93.ppc64le

Host B p9 (qemu3.1):
4.18.0-134.el8.ppc64le
qemu-kvm-3.1.0-30.module+el8.0.1+3755+6782b0ed.ppc64le

qemu cli:
/usr/libexec/qemu-kvm -nodefaults -monitor stdio -machine pseries-rhel7.6.0,max-cpu-compat=power8

Comment 4 David Gibson 2019-08-21 01:54:13 UTC
Laurent,

Assigning to you, it may or may not be related to the migration bugs you're already looking at.

Comment 5 xianwang 2019-08-21 05:38:16 UTC
This issue is also hit when do migration from power8 with rhel8.1.0(qemu4.1) to power8 with rhel7.6.z(qemu-kvm-rhev-2.12).

source p8:
4.18.0-130.el8.ppc64le
qemu-kvm-4.1.0-4.module+el8.1.0+4020+16089f93.ppc64le
SLOF-20190114-2.gita5b428e.module+el8.1.0+3554+1a3a94a6.noarch

destination p8:
3.10.0-957.35.1.el7.ppc64le
qemu-kvm-rhev-2.12.0-18.el7_6.7.ppc64le
SLOF-20171214-2.gitfa98132.el7.noarch

Comment 9 Laurent Vivier 2019-08-23 09:04:03 UTC
Thank you xianwang.

In the new machine type (BZ 1744170) I think the problem is fixed by:

commit 3725ef1a944bbe1173b55fdabe76fb17876f1d9e
Author: Greg Kurz <groug>
Date:   Wed May 22 15:43:46 2019 +0200

    spapr: Don't migrate the hpt_maxpagesize cap to older machine types
    
    Commit 0b8c89be7f7b added the hpt_maxpagesize capability to the migration
    stream. This is okay for new machine types but it breaks backward migration
    to older QEMUs, which don't expect the extra subsection.
    
    Add a compatibility boolean flag to the sPAPR machine class and use it to
    skip migration of the capability for machine types 4.0 and older. This
    fixes migration to an older QEMU. Note that the destination will emit a
    warning:
    
    qemu-system-ppc64: warning: cap-hpt-max-page-size lower level (16) in incoming stream than on destination (24)
    
    This is expected and harmless though. It is okay to migrate from a lower
    HPT maximum page size (64k) to a greater one (16M).
    
    Fixes: 0b8c89be7f7b "spapr: Add forgotten capability to migration stream"
    Based-on: <20190522074016.10521-3-clg>
    Signed-off-by: Greg Kurz <groug>
    Message-Id: <155853262675.1158324.17301777846476373459.stgit>
    Signed-off-by: David Gibson <david.id.au>

Comment 10 Laurent Vivier 2019-08-23 09:17:44 UTC
As BZ 1744170 has been moved to POST and commit 3725ef1a944b "spapr: Don't migrate the hpt_maxpagesize cap to older machine types" is part of the machine I move also this BZ to POST. This will allow to retest the package once the patch is merged.

Comment 12 Gu Nini 2019-08-28 08:37:47 UTC
####Reproduced the bug on following hosts and with the same steps as that in the bug description part:

Host A: P9(alt7.6z)
Host kernel: 4.14.0-115.8.2.el7a.ppc64le
Qemu: qemu-kvm-ma-2.12.0-18.el7_6.4.ppc64le
SLOF: SLOF-20171214-2.gitfa98132.el7.noarch

Host B: P9(8.1.0-av)
Host kernel: 4.18.0-137.el8.ppc64le
Qemu: qemu-kvm-4.1.0-4.module+el8.1.0+4020+16089f93.ppc64le
SLOF: SLOF-20190703-1.gitba1ab360.module+el8.1.0+3730+7d905127.noarch

[root@ibm-p9wr-18 home]# /usr/libexec/qemu-kvm -M pseries-rhel7.6.0 -nodefaults -monitor stdio -incoming tcp:0:5801
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) VNC server running on ::1:5900

(qemu) 
(qemu) qemu-kvm: error while loading state for instance 0x0 of device 'spapr'
qemu-kvm: load of migration failed: No such file or directory


####Verified the bug on the same hosts but different qemu version on host B:
Host B: P9(8.1.0-av)
Host kernel: 4.18.0-137.el8.ppc64le
Qemu: qemu-kvm-4.1.0-5.module+el8.1.0+4076+b5e41ebc.ppc64le
SLOF: SLOF-20190703-1.gitba1ab360.module+el8.1.0+3730+7d905127.noarch


####BTW, I have tried to verify the bug on a P8(8.1.0-av) host as host B on qemu-kvm-4.1.0-5.module+el8.1.0+4076+b5e41ebc.ppc64le, it was also a failure to reproduce the bug there.
Host B1: P8(8.1.0-av)
Host kernel: 4.18.0-136.el8.ppc64le
Guest kernel: 4.18.0-137.el8.ppc64le
Qemu: qemu-kvm-4.1.0-5.module+el8.1.0+4076+b5e41ebc.ppc64le
SLOF: SLOF-20190703-1.gitba1ab360.module+el8.1.0+3730+7d905127.noarch


####Conclusion: Based on above test result, the bug is fixed well on qemu-kvm-4.1.0-5.module+el8.1.0+4076+b5e41ebc.ppc64le

Comment 13 Qunfang Zhang 2019-08-28 09:48:56 UTC
Thanks for the feedback, close it since the issue is fixed. Please correct me if something is wrong.


Note You need to log in before you can comment on or make changes to this bug.