Bug 1890373

Summary: kernel version update cause qemu live migration failed
Product: Red Hat Enterprise Linux 8 Reporter: 张东旭 <xu910121>
Component: kernelAssignee: Andrew Jones <drjones>
kernel sub component: KVM QA Contact: Zhijian Li (Fujitsu) <zhijli>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: bstinson, carl, drjones, eric.auger, hyasuhar, jinzhao, juzhang, jwboyer, knoel, lcapitulino, mmizuma, qzhang, virt-maint, yidliu, yihyu, zhenyzha, zhijli
Version: 8.4Keywords: TestOnly, Triaged
Target Milestone: rc   
Target Release: 8.4   
Hardware: aarch64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-4.18.0-283.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-18 14:16:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1875540, 1907826    
Bug Blocks: 1885655, 1897024    

Description 张东旭 2020-10-22 02:50:18 UTC
On AArch64, qemu live migration with different kernel version:
old kernel version:4.18.0-80.11.el8 (migration source)
new kernel version:4.18.0-147.5.el8 (migration destination)

when I use qemu live migration source VM (host kernel 4.18.0-80.11.el8) to destination VM (host kernel 4.18.0-147.5.el8), qemu live migration will failed with messages:
qemu-kvm: Invalid value 233 expecting positive value <= 232
qemu-kvm: Failed to load cpu:cpreg_vmstate_array_len

migration source and destination hosts have same hardware and same qemu version.just kernel version is different, and the hardware on either side of the migration not support SVE.

I found new version kernel apply this patch:
KVM: arm64/sve: System register context switch and access support
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/kvm/sys_regs.c?id=73433762fcaeb9d59e84d299021c6b15466c96dd

mybe this patch cause live migration failed.

Is there some good suggestions,which can make sure old version kernel live migration to new version kernel with qemu?
thaks a lot.

Comment 1 Andrew Jones 2020-10-22 06:56:07 UTC
The bug also reproduces upstream and I'm currently testing a fix for it.

Comment 4 Andrew Jones 2020-10-30 13:09:52 UTC
Posted fix upstream https://lists.cs.columbia.edu/pipermail/kvmarm/2020-October/042955.html

Comment 5 张东旭 2020-11-04 02:43:31 UTC
It worked for me.
live migration case:
old kernel version to new kernel version
new kernel version to old kernel version
all succeed.

Comment 8 Andrew Jones 2020-11-10 08:50:03 UTC
Patches now upstream

f81cb2c3ad41 KVM: arm64: Don't hide ID registers from userspace
01fe5ace92dd KVM: arm64: Consolidate REG_HIDDEN_GUEST/USER
912dee572691 KVM: arm64: Check RAZ visibility in ID register accessors
c512298eed03 KVM: arm64: Remove AA64ZFR0_EL1 accessors

A KVM selftest test is also now upstream

fd02029a9e01 KVM: selftests: Add aarch64 get-reg-list test
31d212959179 KVM: selftests: Add blessed SVE registers to get-reg-list

To test, build the KVM selftests on AArch64 and run the aarch64/get-reg-list test. On a kernel without f81cb2c3ad41 ("KVM: arm64: Don't hide ID registers from userspace") the test will fail, complaining about a missing register. On a kernel with the patch the test will exit silently with success (exit code 0). An additional test, aarch64/get-reg-list-sve, can be run to confirm no regressions to the visibility of the register occur when SVE is enabled. That test must be run on a machine that supports SVE.

Since these patches are now all upstream, then they should get picked up by the AArch64 KVM rebase, so I'm making this bug a dependency on the rebase bug. I'm also marking it as TestOnly and removing the OtherQA flag, since we have some Virt QE resources that can run KVM selftests.

Comment 11 Zhijian Li (Fujitsu) 2020-11-13 07:42:06 UTC
Reproduced this bug with the following version by kselftests:

kernel-core-4.18.0-240.el8.aarch64
qemu-kvm-core-5.1.0-13.module+el8.3.0+8382+afc3bbea.aarch64
testsuite commit: 585e5b17b92dead8a3aca4e3c9876fbca5f7e0ba

Test steps:
$ cd linux/tools/testing/selftests/kvm
$ make && ./aarch64/get-reg-list
make --no-builtin-rules ARCH=arm64 -C ../../../.. headers_install
make[1]: Entering directory '/home/lizhijian/workspace/linux'
  INSTALL ./usr/include
make[1]: Leaving directory '/home/lizhijian/workspace/linux'
Number blessed registers:   311
Number registers:           310

There are 1 missing registers.
The following lines are missing registers:

	ARM64_SYS_REG(3, 0, 0, 4, 4),

==== Test Assertion Failure ====
  aarch64/get-reg-list.c:453: !missing_regs && !failed_get && !failed_set && !failed_reject
  pid=819317 tid=819317 - Argument list too long
     1	0x0000000000401623: main at get-reg-list.c:450
     2	0x0000ffff90260be3: ?? ??:0
     3	0x00000000004019a3: _start at :?
  There are 1 missing registers; 0 registers failed get; 0 registers failed set; 0 registers failed reject


Test result:  NG

Comment 21 Andrew Jones 2020-12-14 19:29:36 UTC
Bug 1898489 has been closed wont-fix, but we'll still be backporting more fixes for 8.4, including the patches for this bug. I'll update the bug dependency when the new bug is written. I can also just post the patches for this bug independently if needed.

Comment 23 xianwang 2021-01-12 02:02:32 UTC
Bug reproduction:
Host:
[root@fujitsu-fx700-01-n00 home]# uname -r
4.18.0-80.11.1.el8_0.aarch64
qemu-kvm-2.12.0-63.module+el8+2833+c7d6d092.aarch64

[root@fujitsu-fx700-01-n01 home]# uname -r
4.18.0-147.5.1.el8_1.aarch64
qemu-kvm-2.12.0-88.module+el8.1.0+4233+bc44be3f.aarch64


1.Boot a guest on source host with qemu command line:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine virt-rhel7.6.0,gic-version=host,graphics=on \
    -nodefaults \
    -m 8192  \
    -smp 8,maxcpus=8,cores=4,threads=1,sockets=2  \
    -cpu 'host' \
    -vnc :10  \
    -enable-kvm \
    -monitor stdio \
2.Boot a incoming guest on destination host, launch incoming mode
(qemu) migrate_incoming tcp:0:5801
3.Start migration on source host
(qemu) migrate -d tcp:10.16.207.95:5801
4.Result
Migration completed on source end, but qemu crash on destination end
Result:
source:
(qemu) info status 
VM status: paused (postmigrate)
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off late-block-activate: off 
Migration status: completed
total time: 4484 milliseconds
downtime: 13 milliseconds
setup: 3 milliseconds
transferred ram: 18745 kbytes
throughput: 34.45 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2129962 pages
skipped: 0 pages
normal: 6 pages
normal bytes: 24 kbytes
dirty sync count: 3
page size: 4 kbytes

destination:
(qemu) qemu-kvm: Invalid value 233 expecting positive value <= 232
qemu-kvm: Failed to load cpu:cpreg_vmstate_array_len
qemu-kvm: error while loading state for instance 0x0 of device 'cpu'
qemu-kvm: load of migration failed: Invalid argument

Comment 24 Eric Auger 2021-01-12 10:52:14 UTC
Patches listed in comment 8 were submitted as part of
[RHEL8.4 virt 1907826 PATCH 00/16] KVM/ARM: v5.10/v5.11 fixes

Comment 25 Jan Stancek 2021-01-28 07:23:38 UTC
Patch(es) available on kernel-4.18.0-278.el8.dt3

Comment 26 Yiding Liu (Fujitsu) 2021-02-01 08:37:21 UTC
Verified by kselftests on aarch64:

kernel: 4.18.0-278.el8.dt3.aarch64
qemu-kvm: qemu-kvm-5.2.0-4.module+el8.4.0+9676+589043b9.src.rpm
testsuite commit: 585e5b17b92dead8a3aca4e3c9876fbca5f7e0ba

Test steps:
```
[root@fujitsu-fx700-01-n00 ~]# cd linux/tools/testing/selftests/kvm
[root@fujitsu-fx700-01-n00 kvm]# make && ./aarch64/get-reg-list
make --no-builtin-rules ARCH=arm64 -C ../../../.. headers_install
make[1]: Entering directory '/root/linux'
  INSTALL ./usr/include
[snip]
[root@fujitsu-fx700-01-n00 kvm]# echo $?
0
```

Comment 27 Qunfang Zhang 2021-02-02 02:04:10 UTC
Thanks Yiding and Zhijian for the efforts!

Comment 30 Yiding Liu (Fujitsu) 2021-02-18 03:26:24 UTC
Verified by kselftests on aarch64:

kernel: 4.18.0-283.el8.aarch64
qemu-kvm: qemu-kvm-5.2.0-5.module+el8.4.0+9775+0937c167.src.rpm
testsuite commit: f40ddce88593482919761f74910f42f4b84c004b

Test steps:
```
[root@hpe-apollo80-01-n00 ~]# cd linux/tools/testing/selftests/kvm/
[root@hpe-apollo80-01-n00 kvm]# make && ./aarch64/get-reg-list
make --no-builtin-rules ARCH=arm64 -C ../../../.. headers_install
make[1]: Entering directory '/root/linux'
[snip]
[root@hpe-apollo80-01-n00 kvm]# echo $?
0
```

Comment 31 Qunfang Zhang 2021-02-18 05:20:25 UTC
Thanks Yiding for the effort!

Comment 34 errata-xmlrpc 2021-05-18 14:16:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1578