Description of problem: Stable Guest ABI failed between rhel 9.0 and rhel 9.2 Version-Release number of selected component (if applicable): SRC: kernel-5.14.0-70.49.1.el9_0.aarch64 qemu-kvm-6.2.0-11.el9_0.7.aarch64 DST kernel-5.14.0-289.el9.aarch64 qemu-kvm-7.2.0-14.el9_2.aarch64 hostname: ampere-hr330a-13.khw4.lab.eng.bos.redhat.com ampere-hr330a-14.khw4.lab.eng.bos.redhat.com How reproducible: 5/5 Steps to Reproduce: 1.boot up a guest on the rhel9.0 host /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox on \ -blockdev node-name=file_aavmf_code,driver=file,filename=/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_aavmf_code,driver=raw,read-only=on,file=file_aavmf_code \ -blockdev node-name=file_aavmf_vars,driver=file,filename=avocado-vt-vm1_rhel900-aarch64-virtio-scsi_qcow2_filesystem_VARS.fd,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_aavmf_vars,driver=raw,read-only=off,file=file_aavmf_vars \ -machine virt-rhel9.0.0,gic-version=host,memory-backend=mem-machine_mem,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ -nodefaults \ -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \ -device virtio-gpu-pci,bus=pcie-root-port-1,addr=0x0 \ -m 8192 \ -object '{"size": 8589934592, "id": "mem-machine_mem", "qom-type": "memory-backend-ram"}' \ -smp 8,maxcpus=8,cores=4,threads=1,sockets=2 \ -cpu 'host' \ -chardev socket,server=on,wait=off,path=/tmp/t11,id=qmp_id_qmpmonitor1 \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,server=on,wait=off,path=/tmp/t11,id=qmp_id_catch_monitor \ -mon chardev=qmp_id_catch_monitor,mode=control \ -serial unix:'/tmp/serialaarch64',server=on,wait=off \ -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \ -device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \ -device '{"id": "virtio_scsi_pci0", "driver": "virtio-scsi-pci", "bus": "pcie-root-port-3", "addr": "0x0"}' \ -blockdev '{"node-name": "file_image1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "rhel900-aarch64-virtio-scsi.qcow2", "cache": {"direct": true, "no-flush": false}}' \ -blockdev '{"node-name": "drive_image1", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_image1"}' \ -device '{"driver": "scsi-hd", "id": "image1", "drive": "drive_image1", "write-cache": "on"}' \ -device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0,chassis=5 \ -device virtio-net-pci,mac=9a:0a:71:f3:69:7d,rombar=0,id=net0,netdev=tap0,bus=pcie-root-port-4,addr=0x0 \ -netdev tap,id=tap0,vhost=on \ -vnc :0 \ -rtc base=utc,clock=host,driftfix=slew \ -no-shutdown \ -enable-kvm \ -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x2,chassis=6 \ -device pcie-root-port,id=pcie_extra_root_port_1,addr=0x2.0x1,bus=pcie.0,chassis=7 \ -monitor stdio On destination host source side command + incoming defer 2.do migration src -> dst - test passed boot vm with incoming defer again on src and migrate guest from rhel9.2 to rhel9.0 dst -> src, migration failed with the following error Actual results: [root@ampere-hr330a-13 rhel900]# sh boot_incoming.sh QEMU 6.2.0 monitor - type 'help' for more information (qemu) migrate_incoming tcp:[::]:4000 (qemu) (qemu) (qemu) qemu-kvm: Invalid value 241 expecting positive value <= 237 qemu-kvm: Failed to load cpu:cpreg_vmstate_array_len qemu-kvm: error while loading state for instance 0x0 of device 'cpu' qemu-kvm: load of migration failed: Invalid argument Expected results: No error Additional info:
Are the host cpu flags matching 100% between the two hosts? I.e. does `cat /proc/cpuinfo` yield the same flags on both machines?
Did a comparison of these two hosts and there's no any difference between them. Please also refer to the attachments named by src.out and dst.output, any issues please let me know. Thank you ! ampere-hr330a-13.khw4.lab.eng.bos.redhat.com ampere-hr330a-14.khw4.lab.eng.bos.redhat.com
edk2 info: SRC: edk2-aarch64-20220126gitbb1bba3d77-3.el9_0.1.noarch DST: edk2-aarch64-20230301gitf80f052277c8-1.el9.noarch
Thanks for the output, the cpus on the two machines do indeed match, but the kernel versions obviously do not, and I think that's the source of the problem. What the code is complaining about is that the number of registers available via the ONE_REG interface decreases when migrating from the 9.2 host to the 9.0 host. This is not surprising, as new kernel versions may expose more registers (and the code is fine with migrating to a system where that number actually increased.) The unfortunate situation is that the number of registers is directly taken from whatever kernel version the host is running with, and not covered by any compatibility handling at all (neither upstream nor downstream). I.e. depending on the kernel versions used in different releases, there's a good chance that backwards migration will break. What we would need is kind of an extended CPU model that not only covers whatever the host exposes, but also whatever KVM exposes, so that it can filter some features. Unfortunately, we do not even have more basic CPU models yet that could insulate us from small changes in the host feature set... I do not see any quick way to fix this -- unless someone else has a good idea?
Hm, the only thing that comes to mind is to let migration tooling warn/prevent about a version mismatch (maybe just for downgrades). The same should be true for a host kernel downgrade without migration where the VM's state is preserved (e.g. VM in suspend, or host kernel downgrade via kexec).
The issue also can be reproduced between RHEL9.1 to RHEL9.2 qemu-kvm-7.0.0-13.el9_1.1.aarch64 5.14.0-162.23.1.el9_1.aarch64 edk2-aarch64-20220526git16779ede2d36-3.el9.noarch qemu-kvm-7.2.0-14.el9_2.aarch64 5.14.0-162.23.1.el9_1.aarch64 edk2-aarch64-20220526git16779ede2d36-3.el9.noarch
(In reply to Min Deng from comment #8) > The issue also can be reproduced between RHEL9.1 to RHEL9.2 > qemu-kvm-7.0.0-13.el9_1.1.aarch64 > 5.14.0-162.23.1.el9_1.aarch64 > edk2-aarch64-20220526git16779ede2d36-3.el9.noarch > qemu-kvm-7.2.0-14.el9_2.aarch64 > 5.14.0-162.23.1.el9_1.aarch64 > edk2-aarch64-20220526git16779ede2d36-3.el9.noarch Does that fail with the same error? (I'm surprised that the kernel version seems to match exactly, shouldn't it be an el9_2 version on the second host?)
> > Does that fail with the same error? (I'm surprised that the kernel version > seems to match exactly, shouldn't it be an el9_2 version on the second host?) Hi Cornelia, I pasted the info of these two hosts as below, thank you ! ampere-hr350a-08.khw4.lab.eng.bos.redhat.com ampere-hr350a-09.khw4.lab.eng.bos.redhat.com SRC: [root@ampere-hr350a-08 ~]# uname -r 5.14.0-162.23.1.el9_1.aarch64 [root@ampere-hr350a-08 ~]# rpm -qa|grep qemu-kvm qemu-kvm-common-7.0.0-13.el9_1.1.aarch64 qemu-kvm-audio-pa-7.0.0-13.el9_1.1.aarch64 qemu-kvm-device-display-virtio-gpu-7.0.0-13.el9_1.1.aarch64 qemu-kvm-device-display-virtio-gpu-gl-7.0.0-13.el9_1.1.aarch64 qemu-kvm-device-display-virtio-gpu-pci-7.0.0-13.el9_1.1.aarch64 qemu-kvm-device-display-virtio-gpu-pci-gl-7.0.0-13.el9_1.1.aarch64 qemu-kvm-device-usb-host-7.0.0-13.el9_1.1.aarch64 qemu-kvm-tools-7.0.0-13.el9_1.1.aarch64 qemu-kvm-docs-7.0.0-13.el9_1.1.aarch64 qemu-kvm-core-7.0.0-13.el9_1.1.aarch64 qemu-kvm-block-rbd-7.0.0-13.el9_1.1.aarch64 qemu-kvm-7.0.0-13.el9_1.1.aarch64 qemu-kvm-tests-7.0.0-13.el9_1.1.aarch64 qemu-kvm-block-curl-7.0.0-13.el9_1.1.aarch64 edk2-aarch64-20220526git16779ede2d36-3.el9.noarch DST: [root@ampere-hr350a-09 ~]# uname -r 5.14.0-284.8.1.el9_2.aarch64 [root@ampere-hr350a-09 ~]# rpm -qa|grep qemu-kvm qemu-kvm-common-7.2.0-14.el9_2.aarch64 qemu-kvm-device-display-virtio-gpu-7.2.0-14.el9_2.aarch64 qemu-kvm-device-display-virtio-gpu-pci-7.2.0-14.el9_2.aarch64 qemu-kvm-audio-pa-7.2.0-14.el9_2.aarch64 qemu-kvm-device-usb-host-7.2.0-14.el9_2.aarch64 qemu-kvm-tools-7.2.0-14.el9_2.aarch64 qemu-kvm-docs-7.2.0-14.el9_2.aarch64 qemu-kvm-core-7.2.0-14.el9_2.aarch64 qemu-kvm-block-rbd-7.2.0-14.el9_2.aarch64 qemu-kvm-7.2.0-14.el9_2.aarch64 qemu-kvm-tests-7.2.0-14.el9_2.aarch64 qemu-kvm-block-curl-7.2.0-14.el9_2.aarch64 edk2-aarch64-20221207gitfff6d81270b5-9.el9_2.noarch
Thanks; with the two different kernel versions, this looks like the same problem as for 9.0 <-> 9.2. Unfortunately, I think this needs to be addressed in the general context of CPU models, which means it will take some time -- we first need to build some kind of consensus upstream, and I doubt there will be major progress before (northern hemisphere) summer...
So, this is why we decided to close this as CANTFIX: We currently see a breakage when migrating from 9.2 to anything older. This is caused by a change in the KVM kernel code, which triggers a visible change in the guest ABI exposed by QEMU. To fix this, we need a bigger change in QEMU, which needs to be done in the context of Arm CPU models. This is a non-trivial development item which will need some time to complete. It won't make sense to try to backport any solution we come up with to 9.2 and older versions. Migrations from older versions to newer versions are unaffected.