Description of problem: qemu-kvm: error while loading state for instance 0x0 of device 'pl011' when migrate vm from rhel8.4 to rhel8.3 Version-Release number of selected component (if applicable): source host: kernel: 4.18.0-240.el8.aarch64 qemu-kvm: qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1.aarch64.rpm destination host: kernel: 4.18.0-278.el8.dt4.aarch64 qemu-kvm: qemu-kvm-5.2.0-4.module+el8.4.0+9676+589043b9.aarch64 How reproducible: Steps to Reproduce: 1.start an incoming vm on destination side [root@hpe-apollo80-02-n01 ~]# /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio -incoming tcp:0:8888 2. start a vm on source side and start migration [root@hpe-apollo80-01-n01 qemu]# /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio QEMU 5.2.0 monitor - type 'help' for more information (qemu) migrate -d tcp:10.19.241.167:8888 Actual results: check the vm status in both source and destination source: (qemu) info status VM status: paused (postmigrate) destination: [root@hpe-apollo80-02-n01 ~]# /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio -incoming tcp:0:8888 QEMU 5.1.0 monitor - type 'help' for more information (qemu) qemu-kvm: error while loading state for instance 0x0 of device 'pl011' qemu-kvm: load of migration failed: No such file or directory Expected results: at least one VM is running Additional info:
I reproduced this by virt-install'ing a virt-rhel8.2.0 machine type guest and using virsh for the migration between two Apollo's, one with latest 8.4 installed and the other with 8.3. Furthermore, I tried a ping-pong migration starting on the RHEL 8.3 host. The migration succeeded going from 8.3 to 8.4, but then failed when going back to the 8.3 host with the following logs 2021-02-02 09:51:22.755+0000: initiating migration 2021-02-02T09:51:24.063818Z qemu-kvm: Output state validation failed: cpu/m-security/secure MPU_RNR is valid qemu-kvm: ../migration/vmstate.c:409: vmstate_save_state_v: Assertion `!(field->flags & VMS_MUST_EXIST)' failed. 2021-02-02 09:51:25.679+0000: shutting down, reason=crashed The VM did indeed crash and the 'virsh migrate' command hung and required a 'kill -9' to destroy it (^C didn't work). Also libvirtd had to be restarted in order to release the state change lock which was still held by monitor=remoteDispatchDomainMigrateBegin3Params (according to the error received when trying another migration from 8.3 to 8.4 after the attempt timed out). So it looks to me like we have a couple QEMU migration bugs and at least one libvirt/virsh bug, as it's not handling QEMU's migration crash gracefully. A virt-rhel8.3.0 machine type guest has the same problems. Regarding the 'virsh migrate' hang, virsh didn't hang every time the guest crashed while migrating, but it did most the time. If virsh didn't hang, requiring the 'kill -9', then libvirtd didn't need to be restarted.
I tried to reproduce the issue on Seattle, but hit another migration issue. For now, I can't find two spare Appolo machines to reproduce the issue from the machine pool (beaker). According the raised error message (from comment#1), it would be compatible issue and -ENOENT is returned from the following function calls. Also, the migration should be from RHEL8.4 to RHEL8.3 because there is one new sub-section is added to RHEL8.4 for PL011 device. It means the devices between RHEL8.{3,4} isn't compatile for migration. However, migration from RHEL8.3 to RHEL8.4 should be fine because the newly added sub-section won't be loaded at all. QEMU 5.1.0 monitor - type 'help' for more information (qemu) qemu-kvm: error while loading state for instance 0x0 of device 'pl011' qemu-kvm: load of migration failed: No such file or directory __startcontext coroutine_trampoline process_incoming_migration_co qemu_loadvm_state qemu_loadvm_state_main qemu_loadvm_section_start_full vmstate_load vmstate_load_state vmstate_subsection_load vmstate_get_subsection # "pl011/clock" sub-section not found and -ENOENT is returned
Hi Gavin, One host is enough for this bug to be reproduced. I was able to reproduce with upstream QEMU builds (better to fix it there first) on a ThunderX with these commands # On one terminal $ cd /dir/with/build/for/qemu-4.1 qemu-4.1$ aarch64-softmmu/qemu-system-aarch64 -display none -M virt-4.1,accel=kvm,gic-version=host -cpu host -monitor stdio -incoming tcp:0:4444 # On another terminal $ cd /dir/with/build/for/latest-qemu latest-qemu$ ./qemu-system-aarch64 -display none -M virt-4.1,accel=kvm,gic-version=host -cpu host -monitor stdio -S (qemu) migrate -d tcp:0:4444 The first terminal will show qemu-system-aarch64: error while loading state for instance 0x0 of device 'pl011' qemu-system-aarch64: load of migration failed: No such file or directory Also, while fixing it upstream, it'd be nice to post some simple migration tests to QEMU that ensure we can ping-pong migrate all machine type versions we have (or at least a reasonable number of the most recent ones). To do this, we'll probably want to have a collection of pre-compiled QEMUs of each release version we care about in our CI. And, the tests should simply SKIP and suggest building the binaries when they're not present. That's because building the different versions as part of the test building and running would take too long. Thanks, drew
Hi Drew, Thanks for the instructions. Yeah, migration from version where pl011 clock exists to version where the clock doesn't exist should fail. I've successfully reproduced the issue on one machine. Also, the fix has been posted for review. Lets see what feedback I will get. The fix is not to migrate the clock, but report the baud rate change on post load time. By the way, the issue is caused by aac63e0e6ea3 ("hw/char/pl011: add a clock input") where starts to be emerge from RHEL8.4.0. Thanks, Gavin
upstream fix (v1) is to disable the clock migration for pl011 completely, but rejected by the community. https://patchwork.kernel.org/project/qemu-devel/patch/20210317044441.112313-1-gshan@redhat.com/ upstream fix (v2) according to Drew's suggestion has been posted for review. Lets see what feedback I will get from community. https://patchwork.kernel.org/project/qemu-devel/patch/20210318023801.18287-1-gshan@redhat.com/
This is fixed by the following commit, merged to qemu-6.2.0: e6fa978d8343 ("hw/arm/virt: Disable pl011 clock migration if needed")
Correcting the comments for comment#6: It was merged to qemu-6.0 instead of qemu-6.2.0. I'm including Mirek so that the fix can be picked for qemu on RHEL8.5 since it's uncertain if qemu-RHEL8.4 needs this. e6fa978d8343 ("hw/arm/virt: Disable pl011 clock migration if needed")
Migrate from rhel8.5(qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.aarch64) to rhel8.4(qemu-kvm-5.2.0-16.module+el8.4.0+10806+b7d97207.aarch64) 1. Between same cpu server, migration succeed 2. Between server with different cpu flags. error message "qemu-kvm: error while loading state for instance 0x0 of device 'cpu'" output. Migration between different cpu flags server: src: [root@fujitsu-fx700-01-n00 auto_test_tool]# /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio QEMU 6.0.0 monitor - type 'help' for more information (qemu) migrate -d tcp:10.19.241.163:8888 (qemu) info status VM status: paused (postmigrate) des: [root@hpe-apollo80-01-n00 ~]# /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio -incoming tcp:0:8888 QEMU 5.2.0 monitor - type 'help' for more information (qemu) qemu-kvm: error while loading state for instance 0x0 of device 'cpu' qemu-kvm: load of migration failed: Operation not permitted
(In reply to Zhijian Li (Fujitsu) from comment #12) > Migrate from > rhel8.5(qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.aarch64) to > rhel8.4(qemu-kvm-5.2.0-16.module+el8.4.0+10806+b7d97207.aarch64) > 1. Between same cpu server, migration succeed > 2. Between server with different cpu flags. error message "qemu-kvm: error > while loading state for instance 0x0 of device 'cpu'" output. > > > Migration between different cpu flags server: > src: > [root@fujitsu-fx700-01-n00 auto_test_tool]# /usr/libexec/qemu-kvm -name > 'avocado-vt-vm1' -sandbox on -machine > virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 > -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio > QEMU 6.0.0 monitor - type 'help' for more information > (qemu) migrate -d tcp:10.19.241.163:8888 > (qemu) info status > VM status: paused (postmigrate) > > des: > [root@hpe-apollo80-01-n00 ~]# /usr/libexec/qemu-kvm -name > 'avocado-vt-vm1' -sandbox on -machine > virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 > -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio > -incoming tcp:0:8888 > QEMU 5.2.0 monitor - type 'help' for more information > (qemu) qemu-kvm: error while loading state for instance 0x0 of device 'cpu' > qemu-kvm: load of migration failed: Operation not permitted cpu flags: fujitsu-fx700-01-n00: Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm fcma dcpop sve hpe-apollo80-01-n00: Flags: fp asimd evtstrm sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm fcma dcpop sve
Zhijian, For test case#2, I think the CPUs on these two machines aren't exactly same. Migration between not exactly same CPUs are expected to fail. By the way, the issue isn't what this bugzilla is tracking. Please create another bugzilla to track the failure from the test case#2 and assign to me, so that I can investigate. Besides, it'd better to try same host kernel and qemu to see if migration between these two different types of machines can succeed.
It's verified that the initial problem "qemu-kvm: error while loading state for instance 0x0 of device 'pl011'" has beed fixed.
Filed a new https://bugzilla.redhat.com/show_bug.cgi?id=1961519 to track the new issue.
Setting to VERIFIED since we have new bug 1961519 (low priority) to track the new issue.
(In reply to Zhijian Li (Fujitsu) from comment #15) > It's verified that the initial problem "qemu-kvm: error while loading state > for instance 0x0 of device 'pl011'" has beed fixed. How was this verified? With upstream QEMU builds? No RHEL version has yet merged the fix. It's just been posted now under bug 1957667, for which this bug should be dependent.
Yes the issue will be fixed with bug 1957667. No additional work needed.
OK, I'm making this as a TestOnly.
qemu-kvm version: rhel8.5(qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.aarch64) rhel8.3(qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1.aarch64) mchine type: virt-rhel8.2.0 and virt-rhel8.3.0 result: 1. Still failed when migrate from rhel8.5 to rhel8.3 2. Succeed when migrate from rhel8.3 to rhel8.5 From rhel8.5 to rhel8.3 src: [root@hpe-apollo80-01-n01 ~]# /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio QEMU 6.0.0 monitor - type 'help' for more information (qemu) migrate -d tcp:10.19.241.163:8888 (qemu) info status VM status: paused (postmigrate) des: [root@hpe-apollo80-01-n00 auto_test_tool]# /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio -incoming tcp:0:8888 QEMU 5.1.0 monitor - type 'help' for more information (qemu) qemu-kvm: error while loading state for instance 0x0 of device 'pl011' qemu-kvm: load of migration failed: No such file or directory From rhel8.3 to rhel8.5: src: [root@hpe-apollo80-01-n00 auto_test_tool]# /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio QEMU 5.1.0 monitor - type 'help' for more information (qemu) migrate -d tcp:10.19.241.164:8888 (qemu) info status VM status: paused (postmigrate) des: [root@hpe-apollo80-01-n01 ~]# /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio -incoming tcp:0:8888 QEMU 6.0.0 monitor - type 'help' for more information (qemu) info status VM status: running
Xiaohui, the upstream commit won't take into effect until the dependency (bug 1957667) is closed. Please hold the verification until the code changes for bug 1957667 is merged.
qemu-kvm version: qemu-kvm-6.0.0-18.module+el8.5.0+11243+5269aaa1.aarch64 qemu-kvm-5.1.0-21.module+el8.3.1+10464+8ad18d1a.aarch64 result: Succeed when migrate between these two versions From 6.0.0-18 to 5.1.0-21 src: # On one terminal $ cd /dir/with/build/for/qemu-kvm-6.0.0-18 qemu-kvm-6.0.0-18$ ./qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :9 -enable-kvm -monitor stdio QEMU 6.0.0 monitor - type 'help' for more information (qemu) migrate -d tcp:10.16.207.95:8888 (qemu) info status VM status: paused (postmigrate) des: # On another terminal $ cd /dir/with/build/for/qemu-kvm-5.1.0-21 qemu-kvm-5.1.0-21$ ./qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio -incoming tcp:0:8888 QEMU 5.1.0 monitor - type 'help' for more information (qemu) info status VM status: running From 5.1.0-21 to 6.0.0-18 src: # On one terminal $ cd /dir/with/build/for/qemu-kvm-5.1.0-21 qemu-kvm-5.1.0-21$ ./qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio QEMU 5.1.0 monitor - type 'help' for more information (qemu) migrate -d tcp:10.16.207.95:8888 (qemu) info status VM status: paused (postmigrate) des: # On another terminal $ cd /dir/with/build/for/qemu-kvm-6.0.0-18 qemu-kvm-6.0.0-18$ ./qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.2.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :9 -enable-kvm -monitor stdio -incoming tcp:0:8888 QEMU 6.0.0 monitor - type 'help' for more information (qemu) info status VM status: running
Setting to VERIFIED after confirmed with Xinjian.
RHEL9 only supports virt-rhel8.5.0 qemu-kvm version: qemu-kvm-6.0.0-12.el9 qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac machine type: virt-rhel8.5.0 result: Succeed when migrate between these two versions From qemu-kvm-6.0.0-12.el9 to qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac src: # On one terminal $ cd /dir/with/build/for/qemu-kvm-6.0.0-12.el9 qemu-kvm-6.0.0-12.el9$ ./qemu-system-aarch64 -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.5.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio QEMU 6.0.0 monitor - type 'help' for more information (qemu) migrate -d tcp:10.16.207.95:8888 (qemu) info status VM status: paused (postmigrate) des: # On another terminal $ cd /dir/with/build/for/qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac$ ./qemu-system-aarch64 -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.5.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :9 -enable-kvm -monitor stdio -incoming tcp:0:8888 QEMU 6.0.0 monitor - type 'help' for more information (qemu) info status VM status: running From qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac to qemu-kvm-6.0.0-12.el9 src: # On one terminal $ cd /dir/with/build/for/qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac$ ./qemu-system-aarch64 -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.5.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :9 -enable-kvm -monitor stdio QEMU 6.0.0 monitor - type 'help' for more information (qemu) migrate -d tcp:10.16.207.95:8888 (qemu) info status VM status: paused (postmigrate) des: # On another terminal $ cd /dir/with/build/for/qemu-kvm-6.0.0-12.el9 qemu-kvm-6.0.0-12.el9$ ./qemu-system-aarch64 -name 'avocado-vt-vm1' -sandbox on -machine virt-rhel8.5.0,gic-version=host,graphics=on -nodefaults -m 1024 -smp 2 -cpu 'host' -vnc :10 -enable-kvm -monitor stdio -incoming tcp:0:8888 QEMU 6.0.0 monitor - type 'help' for more information (qemu) info status VM status: running
Sorry, forget my last comment Since RHEL9 only support machine type rhel8.5, this scenario is not reproduced on RHEL9
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4684