Bug 1923881 - qemu-kvm: migration failure from rhel8.4 to rhel8.3 because of pl011
Summary: qemu-kvm: migration failure from rhel8.4 to rhel8.3 because of pl011
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.4
Hardware: aarch64
OS: Linux
low
medium
Target Milestone: rc
: 8.5
Assignee: Guowen Shan
QA Contact: Xinjian Ma(Fujitsu)
URL:
Whiteboard:
Depends On: 1957667
Blocks: 1875540 1885765 1957194
TreeView+ depends on / blocked
 
Reported: 2021-02-02 06:41 UTC by Zhijian Li (Fujitsu)
Modified: 2021-11-16 08:10 UTC (History)
13 users (show)

Fixed In Version: qemu-kvm-6.0.0-18.module+el8.5.0+11243+5269aaa1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-16 07:51:32 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:4684 0 None None None 2021-11-16 07:51:58 UTC

Description Zhijian Li (Fujitsu) 2021-02-02 06:41:08 UTC
Description of problem:
qemu-kvm: error while loading state for instance 0x0 of device 'pl011' when migrate vm from rhel8.4 to rhel8.3

Version-Release number of selected component (if applicable):
source host:
kernel: 4.18.0-240.el8.aarch64
qemu-kvm: qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1.aarch64.rpm

destination host:
kernel: 4.18.0-278.el8.dt4.aarch64
qemu-kvm: qemu-kvm-5.2.0-4.module+el8.4.0+9676+589043b9.aarch64

How reproducible:


Steps to Reproduce:
1.start an incoming vm on destination side
[root@hpe-apollo80-02-n01 ~]# /usr/libexec/qemu-kvm     -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio -incoming tcp:0:8888

2. start a vm on source side and start migration
[root@hpe-apollo80-01-n01 qemu]# /usr/libexec/qemu-kvm     -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio 
QEMU 5.2.0 monitor - type 'help' for more information
(qemu) migrate -d tcp:10.19.241.167:8888


Actual results:

check the vm status in both source and destination
source:
(qemu) info status
VM status: paused (postmigrate)

destination:
[root@hpe-apollo80-02-n01 ~]# /usr/libexec/qemu-kvm     -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio -incoming tcp:0:8888
QEMU 5.1.0 monitor - type 'help' for more information
(qemu) qemu-kvm: error while loading state for instance 0x0 of device 'pl011'
qemu-kvm: load of migration failed: No such file or directory

Expected results:
at least one VM is running

Additional info:

Comment 1 Andrew Jones 2021-02-02 10:31:11 UTC
I reproduced this by virt-install'ing a virt-rhel8.2.0 machine type guest and using virsh for the migration between two Apollo's, one with latest 8.4 installed and the other with 8.3. Furthermore, I tried a ping-pong migration starting on the RHEL 8.3 host. The migration succeeded going from 8.3 to 8.4, but then failed when going back to the 8.3 host with the following logs

2021-02-02 09:51:22.755+0000: initiating migration
2021-02-02T09:51:24.063818Z qemu-kvm: Output state validation failed: cpu/m-security/secure MPU_RNR is valid
qemu-kvm: ../migration/vmstate.c:409: vmstate_save_state_v: Assertion `!(field->flags & VMS_MUST_EXIST)' failed.
2021-02-02 09:51:25.679+0000: shutting down, reason=crashed

The VM did indeed crash and the 'virsh migrate' command hung and required a 'kill -9' to destroy it (^C didn't work). Also libvirtd had to be restarted in order to release the state change lock which was still held by monitor=remoteDispatchDomainMigrateBegin3Params (according to the error received when trying another migration from 8.3 to 8.4 after the attempt timed out).

So it looks to me like we have a couple QEMU migration bugs and at least one libvirt/virsh bug, as it's not handling QEMU's migration crash gracefully.

A virt-rhel8.3.0 machine type guest has the same problems.

Regarding the 'virsh migrate' hang, virsh didn't hang every time the guest crashed while migrating, but it did most the time. If virsh didn't hang, requiring the 'kill -9', then libvirtd didn't need to be restarted.

Comment 2 Guowen Shan 2021-03-16 12:59:33 UTC
I tried to reproduce the issue on Seattle, but hit another migration issue.
For now, I can't find two spare Appolo machines to reproduce the issue from
the machine pool (beaker).

According the raised error message (from comment#1), it would be compatible
issue and -ENOENT is returned from the following function calls. Also, the
migration should be from RHEL8.4 to RHEL8.3 because there is one new
sub-section is added to RHEL8.4 for PL011 device. It means the devices
between RHEL8.{3,4} isn't compatile for migration. However, migration from
RHEL8.3 to RHEL8.4 should be fine because the newly added sub-section
won't be loaded at all.

   QEMU 5.1.0 monitor - type 'help' for more information
   (qemu) qemu-kvm: error while loading state for instance 0x0 of device 'pl011'
   qemu-kvm: load of migration failed: No such file or directory

   __startcontext
   coroutine_trampoline
   process_incoming_migration_co
   qemu_loadvm_state
   qemu_loadvm_state_main
   qemu_loadvm_section_start_full
   vmstate_load
   vmstate_load_state
   vmstate_subsection_load
   vmstate_get_subsection        # "pl011/clock" sub-section not found and -ENOENT is returned

Comment 3 Andrew Jones 2021-03-16 15:34:47 UTC
Hi Gavin,

One host is enough for this bug to be reproduced. I was able to reproduce with upstream QEMU builds (better to fix it there first) on a ThunderX with these commands

# On one terminal
$ cd /dir/with/build/for/qemu-4.1
qemu-4.1$ aarch64-softmmu/qemu-system-aarch64 -display none -M virt-4.1,accel=kvm,gic-version=host -cpu host -monitor stdio -incoming tcp:0:4444

# On another terminal
$ cd /dir/with/build/for/latest-qemu
latest-qemu$ ./qemu-system-aarch64 -display none -M virt-4.1,accel=kvm,gic-version=host -cpu host -monitor stdio -S
(qemu) migrate -d tcp:0:4444

The first terminal will show

qemu-system-aarch64: error while loading state for instance 0x0 of device 'pl011'
qemu-system-aarch64: load of migration failed: No such file or directory

Also, while fixing it upstream, it'd be nice to post some simple migration tests to QEMU that ensure we can ping-pong migrate all machine type versions we have (or at least a reasonable number of the most recent ones). To do this, we'll probably want to have a collection of pre-compiled QEMUs of each release version we care about in our CI. And, the tests should simply SKIP and suggest building the binaries when they're not present. That's because building the different versions as part of the test building and running would take too long.

Thanks,
drew

Comment 4 Guowen Shan 2021-03-17 04:59:56 UTC
Hi Drew,

Thanks for the instructions. Yeah, migration from version where pl011 clock
exists to version where the clock doesn't exist should fail. I've successfully
reproduced the issue on one machine. Also, the fix has been posted for review.
Lets see what feedback I will get. The fix is not to migrate the clock, but
report the baud rate change on post load time.

By the way, the issue is caused by aac63e0e6ea3 ("hw/char/pl011: add a clock input")
where starts to be emerge from RHEL8.4.0.

Thanks,
Gavin

Comment 5 Guowen Shan 2021-03-18 02:46:17 UTC
upstream fix (v1) is to disable the clock migration for pl011 completely,
but rejected by the community.

https://patchwork.kernel.org/project/qemu-devel/patch/20210317044441.112313-1-gshan@redhat.com/

upstream fix (v2) according to Drew's suggestion has been posted for review.
Lets see what feedback I will get from community.

https://patchwork.kernel.org/project/qemu-devel/patch/20210318023801.18287-1-gshan@redhat.com/

Comment 6 Guowen Shan 2021-03-23 23:50:35 UTC
This is fixed by the following commit, merged to qemu-6.2.0:

   e6fa978d8343 ("hw/arm/virt: Disable pl011 clock migration if needed")

Comment 7 Guowen Shan 2021-03-28 23:20:09 UTC
Correcting the comments for comment#6: It was merged to qemu-6.0 instead
of qemu-6.2.0. I'm including Mirek so that the fix can be picked for
qemu on RHEL8.5 since it's uncertain if qemu-RHEL8.4 needs this.

   e6fa978d8343 ("hw/arm/virt: Disable pl011 clock migration if needed")

Comment 12 Zhijian Li (Fujitsu) 2021-05-18 03:29:40 UTC
Migrate from rhel8.5(qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.aarch64) to rhel8.4(qemu-kvm-5.2.0-16.module+el8.4.0+10806+b7d97207.aarch64)
1. Between same cpu server, migration succeed
2. Between server with different cpu flags. error message "qemu-kvm: error while loading state for instance 0x0 of device 'cpu'" output.


Migration between different cpu flags server: 
src:
[root@fujitsu-fx700-01-n00 auto_test_tool]#  /usr/libexec/qemu-kvm     -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio
QEMU 6.0.0 monitor - type 'help' for more information
(qemu) migrate -d tcp:10.19.241.163:8888
(qemu) info status
VM status: paused (postmigrate)

des:
[root@hpe-apollo80-01-n00 ~]# /usr/libexec/qemu-kvm     -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio -incoming tcp:0:8888
QEMU 5.2.0 monitor - type 'help' for more information
(qemu) qemu-kvm: error while loading state for instance 0x0 of device 'cpu'
qemu-kvm: load of migration failed: Operation not permitted

Comment 13 Zhijian Li (Fujitsu) 2021-05-18 04:27:13 UTC
(In reply to Zhijian Li (Fujitsu) from comment #12)
> Migrate from
> rhel8.5(qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.aarch64) to
> rhel8.4(qemu-kvm-5.2.0-16.module+el8.4.0+10806+b7d97207.aarch64)
> 1. Between same cpu server, migration succeed
> 2. Between server with different cpu flags. error message "qemu-kvm: error
> while loading state for instance 0x0 of device 'cpu'" output.
> 
> 
> Migration between different cpu flags server: 
> src:
> [root@fujitsu-fx700-01-n00 auto_test_tool]#  /usr/libexec/qemu-kvm     -name
> 'avocado-vt-vm1'      -sandbox on      -machine
> virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024     
> -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio
> QEMU 6.0.0 monitor - type 'help' for more information
> (qemu) migrate -d tcp:10.19.241.163:8888
> (qemu) info status
> VM status: paused (postmigrate)
> 
> des:
> [root@hpe-apollo80-01-n00 ~]# /usr/libexec/qemu-kvm     -name
> 'avocado-vt-vm1'      -sandbox on      -machine
> virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024     
> -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio
> -incoming tcp:0:8888
> QEMU 5.2.0 monitor - type 'help' for more information
> (qemu) qemu-kvm: error while loading state for instance 0x0 of device 'cpu'
> qemu-kvm: load of migration failed: Operation not permitted

cpu flags:
fujitsu-fx700-01-n00: Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm fcma dcpop sve
hpe-apollo80-01-n00:  Flags:  fp asimd evtstrm sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm fcma dcpop sve

Comment 14 Guowen Shan 2021-05-18 06:10:53 UTC
Zhijian, For test case#2, I think the CPUs on these two machines aren't exactly same. Migration between not exactly same CPUs are expected to fail. By the way, the issue
isn't what this bugzilla is tracking. Please create another bugzilla to track the
failure from the test case#2 and assign to me, so that I can investigate.

Besides, it'd better to try same host kernel and qemu to see if migration between
these two different types of machines can succeed.

Comment 15 Zhijian Li (Fujitsu) 2021-05-18 06:12:16 UTC
It's verified that the initial problem "qemu-kvm: error while loading state for instance 0x0 of device 'pl011'" has beed fixed.

Comment 16 Zhijian Li (Fujitsu) 2021-05-18 07:50:31 UTC
Filed a new https://bugzilla.redhat.com/show_bug.cgi?id=1961519 to track the new issue.

Comment 17 Qunfang Zhang 2021-05-19 01:18:30 UTC
Setting to VERIFIED since we have new bug 1961519 (low priority) to track the new issue.

Comment 18 Andrew Jones 2021-05-20 11:34:28 UTC
(In reply to Zhijian Li (Fujitsu) from comment #15)
> It's verified that the initial problem "qemu-kvm: error while loading state
> for instance 0x0 of device 'pl011'" has beed fixed.

How was this verified? With upstream QEMU builds? No RHEL version has yet merged the fix. It's just been posted now under bug 1957667, for which this bug should be dependent.

Comment 22 Eric Auger 2021-05-20 12:28:50 UTC
Yes the issue will be fixed with bug 1957667. No additional work needed.

Comment 23 Luiz Capitulino 2021-05-20 13:14:32 UTC
OK, I'm making this as a TestOnly.

Comment 25 Zhijian Li (Fujitsu) 2021-05-21 02:42:47 UTC
qemu-kvm version:
rhel8.5(qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.aarch64)
rhel8.3(qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1.aarch64)

mchine type: virt-rhel8.2.0 and virt-rhel8.3.0

result:
1. Still failed when migrate from rhel8.5 to rhel8.3
2. Succeed when migrate from rhel8.3 to rhel8.5

From rhel8.5 to rhel8.3
src:
[root@hpe-apollo80-01-n01 ~]# /usr/libexec/qemu-kvm     -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio
QEMU 6.0.0 monitor - type 'help' for more information
(qemu) migrate -d tcp:10.19.241.163:8888
(qemu) info status
VM status: paused (postmigrate)

des:
[root@hpe-apollo80-01-n00 auto_test_tool]# /usr/libexec/qemu-kvm     -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio -incoming tcp:0:8888
QEMU 5.1.0 monitor - type 'help' for more information
(qemu) qemu-kvm: error while loading state for instance 0x0 of device 'pl011'
qemu-kvm: load of migration failed: No such file or directory

From rhel8.3 to rhel8.5:
src:
[root@hpe-apollo80-01-n00 auto_test_tool]# /usr/libexec/qemu-kvm     -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio
QEMU 5.1.0 monitor - type 'help' for more information
(qemu) migrate -d tcp:10.19.241.164:8888
(qemu) info status
VM status: paused (postmigrate)

des:
[root@hpe-apollo80-01-n01 ~]# /usr/libexec/qemu-kvm     -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio -incoming tcp:0:8888
QEMU 6.0.0 monitor - type 'help' for more information
(qemu) info status
VM status: running

Comment 26 Guowen Shan 2021-05-21 03:19:16 UTC
Xiaohui, the upstream commit won't take into effect until the dependency (bug 1957667)
is closed. Please hold the verification until the code changes for bug 1957667 is merged.

Comment 34 Xinjian Ma(Fujitsu) 2021-06-10 06:16:07 UTC
qemu-kvm version:
qemu-kvm-6.0.0-18.module+el8.5.0+11243+5269aaa1.aarch64
qemu-kvm-5.1.0-21.module+el8.3.1+10464+8ad18d1a.aarch64


result: Succeed when migrate between these two versions

From 6.0.0-18 to 5.1.0-21
src:
# On one terminal
$ cd /dir/with/build/for/qemu-kvm-6.0.0-18
qemu-kvm-6.0.0-18$ ./qemu-kvm     -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :9      -enable-kvm     -monitor stdio QEMU 6.0.0 monitor - type 'help' for more information
(qemu) migrate -d tcp:10.16.207.95:8888
(qemu) info status
VM status: paused (postmigrate)

des:
# On another terminal
$ cd /dir/with/build/for/qemu-kvm-5.1.0-21
qemu-kvm-5.1.0-21$ ./qemu-kvm     -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio -incoming tcp:0:8888
QEMU 5.1.0 monitor - type 'help' for more information
(qemu) info status
VM status: running

From 5.1.0-21 to 6.0.0-18
src: 
# On one terminal
$ cd /dir/with/build/for/qemu-kvm-5.1.0-21
qemu-kvm-5.1.0-21$ ./qemu-kvm     -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio
QEMU 5.1.0 monitor - type 'help' for more information
(qemu)  migrate -d tcp:10.16.207.95:8888
(qemu) info status
VM status: paused (postmigrate)

des:
# On another terminal
$ cd /dir/with/build/for/qemu-kvm-6.0.0-18
qemu-kvm-6.0.0-18$ ./qemu-kvm     -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.2.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :9      -enable-kvm     -monitor stdio -incoming tcp:0:8888
QEMU 6.0.0 monitor - type 'help' for more information
(qemu) info status
VM status: running

Comment 35 Qunfang Zhang 2021-06-10 06:54:28 UTC
Setting to VERIFIED after confirmed with Xinjian.

Comment 37 Xinjian Ma(Fujitsu) 2021-08-27 03:36:55 UTC
RHEL9 only supports virt-rhel8.5.0

qemu-kvm version:
qemu-kvm-6.0.0-12.el9
qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac


machine type: 
virt-rhel8.5.0

result: Succeed when migrate between these two versions

From qemu-kvm-6.0.0-12.el9 to qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac
src:
# On one terminal
$ cd /dir/with/build/for/qemu-kvm-6.0.0-12.el9
qemu-kvm-6.0.0-12.el9$ ./qemu-system-aarch64 -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.5.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio
QEMU 6.0.0 monitor - type 'help' for more information
(qemu)  migrate -d tcp:10.16.207.95:8888
(qemu) info status
VM status: paused (postmigrate)

des:
# On another terminal
$ cd /dir/with/build/for/qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac
qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac$ ./qemu-system-aarch64 -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.5.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :9      -enable-kvm     -monitor stdio -incoming tcp:0:8888
QEMU 6.0.0 monitor - type 'help' for more information
(qemu) info status
VM status: running


From qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac to qemu-kvm-6.0.0-12.el9
src: 
# On one terminal
$ cd /dir/with/build/for/qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac
qemu-kvm-6.0.0-29.module+el8.5.0+12386+43574bac$ ./qemu-system-aarch64 -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.5.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :9      -enable-kvm     -monitor stdio
QEMU 6.0.0 monitor - type 'help' for more information
(qemu)  migrate -d tcp:10.16.207.95:8888
(qemu) info status
VM status: paused (postmigrate)

des:
# On another terminal
$ cd /dir/with/build/for/qemu-kvm-6.0.0-12.el9
qemu-kvm-6.0.0-12.el9$ ./qemu-system-aarch64 -name 'avocado-vt-vm1'      -sandbox on      -machine virt-rhel8.5.0,gic-version=host,graphics=on     -nodefaults     -m 1024      -smp 2      -cpu 'host'     -vnc :10      -enable-kvm     -monitor stdio -incoming tcp:0:8888
QEMU 6.0.0 monitor - type 'help' for more information
(qemu) info status
VM status: running

Comment 38 Xinjian Ma(Fujitsu) 2021-08-31 06:27:42 UTC
Sorry, forget my last comment

Since RHEL9 only support machine type rhel8.5, this scenario is not reproduced on RHEL9

Comment 40 errata-xmlrpc 2021-11-16 07:51:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684


Note You need to log in before you can comment on or make changes to this bug.