Bug 1999141 - migration fails with: "qemu-kvm: get_pci_config_device: Bad config data: i=0x9a read: 3 device: 2 cmask: ff wmask: 0 w1cmask:0"
Summary: migration fails with: "qemu-kvm: get_pci_config_device: Bad config data: i=0x...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.5
Hardware: x86_64
OS: All
unspecified
medium
Target Milestone: rc
: 8.5
Assignee: Eduardo Habkost
QA Contact: jingzhao
URL:
Whiteboard:
Depends On:
Blocks: 2025468 2026443
TreeView+ depends on / blocked
 
Reported: 2021-08-30 13:49 UTC by Jean-Louis Dupond
Modified: 2023-03-14 07:06 UTC (History)
15 users (show)

Fixed In Version: qemu-kvm-6.0.0-33.module+el8.5.0+13041+05be2dc6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2025468 2026443 (view as bug list)
Environment:
Last Closed: 2021-11-16 07:55:27 UTC
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)
qemu args (5.78 KB, text/plain)
2021-08-30 13:49 UTC, Jean-Louis Dupond
no flags Details
libvirt xml (15.00 KB, text/plain)
2021-08-30 13:49 UTC, Jean-Louis Dupond
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/rhel/src/qemu-kvm qemu-kvm merge_requests 52 0 None None None 2021-10-20 13:47:03 UTC
Red Hat Issue Tracker RHELPLAN-95601 0 None None None 2021-08-30 18:22:56 UTC
Red Hat Product Errata RHBA-2021:4684 0 None None None 2021-11-16 07:57:17 UTC

Description Jean-Louis Dupond 2021-08-30 13:49:37 UTC
Created attachment 1819096 [details]
qemu args

Description of problem:

When trying to migrate a VM on oVirt with Qemu 5.2 to one with Qemu 6, VM migration fails with the following error:

2021-08-30 13:33:32.937+0000: 2956: error : virNetClientProgramDispatchError:172 : internal error: qemu unexpectedly closed the monitor: 2021-08-30T13:33:31.534090Z qemu-kvm: get_pci_config_device: Bad config data: i=0x9a read: 3 device: 2 cmask: ff wmask: 0 w1cmask:0
2021-08-30T13:33:31.534138Z qemu-kvm: Failed to load PCIDevice:config
2021-08-30T13:33:31.534148Z qemu-kvm: Failed to load virtio-net:virtio
2021-08-30T13:33:31.534157Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:02.0:00.0:01.0/virtio-net'
2021-08-30T13:33:31.536760Z qemu-kvm: load of migration failed: Invalid argument
2021-08-30 13:33:32.937+0000: 2956: debug : qemuDomainObjExitRemote:6060 : Exited remote (vm=0x7ff93800bce0 name=mail001)

If I stop the VM and start it on the Qemu 6 hypervisor it works fine.
If I then try to migrate it back to the 5.2 hypervisor it fails again.

Version-Release number of selected component (if applicable):
qemu-kvm-6.0.0-26.el8s.x86_64

How reproducible:
Always (with this VM config)

Steps to Reproduce:
1. Start VM on oVirt 4.4.8 on a node with Qemu 5.2
2. Migrate the VM to a node with Qemu 6

Actual results:
Fails with error above

Expected results:
Migration should succeed

Additional info:

Comment 1 Jean-Louis Dupond 2021-08-30 13:49:56 UTC
Created attachment 1819097 [details]
libvirt xml

Comment 2 Jean-Louis Dupond 2021-09-01 13:06:14 UTC
After further checking I noticed the following.

VM's that failed during migration had the following virtio-net dev:
-device virtio-net-pci-transitional,host_mtu=1500,netdev=hostua-240f47e3-5263-46a7-9a07-c7e2d6069fbc,id=ua-240f47e3-5263-46a7-9a07-c7e2d6069fbc,mac=xxxxxxx,bus=pci.18,addr=0x1 

The ones that succeeded:
-device virtio-net-pci-transitional,mq=on,vectors=6,host_mtu=1500,netdev=hostua-dcd61290-5507-4750-9225-3611118a647f,id=ua-dcd61290-5507-4750-9225-3611118a647f,mac=xxxx,bus=pci.18,addr=0x1

So the ones that failed did not have mq=on and vectors=X defined.
Could it be related to https://github.com/qemu/qemu/commit/51a81a2118df0c70988f00d61647da9e298483a4 ?

Comment 3 jason wang 2021-09-02 07:42:44 UTC
Have you specific the correct machine type?

Thanks

Comment 4 Jean-Louis Dupond 2021-09-02 07:48:51 UTC
Machine type is set to 'pc-q35-rhel8.4.0' (as you can see in the qemu args attachment).
This is set by oVirt itself.

Comment 6 John Ferlan 2021-09-08 21:28:05 UTC
Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release.

Comment 7 Stefan Hajnoczi 2021-09-15 09:25:55 UTC
(In reply to Jean-Louis Dupond from comment #2)
> After further checking I noticed the following.
> 
> VM's that failed during migration had the following virtio-net dev:
> -device
> virtio-net-pci-transitional,host_mtu=1500,netdev=hostua-240f47e3-5263-46a7-
> 9a07-c7e2d6069fbc,id=ua-240f47e3-5263-46a7-9a07-c7e2d6069fbc,mac=xxxxxxx,
> bus=pci.18,addr=0x1 
> 
> The ones that succeeded:
> -device
> virtio-net-pci-transitional,mq=on,vectors=6,host_mtu=1500,netdev=hostua-
> dcd61290-5507-4750-9225-3611118a647f,id=ua-dcd61290-5507-4750-9225-
> 3611118a647f,mac=xxxx,bus=pci.18,addr=0x1
> 
> So the ones that failed did not have mq=on and vectors=X defined.
> Could it be related to
> https://github.com/qemu/qemu/commit/51a81a2118df0c70988f00d61647da9e298483a4
> ?

I think you're right. This commit changed how vectors= is calculated by default and it preserved the old default (3) on old machine types.

However, the command-line you posted has -device virtio-net-pci-transitional, while the commit only preserves the value for "virtio-net-pci".

I haven't verified this but Jason can confirm this theory. It should be possible to test it with a command-line similar to:

  # qemu-system-x86_64 -M pc-q35-rhel8.4.0 -netdev tap,queues=8,id=netdev0,script= -device virtio-net-pci-transitional,netdev=netdev0,mq=on
  (qemu) info qtree

Note that this command-line doesn't quite work for me, the netdev isn't multi-queue. I'm not sure what is missing, but Jason can help with that.

The "info qtree" output from the HMP monitor contains the virtio-net-pci-transitional device properties including vectors=. If the vectors= value is preserved for virtio-net-pci-transitional as it should be, then you'll see vectors=3. Otherwise you'll see a high value (probably 2 * 8 + 2 = 18).

Once you have a working command-line QE will be able to re-use it to verify this BZ. qemu-kvm-6.0.0-26.el8s.x86_64 will report the incorrect value (~18) while the fixed QEMU will report 3.

Comment 8 Jean-Louis Dupond 2021-09-15 09:39:29 UTC
(In reply to Stefan Hajnoczi from comment #7)
> 
> The "info qtree" output from the HMP monitor contains the
> virtio-net-pci-transitional device properties including vectors=. If the
> vectors= value is preserved for virtio-net-pci-transitional as it should be,
> then you'll see vectors=3. Otherwise you'll see a high value (probably 2 * 8
> + 2 = 18).
> 

The info qtree gives me the following:

      dev: virtio-net-pci-transitional, id ""
        ioeventfd = true
        vectors = 18 (0x12)

While virtio-net-pci gives me:
      dev: virtio-net-pci, id ""
        disable-legacy = "off"
        disable-modern = false
        ioeventfd = true
        vectors = 3 (0x3)

Comment 9 Stefan Hajnoczi 2021-09-15 09:53:52 UTC
(In reply to Jean-Louis Dupond from comment #8)
> (In reply to Stefan Hajnoczi from comment #7)
> > 
> > The "info qtree" output from the HMP monitor contains the
> > virtio-net-pci-transitional device properties including vectors=. If the
> > vectors= value is preserved for virtio-net-pci-transitional as it should be,
> > then you'll see vectors=3. Otherwise you'll see a high value (probably 2 * 8
> > + 2 = 18).
> > 
> 
> The info qtree gives me the following:
> 
>       dev: virtio-net-pci-transitional, id ""
>         ioeventfd = true
>         vectors = 18 (0x12)
> 
> While virtio-net-pci gives me:
>       dev: virtio-net-pci, id ""
>         disable-legacy = "off"
>         disable-modern = false
>         ioeventfd = true
>         vectors = 3 (0x3)

Thanks, that explains why migration compatibility broke. Although -device virtio-net-pci is compatible, -device virtio-net-pci-transitional (and other variants) are not :(.

Please post your exact QEMU command-line so QE can use it to verify this BZ.

A quick fix would be add the virtio-net-pci-* variants to the compat machine type properties so they are protected too. In the long term we should find a way to rule out this type of mistake completely, for example, by introducing an aliasing mechanism so that virtio-net-pci-* automatically copy all the compat properties from virtio-net-pci. These are just some ideas, maybe Jason has other solutions in mind.

Comment 10 Jean-Louis Dupond 2021-09-15 09:55:48 UTC
Tested with the following commands:

/usr/libexec/qemu-kvm  -M pc-q35-rhel8.4.0 -netdev tap,queues=8,id=netdev0,script= -device virtio-net-pci-transitional,netdev=netdev0,mq=on -nographic

And

/usr/libexec/qemu-kvm  -M pc-q35-rhel8.4.0 -netdev tap,queues=8,id=netdev0,script= -device virtio-net-pci,netdev=netdev0,mq=on -nographic

Comment 11 jason wang 2021-09-16 02:45:59 UTC
(In reply to Stefan Hajnoczi from comment #9)
> (In reply to Jean-Louis Dupond from comment #8)
> > (In reply to Stefan Hajnoczi from comment #7)
> > > 
> > > The "info qtree" output from the HMP monitor contains the
> > > virtio-net-pci-transitional device properties including vectors=. If the
> > > vectors= value is preserved for virtio-net-pci-transitional as it should be,
> > > then you'll see vectors=3. Otherwise you'll see a high value (probably 2 * 8
> > > + 2 = 18).
> > > 
> > 
> > The info qtree gives me the following:
> > 
> >       dev: virtio-net-pci-transitional, id ""
> >         ioeventfd = true
> >         vectors = 18 (0x12)
> > 
> > While virtio-net-pci gives me:
> >       dev: virtio-net-pci, id ""
> >         disable-legacy = "off"
> >         disable-modern = false
> >         ioeventfd = true
> >         vectors = 3 (0x3)
> 
> Thanks, that explains why migration compatibility broke. Although -device
> virtio-net-pci is compatible, -device virtio-net-pci-transitional (and other
> variants) are not :(.
> 
> Please post your exact QEMU command-line so QE can use it to verify this BZ.
> 
> A quick fix would be add the virtio-net-pci-* variants to the compat machine
> type properties so they are protected too. In the long term we should find a
> way to rule out this type of mistake completely, for example, by introducing
> an aliasing mechanism so that virtio-net-pci-* automatically copy all the
> compat properties from virtio-net-pci. These are just some ideas, maybe
> Jason has other solutions in mind.

I fully agree.

For short-term we need compat virtio-net-pci-{non}transitional.

Btw, it looks to me we also need to fix the blk, since we had:

GlobalProperty hw_compat_5_2[] = {
    { "ICH9-LPC", "smm-compat", "on"},
    { "PIIX4_PM", "smm-compat", "on"},
    { "virtio-blk-device", "report-discard-granularity", "off" },
      ^^^^^^^^^^^^^^^^^^^
    { "virtio-net-pci", "vectors", "3"},
};
const size_t hw_compat_5_2_len = G_N_ELEMENTS(hw_compat_5_2);

Thanks

Comment 12 Jean-Louis Dupond 2021-10-04 07:41:26 UTC
Any chance this can get fixed soon? Don't think its huge change to get it fixed :)
This as this bug blocks us from migrating without downtime to a newer version.

Comment 13 Jean-Louis Dupond 2021-10-12 08:40:16 UTC
Posted a possible patch for this on the qemu-dev mailing list.

Comment 14 Daniel Berrangé 2021-10-19 10:32:28 UTC
Patch proposed upstream is https://lists.nongnu.org/archive/html/qemu-devel/2021-10/msg02377.html

Comment 15 Stefan Hajnoczi 2021-10-19 11:23:45 UTC
Added "blocker?". David Gilbert mentioned this 8.5 issue needs to be fixed before release to prevent live migration problems. Once a broken release has been made it will be hard to solve the issue.

Comment 17 John Ferlan 2021-10-19 13:09:32 UTC
Per IRC - moving this to block AV 8.5.0

Comment 18 Eduardo Habkost 2021-10-19 17:05:54 UTC
Note that if the patch I have (In reply to jason wang from comment #11)
> GlobalProperty hw_compat_5_2[] = {
>     { "ICH9-LPC", "smm-compat", "on"},
>     { "PIIX4_PM", "smm-compat", "on"},
>     { "virtio-blk-device", "report-discard-granularity", "off" },
>       ^^^^^^^^^^^^^^^^^^^

What do you mean here?  virtio-blk-device has no separate transitional/non-transitional variants, does it?

Comment 19 Eduardo Habkost 2021-10-19 17:22:16 UTC
Note that if the following fix works:

diff --git a/hw/core/machine.c b/hw/core/machine.c
index b8d95eec32d..bd9c6156c1a 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -56,7 +56,7 @@ GlobalProperty hw_compat_5_2[] = {
     { "ICH9-LPC", "smm-compat", "on"},
     { "PIIX4_PM", "smm-compat", "on"},
     { "virtio-blk-device", "report-discard-granularity", "off" },
-    { "virtio-net-pci", "vectors", "3"},
+    { "virtio-net-pci-base", "vectors", "3"},
 };
 const size_t hw_compat_5_2_len = G_N_ELEMENTS(hw_compat_5_2);
 


That means the bug is in the compat_props array and we don't need to wait for the upstream fix to be merged.

We can do this downstream:

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 6c534e14fa3..6b9c0f66d20 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -47,7 +47,8 @@ GlobalProperty hw_compat_rhel_8_4[] = {
     /* hw_compat_rhel_8_4 from hw_compat_5_2 */
     { "virtio-blk-device", "report-discard-granularity", "off" },
     /* hw_compat_rhel_8_4 from hw_compat_5_2 */
-    { "virtio-net-pci", "vectors", "3"},
+    /* Upstream incorrectly had "virtio-net-pci" instead of "virtio-net-pci-base" */
+    { "virtio-net-pci-base", "vectors", "3"},
 };
 const size_t hw_compat_rhel_8_4_len = G_N_ELEMENTS(hw_compat_rhel_8_4);

Comment 24 jason wang 2021-10-20 01:30:22 UTC
(In reply to Eduardo Habkost from comment #18)
> Note that if the patch I have (In reply to jason wang from comment #11)
> > GlobalProperty hw_compat_5_2[] = {
> >     { "ICH9-LPC", "smm-compat", "on"},
> >     { "PIIX4_PM", "smm-compat", "on"},
> >     { "virtio-blk-device", "report-discard-granularity", "off" },
> >       ^^^^^^^^^^^^^^^^^^^
> 
> What do you mean here?  virtio-blk-device has no separate
> transitional/non-transitional variants, does it?

Yes, I was wrong.

Thanks

Comment 25 Jean-Louis Dupond 2021-10-20 07:05:05 UTC
(In reply to Eduardo Habkost from comment #19)
> Note that if the following fix works:
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index b8d95eec32d..bd9c6156c1a 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -56,7 +56,7 @@ GlobalProperty hw_compat_5_2[] = {
>      { "ICH9-LPC", "smm-compat", "on"},
>      { "PIIX4_PM", "smm-compat", "on"},
>      { "virtio-blk-device", "report-discard-granularity", "off" },
> -    { "virtio-net-pci", "vectors", "3"},
> +    { "virtio-net-pci-base", "vectors", "3"},
>  };
>  const size_t hw_compat_5_2_len = G_N_ELEMENTS(hw_compat_5_2);
>  
> 

Seems to work! Build qemu with this patch included (and also changed the rhel8.4 compat) and info qtree with the above commands give correctly the correct value of 3.

Comment 32 Yanan Fu 2021-10-26 01:45:17 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 39 errata-xmlrpc 2021-11-16 07:55:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684

Comment 40 Dr. David Alan Gilbert 2021-11-23 20:00:04 UTC
probably also needs cloning for 8.6


Note You need to log in before you can comment on or make changes to this bug.