Bug 1889268

Summary: Migrate failed with virtio-vgpu device from RHEL-AV 8.0.0 to RHEL-AV 8.3.0
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: jingzhao <jinzhao>
Component: qemu-kvmAssignee: Virtualization Maintenance <virt-maint>
qemu-kvm sub component: Live Migration QA Contact: jingzhao <jinzhao>
Status: CLOSED CANTFIX Docs Contact:
Severity: high    
Priority: high CC: coli, dgilbert, juzhang, kraxel, mst, virt-maint, zhguo
Version: 8.0Keywords: TestBlocker
Target Milestone: rc   
Target Release: 8.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 17:40:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description jingzhao 2020-10-19 08:30:53 UTC
Description of problem:
Migrate failed with virtio-vgpu device from RHEL-AV 8.0.0 to RHEL-AV 8.3.0




Version-Release number of selected component (if applicable):
source host:
kernel-4.18.0-80.29.1.el8_0.x86_64
qemu-kvm-3.1.0-20.module+el8.0.0.z+3438+2851622e.1.x86_64

destination host:
kernel-4.18.0-240.1.1.el8_3.x86_64
qemu-kvm-5.1.0-14.module+el8.3.0+8438+644aff69.x86_64

How reproducible:
3/3

Steps to Reproduce:
1.Boot qemu command in source host 
/usr/libexec/qemu-kvm -M pc-q35-rhel8.0.0 -device virtio-vga -monitor stdio

2.Boot qemu command in destination host
/usr/libexec/qemu-kvm -M pc-q35-rhel8.0.0 -device virtio-vga -monitor stdio -incoming defer

3.Migrate from source host to destination host

Actual results:
Migrate failed with following error
(qemu) qemu-kvm: get_pci_config_device: Bad config data: i=0x34 read: 84 device: 98 cmask: ff wmask: 0 w1cmask:0
qemu-kvm: Failed to load PCIDevice:config
qemu-kvm: Failed to load virtio-gpu:virtio
qemu-kvm: error while loading state for instance 0x0 of device '0000:00:02.0/virtio-gpu'
qemu-kvm: warning: TSC frequency mismatch between VM (2397221 kHz) and host (1995191 kHz), and TSC scaling unavailable
qemu-kvm: load of migration failed: Invalid argument

Expected results:
Migrate successfully

Additional info:

Comment 3 Gerd Hoffmann 2020-10-22 07:22:38 UTC
> (qemu) qemu-kvm: get_pci_config_device: Bad config data: i=0x34 read: 84

#define PCI_CAPABILITY_LIST	0x34	/* Offset of first capability list entry */

Hmm, looks like something in pci (adding mst to cc) ...

What does "lspci -vxxx" print for the virtio-gpu device (as root, both qemu versions please)?

Comment 4 Dr. David Alan Gilbert 2020-10-22 08:53:19 UTC
'info pci' from the two cases:
3.1.0:

  Bus  0, device   2, function 0:
    VGA controller: PCI device 1af4:1050
      PCI subsystem 1af4:1100
      IRQ 11.
      BAR0: 32 bit prefetchable memory at 0xfe000000 [0xfe7fffff].
      BAR2: 64 bit prefetchable memory at 0xfe800000 [0xfe803fff].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0000fffe].
      id ""

5.1.0
  Bus  0, device   2, function 0:
    VGA controller: PCI device 1af4:1050
      PCI subsystem 1af4:1100
      IRQ 11, pin A
      BAR0: 32 bit prefetchable memory at 0xfe000000 [0xfe7fffff].
      BAR2: 64 bit prefetchable memory at 0xfe800000 [0xfe803fff].
      BAR4: 32 bit memory at 0xfebd4000 [0xfebd4fff].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0000fffe].
      id ""

we seem to have gained a BAR.

Comment 5 Dr. David Alan Gilbert 2020-10-22 09:04:08 UTC
It looks like the problem here is MSI-X; the full error is:
(qemu) qemu-kvm: get_pci_config_device: Bad config data: i=0x34 read: 98 device: 84 cmask: ff wmask: 0 w1cmask:0

lspci -vxxx from 5.1.0:

00:02.0 VGA compatible controller: Red Hat, Inc. Virtio GPU (rev 01) (prog-if 00 [VGA controller])
        Subsystem: Red Hat, Inc. Device 1100
        Flags: bus master, fast devsel, latency 0, IRQ 22
        Memory at fe000000 (32-bit, prefetchable) [size=8M]
        Memory at fe800000 (64-bit, prefetchable) [size=16K]
        Memory at febd4000 (32-bit, non-prefetchable) [size=4K]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [98] MSI-X: Enable+ Count=3 Masked-
        Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
        Capabilities: [70] Vendor Specific Information: VirtIO: Notify
        Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
        Capabilities: [50] Vendor Specific Information: VirtIO: ISR
        Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
        Kernel driver in use: virtio-pci
00: f4 1a 50 10 07 05 10 00 01 00 00 03 00 00 00 00
10: 08 00 00 fe 00 00 00 00 0c 00 80 fe 00 00 00 00
20: 00 40 bd fe 00 00 00 00 00 00 00 00 f4 1a 00 11
30: 00 00 bc fe 98 00 00 00 00 00 00 00 0b 01 00 00
40: 09 00 10 01 02 00 00 00 00 10 00 00 00 08 00 00
50: 09 40 10 03 02 00 00 00 00 18 00 00 00 08 00 00
60: 09 50 10 04 02 00 00 00 00 20 00 00 00 10 00 00
70: 09 60 14 02 02 00 00 00 00 30 00 00 00 10 00 00
80: 04 00 00 00 09 70 14 05 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 11 84 02 80 04 00 00 00
a0: 04 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

lspci from 3.1.0

00:02.0 VGA compatible controller: Red Hat, Inc. Virtio GPU (rev 01) (prog-if 00 [VGA controller])
        Subsystem: Red Hat, Inc. Device 1100
        Flags: bus master, fast devsel, latency 0, IRQ 22
        Memory at fe000000 (32-bit, prefetchable) [size=8M]
        Memory at fe800000 (64-bit, prefetchable) [size=16K]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
        Capabilities: [70] Vendor Specific Information: VirtIO: Notify
        Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
        Capabilities: [50] Vendor Specific Information: VirtIO: ISR
        Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
        Kernel driver in use: virtio-pci
00: f4 1a 50 10 07 01 10 00 01 00 00 03 00 00 00 00
10: 08 00 00 fe 00 00 00 00 0c 00 80 fe 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 f4 1a 00 11
30: 00 00 bc fe 84 00 00 00 00 00 00 00 0b 01 00 00
40: 09 00 10 01 02 00 00 00 00 10 00 00 00 08 00 00
50: 09 40 10 03 02 00 00 00 00 18 00 00 00 08 00 00
60: 09 50 10 04 02 00 00 00 00 20 00 00 00 10 00 00
70: 09 60 14 02 02 00 00 00 00 30 00 00 00 10 00 00
80: 04 00 00 00 09 70 14 05 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

the 98/84 is the start of the chain, and we've gained an MSI-X capability at 98.

Comment 6 Dr. David Alan Gilbert 2020-10-22 09:19:30 UTC
This looks to me as if it happened somewhere between 4.0.0 and 4.1.0

Comment 7 Gerd Hoffmann 2020-10-22 14:45:28 UTC
(In reply to Dr. David Alan Gilbert from comment #5)
> It looks like the problem here is MSI-X:

> the 98/84 is the start of the chain, and we've gained an MSI-X capability at
> 98.

Yep.  Seems the default for vectors= changed from 0 to 3.

Comment 8 Gerd Hoffmann 2020-10-26 12:31:25 UTC
Bisect landed at:

commit c68082c43a3ddeb5e5da4ab401e3f9f422e7a290
Author: Marc-André Lureau <marcandre.lureau>
Date:   Fri May 24 15:09:45 2019 +0200

    virtio-gpu: split virtio-gpu-pci & virtio-vga
    
    Add base classes that are common to vhost-user-gpu-pci and
    vhost-user-vga.
    
    Signed-off-by: Marc-André Lureau <marcandre.lureau>
    Message-id: 20190524130946.31736-9-marcandre.lureau
    Signed-off-by: Gerd Hoffmann <kraxel>

Also it's not that the vectors default changed.
The vectors property simply isn't there in v4.0.
It isn't obvious why, it should have been there.

Beside that I don't see an easy way to fix that
given that both 3.1 (vectors=0) and 4.1 (vectors=3)
are released already.  We can make 5.1 compatible
with the one or the other but not both.

Comment 9 Dr. David Alan Gilbert 2020-10-27 17:40:01 UTC
Yes; so I think we need to close this as 'cantfix' because as you say we can fix it one way or the other;
and given the bad choices we may as well keep compatibility with the newer version.
I'm guessing this affects a range of !av versions as well.