Bug 2012026

Summary: [failover vf migration] The failover vf will be unregistered if cancelling the migration when status is "active"
Product: Red Hat Enterprise Linux 9 Reporter: Yanhui Ma <yama>
Component: qemu-kvmAssignee: Laurent Vivier <lvivier>
qemu-kvm sub component: Networking QA Contact: Yanhui Ma <yama>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: chayang, coli, jinzhao, juzhang, lvivier, mrezanin, pvlasin, virt-maint, yfu
Version: 9.0Keywords: Regression, Triaged
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-6.2.0-1.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2060796 (view as bug list) Environment:
Last Closed: 2022-05-17 12:24:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2060796    

Description Yanhui Ma 2021-10-08 02:51:11 UTC
Description of problem:
The failover vf will be hot-unplug if the migration is cancelled when migration status is "active"

Version-Release number of selected component (if applicable):

qemu-kvm-6.1.0-3.el9.x86_64
kernel-5.14.0-3.el9.x86_64

How reproducible:

100%
Steps to Reproduce:
1. create bridge named br0 based on the PF

nmcli connection add type bridge ifname br0 con-name br0 stp off autoconnect yes
nmcli connection add type bridge-slave ifname "$MAIN_CONN" con-name "$MAIN_CONN" master br0 autoconnect yes
systemctl restart NetworkManager


2. create VF from the same PF
# echo 1 > /sys/bus/pci/devices/0000\:d8\:00.0/sriov_numvfs 


3. setup vm network

# virsh net-dumpxml failover-bridge 
<network>
  <name>failover-bridge</name>
  <uuid>bc8a813f-415d-404f-9996-ba22e27bfea6</uuid>
  <forward mode='bridge'/>
  <bridge name='br0'/>
</network>

# virsh net-dumpxml failover-vf --inactive 
<network>
  <name>failover-vf</name>
  <uuid>8e09aebc-83af-4eda-b72f-e6061c3456a5</uuid>
  <forward mode='hostdev' managed='yes'>
    <pf dev='ens8f0'/>
  </forward>
</network>


4. start a VM with a failover vf and a virtio net device

The domain xml:
    <interface type='bridge'>
      <mac address='52:54:11:aa:1c:ef'/>
      <source bridge='br0'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <teaming type='persistent'/>
      <alias name='ua-test'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </interface>
    <interface type='hostdev' managed='yes'>
      <mac address='52:54:11:aa:1c:ef'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0xd8' slot='0x10' function='0x0'/>
      </source>
      <teaming type='transient' persistent='ua-test'/>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </interface>

The qemu cmd line:
-device virtio-net-pci,failover=on,netdev=hostua-test,id=ua-test,mac=52:54:11:aa:1c:ef,bus=pci.4,addr=0x0 
-device vfio-pci,host=0000:d8:10.0,id=hostdev0,bus=pci.5,addr=0x0,failover_pair_id=ua-test

5. check the failover device info in the vm

Both failover virtio net device  and failover vf exist in the vm

# dmesg | grep -i failover
[    3.118262] virtio_net virtio2 eth0: failover master:eth0 registered
[    3.125486] virtio_net virtio2 eth0: failover standby slave:eth1 registered
[    7.018876] virtio_net virtio2 enp4s0: failover primary slave:eth0 registered

6. migrate the vm
# virsh migrate --live --verbose $domain qemu+ssh://$target_ip_address/system


7. cancel the migration when the migration status is "active"

The relate cmd:
# virsh migrate --live --verbose $domain qemu+ssh://$target_ip_address/system
^Cerror: operation aborted: migration out job: canceled by client  <--- enter Ctrl + C to cancel the migration

8.check the failover device info in the vm again

# dmesg 
[  801.443185] pcieport 0000:00:02.4: Slot(0-4): Attention button pressed
[  801.450971] pcieport 0000:00:02.4: Slot(0-4): Powering off due to button press
[  804.216142] pcieport 0000:00:02.4: Slot(0-4): Attention button pressed
[  804.224091] pcieport 0000:00:02.4: Slot(0-4): Button cancel
[  804.228877] pcieport 0000:00:02.4: Slot(0-4): Action canceled due to button press
[  804.234715] pcieport 0000:00:02.4: Slot(0-4): Card not present
[  804.269636] virtio_net virtio2 enp4s0: failover primary slave:enp5s0 unregistered <-- the failover vf has been unregistered

# ifconfig <-- Only failover virtio net device exist in the vm at that time
enp4s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.73.33.236  netmask 255.255.254.0  broadcast 10.73.33.255
        inet6 fe80::5197:45ad:a07f:643c  prefixlen 64  scopeid 0x20<link>
        inet6 2620:52:0:4920:a52d:dc6a:d47f:247c  prefixlen 64  scopeid 0x0<global>
        inet6 2001::c5f2:5cb0:5f90:bb17  prefixlen 64  scopeid 0x0<global>
        ether 52:54:11:aa:1c:ef  txqueuelen 1000  (Ethernet)
        RX packets 4582  bytes 340560 (332.5 KiB)
        RX errors 0  dropped 943  overruns 0  frame 0
        TX packets 914  bytes 104119 (101.6 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp4s0nsby: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::88fe:59f7:c9fc:db72  prefixlen 64  scopeid 0x20<link>
        ether 52:54:11:aa:1c:ef  txqueuelen 1000  (Ethernet)
        RX packets 3575  bytes 244274 (238.5 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 466  bytes 54008 (52.7 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
Actual results:

The failover vf do not exist in the vm 
Expected results:

The failover vf still exists in the vm 
Additional info:

Comment 3 Yanhui Ma 2021-12-15 09:16:23 UTC
The issue still exists on following packages version:

[root@dell-per440-22 vfio]# rpm -q qemu-kvm
qemu-kvm-6.1.0-8.el9.x86_64
[root@dell-per440-22 vfio]# uname -r
5.14.0-30.el9.x86_64

Comment 4 Yanhui Ma 2021-12-16 03:09:04 UTC
win2022(q35+edk2) guest with packages version of comment3 also hits the issue.

Comment 9 Laurent Vivier 2022-02-02 10:31:34 UTC
Move the BZ to POST as the fix is in the rebase to 6.2.0 ()

commit 9323f892b39d133eb69b301484bf7b2f3f49737d
Author: Laurent Vivier <lvivier>
Date:   Thu Nov 18 14:32:23 2021 +0100

    failover: fix unplug pending detection
    
    Failover needs to detect the end of the PCI unplug to start migration
    after the VFIO card has been unplugged.
    
    To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and reset in
    pcie_unplug_device().
    
    But since
        17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on Q35")
    we have switched to ACPI unplug and these functions are not called anymore
    and the flag not set. So failover migration is not able to detect if card
    is really unplugged and acts as it's done as soon as it's started. So it
    doesn't wait the end of the unplug to start the migration. We don't see any
    problem when we test that because ACPI unplug is faster than PCIe native
    hotplug and when the migration really starts the unplug operation is
    already done.
    
    See c000a9bd06ea ("pci: mark device having guest unplug request pending")
        a99c4da9fc2a ("pci: mark devices partially unplugged")
    
    Signed-off-by: Laurent Vivier <lvivier>
    Reviewed-by: Ani Sinha <ani>
    Message-Id: <20211118133225.324937-4-lvivier>
    Reviewed-by: Michael S. Tsirkin <mst>
    Signed-off-by: Michael S. Tsirkin <mst>

Comment 10 Yanan Fu 2022-02-07 06:52:13 UTC
Add 'Verified:Tested,SanityOnly' as gating test with qemu-kvm-6.2.0-1.el9 PASS

Comment 18 Yanhui Ma 2022-03-08 03:18:08 UTC
Finally verify the bug with qemu-kvm-6.2.0-8.el9.x86_64, it works well.

enp4s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.200.132  netmask 255.255.255.0  broadcast 192.168.200.255
        inet6 fe80::7629:599b:a503:e9df  prefixlen 64  scopeid 0x20<link>
        inet6 2001::ca9a:3558:8328:e3e0  prefixlen 64  scopeid 0x0<global>
        ether 52:54:00:aa:1c:ef  txqueuelen 1000  (Ethernet)
        RX packets 133  bytes 17194 (16.7 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 69  bytes 6732 (6.5 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp4s0nsby: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.200.132  netmask 255.255.255.0  broadcast 192.168.200.255
        inet6 fe80::17be:e17c:345e:a239  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:aa:1c:ef  txqueuelen 1000  (Ethernet)
        RX packets 215  bytes 26934 (26.3 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 53  bytes 5948 (5.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp5s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.200.132  netmask 255.255.255.0  broadcast 192.168.200.255
        inet6 fe80::6564:75b3:1b28:8516  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:aa:1c:ef  txqueuelen 1000  (Ethernet)
        RX packets 5  bytes 462 (462.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 24  bytes 2424 (2.3 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Comment 19 Laurent Vivier 2022-03-08 08:07:25 UTC
(In reply to Yanhui Ma from comment #18)
> Finally verify the bug with qemu-kvm-6.2.0-8.el9.x86_64, it works well.


Could you move the BZ to VERIFIED?

Thanks

Comment 23 Yanhui Ma 2022-03-08 13:32:57 UTC
Based on comment 18, set the bug verified.

Comment 25 errata-xmlrpc 2022-05-17 12:24:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: qemu-kvm), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2307