Bug 1927984

Summary: support <teaming> element ("QEMU failover") with plain <hostdev> devices.
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Laine Stump <laine>
Component: libvirtAssignee: Laine Stump <laine>
Status: CLOSED ERRATA QA Contact: yalzhang <yalzhang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.3CC: chhu, fjin, jdenemar, jsuchane, phoracek, virt-maint, yalzhang
Target Milestone: rcKeywords: Triaged
Target Release: 8.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: Feature_Enhancement
Fixed In Version: libvirt-7.0.0-7.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-25 06:47:28 UTC Type: Feature Request
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Laine Stump 2021-02-12 01:49:21 UTC
Bug 1693587 tracked the addition of support in libvirt for QEMU's "virtio Failover" feature which, when combined with a capable virtio driver in the guest, transparently pairs a virtio-net emulated NIC with an SRIOV VF assigned from the host (via VFIO) into a simple failover bond; qemu then automatically unplugs the VF prior to migration (which is important because a guest with a VFIO assigned device can't be migrated), then plugs in a similar new device on the destination.

The initial support required that the VF be configured in libvirt with "<interface type='hostdev'>", which has the aide effect of setting the proper MAC address for the VF via its PF on the host.

Meanwhile, when CNV looked at implementing VFIO assignment of VFs, they found that they were unable to use <interface type='hostdev'>, because they run libvirtd inside a container that doesn't have access to the PF - the libvirtd in the container can only see the VF, and so is unable to initialize its  MAC address. So they added VF MAC configuration to their own code that runs outside the container.

Then they began implementing migration for guests with a VFIO-assigned VF. Since a guest with a VFIO-assigned device can't be migrated, this means that they must complicate their migration code by unplugging the VF, then migrating, and then plugging in a different VF on the destination host. On top of that, in order for guest network connectivity to be maintained during the migration, they must rely on the user to add an additional emulated NIC to the guest, and configure some sort of failover bond device in the guest OS' network config.

As an alternative, I suggested that I could enhance libvirt to make <teaming> usable within their limitations - that is the purpose of the patches described in this BZ.

It's quite simple - libvirt's parsing/formatting of the <teaming> element in <interface> is extracted into its own functions, a teaming element is added to <hostdev>, and the requisite functions are called by the parser/formatter for <hostdev>, as well as pointing the QEMU driver at the right place when building the vfio-pci device commandline.

As a result, as long as the guest OS has a recent version of the virtio-net guest driver (there are versions for both Windows and Linux), CNV will be able to just specify one <interface type='bridge'> (for the virtio "persistent" device) and one <hostdev> (for the VF) in the libvirt config, set the MAC address of the VF, then start up the guest; when it is time to migrate, they won't need to do any hotplugging (which has caused difficulties for them already due to libvirtd being run unprivileged and containerized, see Bug 1916346).

In the end CNV may end up not using <teaming> for their assigned VFs. But if they do, at least it won't be because of this one simple missing link.

Comment 1 Laine Stump 2021-02-12 01:52:29 UTC
Patches pushed upstream to support this. They will be in libvirt-7.1.0:

commit 5d74e2f168d69541038896925b08f09807a1fa39
Author: Laine Stump <laine>
Date:   Wed Feb 10 20:08:29 2021 -0500

    conf: make teaming info an official type
    
Commit 13be68094d47693fd5346d45612d05de425e2529
Author: Laine Stump <laine>
Date:   Wed Feb 10 21:09:58 2021 -0500

    conf: use virDomainNetTeamingInfoPtr instead of virDomainNetTeamingInfo
    
commit dea27109119da13e8ed0c564edd7796d98bb795c
Author: Laine Stump <laine>
Date:   Wed Feb 10 22:44:08 2021 -0500

    conf: separate Parse/Format functions for virDomainNetTeamingInfo
    
commit 5cea59b2b3cbe4218d8311da177de95403c10980
Author: Laine Stump <laine>
Date:   Wed Feb 10 22:59:31 2021 -0500

    schema: separate teaming element definition from interface element
    
commit db64acfbda59ad22b671580fda13968c60bb8c1a
Author: Laine Stump <laine>
Date:   Thu Feb 11 00:58:29 2021 -0500

    conf: parse/format <teaming> element in plain <hostdev>
    
commit 010ed0856bb06f439e6fdf44e4f529f53441c398
Author: Laine Stump <laine>
Date:   Thu Feb 11 02:05:15 2021 -0500

    qemu: plug <teaming> config from <hostdev> into qemu commandline
    
commit bebaafd6b4a54b35f0d6676ab9156ea1489cbf5e (HEAD -> master, upstream/master, active-detach-compare-alias)
Author: Laine Stump <laine>
Date:   Thu Feb 11 02:47:29 2021 -0500

    news: document support for <teaming> in <hostdev>

Comment 6 yalzhang@redhat.com 2021-02-23 13:53:28 UTC
Hi laine, 

Does it support live migration? if yes, we should ensure there is vf with the same pci address on the target host. I have tried to migrate, it failed with:
# virsh migrate rhel --live --verbose qemu+ssh://dell-xx.lab.eng.pek2.redhat.com/system 
root.eng.pek2.redhat.com's password: 
error: Operation not supported: cannot migrate a domain with <hostdev mode='subsystem' type='pci'>

The configuration is as below:
<interface type='bridge'>
      <mac address='52:54:00:aa:1c:ef'/>
      <source network='host-bridge' portid='34e111b8-1b34-407e-8e25-b52b6e7d8c54' bridge='br0'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <teaming type='persistent'/>
      <alias name='ua-backup0'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
 <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x82' slot='0x10' function='0x1'/>
      </source>
      <teaming type='transient' persistent='ua-backup0'/>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </hostdev>

And when checked the doc, I found:
[PATCH 5/7] conf: parse/format <teaming> element in plain <hostdev>

+     <hostdev mode='subsystem' type='pci' managed='no'>
+       <source>
+         <address domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
+       </source>
+       <mac address='00:11:22:33:44:55:66'/>
+       <teaming type='transient' persistent='ua-backup0'/>
+     </interface>

The mac address can not be set in hostdev device, suggest removing the mac address line here.
and
s/</interface>/</hostdev>

Comment 7 Laine Stump 2021-02-23 15:47:02 UTC
(In reply to yalzhang from comment #6)
> Hi laine, 
> 
> Does it support live migration? if yes, we should ensure there is vf with
> the same pci address on the target host.

Yes, it should. The complicated part is that, since you don't have the abstraction of VF pools provided by <interface type='network'>, you have to either (as you suggest) have identical hardware on the destination host, or you have to put hooks into the migration to modify the XML during the migration to change the PCI address of the VF.


> I have tried to migrate, it failed
> with:
> # virsh migrate rhel --live --verbose
> qemu+ssh://dell-xx.lab.eng.pek2.redhat.com/system 
> root.eng.pek2.redhat.com's password: 
> error: Operation not supported: cannot migrate a domain with <hostdev
> mode='subsystem' type='pci'>

Ooh. Oops! I need to fix that! (I hadn't tested migration because my secondary migration host is currently out of commission, There is one check I forgot to add. I will make that patch, get it pushed upstream and backported, then move this BZ back to POST.


> 
> The configuration is as below:
> <interface type='bridge'>
>       <mac address='52:54:00:aa:1c:ef'/>
>       <source network='host-bridge'
> portid='34e111b8-1b34-407e-8e25-b52b6e7d8c54' bridge='br0'/>
>       <target dev='vnet0'/>
>       <model type='virtio'/>
>       <teaming type='persistent'/>
>       <alias name='ua-backup0'/>
>       <address type='pci' domain='0x0000' bus='0x01' slot='0x00'
> function='0x0'/>
>     </interface>
>  <hostdev mode='subsystem' type='pci' managed='yes'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0000' bus='0x82' slot='0x10' function='0x1'/>
>       </source>
>       <teaming type='transient' persistent='ua-backup0'/>
>       <alias name='hostdev0'/>
>       <address type='pci' domain='0x0000' bus='0x05' slot='0x00'
> function='0x0'/>
>     </hostdev>
> 
> And when checked the doc, I found:
> [PATCH 5/7] conf: parse/format <teaming> element in plain <hostdev>
> 
> +     <hostdev mode='subsystem' type='pci' managed='no'>
> +       <source>
> +         <address domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
> +       </source>
> +       <mac address='00:11:22:33:44:55:66'/>
> +       <teaming type='transient' persistent='ua-backup0'/>
> +     </interface>
> 
> The mac address can not be set in hostdev device, suggest removing the mac
> address line here.
> and
> s/</interface>/</hostdev>

Sigh. Correct on both points. Thanks for the proof-reading. I cut-pasted the documentation example at the last minute and was too hasty in committing.

Comment 8 Laine Stump 2021-02-24 03:46:29 UTC
Corrections for both problems listed in Comment 7 have been posted upstream:

https://listman.redhat.com/archives/libvir-list/2021-February/msg01133.html

Comment 9 Laine Stump 2021-02-24 23:23:18 UTC
Fixes for bugs found by Yalzhang in Comment 7 pushed upstream, will be in 7.1.0:

commit 98e67d4d8ca933b5fa2ec4fbc35dfe7cd8b1547b
Author: Laine Stump <laine>
Date:   Tue Feb 23 17:21:56 2021 -0500

    qemu: allow migration of generic <hostdev> with <teaming>
    
commit a0cef16787930c810263f1edd057e038cb6406e3 (HEAD -> master, upstream/master)
Author: Laine Stump <laine>
Date:   Tue Feb 23 17:30:51 2021 -0500

    docs: fix bad cut/paste in <teaming> example

Comment 11 yalzhang@redhat.com 2021-03-04 05:54:15 UTC
Test on libvirt-7.0.0-7.module+el8.4.0+10195+258bbfb7.x86_64, the result is as expected.

In addition, if the hostdev device on the target host has different pci address with the source host, we can migrate with "--xml" option with a modified xml attached. Except mac address, vlan and virtualport can not be set in hostdev device. 

The scenarios as below:
1. Start vm with hostdev device with teaming setting, and check the network functionality;
2. Start vm without any interface, then hotplug the bridge interface and hostdev device, check the network functionality, then unplug the hostdev device;
3. Migrate vm with hostdev device with teaming setting, succeed.

Details:
the hostdev device is one of the vf
S1: start vm
1. Set the mac address of vf, then start the vm with interface as below:
# ip link set enp130s0f1 vf 0 mac 52:54:00:96:a4:f1
Interface setting:
<interface type='bridge'>
      <mac address='52:54:00:96:a4:f1'/>
      <source network='host-bridge' portid='ac87b686-47bc-4d21-ad6c-2572b0f15776' bridge='br0'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <teaming type='persistent'/>
      <alias name='ua-backup0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </interface>
<hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x82' slot='0x10' function='0x1'/>
      </source>
      <teaming type='transient' persistent='ua-backup0'/>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </hostdev>
# virsh start rh
Domain 'rh' started

Login the vm to check the network functions:
[root@bootp-73-33-113 ~]# ifconfig -a
enp1s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.73.33.157  netmask 255.255.254.0  broadcast 10.73.33.255
        inet6 fe80::bdc1:4a2b:3428:6dde  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:96:a4:f1  txqueuelen 1000  (Ethernet)
        RX packets 89  bytes 18707 (18.2 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 140  bytes 17468 (17.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp4s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.73.33.113  netmask 255.255.254.0  broadcast 10.73.33.255
        inet6 2620:52:0:4920:383:2ff8:1058:6fb2  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::6c6c:f483:f38b:ed80  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:96:a4:f1  txqueuelen 1000  (Ethernet)
        RX packets 245  bytes 32325 (31.5 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 171  bytes 23330 (22.7 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp4s0nsby: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::e89b:2415:cc11:54bc  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:96:a4:f1  txqueuelen 1000  (Ethernet)
        RX packets 156  bytes 13618 (13.2 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 31  bytes 5862 (5.7 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@bootp-73-33-113 ~]# ping www.baidu.com -c 2
PING www.a.shifen.com (220.181.38.150) 56(84) bytes of data.
64 bytes from 220.181.38.150 (220.181.38.150): icmp_seq=1 ttl=46 time=3.33 ms
64 bytes from 220.181.38.150 (220.181.38.150): icmp_seq=2 ttl=46 time=3.39 ms
....

S2: hotunplug/hotplug
Start a vm without any  interfaces as below:
# virsh start rh 
Domain 'rh' started
Then do hotplug:
# cat bridge_interface.xml 
<interface type='network'>
      <mac address='52:54:00:96:a4:f1'/>
      <source network='host-bridge'/>
      <model type='virtio'/>
      <teaming type='persistent'/>
      <alias name='ua-backup0'/>
    </interface>
Set the mac of the vf then prepare the xml:
# ip link set enp130s0f1 vf 0 mac 52:54:00:96:a4:f1
# cat hostdev.xml
<hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
 <address domain='0x0000' bus='0x82' slot='0x10' function='0x1'/>
      </source>
      <teaming type='transient' persistent='ua-backup0'/>
    </hostdev>
# virsh attach-device rh bridge_interface.xml 
Device attached successfully

# virsh attach-device rh hostdev.xml 
Device attached successfully
Login the vm to check the functionality:
# ping www.baidu.com  -c 2
PING www.a.shifen.com (220.181.38.150) 56(84) bytes of data.
64 bytes from 220.181.38.150 (220.181.38.150): icmp_seq=1 ttl=46 time=3.35 ms
64 bytes from 220.181.38.150 (220.181.38.150): icmp_seq=2 ttl=46 time=3.35 ms
...

hotunplug:
Keep the ping running on guest, and unplug the hostdev device by:
# cat hostdev.xml
<hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
	      <address domain='0x0000' bus='0x82' slot='0x10' function='0x1'/>
      </source>
      <teaming type='transient' persistent='ua-backup0'/>
    </hostdev>
# virsh detach-device rh hostdev.xml
Device detached successfully

check the ping should work on the vm as well

S3: migration
1. Prepare the network, enable vfs on the target host, and set the vf mac address; 
if the vf’s mac address is different than the src hostdev device, modify the pci address in the xml of the src host:

# virsh dumpxml rh > rh_migrate.xml
# vim rh_migrate.xml
(edit the pci address of the vf in the hostdev device element)

# cat  rh_migrate.xml  | grep /hostdev -B6
      <source>
        <address domain='0x0000' bus='0x04' slot='0x10' function='0x0'/>
      </source>
…

# virsh migrate rh --live --verbose --p2p qemu+ssh://dell-per730-37.lab.eng.pek2.redhat.com/system --xml rh_migrate.xml
Migration: [100 %]

Migration succeeds, and check the network functionality on the vm, it works well.

Migrate back to the src host, and check the network functionality, it works well.

Comment 12 yalzhang@redhat.com 2021-03-04 06:31:33 UTC
Test with PF but without network functionality check on vm as the hardware environment limitation.

Comment 13 yalzhang@redhat.com 2021-03-05 12:44:40 UTC
More scenarios test as below, the result is as expected except migration with postcopy. After migration with postcopy, the hostdev device did not exists on the vm. I think it is the same issue with Bug 1817965, so I will add a comment there and track the issue there. 

1. When there is no same pci device:
# virsh migrate rh --live --verbose qemu+ssh://dell-per730-58.lab.eng.pek2.redhat.com/system --p2p
error: Device 0000:82:10.0 not found: could not access /sys/bus/pci/devices/0000:82:10.0/config: No such file or directory

2. Cancel the migration then check the vm’s status:
#  virsh migrate rh --live --verbose qemu+ssh://dell-per730-58.lab.eng.pek2.redhat.com/system --p2p --xml rh_migrate.xml
Migration: [ 65 %]^Cerror: operation aborted: migration out job: canceled by client
Check on the vm, the hostdev device reattached and the network works well.

3. Migrate with postcopy:
On first terminal:
#  virsh migrate rh --live --verbose qemu+ssh://dell-per730-58.lab.eng.pek2.redhat.com/system --p2p --xml rh_migrate.xml  --bandwidth 10 --postcopy
 
On the 2nd terminal:
# virsh event --all --loop

During the migration is running, execute the cmd on the 3nd terminal:
# virsh migrate-postcopy rh

Check the migration is successful on the first terminal;
And there is a postcopy event on the 2nd terminal;
# virsh event --all --loop
event 'migration-iteration' for domain 'rh': iteration: '1'
event 'lifecycle' for domain 'rh': Suspended Migrated
event 'lifecycle' for domain 'rh': Suspended Post-copy
event 'migration-iteration' for domain 'rh': iteration: '2'
event 'lifecycle' for domain 'rh': Stopped Migrated
event 'job-completed' for domain 'rh':
…
Check on the dst host, the hostdev interface is not exists on the vm any more:
[root@bootp-73-33-220 ~]# lspci | grep Eth
01:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01)
[root@bootp-73-33-220 ~]# ifconfig -a
enp1s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.73.33.220  netmask 255.255.254.0  broadcast 10.73.33.255
        inet6 2620:52:0:4920:8a2c:4e3b:d37b:1c86  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::d0e6:cf2e:3c5d:399b  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:aa:1c:ef  txqueuelen 1000  (Ethernet)
        RX packets 207  bytes 18457 (18.0 KiB)
        RX errors 0  dropped 92  overruns 0  frame 0
        TX packets 148  bytes 16424 (16.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp1s0nsby: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 52:54:00:aa:1c:ef  txqueuelen 1000  (Ethernet)
        RX packets 297  bytes 24920 (24.3 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 144  bytes 17582 (17.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
…
check on the dst host:
# readlink -s /sys/bus/pci/devices/0000:05:10.1/driver
../../../../bus/pci/drivers/vfio-pci

Comment 14 yalzhang@redhat.com 2021-03-05 12:57:27 UTC
test managedsave and save, it never finished just as bug 1815426, will track the issue there.
terminal 1:
[root@dell-per730-36 ~]# virsh save rh rh.save
^Cerror: Failed to save domain 'rh' to rh.save
error: operation aborted: domain save job: canceled by client


terminal 2:
on the vm, no difference when run the save/managedsave command until the save/managedsave canceled. Once the save/mangedsave canceled, the hostdev interface unregistered from the vm's os just as bug 1815426
 
[root@bootp-73-33-220 ~]# ping www.baidu.com
PING www.a.shifen.com (220.181.38.149) 56(84) bytes of data.
64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=1 ttl=46 time=3.68 ms
64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=2 ttl=46 time=3.65 ms
64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=4 ttl=46 time=3.64 ms
64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=5 ttl=46 time=3.62 ms
[   41.034476] pcieport 0000:00:02.3: Slot(0-3): Attention button pressed
[   41.035900] pcieport 0000:00:02.3: Slot(0-3): Powering off due to button press
[   41.037421] pcieport 0000:00:02.3: Slot(0-3): Card not present
[   41.051917] virtio_net virtio0 enp1s0: failover primary slave:enp4s0 unregistered
From bootp-73-33-220.lab.eng.pek2.redhat.com (10.73.33.220) icmp_seq=19 Destination Host Unreachable

Comment 16 errata-xmlrpc 2021-05-25 06:47:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2098