1874096 – SR-IOV : NIC is lost after rebooting the vm and ip address/mac goes into unstable state

Bug 1874096 - SR-IOV : NIC is lost after rebooting the vm and ip address/mac goes into unstable state

Summary: SR-IOV : NIC is lost after rebooting the vm and ip address/mac goes into unst...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	2.4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	2.5.0
Assignee:	Edward Haas
QA Contact:	Geetika Kapoor
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-31 14:41 UTC by Geetika Kapoor
Modified:	2020-10-16 08:30 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-04 08:47:00 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Bridge_before_after_reboot (4.91 KB, text/plain) 2020-08-31 17:04 UTC, Geetika Kapoor	no flags	Details
sriov_before_after_reboot (4.32 KB, text/plain) 2020-08-31 17:06 UTC, Geetika Kapoor	no flags	Details
Host side domxml and lspci (28.46 KB, text/plain) 2020-09-10 07:11 UTC, Edward Haas	no flags	Details
Guest side lspci and dmesg (63.74 KB, text/plain) 2020-09-10 07:11 UTC, Edward Haas	no flags	Details
View All

Description Geetika Kapoor 2020-08-31 14:41:51 UTC

Description of problem:

When we have a VM setup with sriov, we saw that after reboot mac/ip of the nic's are getting mixed up and nic's are getting lost after reboot.
This seems to be a concern.

Attaching the complete ip logs of vm before/after reboot.


Version-Release number of selected component (if applicable):


# oc get csv -n openshift-cnv | awk ' { print $4 } ' | tail -n1
2.4.1


How reproducible:

always

Steps to Reproduce:
1. Try to reboot system. if you don't see it try another time and for sure you will hit this problem.
2.
3.

Actual results:

NIC + ip + mac changed.


Expected results:

NIC + ip + mac should be intact even after reboot. This could badly break customer env and test env.


Additional info:


[fedora@sriov-vm-dpdk ~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.0.2.2  netmask 255.255.255.0  broadcast 10.0.2.255
        inet6 fe80::57:92ff:fe00:4  prefixlen 64  scopeid 0x20<link>
        ether 02:57:92:00:00:04  txqueuelen 1000  (Ethernet)
        RX packets 50  bytes 6235 (6.0 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 41  bytes 4060 (3.9 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.200.3.3  netmask 255.255.255.0  broadcast 10.200.3.255
        inet6 fe80::b5ff:feb5:b5fb  prefixlen 64  scopeid 0x20<link>
        ether 02:00:b5:b5:b5:fb  txqueuelen 1000  (Ethernet)
        RX packets 1  bytes 346 (346.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5  bytes 709 (709.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[fedora@sriov-vm-dpdk ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether 02:57:92:00:00:04 brd ff:ff:ff:ff:ff:ff
    altname enp3s0
    inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
       valid_lft 86313463sec preferred_lft 86313463sec
    inet6 fe80::57:92ff:fe00:4/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 02:00:b5:b5:b5:fb brd ff:ff:ff:ff:ff:ff
    altname enp2s1
    altname ens1
    inet 10.200.3.3/24 brd 10.200.3.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::b5ff:feb5:b5fb/64 scope link 
       valid_lft forever preferred_lft forever





reboot machine

After reboot :


eth0: 10.200.3.3 fe80::b5ff:feb5:b5fb
sriov-vm-dpdk login: fedora
Password: 
Last login: Mon Aug 31 14:30:04 on ttyS0
[fedora@sriov-vm-dpdk ~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.200.3.3  netmask 255.255.255.0  broadcast 10.200.3.255
        inet6 fe80::b5ff:feb5:b5fb  prefixlen 64  scopeid 0x20<link>
        ether 02:00:b5:b5:b5:fb  txqueuelen 1000  (Ethernet)
        RX packets 1  bytes 346 (346.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2  bytes 431 (431.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0



[fedora@sriov-vm-dpdk ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 02:00:b5:b5:b5:fb brd ff:ff:ff:ff:ff:ff
    altname enp2s1
    altname ens1
    inet 10.200.3.3/24 brd 10.200.3.255 scope global noprefixroute eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::b5ff:feb5:b5fb/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default qlen 1000
    link/ether 02:57:92:00:00:04 brd ff:ff:ff:ff:ff:ff
    altname enp3s0

Comment 1 Geetika Kapoor 2020-08-31 17:04:49 UTC

Created attachment 1713193 [details]
Bridge_before_after_reboot

Comment 2 Geetika Kapoor 2020-08-31 17:06:56 UTC

Created attachment 1713194 [details]
sriov_before_after_reboot

Comment 6 Edward Haas 2020-09-10 07:11:11 UTC

Created attachment 1714382 [details]
Host side domxml and lspci

Comment 7 Edward Haas 2020-09-10 07:11:49 UTC

Created attachment 1714383 [details]
Guest side lspci and dmesg

Comment 8 Edward Haas 2020-09-10 07:12:19 UTC

Comparing the domxml with the actual VM PCI devices, there seem to be a problem with the PCI addresses.
They are not in sync for some reason, even on first boot when everything seems fine (e.g. the interface is there).

Attaching the relevant information from the host (virt launcher `virsh dumpxml 1`) and the guest (From the VM, `lspci`).

Comment 10 Laine Stump 2020-09-22 20:46:53 UTC

Just a short recap from discussions with Edward on IRC:

1) The observed mismatch in bus numbers between the libvirt XML and guest lspci output is because the "bus number" in libvirt is valid/consistent only within the libvirt config. qemu itself doesn't provide a way to explicitly specify a bus number for any given PCI controller, because it has no way of enforcing such a request - the bus number of any given PCI controller as reported by any guest OS is completely up to the guest OS; it usually depends on the order in which the controllers are probed, but that's not anything that's enforced, and anyway the guest OS could probe controllers in any order it wanted to.

2) Note that the two devices do have the same PCI address from one boot to the next, and the systemd-provided "predictable"  device names remain consistent (as shown by "altname" in the ifconfig output): The device with MAC 02:00:b5:b5:b5:fb is "enp2s1" both times, and the device with MAC 02:57:92:00:00:04 is "enp3s0" both times.

3) But for some reason, this guest has been configured to use traditional ethN network interface names instead of systemd-provided "predictable network interface names". Use of ethN device names has proven problematic (i.e. "unpredictable and prone to change order from one boot to the next on identical hardware") over many years, which is why systemd provides the "predictable" names. Here is a document that explains the problem, and why systemd came to the conclusion that it's not possible to have predictable names using the ethN naming scheme (unless you only have a single ethernet device):

  https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/

IMO, the proper solution to this is to *not* use ethN network interface names when there is more than a single network device.

Comment 11 Vladik Romanovsky 2020-09-23 12:33:40 UTC

(In reply to Laine Stump from comment #10)
> Just a short recap from discussions with Edward on IRC:
> 
> 1) The observed mismatch in bus numbers between the libvirt XML and guest
> lspci output is because the "bus number" in libvirt is valid/consistent only
> within the libvirt config. qemu itself doesn't provide a way to explicitly
> specify a bus number for any given PCI controller, because it has no way of
> enforcing such a request - the bus number of any given PCI controller as
> reported by any guest OS is completely up to the guest OS; it usually
> depends on the order in which the controllers are probed, but that's not
> anything that's enforced, and anyway the guest OS could probe controllers in
> any order it wanted to.

That's a bit surprising to me.

Daniel, doesn't this statement invalidates the assumption in Device role tagging?
[1] https://specs.openstack.org/openstack/nova-specs/specs/mitaka/approved/virt-device-role-tagging.html


> 
> 2) Note that the two devices do have the same PCI address from one boot to
> the next, and the systemd-provided "predictable"  device names remain
> consistent (as shown by "altname" in the ifconfig output): The device with
> MAC 02:00:b5:b5:b5:fb is "enp2s1" both times, and the device with MAC
> 02:57:92:00:00:04 is "enp3s0" both times.
> 
> 3) But for some reason, this guest has been configured to use traditional
> ethN network interface names instead of systemd-provided "predictable
> network interface names". Use of ethN device names has proven problematic
> (i.e. "unpredictable and prone to change order from one boot to the next on
> identical hardware") over many years, which is why systemd provides the
> "predictable" names. Here is a document that explains the problem, and why
> systemd came to the conclusion that it's not possible to have predictable
> names using the ethN naming scheme (unless you only have a single ethernet
> device):
> 
>  
> https://www.freedesktop.org/wiki/Software/systemd/
> PredictableNetworkInterfaceNames/
> 
> IMO, the proper solution to this is to *not* use ethN network interface
> names when there is more than a single network device.

Comment 12 Daniel Berrangé 2020-09-23 12:49:35 UTC

(In reply to Vladik Romanovsky from comment #11)
> (In reply to Laine Stump from comment #10)
> > Just a short recap from discussions with Edward on IRC:
> > 
> > 1) The observed mismatch in bus numbers between the libvirt XML and guest
> > lspci output is because the "bus number" in libvirt is valid/consistent only
> > within the libvirt config. qemu itself doesn't provide a way to explicitly
> > specify a bus number for any given PCI controller, because it has no way of
> > enforcing such a request - the bus number of any given PCI controller as
> > reported by any guest OS is completely up to the guest OS; it usually
> > depends on the order in which the controllers are probed, but that's not
> > anything that's enforced, and anyway the guest OS could probe controllers in
> > any order it wanted to.
> 
> That's a bit surprising to me.
> 
> Daniel, doesn't this statement invalidates the assumption in Device role
> tagging?
> [1]
> https://specs.openstack.org/openstack/nova-specs/specs/mitaka/approved/virt-
> device-role-tagging.html

Yes, it is a bit more complex than described in that spec. The bus numbers in libvirt XML and guest are not a 1-1 match as originally supposed. In simple PCI topologies we can make some simplifying assumptions and it'll still work. For more complex toplogies we need extra information to reliably correlate.

There is actually a separate "bus_nr" setting in QEMU against some of the their PCI bridges that is in turn exposed to the guest OS. If libvirt supported setting that field, then the host/guest correlation can be made more reliable.

Comment 13 Laine Stump 2020-09-23 16:56:34 UTC

> There is actually a separate "bus_nr" setting in QEMU against some of the
> their PCI bridges that is in turn exposed to the guest OS. If libvirt
> supported setting that field, then the host/guest correlation can be made
> more reliable.

The only PCI controllers in qemu that have a "bus_nr" setting are pci-expander-bus and pcie-expander-bus, and libvirt already supports setting it (via <target BusNr='n'/> in the controller XML). But that isn't the purpose of bus_nr for those controllers. bus_nr is set to split the bus numbering space (which is an 8 bit value) between the pcie-root-bus and its descendants, and the pcie-expander-bus and *its* descendants. So for example, if you set <target busNr='100'/> in the definition of the pcie-expander-bus, then bus numbers 1-99 will be available for descendants of the root bus, and bus numbers 100-255 will be available for descendants of the pcie-expander-bus.

There are settings called "chassis" and/or "chassis_nr" for some PCI controllers (and libvirt also supports setting those in the <target> subelement, for controllers that have them), but those also have nothing to do with the bus number. Setting those apparently will set a register in the PCI controller, and that register can be seen from within the guest OS, but that register has no effect on the bus number reported by the guest OS. (I was unsure of this so I just tried setting chassis on the pcie-root-ports of a guest and restarting it - the output of lspci was unchanged)

So unfortunately I don't think there is a way for qemu to directly/explicitly set the bus number of any PCI controller.

Comment 14 Edward Haas 2020-10-04 08:46:40 UTC

The reported issue has been investigated and finally classified as a NOTABUG.
The issue with the lost NIC interface configuration and its inconsistent renaming due to reordering is due to the usage of the non-predictable interface naming.

Summarizing the end result of this issue investigation:
- The initial identification of the PCI address inconsistency between the domxml and the VM actual mapping:
- Although in most case the mapping is in sync (between libvirt & the VM) , in some cases there is no way to assure it. Therefore, it is not to be expected for the mapping sync to be assured.
- In the current case, the inconsistency addresses are caused by the wrong identification of the VF as a PCI device and not a PCIe device (see next for details).
- The PCI mapping sync is not expected to cause issues with the VF functionality or performance.
- libvirt identifies the VF device as PCI instead of PCIe:
This is a libvirt issue that is caused by its inability to access the VF device information in the unprivileged container.
libvirt works in a privileged mode in the virt-launcher, but the container itself is not. For libvirt to work properly in such a case, it should have been launched in an unprivileged/user mode or not assume full access in a privileged mode.
In any case, this is a side effect issue found, it should not really affect the real problem reported in this BZ (see previous details).
- There is no mechanism which can assure the order in which interfaces are detected (and indexed) at boot time, therefore, no assumption of the order should be assumed.
It should be noted that this is only relevant for multi interfaces scenarios and not for the most common case of one interface per VM.
- The interface naming is an old known issue which has been resolved by the consistent/predictable network device naming [1][2].
- The default cloud images from some of the distributions have the consistent naming disabled (probably because most assume a single interface setup).
(if one uses such an image, look at the grub config for the kernel command line and remove the `net.ifnames=0` specification)

The resolution of the issue reported in the BZ is to enable the consistent network device naming in order to avoid dependency on the interface detection order.

1. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/ch-consistent_network_device_naming#sec-Naming_Schemes_Hierarchy
2. https://www.freedesktop.org/software/systemd/man/systemd.net-naming-scheme.html

Comment 15 Igor Mammedov 2020-10-07 14:16:46 UTC

(In reply to Edward Haas from comment #14)
[...]
> 
> The resolution of the issue reported in the BZ is to enable the consistent
> network device naming in order to avoid dependency on the interface
> detection order.

With current QEMU, consistent network device naming will use either
slot or path based naming scheme. It should work in most cases assuming
immutable XML. However if PCI topology changes or machine type changes,
previously used interface name may change as well.

consistent network device naming also supports 'onboard' naming scheme,
in which case for naming interfaces it uses 'eno' prefix with acpi_index
supplied by firmware.
So if there is interest, it's possible to add missing ACPI code to QEMU
and let user to specify acpi_index for a network card. That way user should be
able to control interface naming and avoid name changes even if machine type
flips between PC|Q35 or PCI topology changes.
(as minimum, it should work for cold-plugged PCI devices)

Comment 16 Laine Stump 2020-10-09 22:04:01 UTC

Igor - now that I've digested what you've said, this sounds like a potentially *very good* thing to have. It would require a proper configuration to be provided by management (so you wouldn't be able to take some random cloud image and boot it up with "just any old config" and have the device named correctly; but at least it would be possible to have a guaranteed name for a device without depending on a specific MAC address (which would mean that the image would need to be modified for each instance).

Count me in if you want to do any experimenting with this idea!

Note You need to log in before you can comment on or make changes to this bug.