Bug 2182173 - cloud-init-local.service fails with "Unable to find a system nic" with some Mellanox NICs
Summary: cloud-init-local.service fails with "Unable to find a system nic" with some M...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: cloud-init
Version: 37
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Major Hayden 🤠
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-27 18:45 UTC by Louis Sautier
Modified: 2024-01-12 23:21 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2024-01-12 23:21:01 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 2001058 0 unspecified CLOSED systemd-udev-settle.service is deprecated. Please fix multipathd.service not to pull it in. 2023-03-27 18:45:13 UTC

Description Louis Sautier 2023-03-27 18:45:14 UTC
cloud-init-local.service fails with "Unable to find a system nic" for some Mellanox NICs. This is caused by the removal of systemd-udev-settle.service from the boot dependencies.

Description of problem:
On some servers with Mellanox NICs, cloud-init-local.service attempts to start before the network interfaces are ready. I believe the problem could also arise on different NICs that take a long time to initialise.

Version-Release number of selected component (if applicable):
cloud-init-22.2-4.fc37.noarch

How reproducible:
I can reproduce this consistently on two servers. One has this NIC:
86:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
        Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] (Stand-up ConnectX-4 Lx EN, 25GbE dual-port SFP28, PCIe3.0 x8, MCX4121A-ACAT) [15b3:0003]
        Kernel driver in use: mlx5_core
The other has:
1a:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
	Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:0068]
	Kernel driver in use: mlx5_core
On a third server with a similar (but not identical) NIC, I *do not* have an issue, the interface appears approximately one second before cloud-init-local.service starts:
86:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
	Subsystem: Mellanox Technologies Device [15b3:0121]
	Kernel driver in use: mlx5_core

Steps to Reproduce:
1. Prepare a Fedora 37 installation where cloud-init handles the networking configuration
2. Boot the server 

Actual results:
2023-03-27T16:51:36.253467+0000 packer-output kernel: pci 0000:86:00.0: [15b3:1015] type 00 class 0x020000
2023-03-27T16:51:36.253549+0000 packer-output kernel: pci 0000:86:00.0: reg 0x10: [mem 0x397ffc000000-0x397ffdffffff 64bit pref]
2023-03-27T16:51:36.253627+0000 packer-output kernel: pci 0000:86:00.0: reg 0x30: [mem 0xf0e00000-0xf0efffff pref]
2023-03-27T16:51:36.253703+0000 packer-output kernel: pci 0000:86:00.0: PME# supported from D3cold
2023-03-27T16:51:36.253780+0000 packer-output kernel: pci 0000:86:00.0: reg 0x1a4: [mem 0x397ffe800000-0x397ffe8fffff 64bit pref]
2023-03-27T16:51:36.253856+0000 packer-output kernel: pci 0000:86:00.0: VF(n) BAR0 space: [mem 0x397ffe800000-0x397ffeffffff 64bit pref] (contains BAR0 for 8 VFs)
2023-03-27T16:51:42.172737+0000 packer-output kernel: mlx5_core 0000:86:00.0: firmware version: 14.21.1000
2023-03-27T16:51:42.188161+0000 packer-output kernel: mlx5_core 0000:86:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
2023-03-27T16:51:42.825883+0000 packer-output systemd[1]: Starting cloud-init-local.service - Initial cloud-init job (pre-networking)...
2023-03-27T16:51:43.399795+0000 example.com cloud-init[1368]: ValueError: Unable to find a system nic for {'type': 'physical', 'accept-ra': False, 'subnets': [{'type': 'dhcp4'}, {'type': 'static6', 'dns_nameservers': ['xxx'], 'gateway': 'fe80::1', 'routes': [], 'address': 'xxx', 'ipv6': True}], 'mac_address': 'xxx'}
2023-03-27T16:51:43.820663+0000 example.com kernel: mlx5_core 0000:86:00.0: Rate limit: 13 rates are supported, range: 0Mbps to 24414Mbps
2023-03-27T16:51:43.821147+0000 example.com kernel: mlx5_core 0000:86:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)
2023-03-27T16:51:43.822301+0000 example.com kernel: mlx5_core 0000:86:00.0: Supported tc offload range - chains: 1, prios: 1
2023-03-27T16:51:43.822663+0000 example.com kernel: mlx5_core 0000:86:00.0: mlx5e_tc_post_act_init:40:(pid 957): firmware level support is missing
2023-03-27T16:51:43.823003+0000 example.com kernel: mlx5_core 0000:86:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic)
2023-03-27T16:51:43.424258+0000 example.com systemd[1]: cloud-init-local.service: Main process exited, code=exited, status=1/FAILURE
2023-03-27T16:51:43.424568+0000 example.com systemd[1]: cloud-init-local.service: Failed with result 'exit-code'.
2023-03-27T16:51:43.425035+0000 example.com systemd[1]: Failed to start cloud-init-local.service - Initial cloud-init job (pre-networking).
2023-03-27T16:51:48.858489+0000 example.com kernel: mlx5_core 0000:86:00.0 ens5f0np0: renamed from eth0
2023-03-27T16:51:50.368505+0000 example.com kernel: mlx5_core 0000:86:00.0 ens5f0np0: Link up

Expected results:
cloud-init-local.service starts correctly.

Additional info:
The issue is not present with Fedora 36 because:
* multipathd.service wants systemd-udev-settle.service
* multipathd.service is wanted by sysinit.target
* therefore, cloud-init-local.service waits for systemd-udev-settle.service

On Fedora 37, systemd-udev-settle.service has been removed from multipathd.service (bug 2001058), making the issue appear.

Forcing cloud-init-local to wait for systemd-udev-settle works. I tested with /etc/systemd/system/cloud-init-local.service.d
/override.conf containing:
[Unit]
Wants=systemd-udev-settle.service
After=systemd-udev-settle.service

After adding the delay:
2023-03-27T18:15:17.480249+0000 packer-output kernel: pci 0000:86:00.0: [15b3:1015] type 00 class 0x020000
2023-03-27T18:15:17.480317+0000 packer-output kernel: pci 0000:86:00.0: reg 0x10: [mem 0x397ffc000000-0x397ffdffffff 64bit pref]
2023-03-27T18:15:17.480383+0000 packer-output kernel: pci 0000:86:00.0: reg 0x30: [mem 0xf0e00000-0xf0efffff pref]
2023-03-27T18:15:17.480477+0000 packer-output kernel: pci 0000:86:00.0: PME# supported from D3cold
2023-03-27T18:15:17.480550+0000 packer-output kernel: pci 0000:86:00.0: reg 0x1a4: [mem 0x397ffe800000-0x397ffe8fffff 64bit pref]
2023-03-27T18:15:17.480621+0000 packer-output kernel: pci 0000:86:00.0: VF(n) BAR0 space: [mem 0x397ffe800000-0x397ffeffffff 64bit pref] (contains BAR0 for 8 VFs)
2023-03-27T18:15:23.379002+0000 example.com udevadm[1100]: systemd-udev-settle.service is deprecated. Please fix cloud-init-local.service not to pull it in.
2023-03-27T18:15:23.745775+0000 example.com kernel: mlx5_core 0000:86:00.0: firmware version: 14.21.1000
2023-03-27T18:15:23.745993+0000 example.com kernel: mlx5_core 0000:86:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
2023-03-27T18:15:24.996889+0000 example.com kernel: mlx5_core 0000:86:00.0: Rate limit: 13 rates are supported, range: 0Mbps to 24414Mbps
2023-03-27T18:15:24.997480+0000 example.com kernel: mlx5_core 0000:86:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)
2023-03-27T18:15:25.322275+0000 example.com kernel: mlx5_core 0000:86:00.0: Supported tc offload range - chains: 1, prios: 1
2023-03-27T18:15:25.322915+0000 example.com kernel: mlx5_core 0000:86:00.0: mlx5e_tc_post_act_init:40:(pid 964): firmware level support is missing
2023-03-27T18:15:25.344452+0000 example.com kernel: mlx5_core 0000:86:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic)
2023-03-27T18:15:30.335428+0000 example.com kernel: mlx5_core 0000:86:00.0 ens5f0np0: renamed from eth0
2023-03-27T18:15:30.704857+0000 example.com systemd[1]: Starting cloud-init-local.service - Initial cloud-init job (pre-networking)...
2023-03-27T18:15:31.220462+0000 example.com systemd[1]: Finished cloud-init-local.service - Initial cloud-init job (pre-networking).
2023-03-27T18:15:31.953475+0000 example.com kernel: mlx5_core 0000:86:00.0 ens5f0np0: Link up


Since systemd-udev-settle.service is deprecated, isn't there a better way to do wait for the interfaces to appear?

Comment 1 Louis Sautier 2023-03-28 18:52:45 UTC
Hi,
I've found a workaround: including mlx5_core in the dracut-generated initramfs with "add_drivers" makes the interface appear 10 seconds before cloud-init-local.service starts:

2023-03-28T18:02:40.108462+0000 packer-output kernel: pci 0000:86:00.0: [15b3:1015] type 00 class 0x020000
2023-03-28T18:02:40.108535+0000 packer-output kernel: pci 0000:86:00.0: reg 0x10: [mem 0x397ffc000000-0x397ffdffffff 64bit pref]
2023-03-28T18:02:40.108609+0000 packer-output kernel: pci 0000:86:00.0: reg 0x30: [mem 0xf0e00000-0xf0efffff pref]
2023-03-28T18:02:40.108679+0000 packer-output kernel: pci 0000:86:00.0: PME# supported from D3cold
2023-03-28T18:02:40.108755+0000 packer-output kernel: pci 0000:86:00.0: reg 0x1a4: [mem 0x397ffe800000-0x397ffe8fffff 64bit pref]
2023-03-28T18:02:40.108829+0000 packer-output kernel: pci 0000:86:00.0: VF(n) BAR0 space: [mem 0x397ffe800000-0x397ffeffffff 64bit pref] (contains BAR0 for 8 VFs)
2023-03-28T18:02:41.407373+0000 packer-output kernel: mlx5_core 0000:86:00.0: firmware version: 14.21.1000
2023-03-28T18:02:41.407745+0000 packer-output kernel: mlx5_core 0000:86:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
2023-03-28T18:02:42.684779+0000 packer-output kernel: mlx5_core 0000:86:00.0: Rate limit: 13 rates are supported, range: 0Mbps to 24414Mbps
2023-03-28T18:02:42.704769+0000 packer-output kernel: mlx5_core 0000:86:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)
2023-03-28T18:02:42.726061+0000 packer-output kernel: mlx5_core 0000:86:00.0: Port module event: module 0, Cable plugged
2023-03-28T18:02:43.037711+0000 packer-output kernel: mlx5_core 0000:86:00.0: Supported tc offload range - chains: 1, prios: 1
2023-03-28T18:02:43.038128+0000 packer-output kernel: mlx5_core 0000:86:00.0: mlx5e_tc_post_act_init:40:(pid 957): firmware level support is missing
2023-03-28T18:02:43.061788+0000 packer-output kernel: mlx5_core 0000:86:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic)
2023-03-28T18:02:51.035763+0000 packer-output kernel: mlx5_core 0000:86:00.0 ens5f0np0: renamed from eth0
2023-03-28T18:02:52.431310+0000 packer-output systemd[1]: Starting cloud-init-local.service - Initial cloud-init job (pre-networking)...
2023-03-28T18:02:53.082897+0000 example.com systemd[1]: Finished cloud-init-local.service - Initial cloud-init job (pre-networking).
2023-03-28T18:02:53.682824+0000 example.com kernel: mlx5_core 0000:86:00.0 ens5f0np0: Link up

Comment 2 Fedora Admin user for bugzilla script actions 2023-04-28 12:11:11 UTC
This package has changed maintainer in Fedora. Reassigning to the new maintainer of this component.

Comment 3 Major Hayden 🤠 2023-05-11 22:30:57 UTC
I don't have a Mellanox card to play with (but that would be a fun thing to have!).

Does the mlx5_core module take a while to initialize? I'm just trying to figure out the best solution for what seems to be a race between cloud-init and the module loading.

Comment 4 Aoife Moloney 2023-11-23 01:34:45 UTC
This message is a reminder that Fedora Linux 37 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 37 on 2023-12-05.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '37'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 37 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 5 Aoife Moloney 2024-01-12 23:21:01 UTC
Fedora Linux 37 entered end-of-life (EOL) status on 2023-12-05.

Fedora Linux 37 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.