cloud-init-local.service fails with "Unable to find a system nic" for some Mellanox NICs. This is caused by the removal of systemd-udev-settle.service from the boot dependencies. Description of problem: On some servers with Mellanox NICs, cloud-init-local.service attempts to start before the network interfaces are ready. I believe the problem could also arise on different NICs that take a long time to initialise. Version-Release number of selected component (if applicable): cloud-init-22.2-4.fc37.noarch How reproducible: I can reproduce this consistently on two servers. One has this NIC: 86:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] (Stand-up ConnectX-4 Lx EN, 25GbE dual-port SFP28, PCIe3.0 x8, MCX4121A-ACAT) [15b3:0003] Kernel driver in use: mlx5_core The other has: 1a:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:0068] Kernel driver in use: mlx5_core On a third server with a similar (but not identical) NIC, I *do not* have an issue, the interface appears approximately one second before cloud-init-local.service starts: 86:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017] Subsystem: Mellanox Technologies Device [15b3:0121] Kernel driver in use: mlx5_core Steps to Reproduce: 1. Prepare a Fedora 37 installation where cloud-init handles the networking configuration 2. Boot the server Actual results: 2023-03-27T16:51:36.253467+0000 packer-output kernel: pci 0000:86:00.0: [15b3:1015] type 00 class 0x020000 2023-03-27T16:51:36.253549+0000 packer-output kernel: pci 0000:86:00.0: reg 0x10: [mem 0x397ffc000000-0x397ffdffffff 64bit pref] 2023-03-27T16:51:36.253627+0000 packer-output kernel: pci 0000:86:00.0: reg 0x30: [mem 0xf0e00000-0xf0efffff pref] 2023-03-27T16:51:36.253703+0000 packer-output kernel: pci 0000:86:00.0: PME# supported from D3cold 2023-03-27T16:51:36.253780+0000 packer-output kernel: pci 0000:86:00.0: reg 0x1a4: [mem 0x397ffe800000-0x397ffe8fffff 64bit pref] 2023-03-27T16:51:36.253856+0000 packer-output kernel: pci 0000:86:00.0: VF(n) BAR0 space: [mem 0x397ffe800000-0x397ffeffffff 64bit pref] (contains BAR0 for 8 VFs) 2023-03-27T16:51:42.172737+0000 packer-output kernel: mlx5_core 0000:86:00.0: firmware version: 14.21.1000 2023-03-27T16:51:42.188161+0000 packer-output kernel: mlx5_core 0000:86:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) 2023-03-27T16:51:42.825883+0000 packer-output systemd[1]: Starting cloud-init-local.service - Initial cloud-init job (pre-networking)... 2023-03-27T16:51:43.399795+0000 example.com cloud-init[1368]: ValueError: Unable to find a system nic for {'type': 'physical', 'accept-ra': False, 'subnets': [{'type': 'dhcp4'}, {'type': 'static6', 'dns_nameservers': ['xxx'], 'gateway': 'fe80::1', 'routes': [], 'address': 'xxx', 'ipv6': True}], 'mac_address': 'xxx'} 2023-03-27T16:51:43.820663+0000 example.com kernel: mlx5_core 0000:86:00.0: Rate limit: 13 rates are supported, range: 0Mbps to 24414Mbps 2023-03-27T16:51:43.821147+0000 example.com kernel: mlx5_core 0000:86:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384) 2023-03-27T16:51:43.822301+0000 example.com kernel: mlx5_core 0000:86:00.0: Supported tc offload range - chains: 1, prios: 1 2023-03-27T16:51:43.822663+0000 example.com kernel: mlx5_core 0000:86:00.0: mlx5e_tc_post_act_init:40:(pid 957): firmware level support is missing 2023-03-27T16:51:43.823003+0000 example.com kernel: mlx5_core 0000:86:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic) 2023-03-27T16:51:43.424258+0000 example.com systemd[1]: cloud-init-local.service: Main process exited, code=exited, status=1/FAILURE 2023-03-27T16:51:43.424568+0000 example.com systemd[1]: cloud-init-local.service: Failed with result 'exit-code'. 2023-03-27T16:51:43.425035+0000 example.com systemd[1]: Failed to start cloud-init-local.service - Initial cloud-init job (pre-networking). 2023-03-27T16:51:48.858489+0000 example.com kernel: mlx5_core 0000:86:00.0 ens5f0np0: renamed from eth0 2023-03-27T16:51:50.368505+0000 example.com kernel: mlx5_core 0000:86:00.0 ens5f0np0: Link up Expected results: cloud-init-local.service starts correctly. Additional info: The issue is not present with Fedora 36 because: * multipathd.service wants systemd-udev-settle.service * multipathd.service is wanted by sysinit.target * therefore, cloud-init-local.service waits for systemd-udev-settle.service On Fedora 37, systemd-udev-settle.service has been removed from multipathd.service (bug 2001058), making the issue appear. Forcing cloud-init-local to wait for systemd-udev-settle works. I tested with /etc/systemd/system/cloud-init-local.service.d /override.conf containing: [Unit] Wants=systemd-udev-settle.service After=systemd-udev-settle.service After adding the delay: 2023-03-27T18:15:17.480249+0000 packer-output kernel: pci 0000:86:00.0: [15b3:1015] type 00 class 0x020000 2023-03-27T18:15:17.480317+0000 packer-output kernel: pci 0000:86:00.0: reg 0x10: [mem 0x397ffc000000-0x397ffdffffff 64bit pref] 2023-03-27T18:15:17.480383+0000 packer-output kernel: pci 0000:86:00.0: reg 0x30: [mem 0xf0e00000-0xf0efffff pref] 2023-03-27T18:15:17.480477+0000 packer-output kernel: pci 0000:86:00.0: PME# supported from D3cold 2023-03-27T18:15:17.480550+0000 packer-output kernel: pci 0000:86:00.0: reg 0x1a4: [mem 0x397ffe800000-0x397ffe8fffff 64bit pref] 2023-03-27T18:15:17.480621+0000 packer-output kernel: pci 0000:86:00.0: VF(n) BAR0 space: [mem 0x397ffe800000-0x397ffeffffff 64bit pref] (contains BAR0 for 8 VFs) 2023-03-27T18:15:23.379002+0000 example.com udevadm[1100]: systemd-udev-settle.service is deprecated. Please fix cloud-init-local.service not to pull it in. 2023-03-27T18:15:23.745775+0000 example.com kernel: mlx5_core 0000:86:00.0: firmware version: 14.21.1000 2023-03-27T18:15:23.745993+0000 example.com kernel: mlx5_core 0000:86:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) 2023-03-27T18:15:24.996889+0000 example.com kernel: mlx5_core 0000:86:00.0: Rate limit: 13 rates are supported, range: 0Mbps to 24414Mbps 2023-03-27T18:15:24.997480+0000 example.com kernel: mlx5_core 0000:86:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384) 2023-03-27T18:15:25.322275+0000 example.com kernel: mlx5_core 0000:86:00.0: Supported tc offload range - chains: 1, prios: 1 2023-03-27T18:15:25.322915+0000 example.com kernel: mlx5_core 0000:86:00.0: mlx5e_tc_post_act_init:40:(pid 964): firmware level support is missing 2023-03-27T18:15:25.344452+0000 example.com kernel: mlx5_core 0000:86:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic) 2023-03-27T18:15:30.335428+0000 example.com kernel: mlx5_core 0000:86:00.0 ens5f0np0: renamed from eth0 2023-03-27T18:15:30.704857+0000 example.com systemd[1]: Starting cloud-init-local.service - Initial cloud-init job (pre-networking)... 2023-03-27T18:15:31.220462+0000 example.com systemd[1]: Finished cloud-init-local.service - Initial cloud-init job (pre-networking). 2023-03-27T18:15:31.953475+0000 example.com kernel: mlx5_core 0000:86:00.0 ens5f0np0: Link up Since systemd-udev-settle.service is deprecated, isn't there a better way to do wait for the interfaces to appear?
Hi, I've found a workaround: including mlx5_core in the dracut-generated initramfs with "add_drivers" makes the interface appear 10 seconds before cloud-init-local.service starts: 2023-03-28T18:02:40.108462+0000 packer-output kernel: pci 0000:86:00.0: [15b3:1015] type 00 class 0x020000 2023-03-28T18:02:40.108535+0000 packer-output kernel: pci 0000:86:00.0: reg 0x10: [mem 0x397ffc000000-0x397ffdffffff 64bit pref] 2023-03-28T18:02:40.108609+0000 packer-output kernel: pci 0000:86:00.0: reg 0x30: [mem 0xf0e00000-0xf0efffff pref] 2023-03-28T18:02:40.108679+0000 packer-output kernel: pci 0000:86:00.0: PME# supported from D3cold 2023-03-28T18:02:40.108755+0000 packer-output kernel: pci 0000:86:00.0: reg 0x1a4: [mem 0x397ffe800000-0x397ffe8fffff 64bit pref] 2023-03-28T18:02:40.108829+0000 packer-output kernel: pci 0000:86:00.0: VF(n) BAR0 space: [mem 0x397ffe800000-0x397ffeffffff 64bit pref] (contains BAR0 for 8 VFs) 2023-03-28T18:02:41.407373+0000 packer-output kernel: mlx5_core 0000:86:00.0: firmware version: 14.21.1000 2023-03-28T18:02:41.407745+0000 packer-output kernel: mlx5_core 0000:86:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link) 2023-03-28T18:02:42.684779+0000 packer-output kernel: mlx5_core 0000:86:00.0: Rate limit: 13 rates are supported, range: 0Mbps to 24414Mbps 2023-03-28T18:02:42.704769+0000 packer-output kernel: mlx5_core 0000:86:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384) 2023-03-28T18:02:42.726061+0000 packer-output kernel: mlx5_core 0000:86:00.0: Port module event: module 0, Cable plugged 2023-03-28T18:02:43.037711+0000 packer-output kernel: mlx5_core 0000:86:00.0: Supported tc offload range - chains: 1, prios: 1 2023-03-28T18:02:43.038128+0000 packer-output kernel: mlx5_core 0000:86:00.0: mlx5e_tc_post_act_init:40:(pid 957): firmware level support is missing 2023-03-28T18:02:43.061788+0000 packer-output kernel: mlx5_core 0000:86:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic) 2023-03-28T18:02:51.035763+0000 packer-output kernel: mlx5_core 0000:86:00.0 ens5f0np0: renamed from eth0 2023-03-28T18:02:52.431310+0000 packer-output systemd[1]: Starting cloud-init-local.service - Initial cloud-init job (pre-networking)... 2023-03-28T18:02:53.082897+0000 example.com systemd[1]: Finished cloud-init-local.service - Initial cloud-init job (pre-networking). 2023-03-28T18:02:53.682824+0000 example.com kernel: mlx5_core 0000:86:00.0 ens5f0np0: Link up
This package has changed maintainer in Fedora. Reassigning to the new maintainer of this component.
I don't have a Mellanox card to play with (but that would be a fun thing to have!). Does the mlx5_core module take a while to initialize? I'm just trying to figure out the best solution for what seems to be a race between cloud-init and the module loading.
This message is a reminder that Fedora Linux 37 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 37 on 2023-12-05. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '37'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see it. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 37 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
Fedora Linux 37 entered end-of-life (EOL) status on 2023-12-05. Fedora Linux 37 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed.