2079277 – [UPI][Baremetal] RCHOS is not able to configure network interfaces to reach ignition file

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2079277 - [UPI][Baremetal] RCHOS is not able to configure network interfaces to reach ignition file

Summary: [UPI][Baremetal] RCHOS is not able to configure network interfaces to reach i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	NetworkManager
Sub Component:
Version:	8.4
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Beniamino Galvani
QA Contact:	Vladimir Benes
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-27 10:19 UTC by Ignacio
Modified:	2022-11-08 11:22 UTC (History)
CC List:	23 users (show)
Fixed In Version:	NetworkManager-1.39.10-1.el8
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-08 10:10:31 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-124218	None	None	None	2022-06-03 14:47:20 UTC
Red Hat Product Errata	RHBA-2022:7680	None	None	None	2022-11-08 10:10:55 UTC
freedesktop.org Gitlab	NetworkManager NetworkManager merge_requests 1239	None	merged	initrd: set a default carrier timeout of 15 seconds in initrd	2022-07-11 13:50:24 UTC

Comment 1 Luca BRUNO 2022-04-27 10:29:26 UTC

Thanks for the report and the attached logs.
From the emergency shell, can you please check what `ls -la /dev/disk/by-label/` shows?

Comment 21 Amit Ugol 2022-05-11 07:08:00 UTC

Can we please get the following:
Machine firmware.
NICs, models, and firmware versions.
What is the network layout.
What sort of bond was configured, and between which NICs.

Comment 32 Thomas Haller 2022-05-12 16:08:57 UTC

in comment 11, there is no route to 10.x.x.x/8.
It would be NetworkManager's job to configure that.

Is it possible to get debug logs of NetworkManager (and just the entire boot)?
AFAIS, there are none present. I think you get debug logs by setting "rd.debug" on the kernel command line.

Comment 35 Ignacio 2022-05-13 15:14:15 UTC

(In reply to Thomas Haller from comment #32)
> in comment 11, there is no route to 10.x.x.x/8.
> It would be NetworkManager's job to configure that.
> 
> Is it possible to get debug logs of NetworkManager (and just the entire
> boot)?
> AFAIS, there are none present. I think you get debug logs by setting
> "rd.debug" on the kernel command line.

we have entire boot logs with "systemd.log_level=debug systemd.journald.forward_to_console=1 inst.loglevel=debug", please see comment 12

Let us know if those logs are enough. If you still need the debug logs by setting "rd.debug" on the kernel command line, please specify which kind of test you want, forcing to go to dracut emergency shell or letting it boot from disk and getting the loop trying to fetch the ignition file.

Comment 36 Dusty Mabe 2022-05-13 19:26:17 UTC

Those options set systemd into debug logging. For NetworkManager, IIUC, rd.debug is needed to set NetworkManager into TRACE logging.

See https://github.com/dracutdevs/dracut/blob/9bef71094eba84a9eac161fc45386ccd73bd2b34/modules.d/35network-manager/nm-config.sh#L9-L18

Comment 37 Dusty Mabe 2022-05-13 20:33:18 UTC

Forgot to answer the last part.. Probably the scenario where the system gets in a loop trying to fetch the Ignition config.

Comment 38 Dusty Mabe 2022-05-17 17:21:32 UTC

Any updates here?

Comment 44 Beniamino Galvani 2022-05-25 10:09:05 UTC

Hi, after discussing with Thomas, we think that the problem might be related to the long time that interfaces take to get carrier after they are added to the bond. NetworkManager has a built-in timeout of 6 seconds that doesn't seem enough in this case. Therefore, it quits too early without activating the VLAN.

A solution to that could be to add argument "rd.net.timeout.carrier=60" to the kernel command line, that increases the carrier timeout to 60 seconds.

Ignacio, would it be possible to try again with the new argument?

Comment 45 Ignacio 2022-05-25 10:56:01 UTC

Sure, I will let know you the output. It may take some days because they have some days off this week.

Comment 46 Ignacio 2022-05-26 08:02:17 UTC

Good news. First test looks promising. The installation continues and the node is able to reach the cluster
I'll upload the log so you can check for any differences of how much time the system really needed for the carrier.

Now the question would be if the default carrier timeout in Networkmanager should be revisited and increased. What do you think?

Comment 48 Beniamino Galvani 2022-05-27 14:41:35 UTC

In new logs, I see that ens2f0 needs 6.53 seconds to get carrier after it's added to the bond:

  [   36.812315] bond0: (slave ens2f0): Enslaving as a backup interface with a down link
  [   43.346666] ixgbe 0000:05:00.0 ens2f0: NIC Link is Up 10 Gbps, Flow Control: RX/TX

So it's just half second more than NM's timeout (6 seconds).

>  Now the question would be if the default carrier timeout in Networkmanager should be revisited and increased. What do you think?

You are correct, it would be wiser to increase the default timeout in initrd. The old dracut network module waited for 10 seconds. I submitted a patch to increase it to 15 seconds in NM:

https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1239

Comment 49 Dusty Mabe 2022-06-03 14:31:42 UTC

ok I'm moving this BZ to NetworkManager. IIUC they have a workaround for now and, regardless of whether NM changes the default timeout in RHEL8 or not, RHEL9 is starting NetworkManager via systemd in the initrd (IIUC), which means that the bond will eventually come up on RHEL9 and Ignition will be able to fetch the config.

Comment 56 Vladimir Benes 2022-08-02 10:30:37 UTC

timeout is set to Dracut's original 10s.

Comment 59 errata-xmlrpc 2022-11-08 10:10:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7680

Note You need to log in before you can comment on or make changes to this bug.

acabral
agabriel
augol
bgalvani
dornelas
dustymabe
ealcaniz
eglottma
fcristin
jlebon
jligon
lrintel
lucab
mrussell
nstielau
openshift-bugs-escalate
pibanezr
rkhan
sfaye
sukulkar
till
vbenes
wking