RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1917773 - NetworkManager doesn't initialize an interface, reports startup complete when booting for the first time
Summary: NetworkManager doesn't initialize an interface, reports startup complete when...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: NetworkManager
Version: 8.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.0
Assignee: Thomas Haller
QA Contact: Desktop QE
URL:
Whiteboard:
Depends On:
Blocks: 1922417 1940071 1943809
TreeView+ depends on / blocked
 
Reported: 2021-01-19 11:37 UTC by vemporop
Modified: 2024-06-13 23:58 UTC (History)
21 users (show)

Fixed In Version: NetworkManager-1.30.0-2.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1940071 (view as bug list)
Environment:
Last Closed: 2021-05-18 13:32:37 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
rdsosreport.txt collected via dracut shell after ignition timed out (122.73 KB, text/plain)
2021-01-19 11:37 UTC, vemporop
no flags Details
logfiles of a similar scenario (32.14 KB, application/x-xz)
2021-02-23 07:28 UTC, Dominik Holler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1931852 1 urgent CLOSED Ignition HTTP GET is failing, because DHCP IPv4 config is failing silently 2023-09-18 00:24:54 UTC
Red Hat Issue Tracker NMT-1143 0 None None None 2024-06-13 23:58:59 UTC

Description vemporop 2021-01-19 11:37:17 UTC
Created attachment 1748701 [details]
rdsosreport.txt collected via dracut shell after ignition timed out

Created attachment 1748701 [details]
rdsosreport.txt collected via dracut shell after ignition timed out

Description of problem:

In a customer environment that has three bear-metal hosts (NUK), a random host fails to acquire IP addresses on boot. This causes ignition to fail download a remote ignition file and keep trying forever. The problem goes away after the host is rebooted, sometimes a few times.

Version-Release number of selected component (if applicable):

version 1.22.8-6.el8_2

How reproducible:

It can be consistently reproduced in the customer environment. Please contact me if you need access to it.

Steps to Reproduce:

1. Install an OpenShift 4.6 cluster using OpenShift Assisted Installer.
2. Wait until the hosts reboot after installation.

Actual results:

One of the hosts (random) cannot boot, is stuck trying to download ignition

[   10.963282] localhost ignition[851]: GET https://192.168.3.251:22623/config/master: attempt #3
[   10.963823] localhost ignition[851]: GET error: Get "https://192.168.3.251:22623/config/master": dial tcp 192.168.3.251:22623: connect: network is unreachable

When we configure a timeout for downloading ignition, the boot eventually exits to an emergency shell. Running "ip a" produces the following output

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 1c:69:7a:6d:b6:ed brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e69:7aff:fe6d:b6ed/64 scope link 
       valid_lft forever preferred_lft forever

Expected results:

All network interfaces are initialized properly and receive IP addresses. A remote ignition configuration can be downloaded. The host boots successfully.
If there's no better solution, we could manage with `rd.net.timeout.carrier` karg as suggested in https://github.com/coreos/fedora-coreos-tracker/issues/708#issuecomment-756846248

Additional info:

The customer has static IP binding to the host MAC addresses in his router.

Comment 1 Thomas Haller 2021-01-19 17:36:24 UTC
it seems that the interface does not have carrier, and after 6 seconds NetworkManager assumes the cable is unplugged and continues.

NetworkManager does not yet support rd.net.timeout.carrier, which would help there (that will be https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/730).


Could you try whether it helps to create a NetworkManager configuration snippet in initrd, like /etc/NetworkManager/conf.d/90-carrier-wait-timeout with the content:

  [device]
  carrier-wait-timeout=20000

that probably requires you to regenerate the initrd.

(alternatively, you can also somehow write it to /run/NetworkManager/conf.d/ (before NetworkManager runs), if that is more convenient)

Comment 2 Beniamino Galvani 2021-01-19 18:09:59 UTC
When rebuilding the initrd, note that you need to explicitly tell dracut to include the file:

 # dracut -f -v -I /etc/NetworkManager/conf.d/90-carrier-wait-timeout.conf

Comment 3 vemporop 2021-01-24 08:57:35 UTC
I don't have direct access to the environment where the problem happens. How can I make sure my changes have any effect? I do see the file when a host is booted, but want to see that the timeout setting works before deploying it.

Comment 4 vemporop 2021-01-24 20:15:52 UTC
Also, I'm not sure I understand in what context to run the dracut command. What I have is an RHCOS image where initramfs-xxx.img is under /boot/ostree/rhcos-8531b9a3611a5ad880a2479e1cf8b720a1a72d49e05aad05ca15fa8ac553353d/initramfs-4.18.0-193.29.1.el8_2.x86_64.img. Let's assume I can modify that (the dracut command above produces a significantly smaller .img for some reason), but then I'll need to repackage the RHCOS image. How do I do that? Or did you have another scenario in mind?

Comment 5 Luca BRUNO 2021-02-18 15:03:41 UTC
Thomas, it is my understanding that the upstream fix is already present in the RPM targeted at RHEL 8.4.
This works great for future Openshift roadmap but leaves a bit of a gap for the current upcoming release (4.7).
May I ask to get MR #730 backported to the current RHEL 8.3 package/branch? We can re-use this ticket to track such backport.

Vitaliy, for RHCOS and this scenario specifically the flow would be indeed to use a `rd.net.timeout.carrier` karg, injected through the Assisted Installer.
If you want to test that flow, there are already some pre-built RHCOS images with 8.4 content, see the mail thread on aos-devel@ with all the references.

Comment 6 vemporop 2021-02-18 15:18:58 UTC
@lucab thanks, will do and update the ticket

Comment 7 vemporop 2021-02-21 13:00:59 UTC
Tried https://releases-rhcos.cloud.privileged.psi.redhat.com/storage/testing/travier-4.7-8.4/47.84.202102161611-0/x86_64/rhcos-47.84.202102161611-0-live.x86_64.iso, but it won't boot on customer's hardware with error

error: ../../grub-core/loader/i386/efi/linux.c:215:/images/pxeboot/vmlinuz has invalid signature.
error: ../../grub-core/loader/i386/efi/linux.c:94:you need to load the kernel first.

No problem booting a VM or old hardware (BIOS) from the same image.

Comment 8 Dominik Holler 2021-02-23 07:28:40 UTC
Created attachment 1758771 [details]
logfiles of a similar scenario

Comment 9 Timothée Ravier 2021-02-23 11:35:07 UTC
4.7-8.4 RHCOS kernels/bootloaders are signed with a non-production Secure Boot signing key so will not boot with Secure Boot on.

Comment 12 vemporop 2021-03-01 09:30:07 UTC
@lucab I got the customer to try https://releases-rhcos.cloud.privileged.psi.redhat.com/storage/testing/travier-4.7-8.4/47.84.202102221652-0/x86_64/rhcos-47.84.202102221652-0-live.x86_64.iso. The installation was successful *without* any changes to kargs - "rd.net.timeout.carrier" or anything else. The only change we had to apply was disabling secure boot because the ISO wasn't signed.

Can you think of any changes that might have fixed the issue without increasing carrier timeout?

Comment 13 Luca BRUNO 2021-03-01 14:21:36 UTC
Vitaliy, I can't really say without seeing the boot logs, but as this issue is about a timing race related to carrier detection there a few non-deterministic factors into it. I'd say some variations/fluctuations are possible.

I'm not surprised if any other delay in the initramfs startup made this less likely for a specific customer, or if the carrier detection is slightly faster in some cases and thus fits into the default timeout value.

Comment 14 Gris Ge 2021-03-03 06:03:55 UTC
Hi Luca,

Any action required from NetworkManager side?

Comment 16 Luca BRUNO 2021-03-03 10:22:05 UTC
Gris, yes, see the request at https://bugzilla.redhat.com/show_bug.cgi?id=1917773#c5.

Specifically, we would need https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/730 to be available in RHEL 8.4 and 8.3 (so that latest OpenShift can use it).
For 8.4, it is my understanding that the newer upstream release is already covering that (but a double-check would be nice).
For 8.3, it is my understanding that a backport is required, and we can freely use this BZ to keep track of that.

Comment 17 vemporop 2021-03-03 10:37:44 UTC
I can confirm that 8.4 includes the rd.net.timeout.carrier PR. I was able to see the conf file generated with a correct value from kargs, with https://releases-rhcos.cloud.privileged.psi.redhat.com/storage/testing/travier-4.7-8.4/47.84.202102221652-0/x86_64/rhcos-47.84.202102221652-0-live.x86_64.iso

Comment 18 Gris Ge 2021-03-03 12:03:13 UTC
Hi Thomas,

Is that doable for us to backport the patch to RHEL 8.3?

Comment 20 Thomas Haller 2021-03-11 13:58:55 UTC
(In reply to Luca BRUNO from comment #5)
> Thomas, it is my understanding that the upstream fix is already present in
> the RPM targeted at RHEL 8.4.
> This works great for future Openshift roadmap but leaves a bit of a gap for
> the current upcoming release (4.7).
> May I ask to get MR #730 backported to the current RHEL 8.3 package/branch?
> We can re-use this ticket to track such backport.
> 
> Vitaliy, for RHCOS and this scenario specifically the flow would be indeed
> to use a `rd.net.timeout.carrier` karg, injected through the Assisted
> Installer.
> If you want to test that flow, there are already some pre-built RHCOS images
> with 8.4 content, see the mail thread on aos-devel@ with all the references.

Yes, it's possible.

It's a bit larger than what we usually would like for a Z-stream, but if necessary, it can be done.

sorry for the slow response.

Comment 22 Thomas Haller 2021-03-12 20:34:45 UTC
ok, so the suggested solution for this is that nm-initrd-generator honors the "rd.net.timeout.carrier" kernel command line, which is an additional delay in seconds to wait for carrier, see `man dracut.cmdline`.

This gives a NetworkManager.conf snippet /run/NetworkManager/conf.d/15-carrier-timeout.conf that sets "[device-15-carrier-timeout] carrier-wait-timeout" (which is documented in `man NetworkManager.conf`).

That feature was added upstream by commit https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/e300138892ee0fc3824d38b527b60103a01758ab, and is already present in latest NetworkManager builds for rhel-8.4 (which ships NetworkManager version 1.30.x). As said in comment 17.

I now backported the patch to upstream "nm-1-26" branch, which is a requisite for the rhel-8.3-z-stream update (rhel-8.3 ships NetworkManager version 1.26.x).


https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=35410569 is a scratch build of that patch for rhel-8.3. This is what I plan to do for rhel-8.3-z-stream.

If possible, please test the 8.3 package. Thank you!! :)

Comment 23 Luca BRUNO 2021-03-15 10:53:53 UTC
> If possible, please test the 8.3 package.

I did a manual OS build (RHCOS image) with proposed 8.3 content:

```
rpm-md repo 'brew-task-repo-NetworkManager-1.26.0-13.rh1917773.1.el8_3-scratch' (cached); generated: 2021-03-12T20:35:11Z
Importing rpm-md... done
Resolving dependencies... done
Installing 468 packages:
  NetworkManager-1:1.26.0-13.rh1917773.1.el8_3.x86_64 (brew-task-repo-NetworkManager-1.26.0-13.rh1917773.1.el8_3-scratch)
  NetworkManager-libnm-1:1.26.0-13.rh1917773.1.el8_3.x86_64 (brew-task-repo-NetworkManager-1.26.0-13.rh1917773.1.el8_3-scratch)
  NetworkManager-ovs-1:1.26.0-13.rh1917773.1.el8_3.x86_64 (brew-task-repo-NetworkManager-1.26.0-13.rh1917773.1.el8_3-scratch)
  NetworkManager-team-1:1.26.0-13.rh1917773.1.el8_3.x86_64 (brew-task-repo-NetworkManager-1.26.0-13.rh1917773.1.el8_3-scratch)
  NetworkManager-tui-1:1.26.0-13.rh1917773.1.el8_3.x86_64 (brew-task-repo-NetworkManager-1.26.0-13.rh1917773.1.el8_3-scratch)

New build ID: 48.83.202103151019-0
```

The initrd cmdline logic worked:

```
# grep PRETTY /etc/initrd-release
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 48.83.202103151019-0 (Ootpa) dracut-049-95.git20200804.el8_3.4 (Initramfs)"

# grep -o rd.net.timeout.carrier='[[:alnum:]]*' /proc/cmdline
rd.net.timeout.carrier=30

# cat /run/NetworkManager/conf.d/15-carrier-timeout.conf
[device-15-carrier-timeout]
match-device=*
carrier-wait-timeout=30000
```

Comment 24 Thomas Haller 2021-03-15 11:26:44 UTC
(In reply to Luca BRUNO from comment #23)
> > If possible, please test the 8.3 package.
> 
> I did a manual OS build (RHCOS image) with proposed 8.3 content:
> 

that's good, right?

Thanks for testing.


For 8.3, we already provided a custom NetworkManager build to RHCOS.

Does that suffice, or do we need a proper 8.3-Z-stream build (additionally to the customer RHCOS build)?

Comment 25 Luca BRUNO 2021-03-15 16:09:16 UTC
Yes that's good :)

The `rhaos` build is the one which is strictly needed for RHCOS, that would suffice for us.

It is generally good to keep that in sync with the RHEL 8.3 z-stream content, but I'll defer that to your call. All the tickets I've seen so far on this topic were coming from CoreOS users.

Comment 33 Thomas Haller 2021-03-18 12:34:04 UTC
For rhel-8.3.z this will be fixed by bug 1940071 and build NetworkManager-1.26.0-14.el8_3.

For RHCOS there is now also "NetworkManager-1.26.0-14.1.rhaos4.7.el8"
You can already use it from here: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1540189

Comment 37 errata-xmlrpc 2021-05-18 13:32:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: NetworkManager and libnma security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1574


Note You need to log in before you can comment on or make changes to this bug.