Bug 1917773
Summary: | NetworkManager doesn't initialize an interface, reports startup complete when booting for the first time | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | vemporop | ||||||
Component: | NetworkManager | Assignee: | Thomas Haller <thaller> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Desktop QE <desktop-qa-list> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 8.2 | CC: | acardace, atragler, bgalvani, bradnichols, dholler, dornelas, fge, fgiloux, jmaxwell, keyoung, lrintel, lucab, rkhan, rsdeor, sukulkar, thaller, till, tpelka, travier, vbenes, ykashtan | ||||||
Target Milestone: | rc | Keywords: | Triaged, ZStream | ||||||
Target Release: | 8.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | NetworkManager-1.30.0-2.el8 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1940071 (view as bug list) | Environment: | |||||||
Last Closed: | 2021-05-18 13:32:37 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1922417, 1940071, 1943809 | ||||||||
Attachments: |
|
Description
vemporop
2021-01-19 11:37:17 UTC
it seems that the interface does not have carrier, and after 6 seconds NetworkManager assumes the cable is unplugged and continues. NetworkManager does not yet support rd.net.timeout.carrier, which would help there (that will be https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/730). Could you try whether it helps to create a NetworkManager configuration snippet in initrd, like /etc/NetworkManager/conf.d/90-carrier-wait-timeout with the content: [device] carrier-wait-timeout=20000 that probably requires you to regenerate the initrd. (alternatively, you can also somehow write it to /run/NetworkManager/conf.d/ (before NetworkManager runs), if that is more convenient) When rebuilding the initrd, note that you need to explicitly tell dracut to include the file: # dracut -f -v -I /etc/NetworkManager/conf.d/90-carrier-wait-timeout.conf I don't have direct access to the environment where the problem happens. How can I make sure my changes have any effect? I do see the file when a host is booted, but want to see that the timeout setting works before deploying it. Also, I'm not sure I understand in what context to run the dracut command. What I have is an RHCOS image where initramfs-xxx.img is under /boot/ostree/rhcos-8531b9a3611a5ad880a2479e1cf8b720a1a72d49e05aad05ca15fa8ac553353d/initramfs-4.18.0-193.29.1.el8_2.x86_64.img. Let's assume I can modify that (the dracut command above produces a significantly smaller .img for some reason), but then I'll need to repackage the RHCOS image. How do I do that? Or did you have another scenario in mind? Thomas, it is my understanding that the upstream fix is already present in the RPM targeted at RHEL 8.4. This works great for future Openshift roadmap but leaves a bit of a gap for the current upcoming release (4.7). May I ask to get MR #730 backported to the current RHEL 8.3 package/branch? We can re-use this ticket to track such backport. Vitaliy, for RHCOS and this scenario specifically the flow would be indeed to use a `rd.net.timeout.carrier` karg, injected through the Assisted Installer. If you want to test that flow, there are already some pre-built RHCOS images with 8.4 content, see the mail thread on aos-devel@ with all the references. @lucab thanks, will do and update the ticket Tried https://releases-rhcos.cloud.privileged.psi.redhat.com/storage/testing/travier-4.7-8.4/47.84.202102161611-0/x86_64/rhcos-47.84.202102161611-0-live.x86_64.iso, but it won't boot on customer's hardware with error error: ../../grub-core/loader/i386/efi/linux.c:215:/images/pxeboot/vmlinuz has invalid signature. error: ../../grub-core/loader/i386/efi/linux.c:94:you need to load the kernel first. No problem booting a VM or old hardware (BIOS) from the same image. Created attachment 1758771 [details]
logfiles of a similar scenario
4.7-8.4 RHCOS kernels/bootloaders are signed with a non-production Secure Boot signing key so will not boot with Secure Boot on. @lucab I got the customer to try https://releases-rhcos.cloud.privileged.psi.redhat.com/storage/testing/travier-4.7-8.4/47.84.202102221652-0/x86_64/rhcos-47.84.202102221652-0-live.x86_64.iso. The installation was successful *without* any changes to kargs - "rd.net.timeout.carrier" or anything else. The only change we had to apply was disabling secure boot because the ISO wasn't signed. Can you think of any changes that might have fixed the issue without increasing carrier timeout? Vitaliy, I can't really say without seeing the boot logs, but as this issue is about a timing race related to carrier detection there a few non-deterministic factors into it. I'd say some variations/fluctuations are possible. I'm not surprised if any other delay in the initramfs startup made this less likely for a specific customer, or if the carrier detection is slightly faster in some cases and thus fits into the default timeout value. Hi Luca, Any action required from NetworkManager side? Gris, yes, see the request at https://bugzilla.redhat.com/show_bug.cgi?id=1917773#c5. Specifically, we would need https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/730 to be available in RHEL 8.4 and 8.3 (so that latest OpenShift can use it). For 8.4, it is my understanding that the newer upstream release is already covering that (but a double-check would be nice). For 8.3, it is my understanding that a backport is required, and we can freely use this BZ to keep track of that. I can confirm that 8.4 includes the rd.net.timeout.carrier PR. I was able to see the conf file generated with a correct value from kargs, with https://releases-rhcos.cloud.privileged.psi.redhat.com/storage/testing/travier-4.7-8.4/47.84.202102221652-0/x86_64/rhcos-47.84.202102221652-0-live.x86_64.iso Hi Thomas, Is that doable for us to backport the patch to RHEL 8.3? (In reply to Luca BRUNO from comment #5) > Thomas, it is my understanding that the upstream fix is already present in > the RPM targeted at RHEL 8.4. > This works great for future Openshift roadmap but leaves a bit of a gap for > the current upcoming release (4.7). > May I ask to get MR #730 backported to the current RHEL 8.3 package/branch? > We can re-use this ticket to track such backport. > > Vitaliy, for RHCOS and this scenario specifically the flow would be indeed > to use a `rd.net.timeout.carrier` karg, injected through the Assisted > Installer. > If you want to test that flow, there are already some pre-built RHCOS images > with 8.4 content, see the mail thread on aos-devel@ with all the references. Yes, it's possible. It's a bit larger than what we usually would like for a Z-stream, but if necessary, it can be done. sorry for the slow response. ok, so the suggested solution for this is that nm-initrd-generator honors the "rd.net.timeout.carrier" kernel command line, which is an additional delay in seconds to wait for carrier, see `man dracut.cmdline`. This gives a NetworkManager.conf snippet /run/NetworkManager/conf.d/15-carrier-timeout.conf that sets "[device-15-carrier-timeout] carrier-wait-timeout" (which is documented in `man NetworkManager.conf`). That feature was added upstream by commit https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/e300138892ee0fc3824d38b527b60103a01758ab, and is already present in latest NetworkManager builds for rhel-8.4 (which ships NetworkManager version 1.30.x). As said in comment 17. I now backported the patch to upstream "nm-1-26" branch, which is a requisite for the rhel-8.3-z-stream update (rhel-8.3 ships NetworkManager version 1.26.x). https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=35410569 is a scratch build of that patch for rhel-8.3. This is what I plan to do for rhel-8.3-z-stream. If possible, please test the 8.3 package. Thank you!! :) > If possible, please test the 8.3 package.
I did a manual OS build (RHCOS image) with proposed 8.3 content:
```
rpm-md repo 'brew-task-repo-NetworkManager-1.26.0-13.rh1917773.1.el8_3-scratch' (cached); generated: 2021-03-12T20:35:11Z
Importing rpm-md... done
Resolving dependencies... done
Installing 468 packages:
NetworkManager-1:1.26.0-13.rh1917773.1.el8_3.x86_64 (brew-task-repo-NetworkManager-1.26.0-13.rh1917773.1.el8_3-scratch)
NetworkManager-libnm-1:1.26.0-13.rh1917773.1.el8_3.x86_64 (brew-task-repo-NetworkManager-1.26.0-13.rh1917773.1.el8_3-scratch)
NetworkManager-ovs-1:1.26.0-13.rh1917773.1.el8_3.x86_64 (brew-task-repo-NetworkManager-1.26.0-13.rh1917773.1.el8_3-scratch)
NetworkManager-team-1:1.26.0-13.rh1917773.1.el8_3.x86_64 (brew-task-repo-NetworkManager-1.26.0-13.rh1917773.1.el8_3-scratch)
NetworkManager-tui-1:1.26.0-13.rh1917773.1.el8_3.x86_64 (brew-task-repo-NetworkManager-1.26.0-13.rh1917773.1.el8_3-scratch)
New build ID: 48.83.202103151019-0
```
The initrd cmdline logic worked:
```
# grep PRETTY /etc/initrd-release
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 48.83.202103151019-0 (Ootpa) dracut-049-95.git20200804.el8_3.4 (Initramfs)"
# grep -o rd.net.timeout.carrier='[[:alnum:]]*' /proc/cmdline
rd.net.timeout.carrier=30
# cat /run/NetworkManager/conf.d/15-carrier-timeout.conf
[device-15-carrier-timeout]
match-device=*
carrier-wait-timeout=30000
```
(In reply to Luca BRUNO from comment #23) > > If possible, please test the 8.3 package. > > I did a manual OS build (RHCOS image) with proposed 8.3 content: > that's good, right? Thanks for testing. For 8.3, we already provided a custom NetworkManager build to RHCOS. Does that suffice, or do we need a proper 8.3-Z-stream build (additionally to the customer RHCOS build)? Yes that's good :) The `rhaos` build is the one which is strictly needed for RHCOS, that would suffice for us. It is generally good to keep that in sync with the RHEL 8.3 z-stream content, but I'll defer that to your call. All the tickets I've seen so far on this topic were coming from CoreOS users. For rhel-8.3.z this will be fixed by bug 1940071 and build NetworkManager-1.26.0-14.el8_3. For RHCOS there is now also "NetworkManager-1.26.0-14.1.rhaos4.7.el8" You can already use it from here: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1540189 covered in unit tests' cmdline generator test: https://src.osci.redhat.com/rpms/NetworkManager/blob/000a199ac0f8dfe5beed5604edaf6ef649a60bba/f/1025-initrd-timeout-carrier-rh1917773.patch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: NetworkManager and libnma security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1574 |