Description of problem: We were trying to increase the DHCP retries and timeout to cater to our networking environment however it fails, time syncing all the servers is definitely an option but it's strange that the kernel params are not working. Tried passing following params to kernel (but no changes were reflected) rd.net.timeout.dhcp=100 rd.net.dhcp.retry=10 Any way we can customize the dhcp timeouts and retry counts? Version-Release number of selected component (if applicable): 4.6 How reproducible: Random Steps to Reproduce: 1. Passing params `rd.net.timeout.dhcp=100` `rd.net.dhcp.retry=10` to kernel and booting machine Actual results: Kernel params not getting reflected and DHCP does not retry to assign IP to the machine resulting in boot failure. Expected results: Kernel params configured and DHCP retries and assigns IP resulting in successful machine boot. Additional info: NA
Networking support in RHCOS 4.6 was improved, so that more complex configurations can be supported. Kernel args should still be a supported mechanism for configuring the network. Could you add `rd.break` to the kernel args and collect the journal from the system, so that we can see what the networking logs look like?
This bug has not been selected for work in the current sprint.
The NetworkManager team added support for the `rd.net.timeout.dhcp` upstream in https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/fbf54ab. I did a test with `rd.net.timeout.dhcp=100` in the `next` stream of Fedora CoreOS (based on Fedora 33 with NetworkManager-1.26.2-2.fc33.x86_64) and I see: ``` [ 2.828429] NetworkManager[496]: <info> [1603384675.3589] dhcp4 (ens2): activation: beginning transaction (timeout in 100 seconds) ``` where in RHCOS 4.6 right now I see it using the default of 45 sconds: ``` [ 4.247269] NetworkManager[734]: <info> [1603383872.0092] dhcp4 (ens2): activation: beginning transaction (timeout in 45 seconds) ``` Support for the `rd.net.timeout.dhcp` option should exist in OCP/RHCOS 4.7 since the version of NetworkManager in RHEL 8.3 will include it. As for `rd.net.dhcp.retry`, NM does automatically retry a few times when it times out, but it doesn't look like the number of times is configurable. If you think support for `rd.net.dhcp.retry` then please open a RFE against NetworkManager.
Since `rd.net.timeout.dhcp` isn't supported in the NetworkManager in 4.6 we'll need to try to help you workaround the problem for now. Currently in my tests it looks like the default timeout is 45 seconds and the number of retries is 4, so the final timeout will occur at approximately 180 seconds. It sounds like your DHCP server is taking much longer than that to be able to service the request?
This is being worked on, but is currently awaiting more investigation or more information and won't be completed this sprint.
This bug will be fixed when we rebase RHCOS on top of RHEL 8.3. This will occur in the 4.7 timeframe in a future sprint.
Moving to POST, as we expect to see 8.3 in RHCOS 4.7 soon.
RHCOS 47.83.202012020056-0 includes RHEL 8.3 and `NetworkManager-1.26.0-9.el8_3` which should include the fix in comment #3. Moving to MODIFIED
Verified with RHCOS 47.83.202012072242-0 As noted in comment #3, only `rd.net.timeout.dhcp` is supported by NetworkManager at this time, so I confirmed using that param was used by NM correctly: ``` [core@cosa-devsh ~]$ rpm-ostree status State: idle Deployments: * ostree://d70e44dde4765c2b59cedae6c399c7255a4bb877cc80b1be5c93cbe614b1d395 Version: 47.83.202012072242-0 (2020-12-07T22:46:11Z) [core@cosa-devsh ~]$ rpm -q NetworkManager NetworkManager-1.26.0-9.el8_3.x86_64 [core@cosa-devsh ~]$ cat /proc/cmdline | more BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-da2a55fc8655016771f867e78910e69d6ee3b93e3cbc5 aad74660e2b8d9c8e19/vmlinuz-4.18.0-240.7.1.el8_3.x86_64 random.trust_cpu=on cons ole=tty0 console=ttyS0,115200n8 ignition.platform.id=qemu ignition.firstboot ost ree=/ostree/boot.1/rhcos/da2a55fc8655016771f867e78910e69d6ee3b93e3cbc5aad74660e2 b8d9c8e19/0 rd.net.timeout.dhcp=100 [core@cosa-devsh ~]$ journalctl -b -u NetworkManager --no-pager | grep timeout Dec 10 17:09:35 localhost NetworkManager[1447]: <info> [1607620175.9195] dhcp4 (ens5): activation: beginning transaction (timeout in 100 seconds) ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633