Bug 1879094 - RHCOS dhcp kernel parameters not working as expected
Summary: RHCOS dhcp kernel parameters not working as expected
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: ppc64le
OS: Linux
medium
high
Target Milestone: ---
: 4.7.0
Assignee: Dusty Mabe
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-15 12:36 UTC by Prajyot Parab
Modified: 2022-04-11 19:00 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: Ability to configure DHCP timeout Reason: In certain DHCP environments, acquiring a DHCP lease may take longer than the default 45 seconds. Result: Users now have the ability to configure the timeout value used when trying to acquire a DHCP lease.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:18:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github coreos fedora-coreos-config pull 1607 0 None Waiting on Red Hat Some repositories can not be disabled 2022-06-10 08:44:27 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:18:40 UTC

Internal Links: 1877740

Description Prajyot Parab 2020-09-15 12:36:58 UTC
Description of problem:

We were trying to increase the DHCP retries and timeout to cater to our networking environment however it fails, time syncing all the servers is definitely an option but it's strange that the kernel params are not working.

Tried passing following params to kernel (but no changes were reflected)
rd.net.timeout.dhcp=100 rd.net.dhcp.retry=10

Any way we can customize the dhcp timeouts and retry counts?

Version-Release number of selected component (if applicable):
4.6

How reproducible:
Random


Steps to Reproduce:
1. Passing params `rd.net.timeout.dhcp=100` `rd.net.dhcp.retry=10` to kernel and booting machine

Actual results:
Kernel params not getting reflected and DHCP does not retry to assign IP to the machine resulting in boot failure.

Expected results:
Kernel params configured and DHCP retries and assigns IP resulting in successful machine boot.

Additional info:
NA

Comment 1 Micah Abbott 2020-09-15 14:08:34 UTC
Networking support in RHCOS 4.6 was improved, so that more complex configurations can be supported.  Kernel args should still be a supported mechanism for configuring the network.

Could you add `rd.break` to the kernel args and collect the journal from the system, so that we can see what the networking logs look like?

Comment 2 Dusty Mabe 2020-10-02 15:22:26 UTC
This bug has not been selected for work in the current sprint.

Comment 3 Dusty Mabe 2020-10-22 16:47:49 UTC
The NetworkManager team added support for the `rd.net.timeout.dhcp` upstream in https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/fbf54ab. 

I did a test with `rd.net.timeout.dhcp=100` in the `next` stream of Fedora CoreOS (based on Fedora 33 with NetworkManager-1.26.2-2.fc33.x86_64) and I see:


```
[    2.828429] NetworkManager[496]: <info>  [1603384675.3589] dhcp4 (ens2): activation: beginning transaction (timeout in 100 seconds)
```


where in RHCOS 4.6 right now I see it using the default of 45 sconds:


```
[    4.247269] NetworkManager[734]: <info>  [1603383872.0092] dhcp4 (ens2): activation: beginning transaction (timeout in 45 seconds)
```

Support for the `rd.net.timeout.dhcp` option should exist in OCP/RHCOS 4.7 since the version of NetworkManager in RHEL 8.3 will include it.

As for `rd.net.dhcp.retry`, NM does automatically retry a few times when it times out, but it doesn't look like the number of times is configurable. If you think support for `rd.net.dhcp.retry` then please open a RFE against NetworkManager.

Comment 4 Dusty Mabe 2020-10-22 16:51:12 UTC
Since `rd.net.timeout.dhcp` isn't supported in the NetworkManager in 4.6 we'll need to try to help you workaround the problem for now. Currently in my tests it looks like the default timeout is 45 seconds and the number of retries is 4, so the final timeout will occur at approximately 180 seconds. It sounds like your DHCP server is taking much longer than that to be able to service the request?

Comment 5 Dusty Mabe 2020-10-23 19:41:15 UTC
This is being worked on, but is currently awaiting more investigation or more information and won't be completed this sprint.

Comment 6 Dusty Mabe 2020-11-14 15:25:13 UTC
This bug will be fixed when we rebase RHCOS on top of RHEL 8.3. This will occur in the 4.7 timeframe in a future sprint.

Comment 7 Micah Abbott 2020-11-16 20:38:26 UTC
Moving to POST, as we expect to see 8.3 in RHCOS 4.7 soon.

Comment 8 Micah Abbott 2020-12-02 14:38:18 UTC
RHCOS 47.83.202012020056-0 includes RHEL 8.3 and `NetworkManager-1.26.0-9.el8_3` which should include the fix in comment #3.  Moving to MODIFIED

Comment 10 Micah Abbott 2020-12-10 17:14:09 UTC
Verified with RHCOS 47.83.202012072242-0

As noted in comment #3, only `rd.net.timeout.dhcp` is supported by NetworkManager at this time, so I confirmed using that param was used by NM correctly:

```
[core@cosa-devsh ~]$ rpm-ostree status
State: idle
Deployments:
* ostree://d70e44dde4765c2b59cedae6c399c7255a4bb877cc80b1be5c93cbe614b1d395
                   Version: 47.83.202012072242-0 (2020-12-07T22:46:11Z)
[core@cosa-devsh ~]$ rpm -q NetworkManager
NetworkManager-1.26.0-9.el8_3.x86_64
[core@cosa-devsh ~]$ cat /proc/cmdline | more 
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-da2a55fc8655016771f867e78910e69d6ee3b93e3cbc5
aad74660e2b8d9c8e19/vmlinuz-4.18.0-240.7.1.el8_3.x86_64 random.trust_cpu=on cons
ole=tty0 console=ttyS0,115200n8 ignition.platform.id=qemu ignition.firstboot ost
ree=/ostree/boot.1/rhcos/da2a55fc8655016771f867e78910e69d6ee3b93e3cbc5aad74660e2
b8d9c8e19/0 rd.net.timeout.dhcp=100 
[core@cosa-devsh ~]$ journalctl -b -u NetworkManager --no-pager | grep timeout
Dec 10 17:09:35 localhost NetworkManager[1447]: <info>  [1607620175.9195] dhcp4 (ens5): activation: beginning transaction (timeout in 100 seconds)
```

Comment 13 errata-xmlrpc 2021-02-24 15:18:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.