Bug 1879094

Summary: RHCOS dhcp kernel parameters not working as expected
Product: OpenShift Container Platform Reporter: Prajyot Parab <pparab>
Component: RHCOSAssignee: Dusty Mabe <dustymabe>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: high Docs Contact:
Priority: medium    
Version: 4.6CC: alogan, bbreard, hhei, imcleod, jligon, miabbott, nstielau, pradikum, travier, yshaikh
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Feature: Ability to configure DHCP timeout Reason: In certain DHCP environments, acquiring a DHCP lease may take longer than the default 45 seconds. Result: Users now have the ability to configure the timeout value used when trying to acquire a DHCP lease.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:18:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Prajyot Parab 2020-09-15 12:36:58 UTC
Description of problem:

We were trying to increase the DHCP retries and timeout to cater to our networking environment however it fails, time syncing all the servers is definitely an option but it's strange that the kernel params are not working.

Tried passing following params to kernel (but no changes were reflected)
rd.net.timeout.dhcp=100 rd.net.dhcp.retry=10

Any way we can customize the dhcp timeouts and retry counts?

Version-Release number of selected component (if applicable):
4.6

How reproducible:
Random


Steps to Reproduce:
1. Passing params `rd.net.timeout.dhcp=100` `rd.net.dhcp.retry=10` to kernel and booting machine

Actual results:
Kernel params not getting reflected and DHCP does not retry to assign IP to the machine resulting in boot failure.

Expected results:
Kernel params configured and DHCP retries and assigns IP resulting in successful machine boot.

Additional info:
NA

Comment 1 Micah Abbott 2020-09-15 14:08:34 UTC
Networking support in RHCOS 4.6 was improved, so that more complex configurations can be supported.  Kernel args should still be a supported mechanism for configuring the network.

Could you add `rd.break` to the kernel args and collect the journal from the system, so that we can see what the networking logs look like?

Comment 2 Dusty Mabe 2020-10-02 15:22:26 UTC
This bug has not been selected for work in the current sprint.

Comment 3 Dusty Mabe 2020-10-22 16:47:49 UTC
The NetworkManager team added support for the `rd.net.timeout.dhcp` upstream in https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/fbf54ab. 

I did a test with `rd.net.timeout.dhcp=100` in the `next` stream of Fedora CoreOS (based on Fedora 33 with NetworkManager-1.26.2-2.fc33.x86_64) and I see:


```
[    2.828429] NetworkManager[496]: <info>  [1603384675.3589] dhcp4 (ens2): activation: beginning transaction (timeout in 100 seconds)
```


where in RHCOS 4.6 right now I see it using the default of 45 sconds:


```
[    4.247269] NetworkManager[734]: <info>  [1603383872.0092] dhcp4 (ens2): activation: beginning transaction (timeout in 45 seconds)
```

Support for the `rd.net.timeout.dhcp` option should exist in OCP/RHCOS 4.7 since the version of NetworkManager in RHEL 8.3 will include it.

As for `rd.net.dhcp.retry`, NM does automatically retry a few times when it times out, but it doesn't look like the number of times is configurable. If you think support for `rd.net.dhcp.retry` then please open a RFE against NetworkManager.

Comment 4 Dusty Mabe 2020-10-22 16:51:12 UTC
Since `rd.net.timeout.dhcp` isn't supported in the NetworkManager in 4.6 we'll need to try to help you workaround the problem for now. Currently in my tests it looks like the default timeout is 45 seconds and the number of retries is 4, so the final timeout will occur at approximately 180 seconds. It sounds like your DHCP server is taking much longer than that to be able to service the request?

Comment 5 Dusty Mabe 2020-10-23 19:41:15 UTC
This is being worked on, but is currently awaiting more investigation or more information and won't be completed this sprint.

Comment 6 Dusty Mabe 2020-11-14 15:25:13 UTC
This bug will be fixed when we rebase RHCOS on top of RHEL 8.3. This will occur in the 4.7 timeframe in a future sprint.

Comment 7 Micah Abbott 2020-11-16 20:38:26 UTC
Moving to POST, as we expect to see 8.3 in RHCOS 4.7 soon.

Comment 8 Micah Abbott 2020-12-02 14:38:18 UTC
RHCOS 47.83.202012020056-0 includes RHEL 8.3 and `NetworkManager-1.26.0-9.el8_3` which should include the fix in comment #3.  Moving to MODIFIED

Comment 10 Micah Abbott 2020-12-10 17:14:09 UTC
Verified with RHCOS 47.83.202012072242-0

As noted in comment #3, only `rd.net.timeout.dhcp` is supported by NetworkManager at this time, so I confirmed using that param was used by NM correctly:

```
[core@cosa-devsh ~]$ rpm-ostree status
State: idle
Deployments:
* ostree://d70e44dde4765c2b59cedae6c399c7255a4bb877cc80b1be5c93cbe614b1d395
                   Version: 47.83.202012072242-0 (2020-12-07T22:46:11Z)
[core@cosa-devsh ~]$ rpm -q NetworkManager
NetworkManager-1.26.0-9.el8_3.x86_64
[core@cosa-devsh ~]$ cat /proc/cmdline | more 
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-da2a55fc8655016771f867e78910e69d6ee3b93e3cbc5
aad74660e2b8d9c8e19/vmlinuz-4.18.0-240.7.1.el8_3.x86_64 random.trust_cpu=on cons
ole=tty0 console=ttyS0,115200n8 ignition.platform.id=qemu ignition.firstboot ost
ree=/ostree/boot.1/rhcos/da2a55fc8655016771f867e78910e69d6ee3b93e3cbc5aad74660e2
b8d9c8e19/0 rd.net.timeout.dhcp=100 
[core@cosa-devsh ~]$ journalctl -b -u NetworkManager --no-pager | grep timeout
Dec 10 17:09:35 localhost NetworkManager[1447]: <info>  [1607620175.9195] dhcp4 (ens5): activation: beginning transaction (timeout in 100 seconds)
```

Comment 13 errata-xmlrpc 2021-02-24 15:18:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633