Bug 1836248 - [telco] OpenShift 4.4.3 Bare Metal IPI: RHCOS DHCPDISCOVERs each NIC
Summary: [telco] OpenShift 4.4.3 Bare Metal IPI: RHCOS DHCPDISCOVERs each NIC
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Dusty Mabe
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1186913
TreeView+ depends on / blocked
 
Reported: 2020-05-15 13:15 UTC by Jean-Francois Saucier
Modified: 2023-10-06 20:04 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: It was requested that machines with large number of NICs on networks without DHCP not take a long time to boot. Reason: The legacy network scripts implementation leveraged in the initramfs networking bringup attempts DHCP on every interface in the machine one by one, which leads to long boot delays if the machine has many interfaces. Result: RHCOS was switched to leverage NetworkManager in the initramfs instead of legacy network scripts. NetworkManager does two things that helps with the long timeout problem. First, it won't attempt DHCP on any interface without a physical connection. Second, it will attempt DHCP on all qualifying interfaces in parallel, leading to a much shorter amount of time waiting on DHCP timeouts.
Clone Of:
Environment:
Last Closed: 2020-10-27 16:00:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:00:07 UTC

Description Jean-Francois Saucier 2020-05-15 13:15:52 UTC
Description of problem:
During a bare-metal IPI installation of OpenShift 4.4.3, master nodes successfully boot via PXE and reboot into RHCOS for the first time, so that their ignition configuration can be applied.

In this very first reboot into RHCOS, the bare metal machine can take more than an hour to complete their first boot.

These machines are equipped with multiple network interface cards, and the messages on the RHCOS console suggest that:
(a) the DHCP client tries each NIC *in series*
(b) the DHCP client waits 3-5 minutes before timing out and moving on to the next NIC

On these machines, most NICs have not cables attached to them.  We must wait until (something like) the first 7 NICs have timed out waiting for a DHCP lease before the correct NIC is attempted.

It would be useful to have all these NICs attempted in parallel, so that we're not waiting in series on for DHCP leases.


Version-Release number of the following components:
OpenShift 4.4.3


How reproducible:
Every time

Comment 10 Dusty Mabe 2020-05-15 16:56:57 UTC
To limit the number of interfaces DHCP is tried on it should be sufficient to replace the `ip=dhcp` argument on the kernel command line with `ip=$NIC:dhcp` where $NIC is the name of the single NIC you want DHCP on (i.e. ens2). You should be able to apply this kernel parameter for the "install boot" (the boot where you are doing the bare metal install) via PXE and it will get propagated to the first boot of the installed system.

Comment 11 Jean-Francois Saucier 2020-05-15 17:14:47 UTC
@Dusty sorry for potentially a trivial question but do you have an idea where should we specify this so it is taken into account during the deployment of OpenShift BM IPI?

Comment 12 Dusty Mabe 2020-05-15 17:31:31 UTC
I assume you're using the kargs to do the install over PXE where you have to specify things like coreos.inst=yes. In that same place where you are specifying kargs you most likely also have a `ip=dhcp` kernel argument. Change it to `ip=$NIC:dhcp`.

Comment 13 Jean-Francois Saucier 2020-05-15 18:51:36 UTC
Do we need to generate the manifest with openshift-baremetal-install to get to this setting or is it something we can configure in the install-config.yaml? I checked in the doc and in my environment but I did not find this information on how to pass this parameter to the OpenShift install.

Comment 14 Dusty Mabe 2020-05-15 19:47:11 UTC
hey Jean-Francois - It is the RHCOS install, which I think should be separate from the openshift-installer (though I admittedly don't have experience with IPI). In IPI do you ever execute a step like this https://docs.openshift.com/container-platform/4.4/installing/installing_bare_metal/installing-bare-metal.html#installation-user-infra-machines-pxe_installing-bare-metal ?

Comment 15 Jean-Francois Saucier 2020-05-15 19:57:24 UTC
@Dusty, no this step is automated by the install when doing a baremetal IPI deployment. You create an install-config.yaml file and run "openshift-baremetal-install create cluster" and it launch everything it needs.

I tried to find a way to customize this parameter but did not find any yet.

Comment 18 Julia Kreger 2020-05-19 16:36:42 UTC
For the purposes of additional clarity, the image that has been deployed is a disk image, after the network boot operation has completed. In other words, there is no ability to specify a parameter for the initial ramdisk's processing

Comment 20 Dusty Mabe 2020-05-19 18:21:18 UTC
@julia, the suggested workaround (https://bugzilla.redhat.com/show_bug.cgi?id=1836248#c10) is to change the karg for the install, which is the "network boot operation" you are referring to. That will get propagated forward into the first boot of the machine (i.e. the boot from disk). Basically we need the ability to tweak some of the kernel arguments in the PXE config. Can you advise on that front?

Comment 22 Colin Walters 2020-05-19 19:33:32 UTC
This is basically https://github.com/coreos/ignition/issues/979

Comment 29 Neil Horman 2020-05-28 14:14:24 UTC
FWIW, dracut can also be configured to specify the dhcp timeout and retry parameters such that we can only send a single dhcp discover message and wait a small amount of time for a response, which would accelerate this process significantly

Comment 30 Neil Horman 2020-05-28 14:16:06 UTC
for reference:
rd.net.timeout.dhcp
rd.net.dhcp.retry

are the kernel command line parameters that direct dracut in how often to retry and how long to wait between retries

Comment 37 Dusty Mabe 2020-06-10 04:31:34 UTC
This bug has not been selected for work in the current sprint.

Comment 38 Colin Walters 2020-06-17 02:23:08 UTC
We are having a parallel discussion/debate on this here https://github.com/coreos/ignition/issues/979

Comment 39 Dusty Mabe 2020-06-19 16:18:27 UTC
Suggestion for possible future IPI deployer behavior: https://github.com/coreos/ignition/issues/979#issuecomment-646725569

Comment 40 Dusty Mabe 2020-06-22 21:57:30 UTC
From what I understand there a are a few changes coming that could attack this problem from different angles.

First there is the discussion going on in https://github.com/coreos/ignition/issues/979 about the future of IPI provisioning RHCOS. I'm not sure if all of that will land in 4.6, but most of it should.

Second, in 4.6 we already landed a change that moves to use NetworkManager to do network bringup in the initrd. It appears that NM does try to bring up all interfaces in parallel. I just performed some tests and I believe this will solve the customers immediate need. The original description also states: "On these machines, most NICs have not cables attached to them." I did some more testing and verified that if there is no network cable plugged in then NM won't even try DHCP.

The way I tested this was by using a VM and simulating unplugging the network cables (see https://unix.stackexchange.com/questions/81044/emulate-unplugging-a-network-cable-with-qemu-kvm). I started a machine on a bridge without DHCP:

```
$ sudo virsh net-dumpxml nodhcp
<network>
  <name>nodhcp</name>
  <uuid>626e6e74-49c3-4eb2-87f9-4539f944888e</uuid>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='virbr100' stp='on' delay='0'/>
  <mac address='52:54:00:3a:b3:4d'/>
  <ip address='192.168.130.1' netmask='255.255.255.0'>
  </ip>
</network>
```

Then started a VM with 8 interfaces on that network

```
virt-install --import --name tester --cpu host-passthrough --ram 2048 --vcpus 2 --boot menu=on,useserial=on --accelerate --graphics none --force --qemu-commandline="-fw_cfg name=opt/com.coreos/config,file=/var/b/images/fcct-auto-login-ttyS0.ign" --disk /var/b/images/rhcos-46.82.202006221550-0-qemu.x86_64.qcow2  --rng random --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio
```

I stopped the VM at the grub menu and ran a command for each interface to "unplug" the cable:


```
virsh dumpxml tester | grep -i mac
virsh domif-setlink tester 52:54:00:3d:07:22 down
...
...
```

I then pressed enter at the grub prompt to continue boot. I was presented with a Login prompt in around 20 seconds. If I don't "unplug" the cable from the interfaces then DHCP is attempted we have to wait for timeouts before the boot continues (as is the experience of the customer here in this issue).


Jean-Francois Saucier - I think the immediate need for this will be solved with the move to NetworkManager in 4.6 and in the future (maybe in 4.6) we'll have other pieces in place so networking won't even attempt to be brough up in the initramfs. Can you confirm you think what I'm proposing here is sufficient?

Comment 41 Dusty Mabe 2020-06-22 21:59:45 UTC
(In reply to Dusty Mabe from comment #40)

> I then pressed enter at the grub prompt to continue boot. I was presented
> with a Login prompt in around 20 seconds. If I don't "unplug" the cable from
> the interfaces then DHCP is attempted we have to wait for timeouts before
> the boot continues (as is the experience of the customer here in this issue).

Ignore that 2nd sentence, it's inaccurate.

Comment 42 Dusty Mabe 2020-06-26 20:46:24 UTC
This is being worked on, but is currently awaiting more investigation or more information and won't be completed this sprint.

Comment 44 Dusty Mabe 2020-07-06 17:29:52 UTC
RHCOS/OCP 4.6 will use NetworkManager to bring up the network in the initramfs, which will make the long timeout issue go away as described in https://bugzilla.redhat.com/show_bug.cgi?id=1836248#c40 . Moving this bug to MODIFIED.

Comment 47 Dusty Mabe 2020-07-07 17:47:10 UTC
(In reply to Dusty Mabe from comment #41)
> (In reply to Dusty Mabe from comment #40)
> 
> > I then pressed enter at the grub prompt to continue boot. I was presented
> > with a Login prompt in around 20 seconds. If I don't "unplug" the cable from
> > the interfaces then DHCP is attempted we have to wait for timeouts before
> > the boot continues (as is the experience of the customer here in this issue).
> 
> Ignore that 2nd sentence, it's inaccurate.

To further clarify, the statement should have read:

I then pressed enter at the grub prompt to continue boot. I was presented with a Login prompt in around 20 seconds. If I don't "unplug" the cable from the interfaces then we are still OK because DHCP is attempted in parallel. The DHCP attempts will timeout but since they all happen in parallel the timeout to get to the login prompt is <60 seconds for my trivial VM test case. This timeout is much more reasonable as opposed to the hour long timeout the BZ opener reported.

Comment 48 Michael Nguyen 2020-07-07 18:09:29 UTC
Verified on RHCOS 46.82.202007062141-0 which is a part of 4.6.0-0.nightly-2020-07-07-083718 using steps from https://bugzilla.redhat.com/show_bug.cgi?id=1836248#c40

# cat << EOF > nodhcp.xml 
<network>
  <name>nodhcp</name>
  <uuid>626e6e74-49c3-4eb2-87f9-4539f944888e</uuid>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='virbr100' stp='on' delay='0'/>
  <mac address='52:54:00:3a:b3:4d'/>
  <ip address='192.168.131.1' netmask='255.255.255.0'>
  </ip>
</network>
EOF

# virsh net-create -f nodhcp.xml
# virt-install --import --name tester --cpu host-passthrough --ram 2048 --vcpus 2 --boot menu=on,useserial=on --accelerate --graphics none --force --qemu-commandline="-fw_cfg name=opt/com.coreos/config,file=/var/lib/libvirt/images/rhah/ignition" --disk /var/lib/libvirt/images/rhah/rhcos-46.82.202007071437-0-qemu.x86_64.qcow2 --rng random --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio --network bridge=virbr100,model=virtio

*interrupt grub menu*

# virsh dumpxml tester| grep -i 'mac address' | cut -d\' -f2 | xargs -I % sudo virsh domif-setlink tester % down

*continue boot*.

Results:  With the link down, there is no additional wait for unplugged interfaces.  With the links up, there is a single wait of 45 seconds for DHCP to time out regardless of how many interfaces there are.

Comment 49 Michael Nguyen 2020-07-09 21:05:33 UTC
This was actually verified on rhcos 46.82.202007071437-0

Comment 51 errata-xmlrpc 2020-10-27 16:00:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.