Bug 1931852 - Ignition HTTP GET is failing, because DHCP IPv4 config is failing silently
Summary: Ignition HTTP GET is failing, because DHCP IPv4 config is failing silently
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.8.0
Assignee: vemporop
QA Contact: Yuri Obshansky
URL:
Whiteboard: AI-Team-Core
: 1940011 (view as bug list)
Depends On:
Blocks: 1940011 1940454
TreeView+ depends on / blocked
 
Reported: 2021-02-23 12:03 UTC by Dominik Holler
Modified: 2023-09-18 00:24 UTC (History)
19 users (show)

Fixed In Version: OCP-Metal-v1.0.19.1
Doc Type: Bug Fix
Doc Text:
Cause: In dracut mode, NetworkManager waits for a first address received from DHCP for an interface. Therefore, in environments where DHCP allocates both IPv4 and IPv6 addresses, and one of the address families takes more time to be allocated, an interface may end up not having an address of that family. Consequence: An interface quickly receives IPv6 addresses and the system proceeds to booting, never receiving IPv4 addresses. In this case, if the machine network is IPv4, a node will never be able to download ignition from the bootstrap node. And vice versa. Fix: If the user does not provide a custom networking configuration for installation, force NetworkManager to wait for a specific address family according to the machine network CIDR before proceeding to ignition download. Result: Network interfaces are initialized with addresses allocated by DHCP in the address family (IPv4 or IPv6) that is required for successful communication between nodes. Ignition can be successfully downloaded, and installation completed.
Clone Of:
Environment:
Last Closed: 2021-07-27 22:47:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logfiles (32.14 KB, application/x-xz)
2021-02-23 12:03 UTC, Dominik Holler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:48:02 UTC

Internal Links: 1917773

Description Dominik Holler 2021-02-23 12:03:39 UTC
Created attachment 1758815 [details]
logfiles

Description of problem: After installing a new OCP node with the assisted installer, the new node reboots. After the reboot, DHCPv4 is failing silently,a and ignition blocks for this reason.


Version-Release number of selected component (if applicable):


How reproducible: In my bare metal environment 100%.


Steps to Reproduce:
1. Install OCP with the Assisted Installer

Actual results: After the reboot ignition blocks the host.

Expected results: NetworkManager configures the network interface using DHCPv4 and ignition can continue.


Additional info: The workaround is to use static IP configuration in kernel command line.

Comment 1 Yuval Kashtan 2021-02-24 11:06:46 UTC
this happens for us (telco) as well.
but not all the time.

I think that usually it's related to slow DHCP response

Comment 3 Gris Ge 2021-03-03 06:57:37 UTC
Hi Dominik,

Does changing the DHCP timeout to infinity helps?

nmcli connection modify <connection_name> ipv4.dhcp-timeout infinity ipv6.dhcp-timeout infinity

Comment 4 Bella Khizgiyaev 2021-03-03 09:34:41 UTC
Hi,

We are currently trying to install the cluster with private ISO provided by the assisted installer guys that maybe solving the issue,
in case that it won't work as expected we will try to modify the DHCP timeout as suggested.

Comment 5 Yuval Kashtan 2021-03-09 20:35:45 UTC
In our case, it's the opposite
ie - ipv6 are coming in fast
NM reach GLOBAL CONNECTIVITY state and exit
before dhcpv4 reply is reaching
end up in a state with no ipv4, then trying to fetch the ignition over ipv4.

even if that suggested workaround (infinite timeout) works, we'll still need it implemented in ipi.

I actually think that the installer can be enhanced so that if cluster IPs are ipv4 - NM should wait for ipv4
and if they are ivp6 - wait for IPv6

Comment 6 Beniamino Galvani 2021-03-10 08:33:33 UTC
(In reply to Yuval Kashtan from comment #5)
> I actually think that the installer can be enhanced so that if cluster IPs
> are ipv4 - NM should wait for ipv4
> and if they are ivp6 - wait for IPv6

Yes, the problem currently is that the machine is booted with ip=dhcp,dhcp6. This is not a valid syntax and generates a connection that waits only for the first of {IPv4,IPv6} that completes.

The ideal solution would be to use ip=dhcp in IPv4 environments and ip=dhcp6 in IPv6 environments.

Comment 7 Gris Ge 2021-03-22 05:54:19 UTC
(In reply to Beniamino Galvani from comment #6)
> (In reply to Yuval Kashtan from comment #5)
> > I actually think that the installer can be enhanced so that if cluster IPs
> > are ipv4 - NM should wait for ipv4
> > and if they are ivp6 - wait for IPv6
> 
> Yes, the problem currently is that the machine is booted with ip=dhcp,dhcp6.
> This is not a valid syntax and generates a connection that waits only for
> the first of {IPv4,IPv6} that completes.
> 
> The ideal solution would be to use ip=dhcp in IPv4 environments and ip=dhcp6
> in IPv6 environments.

Hi Beniamino,

Are you suggesting that `ip=dhcp,dhcp6` will wait DHCPv4. `ip=dhcp6,dhcp` will wait DHCPv6?

Comment 8 Beniamino Galvani 2021-03-22 08:41:58 UTC
(In reply to Gris Ge from comment #7)
> Are you suggesting that `ip=dhcp,dhcp6` will wait DHCPv4. `ip=dhcp6,dhcp`
> will wait DHCPv6?

No; "ip=" accepts only one method and therefore both "ip=dhcp,dhcp6" and "ip=dhcp6,dhcp" (as well as "ip=foobar") are an invalid syntax. They all generate a connection with default values, i.e. that does both IPv4 and IPv6 automatic configuration and that waits the address family that finishes first.

Comment 9 Gris Ge 2021-03-23 04:10:46 UTC
(In reply to Beniamino Galvani from comment #8)
> (In reply to Gris Ge from comment #7)
> > Are you suggesting that `ip=dhcp,dhcp6` will wait DHCPv4. `ip=dhcp6,dhcp`
> > will wait DHCPv6?
> 
> No; "ip=" accepts only one method and therefore both "ip=dhcp,dhcp6" and
> "ip=dhcp6,dhcp" (as well as "ip=foobar") are an invalid syntax. They all
> generate a connection with default values, i.e. that does both IPv4 and IPv6
> automatic configuration and that waits the address family that finishes
> first.

Then, please provide a solution for the use case in this bug and confirm whether
it is doable in RHEL 8.5.

Comment 10 Beniamino Galvani 2021-03-23 07:07:06 UTC
> Then, please provide a solution for the use case in this bug 

As mentioned in comment 6 the solution is to use "ip=dhcp" when ignition needs to wait for DHCPv4 and "ip=dhcp6" when it needs IPv6.

> and confirm whether it is doable in RHEL 8.5.

No change is required in NM with this solution.

Comment 11 Gris Ge 2021-03-23 07:42:17 UTC
Hi Dominik,

It seems the change is required to be done at Ignition part.
Can you try it base on above comment?

Comment 12 Dominik Holler 2021-03-23 09:23:22 UTC
(In reply to Gris Ge from comment #11)
> Hi Dominik,
> 
> It seems the change is required to be done at Ignition part.
> Can you try it base on above comment?

Miguel, do you think there is a way using our new automation to check if the kernel parameter ip=dhcp instead of the static IP works?

Comment 13 Miguel Martin 2021-03-23 11:37:45 UTC
Passing ip=dhcp seems to work.

Comment 14 Gris Ge 2021-03-23 12:31:26 UTC
Hi Domink,

Is there any additional work required from NM side?
If not, can we close this as not a bug?

Comment 15 Dominik Holler 2021-03-23 13:51:57 UTC
(In reply to Gris Ge from comment #14)
> Hi Domink,
> 
> Is there any additional work required from NM side?

Looks like it was not known by the layered product, that NetworkManager requires the kernel command line parameter, so let's discuss with the layered product.

> If not, can we close this as not a bug?

The bug is still present in OpenShift Assisted installer, so we have to decide which component has to adopt.

Comment 16 vemporop 2021-03-24 15:42:20 UTC
@dholler do you want to try the suggested solution with Assisted Installer, and if it works we'll automate the selection of ip=dhcp vs ip=dhcp6 depending on the user's selected machine network?

Comment 17 Dominik Holler 2021-03-24 15:44:24 UTC
(In reply to vemporop from comment #16)
> @dholler do you want to try the suggested solution with Assisted
> Installer, and if it works we'll automate the selection of ip=dhcp vs
> ip=dhcp6 depending on the user's selected machine network?

Yes, I am afraid that this is what I understood is required by NetworkManager.

Comment 18 vemporop 2021-03-24 16:19:15 UTC
In the assisted installer, we need to set the right DHCP value in the kargs (ip=dhcp or ip=dhcp6) depending on which IP stack will be used to download ignition.
Also, we need to make sure that other operations over network at the reboot stage do not use a different stack (in a dual-stack setup), because otherwise they in turn will fail because of address allocation.

Comment 19 vemporop 2021-04-08 16:03:07 UTC
@bgalvani we need this only for the intiramfs stage, because NetworkManager doesn't seem to have this problem when running as a service. But how does passing ip=dhcp or ip=dhcp6 to the installer affect an installed system? For instance, when used in combination with --copy-network.

Comment 20 Yuval Kashtan 2021-04-26 06:37:09 UTC
1. We see the same issue with openshift-baremetal-install (IPI).
2. an MC with kernel arg might not be enough, because, well you need network to get that ignition.
3. hence I think installer should be enhanced to add these according to api ip (if it's v4, add ip=dhcp, if it's v6, add ip=dhcp6)
WDYT?

Comment 22 vemporop 2021-05-06 07:16:10 UTC
@bgalvani unfortunately, the suggested fix leads to an undesired behavior when booting is stuck forever waiting for an IPv4 (ip=dhcp) or IPv6 (ip=dhcp6) on a NIC we don't care about. Is there a timeout parameter? For now it seems we'll have to impalement a more complex logic of cherry-picking the interfaces for which we want a particular address family.

@alazar FYI

Comment 23 Dominik Holler 2021-05-06 08:06:11 UTC
The final /proc/cmdline is
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-7ac2827aaf1f8821ff4f20932ef8702cdf349bc1b028d9bd818c5b8cfad05821/vmlinuz-4.18.0-240.22.1.el8_3.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/7ac2827aaf1f8821ff4f20932ef8702cdf349bc1b028d9bd818c5b8cfad05821/0 ip=dhcp root=UUID=4c37b511-7cfd-44f1-a466-72e11708a8bd rw rootflags=prjquota

and the installation succeed smoothly on the previously affected host.

Comment 24 Ronnie Lazar 2021-05-06 09:45:50 UTC
@dholler what is the kernel param that fixes this?

Comment 25 Yuval Kashtan 2021-05-06 09:57:53 UTC
@ronnie l
> what is the kernel param that fixes this?
IIRC it's the `ip=dhcp` part
but I'd still consider that a workaround/hack as it wont solve the problem for dual-stack environments

Comment 26 vemporop 2021-05-06 10:01:29 UTC
Yuval is right, it's `ip=dhcp`. A proper solution should be on the NetworkManager side IMO.

Comment 27 Ronnie Lazar 2021-05-06 10:13:32 UTC
@vemporop , @yshnaidm can't we handle this using ignition overrides? in the pointer ignition?

Comment 28 Till Maas 2021-05-06 10:48:12 UTC
(In reply to vemporop from comment #26)
> Yuval is right, it's `ip=dhcp`. A proper solution should be on the
> NetworkManager side IMO.

What logic would a proper solution on the NetworkMananager side use to decide which IP configuration method to use on which interfaces?

(In reply to Yuval Kashtan from comment #25)
> @ronnie l
> > what is the kernel param that fixes this?
> IIRC it's the `ip=dhcp` part
> but I'd still consider that a workaround/hack as it wont solve the problem
> for dual-stack environments

Do you need a way to make NM block on both DHCPv4 and DHCPv6?

Comment 29 vemporop 2021-05-06 11:08:48 UTC
@alazar no, we can't, that's why the kernel argument solution. This problem prevents ignition from working properly because the NICs don't get a chance to the right IP addresses that would allow to download ignition over the machine network.

@till IMO an option to wait for both IPv4 and IPv6 with a timeout would be a good solution. Moving on once either family is allocated, without waiting for the other family, is not good. Waiting forever on all NICs when a particular family is requested (e.g. ip=dhcp) isn't good either.

Comment 30 Yuval Kashtan 2021-05-06 11:31:20 UTC
> Do you need a way to make NM block on both DHCPv4 and DHCPv6?
yes

maybe something like
`ip=dhcp,dhcp6`

so if nothing specified, NM will wait for either (as it does today)
`ip=dhcp` - wait for ipv4
`ip=dhcp6` - wait for ipv6
`ip=dhcp,dhcp6` - wait for both (indefinitely? maybe allow to set a timeout with an additional param)

Comment 31 vemporop 2021-05-06 11:58:10 UTC
@yobshans you've changed to FailedQA. Could you please provide more details, because according to https://bugzilla.redhat.com/show_bug.cgi?id=1931852#c23 the original issue was fixed.

Comment 32 Yuri Obshansky 2021-05-06 12:09:25 UTC
@ vemporop. 
Actually I do not understand status of current issue. 
Does it depend on another bugs? Or should be changed flow?  
Let's changed again to ON-QA and @ dholler can mark as verified.

Comment 33 Dominik Holler 2021-05-06 13:14:43 UTC
I mark the bug as verified because this bug is about getting a single interface working in a dynamic (DHCP) dual-stack environment, which works now.
We can open new bugs to address multiple interfaces.

Comment 34 Till Maas 2021-05-11 13:52:02 UTC
(In reply to vemporop from comment #29)

> @till IMO an option to wait for both IPv4 and IPv6 with a timeout
> would be a good solution. Moving on once either family is allocated, without
> waiting for the other family, is not good. Waiting forever on all NICs when
> a particular family is requested (e.g. ip=dhcp) isn't good either.

Since these are two new use cases, please file new BZs for them, so we can discuss them, there. Thank you.

Comment 35 Federico Paolinelli 2021-05-11 16:03:18 UTC
*** Bug 1940011 has been marked as a duplicate of this bug. ***

Comment 38 errata-xmlrpc 2021-07-27 22:47:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 39 Red Hat Bugzilla 2023-09-18 00:24:55 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.