Created attachment 1758815 [details] logfiles Description of problem: After installing a new OCP node with the assisted installer, the new node reboots. After the reboot, DHCPv4 is failing silently,a and ignition blocks for this reason. Version-Release number of selected component (if applicable): How reproducible: In my bare metal environment 100%. Steps to Reproduce: 1. Install OCP with the Assisted Installer Actual results: After the reboot ignition blocks the host. Expected results: NetworkManager configures the network interface using DHCPv4 and ignition can continue. Additional info: The workaround is to use static IP configuration in kernel command line.
this happens for us (telco) as well. but not all the time. I think that usually it's related to slow DHCP response
Hi Dominik, Does changing the DHCP timeout to infinity helps? nmcli connection modify <connection_name> ipv4.dhcp-timeout infinity ipv6.dhcp-timeout infinity
Hi, We are currently trying to install the cluster with private ISO provided by the assisted installer guys that maybe solving the issue, in case that it won't work as expected we will try to modify the DHCP timeout as suggested.
In our case, it's the opposite ie - ipv6 are coming in fast NM reach GLOBAL CONNECTIVITY state and exit before dhcpv4 reply is reaching end up in a state with no ipv4, then trying to fetch the ignition over ipv4. even if that suggested workaround (infinite timeout) works, we'll still need it implemented in ipi. I actually think that the installer can be enhanced so that if cluster IPs are ipv4 - NM should wait for ipv4 and if they are ivp6 - wait for IPv6
(In reply to Yuval Kashtan from comment #5) > I actually think that the installer can be enhanced so that if cluster IPs > are ipv4 - NM should wait for ipv4 > and if they are ivp6 - wait for IPv6 Yes, the problem currently is that the machine is booted with ip=dhcp,dhcp6. This is not a valid syntax and generates a connection that waits only for the first of {IPv4,IPv6} that completes. The ideal solution would be to use ip=dhcp in IPv4 environments and ip=dhcp6 in IPv6 environments.
(In reply to Beniamino Galvani from comment #6) > (In reply to Yuval Kashtan from comment #5) > > I actually think that the installer can be enhanced so that if cluster IPs > > are ipv4 - NM should wait for ipv4 > > and if they are ivp6 - wait for IPv6 > > Yes, the problem currently is that the machine is booted with ip=dhcp,dhcp6. > This is not a valid syntax and generates a connection that waits only for > the first of {IPv4,IPv6} that completes. > > The ideal solution would be to use ip=dhcp in IPv4 environments and ip=dhcp6 > in IPv6 environments. Hi Beniamino, Are you suggesting that `ip=dhcp,dhcp6` will wait DHCPv4. `ip=dhcp6,dhcp` will wait DHCPv6?
(In reply to Gris Ge from comment #7) > Are you suggesting that `ip=dhcp,dhcp6` will wait DHCPv4. `ip=dhcp6,dhcp` > will wait DHCPv6? No; "ip=" accepts only one method and therefore both "ip=dhcp,dhcp6" and "ip=dhcp6,dhcp" (as well as "ip=foobar") are an invalid syntax. They all generate a connection with default values, i.e. that does both IPv4 and IPv6 automatic configuration and that waits the address family that finishes first.
(In reply to Beniamino Galvani from comment #8) > (In reply to Gris Ge from comment #7) > > Are you suggesting that `ip=dhcp,dhcp6` will wait DHCPv4. `ip=dhcp6,dhcp` > > will wait DHCPv6? > > No; "ip=" accepts only one method and therefore both "ip=dhcp,dhcp6" and > "ip=dhcp6,dhcp" (as well as "ip=foobar") are an invalid syntax. They all > generate a connection with default values, i.e. that does both IPv4 and IPv6 > automatic configuration and that waits the address family that finishes > first. Then, please provide a solution for the use case in this bug and confirm whether it is doable in RHEL 8.5.
> Then, please provide a solution for the use case in this bug As mentioned in comment 6 the solution is to use "ip=dhcp" when ignition needs to wait for DHCPv4 and "ip=dhcp6" when it needs IPv6. > and confirm whether it is doable in RHEL 8.5. No change is required in NM with this solution.
Hi Dominik, It seems the change is required to be done at Ignition part. Can you try it base on above comment?
(In reply to Gris Ge from comment #11) > Hi Dominik, > > It seems the change is required to be done at Ignition part. > Can you try it base on above comment? Miguel, do you think there is a way using our new automation to check if the kernel parameter ip=dhcp instead of the static IP works?
Passing ip=dhcp seems to work.
Hi Domink, Is there any additional work required from NM side? If not, can we close this as not a bug?
(In reply to Gris Ge from comment #14) > Hi Domink, > > Is there any additional work required from NM side? Looks like it was not known by the layered product, that NetworkManager requires the kernel command line parameter, so let's discuss with the layered product. > If not, can we close this as not a bug? The bug is still present in OpenShift Assisted installer, so we have to decide which component has to adopt.
@dholler do you want to try the suggested solution with Assisted Installer, and if it works we'll automate the selection of ip=dhcp vs ip=dhcp6 depending on the user's selected machine network?
(In reply to vemporop from comment #16) > @dholler do you want to try the suggested solution with Assisted > Installer, and if it works we'll automate the selection of ip=dhcp vs > ip=dhcp6 depending on the user's selected machine network? Yes, I am afraid that this is what I understood is required by NetworkManager.
In the assisted installer, we need to set the right DHCP value in the kargs (ip=dhcp or ip=dhcp6) depending on which IP stack will be used to download ignition. Also, we need to make sure that other operations over network at the reboot stage do not use a different stack (in a dual-stack setup), because otherwise they in turn will fail because of address allocation.
@bgalvani we need this only for the intiramfs stage, because NetworkManager doesn't seem to have this problem when running as a service. But how does passing ip=dhcp or ip=dhcp6 to the installer affect an installed system? For instance, when used in combination with --copy-network.
1. We see the same issue with openshift-baremetal-install (IPI). 2. an MC with kernel arg might not be enough, because, well you need network to get that ignition. 3. hence I think installer should be enhanced to add these according to api ip (if it's v4, add ip=dhcp, if it's v6, add ip=dhcp6) WDYT?
@bgalvani unfortunately, the suggested fix leads to an undesired behavior when booting is stuck forever waiting for an IPv4 (ip=dhcp) or IPv6 (ip=dhcp6) on a NIC we don't care about. Is there a timeout parameter? For now it seems we'll have to impalement a more complex logic of cherry-picking the interfaces for which we want a particular address family. @alazar FYI
The final /proc/cmdline is BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-7ac2827aaf1f8821ff4f20932ef8702cdf349bc1b028d9bd818c5b8cfad05821/vmlinuz-4.18.0-240.22.1.el8_3.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/7ac2827aaf1f8821ff4f20932ef8702cdf349bc1b028d9bd818c5b8cfad05821/0 ip=dhcp root=UUID=4c37b511-7cfd-44f1-a466-72e11708a8bd rw rootflags=prjquota and the installation succeed smoothly on the previously affected host.
@dholler what is the kernel param that fixes this?
@ronnie l > what is the kernel param that fixes this? IIRC it's the `ip=dhcp` part but I'd still consider that a workaround/hack as it wont solve the problem for dual-stack environments
Yuval is right, it's `ip=dhcp`. A proper solution should be on the NetworkManager side IMO.
@vemporop , @yshnaidm can't we handle this using ignition overrides? in the pointer ignition?
(In reply to vemporop from comment #26) > Yuval is right, it's `ip=dhcp`. A proper solution should be on the > NetworkManager side IMO. What logic would a proper solution on the NetworkMananager side use to decide which IP configuration method to use on which interfaces? (In reply to Yuval Kashtan from comment #25) > @ronnie l > > what is the kernel param that fixes this? > IIRC it's the `ip=dhcp` part > but I'd still consider that a workaround/hack as it wont solve the problem > for dual-stack environments Do you need a way to make NM block on both DHCPv4 and DHCPv6?
@alazar no, we can't, that's why the kernel argument solution. This problem prevents ignition from working properly because the NICs don't get a chance to the right IP addresses that would allow to download ignition over the machine network. @till IMO an option to wait for both IPv4 and IPv6 with a timeout would be a good solution. Moving on once either family is allocated, without waiting for the other family, is not good. Waiting forever on all NICs when a particular family is requested (e.g. ip=dhcp) isn't good either.
> Do you need a way to make NM block on both DHCPv4 and DHCPv6? yes maybe something like `ip=dhcp,dhcp6` so if nothing specified, NM will wait for either (as it does today) `ip=dhcp` - wait for ipv4 `ip=dhcp6` - wait for ipv6 `ip=dhcp,dhcp6` - wait for both (indefinitely? maybe allow to set a timeout with an additional param)
@yobshans you've changed to FailedQA. Could you please provide more details, because according to https://bugzilla.redhat.com/show_bug.cgi?id=1931852#c23 the original issue was fixed.
@ vemporop. Actually I do not understand status of current issue. Does it depend on another bugs? Or should be changed flow? Let's changed again to ON-QA and @ dholler can mark as verified.
I mark the bug as verified because this bug is about getting a single interface working in a dynamic (DHCP) dual-stack environment, which works now. We can open new bugs to address multiple interfaces.
(In reply to vemporop from comment #29) > @till IMO an option to wait for both IPv4 and IPv6 with a timeout > would be a good solution. Moving on once either family is allocated, without > waiting for the other family, is not good. Waiting forever on all NICs when > a particular family is requested (e.g. ip=dhcp) isn't good either. Since these are two new use cases, please file new BZs for them, so we can discuss them, there. Thank you.
*** Bug 1940011 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days