Bug 1961666

Summary: In dracut allow enough time for DHCP allocation with dual stack, don't get stuck forever on missing IP family
Product: Red Hat Enterprise Linux 8 Reporter: vemporop
Component: NetworkManagerAssignee: Beniamino Galvani <bgalvani>
Status: CLOSED ERRATA QA Contact: Filip Pokryvka <fpokryvk>
Severity: unspecified Docs Contact:
Priority: urgent    
Version: 8.2CC: benoit, bgalvani, dustymabe, ferferna, fge, fpokryvk, jkonecny, keyoung, lrintel, mhrivnak, mko, rkhan, sfaye, sukulkar, till, vbenes
Target Milestone: betaKeywords: Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: NetworkManager-1.36.0-0.2.el8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1990460 (view as bug list) Environment:
Last Closed: 2022-05-10 14:54:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1954580    
Bug Blocks: 1990460    

Description vemporop 2021-05-18 13:03:30 UTC
Description of problem:

In dracut mode in a dual-stack environment, ignition download fails if the ignition URL points to an IPv4 address but DHCP (IPv4) allocation is slower than DHCP6 (IPv6). This happens because by default NetworkManager waits for whichever address family is assigned to an interface first and immediately moves on, even if the interface is supposed to have another address family as well. On the other hand, if I globally enforce a particular address family, e.g. `ip=dhcp6`, NetworkManager will wait forever for that address family to be assigned even to those interfaces that aren't supposed to have it.

We need a robust and generic way to tell NetworkManager to allow enough time for both stacks to be initialized, while at the same time not getting stuck forever.

This is especially important for automated tools like Assisted Installer where it's not always possible to know in advance which address families are being used in a setup.

How reproducible:
Somewhat tricky, happens in customer environments.

Steps to Reproduce:
1. DHCP much slower than DHCP6 or vice versa.
2. Two NICs, one with both IPv4 and IPv6, the other with IPv6 only.
3. Boot with an ignition configuration that pulls an additional ignition file from a remote server over IPv4.

Actual results:
Ignition can't be downloaded because the machine never gets an IPv4 address. When booted with kernel argument `ip=dhcp` to enforce an IPv4 address, the machine gets and IPv4 from DHCP but the boot process is stuck waiting for an IPv4 address on the second NIC (IPv6-only).

Expected results:
There is an option to allow both stacks/address families to be initialized within a reasonable time period.

Additional info:
This is a follow-up ticket on https://bugzilla.redhat.com/show_bug.cgi?id=1931852

Comment 2 Michael Hrivnak 2021-06-08 18:58:33 UTC
I reported a similar issue that likely has the same root cause. Summary: a minimal ISO created by assisted-installer can't retrieve the rootfs from mirror.openshift.com (which is ipv4-only) if it gets a dhcp6 lease first. https://bugzilla.redhat.com/show_bug.cgi?id=1967632

Comment 3 Till Maas 2021-06-22 14:25:59 UTC
*** Bug 1928345 has been marked as a duplicate of this bug. ***

Comment 4 Dusty Mabe 2021-06-22 20:09:11 UTC
*** Bug 1967632 has been marked as a duplicate of this bug. ***

Comment 5 Dusty Mabe 2021-06-22 20:11:30 UTC
This upstream issue/discussion targets this same problem: https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/729

Comment 6 Beniamino Galvani 2021-06-29 13:52:22 UTC
*** Bug 1947832 has been marked as a duplicate of this bug. ***

Comment 11 Filip Pokryvka 2021-07-22 17:31:18 UTC
I added a few dracut tests with delayed DHCP4 and 'ip=dhcp,dhcp6' kernel argument to NetworkManager-ci (including 2 NICs setup described above). PASS on latest RHEL8.5 build, FAIL on older builds.

Comment 15 Mat Kowalski 2021-10-11 09:11:28 UTC
I wonder if during fixing this bug the following scenario was kept in mind

* dual-stack system
* IPv4 is obtained from DHCP as the first one
* DHCPv6 is delayed
* ignition is served from the IPv6-only host

In this scenario we have IPv4 immediately, but we would need to wait for IPv6. Given that the fix mentions only "set required-timeout by default for IPv4 configuration", I wonder if the described scenario is still prone to fail. The test here would be to delay DHCPv6 and see if the machine waits for both IP addresses to be available.

Related BZ in the Assisted Installer project - https://bugzilla.redhat.com/show_bug.cgi?id=2005498

Comment 16 Gris Ge 2021-10-14 13:04:51 UTC
The test failure is not breaking anything, just not fixing enough stuff.
This is not important enough to block. the 8.5 GA.

I am moving this bug to verify state and continue the fix in 8.6.

Comment 18 Gris Ge 2021-10-15 13:39:07 UTC
Test feedback indicate the bug is only partially fixed.

Change this bug to 8.6 and revoke the zstream approval.

Once confirmed as fix via test on scratch build, we will review zstream for 8.5.0 and 8.4.0 again.

Comment 19 Dusty Mabe 2021-10-18 15:39:05 UTC
(In reply to Mat Kowalski from comment #15)
> I wonder if during fixing this bug the following scenario was kept in mind
> 
> * dual-stack system
> * IPv4 is obtained from DHCP as the first one
> * DHCPv6 is delayed
> * ignition is served from the IPv6-only host
> 
> In this scenario we have IPv4 immediately, but we would need to wait for
> IPv6. Given that the fix mentions only "set required-timeout by default for
> IPv4 configuration", I wonder if the described scenario is still prone to
> fail. The test here would be to delay DHCPv6 and see if the machine waits
> for both IP addresses to be available.
> 
> Related BZ in the Assisted Installer project -
> https://bugzilla.redhat.com/show_bug.cgi?id=2005498

Hey Mat. The current default of ip=dhcp,dhcp6 was set to try to make
sure that if someone had ipv4 or ipv6 networks the OS would still come
up without needing to be configured. Then we hit issues where the
first one would win and we massaged the behavior a bit to make it
match the legacy network dracut module a bit so that it would wait
for ipv4 a little longer. This was reasonable because it matched
the legacy behavior and was likely to match more enironments (ipv4
being more common than ipv6).

Unfortunately for you, we decided that forcing an extra wait/timeout
for ipv6 wasn't reasonable to do in the default case since most
environments probably don't have a ipv6+DHCP6 setup and would be
waiting 20s for nothing. 

If you need ipv6, can you add `ip=dhcp6` to your setup?

Comment 20 Dusty Mabe 2021-10-20 20:19:21 UTC
Talked with Mat. A general workaround for the following use case:

- "I need ipv6 in my initramfs, but both ipv4 and ipv6 in my real root"

is to provide `ip=dhcp6 coreos.no_persist_ip` on the kernel command line. This will give you ipv6 in the initramfs and BOTH ipv4 and ipv6 in your real root (because it won't propagate initramfs networking forward and the default behavior is both ipv6 and ipv4).


In general though, we don't forsee changing the behavior of RHCOS by default to add a 20s timeout for DHCPv6 by default, even if the NetworkManager team changes what `ip=dhcp,dhcp6` means in https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/994. See https://github.com/coreos/fedora-coreos-tracker/issues/1000

Comment 21 Dusty Mabe 2021-10-20 20:22:09 UTC
Note that my comments in comment#19 and comment#20 are RHCOS specific and talk about defaults for RHCOS since that is where the original bug was reported and also where Mat is working.

Comment 22 Gris Ge 2021-11-23 08:36:03 UTC
Please correct me if I get it wrong:


The goal for this bug is to ensure the action of `ip=dhcp,dhcp6` and `ip=dhcp6,dhcp` in NetworkManager.

 1. `ip=dhcp,dhcp6` and `ip=dhcp6,dhcp` generates identical results.
 2. Both DHCPv4 and IPv6-Autoconf(DHCPv6 will be ran after route RA indicate so through ipv6-autoconf) will be enabled and run.
 3. Single IP family is required pass.
 4. Wait 20 seconds for secondary IP family DHCP/Autoconf.


If user don't want to wait this extra 20 seconds, they could use:
 1. ip=dhcp : Both DHCPv4 and DHCPv6 and Autoconf enabled, but only DHCPv4 required.
 2. ip=dhcp6: Only IPv6 autoconf/DHCPv6 is enabled. IPv4 is disabled.

Comment 29 Filip Pokryvka 2021-12-03 14:06:21 UTC
I have added the test cases with slow ipv6, so now we have covered:

* ip=dhcp,dhpc6, slow IPv4, nfsroot over IPv4
* ip=dhcp,dhpc6, slow IPv6, nfsroot over IPv6
* ip=dhcp,dhcp6, NIC1:IPv4 + IPv6, NIC2:IPv6, nfsroot over IPv4
* ip=dhcp,dhcp6, NIC1:slow IP4 + IPv6, NIC2:IPv6, nfsroot over IPv4
* ip=dhcp,dhcp6, NIC1:IPv4 + slow IPv6, NIC2:IPv6, nfsroot over IPv6

Comment 30 Filip Pokryvka 2021-12-03 19:12:24 UTC
Hi Beniamino,

I have noticed the tests are passing on 1.36.0-0.1.el8 as well [1][2]. Does it have the fix or the tests do not cover the bug (after FailedQA)? 

Also, as you see, there are too many combinations to check (possibility might be also slowing down the IPv6 only NIC, or try only ip=dhcp with IPv6 only NIC, or ip=dhcp6 when IPv4 only nic is present...), but dracut tests are consuming a lot of resources now. Is it worth we add tests for some of these combinations? Which you consider the most important? Thank you! 

[1] https://desktopqe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/beaker-NetworkManager-gitlab-trigger-test-upstream/2074/artifact/artifacts/report_NetworkManager-ci_Test0013_dracut_NM_NFS_root_nfs_ip_dhcp_dhcp6_with_slow_ip64_and_ip6_nic.html
[2] https://desktopqe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/beaker-NetworkManager-gitlab-trigger-test-upstream/2074/artifact/artifacts/report_NetworkManager-ci_Test0010_dracut_NM_NFS_root_nfs_ip_dhcp_dhcp6_slow_ip6.html

Comment 31 Beniamino Galvani 2021-12-06 09:00:33 UTC
(In reply to Filip Pokryvka from comment #30)
> Hi Beniamino,
> 
> I have noticed the tests are passing on 1.36.0-0.1.el8 as well [1][2]. Does
> it have the fix or the tests do not cover the bug (after FailedQA)? 

The fix was added in 1.33.4, so 1.36.0-0.1.el8 already includes it.

> Which you consider the most important?

These two seems the most important to cover this bz:

* ip=dhcp,dhpc6, slow IPv4, nfsroot over IPv4
* ip=dhcp,dhpc6, slow IPv6, nfsroot over IPv6

The others are combinations of the two above with another NIC... I think they can be dropped to save resources.

> possibility might be also slowing down the IPv6 only NIC

This would test that NM signals completion only after all connections are done. I'm not sure we need it, as there are already other tests with multiple NICs, right? And so, a problem in this area would be probably caught by other tests...

> or try only ip=dhcp with IPv6 only NIC
> or ip=dhcp6 when IPv4 only nic

For these, we already test in NM unit tests that a correct connection is generated:

 - for ip=dhcp with IPv4 enabled/required and IPv6 enabled/not-required
 - for ip=dhcp6 with IPv6 enabled/required and IPv4 disabled

Once a correct connection is created, I don't expect we need to test anything else in integration tests.

Comment 34 errata-xmlrpc 2022-05-10 14:54:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:1985