1791624 – [RFE] support preserving the interface when DHCP timeout

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1791624 - [RFE] support preserving the interface when DHCP timeout

Summary: [RFE] support preserving the interface when DHCP timeout

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	NetworkManager
Sub Component:
Version:	9.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	beta
Target Release:	---
Assignee:	NetworkManager Development Team
QA Contact:	Desktop QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1791372 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-16 09:46 UTC by Dominik Holler
Modified:	2022-02-01 07:27 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: Support preserving the interface when DHCP timeout Reason: If any aspect of the configuration of a bridge is failing, e.g. a timeout on a dhcp renewal or if dhcp is not successful in the first attempt, NetworkManager removes the bridge, even the bridge is still used, e.g. by a VM. Result:
Clone Of:
Environment:
Last Closed:	2022-02-01 07:27:24 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
terminal log (234.60 KB, text/plain) 2020-01-16 09:46 UTC, Dominik Holler	no flags	Details
simple tap example (67.67 KB, text/plain) 2020-09-02 06:42 UTC, Dominik Holler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1818697	0	high	CLOSED	Vlan over bond is not active after first boot	2022-12-22 22:01:47 UTC

Internal Links: 1818697

Description Dominik Holler 2020-01-16 09:46:24 UTC

Created attachment 1652695 [details]
terminal log

Description of problem:
If any aspect of the configuration of a bridge is failing, e.g. a timeout on a dhcp renewal or if dhcp is not successful in the first attempt, NetworkManager removes the bridge, even the bridge is still used, e.g. by a VM.

Version-Release number of selected component (if applicable):
NetworkManager-1.20.0-3.el8.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create a bridge
2. Connect a VM
3. Let something on the bridge fail, e.g. dynamic IP address

Actual results:
Bridge is removed, users of the bridge losses network connectifity

Expected results:
Bridge is not removed, users of the bridge losses network connectifity

Comment 1 Dominik Holler 2020-01-16 11:24:45 UTC

Expected results:
Bridge is not removed, users of the bridge does NOT lose network connectivity.

Comment 2 Thomas Haller 2020-08-31 16:00:55 UTC

> nmcli con add ifname br0 type bridge con-name br0

this command creates a profile with DHCP and autoconf6 enabled (and ipv4.may-fail=yes, ipv6.may-fail=yes).

In the current form ipv4.may-fail=yes, ipv6.may-fail=yes means that at least one of the address families must succeed. In the log you see that both fail, and consequently the profile goes down (and the bridge gets removed).

This is similar to other issues that we discussed (bug 1801158, bug 1791378), isn't it?

The solution here is to set ipv4.dhcp-timeout and ipv6.ra-timeout to infinity -- as that is apparently what you want.

`nmcli con add` does not do that by default, because such a behavior does not seem the desirable default. Granted, it would be desirable in you csae, but the problem with defaults is that they never cover any possible usecase optimally. The solution is to configure the profile in a manner that suits you -- especially if the default behavior is not what you need.

Does that work for you? Why not?




Long term, we want more flexible ways how to enable individual address configuration mechanisms. For example, currently 

  - cannot enable DHCPv4 and ipv4ll together (which would be very useful)

  - or, you cannot say that ipv4ll is allowed to fail while DHCPv4 not (you can only set ipv4.dhcp-timeout/ipv6.ra-timeout to infinity, which is a subset of the things that should be possible.

  - you also cannot configure that certain addressing methods are mandatory/required, optionally (at least one must succeed), or entirely optional.

Currently, you can only approximate certain combinations by configuring ipv4.method, ipv6.method, ipv4.may-fail, ipv6.may-fail, ipv4.dhcp-timeout, ipv6.ra-timeout.

But such a entirely optional DHCP/autoconf6 configuration probably still would not be created by `nmcli con add`. Even with that feature, you would still have to opt-in into that behavior during `nmcli con add`.

Comment 3 Dominik Holler 2020-09-01 08:45:03 UTC

(In reply to Thomas Haller from comment #2)
> > nmcli con add ifname br0 type bridge con-name br0
> 
> this command creates a profile with DHCP and autoconf6 enabled (and
> ipv4.may-fail=yes, ipv6.may-fail=yes).
> 
> In the current form ipv4.may-fail=yes, ipv6.may-fail=yes means that at least
> one of the address families must succeed. In the log you see that both fail,
> and consequently the profile goes down (and the bridge gets removed).
> 
> This is similar to other issues that we discussed (bug 1801158, bug
> 1791378), isn't it?
> 
> The solution here is to set ipv4.dhcp-timeout and ipv6.ra-timeout to
> infinity -- as that is apparently what you want.
> 
> `nmcli con add` does not do that by default, because such a behavior does
> not seem the desirable default. Granted, it would be desirable in you csae,
> but the problem with defaults is that they never cover any possible usecase
> optimally. The solution is to configure the profile in a manner that suits
> you -- especially if the default behavior is not what you need.
> 
> Does that work for you? Why not?
> 
> 

In oVirt we can work around this issue by setting the timeouts as suggested.

> 
> 
> Long term, we want more flexible ways how to enable individual address
> configuration mechanisms. For example, currently 
> 
>   - cannot enable DHCPv4 and ipv4ll together (which would be very useful)
> 
>   - or, you cannot say that ipv4ll is allowed to fail while DHCPv4 not (you
> can only set ipv4.dhcp-timeout/ipv6.ra-timeout to infinity, which is a
> subset of the things that should be possible.
> 
>   - you also cannot configure that certain addressing methods are
> mandatory/required, optionally (at least one must succeed), or entirely
> optional.
> 
> Currently, you can only approximate certain combinations by configuring
> ipv4.method, ipv6.method, ipv4.may-fail, ipv6.may-fail, ipv4.dhcp-timeout,
> ipv6.ra-timeout.
> 
> But such a entirely optional DHCP/autoconf6 configuration probably still
> would not be created by `nmcli con add`. Even with that feature, you would
> still have to opt-in into that behavior during `nmcli con add`.


I will try to explain the bug in the way I understood NetworkManager is thinking.
1.a) NetworkManager adds a bridge (layer 2), and tries to get an dynamic IP address (layer 3).
2.b) Dynamic IP address did not work, NetworkManager assumes if NetworkManager does not succeed on layer 3,
     nobody can use the bridge, so NetworkManager removes the bridge.

The point NetworkManager misses is, that there might be a step
1.b) Attach (e.g. by libvirt) another interface to the bridge,
     and use this other interface in a way which is not visible to NetworkManager,
     e.g. a static IP address inside a VM

1.b) might happen before of after 2.b) .

For this reason NetworkManager should not conclude from
"NetworkManger detect an issue on layer 3" -> "NetworkManager is responsible to remove the device on layer 2",
if not explicitly requested.

Comment 4 Thomas Haller 2020-09-01 09:59:47 UTC

> For this reason NetworkManager should not conclude from
> "NetworkManger detect an issue on layer 3" -> "NetworkManager is responsible to remove the device on layer 2",
> if not explicitly requested.

You did explicitly request it. With `nmcli con add` a profile was created that requested having either DHCP or autoconf6 (at least one of both) succeeding.

Comment 5 Dominik Holler 2020-09-01 10:36:56 UTC

(In reply to Thomas Haller from comment #4)
> > For this reason NetworkManager should not conclude from
> > "NetworkManger detect an issue on layer 3" -> "NetworkManager is responsible to remove the device on layer 2",
> > if not explicitly requested.
> 
> You did explicitly request it. With `nmcli con add` a profile was created
> that requested having either DHCP or autoconf6 (at least one of both)
> succeeding.

I do not agree. That I want to have dhcp, does not imply that I want to bridge to be removed, if dhcp fails.

Comment 6 Thomas Haller 2020-09-01 11:53:45 UTC

(In reply to Dominik Holler from comment #5)
> (In reply to Thomas Haller from comment #4)
>
> I do not agree. That I want to have dhcp, does not imply that I want to
> bridge to be removed, if dhcp fails.

if you create a profile with `nmcli con add` with DHCP enabled and ipv4.dhcp-timeout unspecified, then you get a profile with a finite DHCP timeout. And a profile with finite DHCP timeout can fail if no lease was obtianed in time. And when that happens (depending on other defined circumstances) the profile can go down -- thereby removing the bridge interface.

"That I want to have dhcp, does not imply ..." seems to me an odd way to put it. You created a profile by calling `nmcli con add ifname br0 type bridge con-name br0` which results in a profile with a certain defined behavior. Yes, the profile doesn't have the behavior you wished for, but nothing is left ambiguous for being "implied".

The solution is to create the profile with the behavior you want. Or is your issue that a plain `nmcli con add ifname br0 type bridge con-name br0` (without additional parameters) doesn't result in the desired profile?

Comment 7 Dominik Holler 2020-09-02 06:42:17 UTC

Created attachment 1713406 [details]
simple tap example

(In reply to Thomas Haller from comment #6)
> (In reply to Dominik Holler from comment #5)
> > (In reply to Thomas Haller from comment #4)
> >
> > I do not agree. That I want to have dhcp, does not imply that I want to
> > bridge to be removed, if dhcp fails.
> 
> if you create a profile with `nmcli con add` with DHCP enabled and
> ipv4.dhcp-timeout unspecified, then you get a profile with a finite DHCP
> timeout. And a profile with finite DHCP timeout can fail if no lease was
> obtianed in time. And when that happens (depending on other defined
> circumstances) the profile can go down -- thereby removing the bridge
> interface.
> 
> "That I want to have dhcp, does not imply ..." seems to me an odd way to put
> it. You created a profile by calling `nmcli con add ifname br0 type bridge
> con-name br0` which results in a profile with a certain defined behavior.
> Yes, the profile doesn't have the behavior you wished for, but nothing is
> left ambiguous for being "implied".
> 
> The solution is to create the profile with the behavior you want. Or is your
> issue that a plain `nmcli con add ifname br0 type bridge con-name br0`
> (without additional parameters) doesn't result in the desired profile?

No doubts that this philosophy is helpful to discuss the desired layer 3 behavior,
but I created this bug to take layer 2 in consideration, too.
From my point if view it is confusing, but not intuitive if the bridge is removed
before step 2.b) of comment #3, so never mind.

But from my point of view it is catastrophic, if the bridge is removed if it has
slaves which are not managed by NetworManager, after step 2.b).
This might happen if libvirt is using the bridge, like in attachment 1652695 [details].
A more simple example could like this:

nmcli con add type bridge ifname br0
nmcli c m bridge-br0 ipv4.dhcp-timeout 60
nmcli con add type bridge-slave ifname ens3  master br0
nmcli con up br0
nmcli con up bridge-br0
nmcli con up  bridge-slave-ens3
reboot

nmcli c s
journalctl -f &
ip tuntap add name tap0 mode tap
ip link set tap0 master br0
ip l set tap0 up

-> NetworkManager removes the bridge, even it is in use by tap0.

Maybe NetworkManager could remove the bridge only, if he is sure that he is managing all slaves?

Comment 8 Thomas Haller 2020-09-02 11:57:08 UTC

NetworkManager's API (its world view) is all about profiles and activating them. You configure an interface by activating a profile. When the profile goes down, the configuration gets undone. So, asking to leave the bridge can either mean

(a) prevent the profile to fail and stay connected even after timeout to get IP configuration. That is already possible (the current way (*) is setting ipv4.dhcp-timeout=infinity). Profiles are the only real API that NetworkManager has, and you created a profile that asked to fail the connection.

(b) let the profile fail, then the connection goes down, the device deactivates, but somehow (some parts) of the resources are not cleaned up. The only API that NetworkManager understands is activating and deactivating profiles. This asks for a way how to have resources on your system, without having a profile. Such an API does not exist, nor is it clear what this would mean or how this should be modelled.

You ask "to take layer 2 in consideration". You used the only API that NetworkManager understands (profiles) and activate a profile that is configured to go down when IP addressing times out -- and then you expect it to not do the very thing you asked for.

In NetworkManager you configure both Layer 2 and Layer 3 in one profile. And if Layer 3 fails, it also brings down Layer 2. So, in NetworkManager layer 3 and layer 2 succeed and fail together. The solution to the "problem" is to not configure Layer 3 in a way that explicitly ask to fail if DHCP times out.

The reason why this is (and why this is actually good), is that in NetworkManager a profile (and a device) have one important property: the overall state whether a profile/device is activated/connected/up or not. When this overall state goes bad a defined state should be reached (by cleaning up all resources). Granted, you have substates, like "is DHCP successfull/enabled/failed". But DHCP failed does not necessarily mean that the overall state fails, that depends on whether you ask for that. NetworkManager will consider both Layer 2 and Layer 3.

(*) granted, setting ipv4.dhcp-timeout=infinity is a bit limited/ugly API for saying "ignore failures to IP configuration". We will extend that so you can mark IP methods as optional/required. But that won't change the discussion here.

I guess, one aspect to the problem is that you are sidestepping NetworkManager's API and enslave the interface to the bridge with `iproute2`, and expect NetworkManager to accept that external action as an indication of what you want (despite, the actual configuration telling NetworkManager otherwise). I don't see a solution for that, except adjust your expectations. If the party who activated the profile requested the profile to fail when DHCP times out, then enslaving an interface with iproute2 should not overrule that. Why would it? But even if you use NetworkManager API to enslave an interface to the bridge, the same thing still happens.

Comment 14 Till Maas 2021-04-30 13:43:26 UTC

*** Bug 1791372 has been marked as a duplicate of this bug. ***

Comment 17 RHEL Program Management 2022-02-01 07:27:24 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.