1503587 – better handle DHCP outages and retry DHCP indefinitely

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1503587 - better handle DHCP outages and retry DHCP indefinitely

Summary: better handle DHCP outages and retry DHCP indefinitely

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	NetworkManager
Sub Component:
Version:	7.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Francesco Giudici
QA Contact:	Desktop QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-18 12:39 UTC by Thomas Haller
Modified:	2023-12-15 16:02 UTC (History)
CC List:	13 users (show)
Fixed In Version:	NetworkManager-1.10.2-13.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-10 13:31:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Alternative "check_ip_state" patch (sample) (2.06 KB, patch) 2018-01-15 11:51 UTC, Francesco Giudici	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	NMT-995	0	None	None	None	2023-12-15 16:02:04 UTC
Red Hat Product Errata	RHBA-2018:0778	0	None	None	None	2018-04-10 13:32:56 UTC

Description Thomas Haller 2017-10-18 12:39:50 UTC

I am not entirely sure about what exactly happens.
This bug is also a reminder to investigate.


I think, then the DHCP server goes away, the IPv4 method will fail.

in https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/devices/nm-device.c?id=92774592d4c423e36d940f90d044a7bb38ffcbf8#n84 there is DHCP_NUM_TRIES_MAX and DHCP_RESTART_TIMEOUT. So, we retry for 3 times whether DHCP is still around.

Afterwards, we may fail the connection. Failing the connection at this point, is a possible useful behavior, because at that point, autoconnect kicks in and we retry indefinitely to get autoactivate a connection. Autoconnect has retry counters, backoff-timeouts, etc.

But if ipv4.may-fail=yes and there is a working IPv6 configuration, then we don't fail the connection, and autoconnect does not kick in.
Instead, DHCP stays dead indefinitely.


That needs fixing. Or at least, we need to ensure that all these scenarios work properly.


also, it might be worth to never actually stop our DHCP client. All DHCP clients have their own build-in retry handling. There is little point during NM_DHCP_STATE_TIMEOUT to kill the current DHCP client and restart it 120 seconds later. Especially, with this backoff timeout of 120 seconds, there is a long time when we would miss the return of the DHCP server. Maybe the DHCP client knows better how long to wait, and which backoff timeouts to use.

Comment 3 Francesco Giudici 2017-10-20 13:18:12 UTC

Recap of current behavior:
(1) When a dhcp connection is upped, if the dhcp client is not able to
get a lease in "dhcp-timeout" seconds, the connection fails.
If the connection has "autoconnect" enabled, it will be tried
"autoconnect-retries" times (default is 4), basically respawning the
dhcp client. If the connection still cannot acquire the dhcp address,
activation will fail.
After 5 minutes, the auto-connection process will start from scratch,
giving again a chance to the dhcp client to acquire an address.

(2)If a dhcp connection is upped successfully but later the lease
acquired cannot be renewed, the dhcp client will be restarted for 3
times every 2 minutes to try to get a lease: each time it will have
"dhcp-timeout" seconds to get the lease.
If all the three attempts fail, what happens depends on the
"ipvX.may-fail" value:
(a) if "yes" (default): connection is kept up but the dhcp client will
never be spawned again.
(b) if "no": connection will be deactivated. At this point if the
connection has "autoconnect" enabled, NetworkManager will try to
activate the connection from scratch and will behave like in (1).


Patch proposed in fg/dhcp_lease keeps trying acquiring the lease forever for case (2)(a), which is the default configuration for newly created connections.

This is a minimal patch: maybe a rework to better manager retries is required.
For this reason not moving the bug to POST.

Please give a look to fg/dhcp_lease.

Comment 4 Thomas Haller 2017-10-20 18:51:20 UTC

> Please give a look to fg/dhcp_lease.

If you set ipv4.may-fail=yes and ipv6.may-fail=yes (the default), then a connection should still fail if it has not the corresponding other IP address familiy configured.
"may-fail" really means, may-fail-if-the-other-address-family-is-working.

Assumption (2.b) is not correct and nm_device_ip_method_failed() may not fail the active connection entirely.

If DHCP is enabled, we must indefinitely retry, never stop until the connection deactivates. That is independent from ip4_state.

Comment 5 Francesco Giudici 2017-10-21 22:33:00 UTC

(In reply to Thomas Haller from comment #4)
> > Please give a look to fg/dhcp_lease.
> 
> If you set ipv4.may-fail=yes and ipv6.may-fail=yes (the default), then a
> connection should still fail if it has not the corresponding other IP
> address familiy configured.
> "may-fail" really means, may-fail-if-the-other-address-family-is-working.

Yes, that's is the current behavior on connection start-up.
But on dhcp lease renewal, on current NetworkManager what happens is that only the "may.fail" belonging to the dhcp address family is checked (i.e., if dhcp4 fails, only ipv4.may-fail is checked).
This due to check_ip_state() returning immediately if
nm_device_get_state (self) != NM_DEVICE_STATE_IP_CONFIG

So, to be clear, the state of the other address family is not taken into account at all. Behavior is determined only on the corresponding address family "may-fail" as described in point (2), comment 3

> 
> Assumption (2.b) is not correct and nm_device_ip_method_failed() may not
> fail the active connection entirely.

This is what happens currently: if may-fail is no and dhcp fails to renew the lease, the connection is terminated.

> 
> If DHCP is enabled, we must indefinitely retry, never stop until the
> connection deactivates. That is independent from ip4_state.

Well, don't know what would be better here. We loose the address in the case the lease cannot be renewed: could be useful to allow to fail the connection to allow autoconnection to jump in and activate a lower priority backup connection with a static ip. Or anyway, if there is only that dhcp connection and autoconnect has not been disabled, it would be retried forever.

But for sure, if we keep the connection up and we are not able to renew the lease, we should keep trying to renew it.

Current behavior is:
may-fail=yes --> keep connection up
may-fail=no --> fail the whole connection and allow autoconnect to jump in

The only thing that is clearly wrong is that if the connection is kept up, at some point there will be no more lease renewal attempt also if the connection has no ip configuration. The patch fixes just this.
Maybe we could rework a bit the functionality instead.

Comment 6 Beniamino Galvani 2017-10-23 12:00:44 UTC

I think it's a bug in the current code that the connection is never failed after the renewal failure when may-fail=yes. Instead we should use the same logic implemented during connection activation (check both methods).

Comment 8 Francesco Giudici 2018-01-11 15:28:58 UTC

On lease renewal failure (all 3 attempts failed):
* Check if the other ip family is properly configured on the device before taking a decision if the connection should stay up or be disconnected.
* Keep trying renewing the dhcp if the connection is up thanks to the other ip family configured.

Please review branch fg/dhcp_lease

Comment 9 Thomas Haller 2018-01-12 13:54:41 UTC

> dhcp: simplify condition checks in dhcpX_fail

I don't understand the code that was there before, and it's not clear to me how the condition behaves now.

+ } else    if (   priv->ip4_st
        ^^^^
    


> device: always consider both ip families when deciding to fail

 {
+
     g_return_if_fail (NM_IS_DEVICE (self));

spurious newline.

Previously, the code would either check_ip_state() or nm_device_state_changed().
Now, it might not do anything. It basically moves additional logic on how to fail to nm_device_ip_method_failed(). I think it would be better to extend check_ip_state() to handle this case. That way, all the conditions are at one place. Also, check_ip_state() does already something similar to nm_device_ip_other_family_successful(). It dupliates the logic in parts.
Maybe you need another argument to check_ip_state(), like "gboolean method_just_failed".


> device: never stop trying renewing the lease

Now NMDevice makes uses of the active connections state flags. Although, it is NMDevice who sets the flags in the first place. So, priv->ip4_state results in setting active connection state flags and then later the state depends on these flags. That is a bit ugly. Is that necessary? Can you change that easily?
Could you add and use
   bool ip4_done_state_reached:1
   bool ip6_done_state_reached:1
flags?
You can set them to true in _set_ip_state() when IP_DONE is reached, and clear them when setting IP_NONE.

Comment 10 Francesco Giudici 2018-01-15 10:16:02 UTC

(In reply to Thomas Haller from comment #9)
> > dhcp: simplify condition checks in dhcpX_fail
> 
> I don't understand the code that was there before, and it's not clear to me
> how the condition behaves now.

previusly was:
- priv->dhcp4.num_tries_left == DHCP_NUM_TRIES_MAX
  "if we have never retried dhcp till now" AND
- timeout
  "we get no reply from a dhcp server"
- priv->ip4_state == IP_CONF
  "we are trying to configure ip (first time)

so, it was: "if we never had a dhcp failure on this ip configuration AND (the DHCP server timed out OR we are configuring ip for the first time in this activation)" --> schedule ip timeout
"else if ip was already configured successfully for this activation or we assumed the connection (so we need to renew the lease), retry dhcp"

The (priv->dhcp4.num_tries_left == DHCP_NUM_TRIES_MAX) was required because when the lease expires after dhcp conf was successful we get a DHCP "EXPIRE" event which has "timeout=FALSE": but after that, we respawn dhcp and we get the "timeout=TRUE" and we don't want to do a nm_device_activate_schedule_ip4_config_timeout()
anymore: just retry or fail dhcp.


With the change it is:
"if ip was successfully configured or we inherited dhcp from an assumed connection" --> restart or fail dhcp
"else if ip was configuring or we got a timeout schedule an ip conf timeout"


It seemed to me more easy in this way... but if it is not, the goal of the commit is missed and I think would be better to drop it. Let's drop it?


> 
> + } else    if (   priv->ip4_st
>         ^^^^

Thanks, fixed

>     
> 
> 
> > device: always consider both ip families when deciding to fail
> 
>  {
> +
>      g_return_if_fail (NM_IS_DEVICE (self));
> 
> spurious newline.

Thanks, fixed


> 
> Previously, the code would either check_ip_state() or
> nm_device_state_changed().
> Now, it might not do anything.
Well, this is not exact, now it does something more indeed:
"check_ip_state()" does nothing when
"nm_device_get_state (self) != NM_DEVICE_STATE_IP_CONFIG".
I anticipated the check in order to fail the device one time more than before, without touching "check_ip_state()" which main duty is:
"Transition the device from IP_CONFIG to the next state according to the outcome of IPv4 and IPv6 configuration."

> It basically moves additional logic on how to
> fail to nm_device_ip_method_failed(). I think it would be better to extend
> check_ip_state() to handle this case. That way, all the conditions are at
> one place. Also, check_ip_state() does already something similar to
> nm_device_ip_other_family_successful(). It dupliates the logic in parts.
> Maybe you need another argument to check_ip_state(), like "gboolean
> method_just_failed".

I started doing something like what you suggested... but then I stopped as it looked bad to me as:
- check_ip_state() main duty is to transit device states: I had to keep the logic but with the "method_just_failed" boolean I should skip state updates when != from FAILED
- I had to skip basically all the code just keeping few lines of code

Looked better to me from a functional PoV to re-implement the logic there. Sure, not happy to duplicate checks, but seemed the least worst. Still I would slightly prefer to keep it apart from check_ip_state()... but I can move it easily in check_ip_state() if we prefer this way.

> 
> 
> > device: never stop trying renewing the lease
> 
> Now NMDevice makes uses of the active connections state flags. Although, it
> is NMDevice who sets the flags in the first place. So, priv->ip4_state
> results in setting active connection state flags and then later the state
> depends on these flags. That is a bit ugly. Is that necessary? Can you
> change that easily?
> Could you add and use
>    bool ip4_done_state_reached:1
>    bool ip6_done_state_reached:1
> flags?
> You can set them to true in _set_ip_state() when IP_DONE is reached, and
> clear them when setting IP_NONE.

Yes, could be done. Or we can leverage "dhcp4.num_tries_left" as before, this time checking for != DHCP_NUM_TRIES_MAX. If we drop the first commit "dhcp: simplify condition checks in dhcpX_fail" this comes for free. Maybe then the first commit should be dropped...

Comment 11 Francesco Giudici 2018-01-15 11:51:31 UTC

Created attachment 1381385 [details]
Alternative "check_ip_state" patch (sample)

Comment 12 Francesco Giudici 2018-01-15 11:54:57 UTC

Updated branch fg/dhcp_lease:
. dropped "dhcp: simplify condition checks in dhcpX_fail" commit
. switched from checking on NM_ACTIVATION_STATE_FLAG_IP4_READY to
  DHCP_NUM_TRIES_MAX
. still left instead the standalone "other_ip_method" check: adapting
  check_ip_state would require in our case to selectively drop the device state
  updates when different from FAILED. Sample patch of how would it look in 
  attachment 1381385 [details]. If looks better this one, just tell.

Thanks

Comment 13 Thomas Haller 2018-01-22 10:56:34 UTC

(In reply to Francesco Giudici from comment #10)
> (In reply to Thomas Haller from comment #9)

> > It basically moves additional logic on how to
> > fail to nm_device_ip_method_failed(). I think it would be better to extend
> > check_ip_state() to handle this case. That way, all the conditions are at
> > one place. Also, check_ip_state() does already something similar to
> > nm_device_ip_other_family_successful(). It dupliates the logic in parts.
> > Maybe you need another argument to check_ip_state(), like "gboolean
> > method_just_failed".
> 
> I started doing something like what you suggested... but then I stopped as
> it looked bad to me as:

These two arguments don't convince me.

> - check_ip_state() main duty is to transit device states: I had to keep the
> logic but with the "method_just_failed" boolean I should skip state updates
> when != from FAILED

check_ip_state() is a series of checks and optional changes of the device state. Just like:

    else if (!nm_device_ip_other_family_successful (self, addr_family))
        nm_device_state_changed (self, NM_DEVICE_STATE_FAILED, reason);


> - I had to skip basically all the code just keeping few lines of code

The entire check_ip_state() is a series of "if() { do_something(); return; }". To me, this return-early pattern makes it a bit clearer what happens, because you can read it from top-to-bottom and identify situations where we do something and are done.
"skip all the code" is what check_ip_state() does already heavily, and in this case it seems preferably to me -- contrary to patterns where you would want to minimize the number of return-points.
 

attachment 1381385 [details] seems nice to me, only the name "do_state_update" seems not great.



> > Now NMDevice makes uses of the active connections state flags. Although, it
> > is NMDevice who sets the flags in the first place. So, priv->ip4_state
> > results in setting active connection state flags and then later the state
> > depends on these flags. That is a bit ugly. Is that necessary? Can you
> > change that easily?
> > Could you add and use
> >    bool ip4_done_state_reached:1
> >    bool ip6_done_state_reached:1
> > flags?
> > You can set them to true in _set_ip_state() when IP_DONE is reached, and
> > clear them when setting IP_NONE.
> 
> Yes, could be done. Or we can leverage "dhcp4.num_tries_left" as before,
> this time checking for != DHCP_NUM_TRIES_MAX. If we drop the first commit
> "dhcp: simplify condition checks in dhcpX_fail" this comes for free. Maybe
> then the first commit should be dropped...

Is "priv->dhcp4.num_tries_left < DHCP_NUM_TRIES_MAX" really the same as ip4_done_state_reached? If DHCP keeps failing and retrying, num_tries_left will count down, but it never actually reached the IP_DONE state.

Comment 14 Francesco Giudici 2018-01-22 18:13:03 UTC

(In reply to Thomas Haller from comment #13)
> (In reply to Francesco Giudici from comment #10)
> > (In reply to Thomas Haller from comment #9)
> 
[...]
>  
> 
> attachment 1381385 [details] seems nice to me, only the name
> "do_state_update" seems not great.

Ok, changed the code leveraging the patch... changed also "do_state_update" to "full_state_update" (it is the best I could think of... if you have a better name, please tell! ;-) )
> 
> 
> 
> > > Now NMDevice makes uses of the active connections state flags. Although, it
> > > is NMDevice who sets the flags in the first place. So, priv->ip4_state
> > > results in setting active connection state flags and then later the state
> > > depends on these flags. That is a bit ugly. Is that necessary? Can you
> > > change that easily?
> > > Could you add and use
> > >    bool ip4_done_state_reached:1
> > >    bool ip6_done_state_reached:1
> > > flags?
> > > You can set them to true in _set_ip_state() when IP_DONE is reached, and
> > > clear them when setting IP_NONE.
> > 
> > Yes, could be done. Or we can leverage "dhcp4.num_tries_left" as before,
> > this time checking for != DHCP_NUM_TRIES_MAX. If we drop the first commit
> > "dhcp: simplify condition checks in dhcpX_fail" this comes for free. Maybe
> > then the first commit should be dropped...
> 
> Is "priv->dhcp4.num_tries_left < DHCP_NUM_TRIES_MAX" really the same as
> ip4_done_state_reached? If DHCP keeps failing and retrying, num_tries_left
> will count down, but it never actually reached the IP_DONE state.

"priv->dhcp4.num_tries_left--" is guarded by:

  else if (   priv->dhcp4.num_tries_left < DHCP_NUM_TRIES_MAX
           || priv->ip4_state == IP_DONE
           || priv->dhcp4.was_active) {

no other code decrements it. So, the only way for it to be < DHCP_NUM_TRIES_MAX is that we reached ip4_state == IP_DONE (or that we assumed the connection which had a configured dhcp already there).
DHCP retries apply only on lease renewal.
So, yes, I think should be safe to assume that.

Updated patch on new branch: fg/dhcp_lease-rh1503587

Comment 15 Thomas Haller 2018-01-23 08:40:38 UTC

               nm_device_ip_method_failed (self, AF_INET, NM_DEVICE_STATE_REASON_IP_CONFIG_EXPIRED);
+              /* We failed but schedule again the retries as the connection may stay up
+               * if the ipv6 method is configured and we want to keep trying renewing
+               * our lost lease in this case.
+               */
+              dhcp_schedule_restart (self, AF_INET, "renewal failed");


the order seems wrong. nm_device_ip_method_failed() might already change state, in which case you don't want to restart DHCP. Either, first restart, or only restart if the state is still as expected.

If you do that, do you even still need
+         /* Stop DHCP renewal if any in progress (as too many lease renewal
+          * failures may bring us here) */
+         dhcp4_cleanup (self, CLEANUP_TYPE_DECONFIGURE, FALSE);
+         dhcp6_cleanup (self, CLEANUP_TYPE_DECONFIGURE, FALSE);
?


+         /* After long time we have been able to renew the lease:
+          * update the ip state
+          */
+         if (   priv->dhcp4.num_tries_left < DHCP_NUM_TRIES_MAX
+             && priv->ip4_state == IP_FAIL) {
+              _set_ip_state (self, AF_INET, IP_CONF);
+         }

why IP_CONF?

Comment 16 Francesco Giudici 2018-01-23 09:54:49 UTC

(In reply to Thomas Haller from comment #15)
>                nm_device_ip_method_failed (self, AF_INET,
> NM_DEVICE_STATE_REASON_IP_CONFIG_EXPIRED);
> +              /* We failed but schedule again the retries as the connection
> may stay up
> +               * if the ipv6 method is configured and we want to keep
> trying renewing
> +               * our lost lease in this case.
> +               */
> +              dhcp_schedule_restart (self, AF_INET, "renewal failed");
> 
> 
> the order seems wrong. nm_device_ip_method_failed() might already change
> state, in which case you don't want to restart DHCP. Either, first restart,
> or only restart if the state is still as expected.

Yeah, better to restart just after checking that device state is not FAILED.

> 
> If you do that, do you even still need
> +         /* Stop DHCP renewal if any in progress (as too many lease renewal
> +          * failures may bring us here) */
> +         dhcp4_cleanup (self, CLEANUP_TYPE_DECONFIGURE, FALSE);
> +         dhcp6_cleanup (self, CLEANUP_TYPE_DECONFIGURE, FALSE);
> ?

No :-) Removed.

> 
> 
> +         /* After long time we have been able to renew the lease:
> +          * update the ip state
> +          */
> +         if (   priv->dhcp4.num_tries_left < DHCP_NUM_TRIES_MAX
> +             && priv->ip4_state == IP_FAIL) {
> +              _set_ip_state (self, AF_INET, IP_CONF);
> +         }
> 
> why IP_CONF?

This is a transition state: we will in a while fall in the "if (ip4_state == IP_CONF)" few lines below that will do everything needed to transit to the IP_DONE state. But in the meanwhile be coherent with the ip state.

Thanks, updated branch fg/dhcp_lease-rh1503587

Comment 17 Scott Shambarger 2018-01-24 23:31:16 UTC

I've been trying to get NetworkManager to correctly handle may_fail on my server, but the high level logic appears to not support reliable connectivity. (I'm forced to use ipv6.ignore and roll my own dhclient scripts).

Example: ipv4 works, but DHCP6 vanishes for awhile, and the lease expires.  When DHCP6 comes back, ipv6 connectivity is never restored, despite ipv6.may_fail=no!

ipv{4,6}.may_fail should independently determine if state should return to IP_CONF after failure.  A may_fail=no config should never reach IP_DONE without a valid address.  ie. Retries should basically not stop; ipv6.may_fail=no should mean what it says.  ipv4 state in this case is really immaterial (and vice-versa).

I understand the desire to have ipv4/6 give up spamming DHCP if the associated may_fail=yes (by default)... ideally there should be a connection.may_fail=no setting that can say this config needs to work, but doesn't matter if it's ipv4 or 6 (and both ipv{4,6}.may_fail=yes). With connection.may_fail=yes (the default), you have the current behavior and alternative configs get a shot.

In short, losing ipv6 connectivity when ipv6.may_fail=no (and a DHCP6 server is available) is a bug... one I'd love to have fixed to I can ditch my custom scripts :)

Comment 18 Thomas Haller 2018-01-25 10:12:34 UTC

(In reply to Scott Shambarger from comment #17)
> I've been trying to get NetworkManager to correctly handle may_fail ...

Did you try the WIP branch on this bug? Are you saying, that this branch does not fix your issue?

Comment 19 Scott Shambarger 2018-01-25 18:13:59 UTC

I tried the fg/dhcp_lease-rh1503587 branch (commit 137ace5...), and with ipv6.may-fail=no, the problem I described in comment 17 still exists:

- Static ipv4 (so that never fails)
- DHCPD6 server goes away
- Lease expires (and no retries are even attempted, but I already logged this as https://bugzilla.gnome.org/show_bug.cgi?id=792745, but that just delays the problem...)
- DHCPD6 server returns
- No ipv6 connectivity is re-established

(Please let me know if you'd like more details on reproducing this... or perhaps I'm doing something wrong? :)

Comment 20 Francesco Giudici 2018-01-28 14:22:10 UTC

(In reply to Scott Shambarger from comment #19)
> I tried the fg/dhcp_lease-rh1503587 branch (commit 137ace5...), and with
> ipv6.may-fail=no, the problem I described in comment 17 still exists:
> 
> - Static ipv4 (so that never fails)
> - DHCPD6 server goes away
> - Lease expires (and no retries are even attempted, but I already logged
> this as https://bugzilla.gnome.org/show_bug.cgi?id=792745, but that just
> delays the problem...)
> - DHCPD6 server returns
> - No ipv6 connectivity is re-established
> 
> (Please let me know if you'd like more details on reproducing this... or
> perhaps I'm doing something wrong? :)

Hi Scott, thanks for reporting and jumping on this early!
The issue is the one you reported in https://bugzilla.gnome.org/show_bug.cgi?id=792745. Good catch.
Let's make the patch here deal also with that too. I will deal with it too.

I would like anyway to add a thing here: current logic of may-fail means that if you set one ipX to "no", failing renewing the lease for that ip family would cause your connection to be teared down (after the 3 checks).
At his point if the connection has autoconnect=yes, the connection should be retried in 5 minutes otherwise it will stay down.

Comment 21 Scott Shambarger 2018-01-28 18:01:01 UTC

(In reply to Francesco Giudici from comment #20)
> https://bugzilla.gnome.org/show_bug.cgi?id=792745. Good catch.
> Let's make the patch here deal also with that too. I will deal with it too.

Excellent... hopefully the patch might make it to Fedora too? (which I use :)

> I would like anyway to add a thing here: current logic of may-fail means
> that if you set one ipX to "no", failing renewing the lease for that ip
> family would cause your connection to be teared down (after the 3 checks).
> At his point if the connection has autoconnect=yes, the connection should be
> retried in 5 minutes otherwise it will stay down.

Hmmm... well, fg/dhcp_lease-rh1503587 definitely still has that bug then. If the connection already had a working address for an ipX.may-fail=no family, and then fails to renew (really rebind), the connection isn't brought down - so the autoconnect logic never kicks in.

Also, sounds like the may-fail logic makes the connection quite fragile...either family goes down, and the other family loses connectivity for 5 minutes! (why the delay? and why no check if the teardown has an alternative config before breaking an existing, working, connection family?).

Since NetworkManager is intentionally overriding the quite robust retry logic in the dhclient to implement may-fail (or not as it appears :), the least it could do if have a setting to restore it.

How do I tell NetworkManager that I want the address family to retry forever (dhclient logic), and that I'm not using may-fail/autoconnect? (especially since it affects other addresses families which have their own lifecycle, or are static and have none!).  Perhaps ipX.dhcp-retry=-1 (infinity) so that the address family can return without affecting the whole connection? ... Ideally with no delays between retries... if the dhcp server returns, the host should get an address immediately;  dhclient does this! :)

In short, how about dhcp-retry=-1 means don't kill dhclient, let it do it's thing!

Comment 22 Francesco Giudici 2018-02-08 15:17:26 UTC

(In reply to Scott Shambarger from comment #21)
> (In reply to Francesco Giudici from comment #20)
> > https://bugzilla.gnome.org/show_bug.cgi?id=792745. Good catch.
> > Let's make the patch here deal also with that too. I will deal with it too.
> 
> Excellent... hopefully the patch might make it to Fedora too? (which I use :)

Sure, it will land there too.

> 
> > I would like anyway to add a thing here: current logic of may-fail means
> > that if you set one ipX to "no", failing renewing the lease for that ip
> > family would cause your connection to be teared down (after the 3 checks).
> > At his point if the connection has autoconnect=yes, the connection should be
> > retried in 5 minutes otherwise it will stay down.
> 
> Hmmm... well, fg/dhcp_lease-rh1503587 definitely still has that bug then. If
> the connection already had a working address for an ipX.may-fail=no family,
> and then fails to renew (really rebind), the connection isn't brought down -
> so the autoconnect logic never kicks in.

Not sure we are on the same page here... may-fail=no means if it fails (during lease renewal are 3 checks to declare the "failure") the connection is brought down. Rationale is "may-fail=no" means that connection is not valid if that ip family fails.

The 5 minutes are due to the autoconnect feature... but let's skip it from dhcp lease renewal discussion, as it is something "more", not really the way to let dhcp retry ;-)

> 
> Also, sounds like the may-fail logic makes the connection quite
> fragile...either family goes down, and the other family loses connectivity
> for 5 minutes! (why the delay? and why no check if the teardown has an
> alternative config before breaking an existing, working, connection family?).

Well, let's drop as said the autoconnectivity feature from the discussion: if you configured both the "ipvX.may-fail=no", the connection will go down if one of them fails.
May-fail=no means that connection becomes invalid if that ip family fails.


> 
> Since NetworkManager is intentionally overriding the quite robust retry
> logic in the dhclient to implement may-fail (or not as it appears :), the
> least it could do if have a setting to restore it.
> 
> How do I tell NetworkManager that I want the address family to retry forever
> (dhclient logic), and that I'm not using may-fail/autoconnect? (especially
> since it affects other addresses families which have their own lifecycle, or
> are static and have none!).  Perhaps ipX.dhcp-retry=-1 (infinity) so that
> the address family can return without affecting the whole connection? ...
> Ideally with no delays between retries... if the dhcp server returns, the
> host should get an address immediately;  dhclient does this! :)
> 
> In short, how about dhcp-retry=-1 means don't kill dhclient, let it do it's
> thing!

There is already a conf like that for ipv4: just set "ipv4.dhcp-timeout" to maxint32 (2147483647)... or, via nmcli, you can use "infinity" to do that:
$> nmcli conn modify $CONN ipv4.dhcp-timeout infinity
A similar property is missing for ipv6. Maybe you can open a RFE (feature request) if you need that.

Please, let's move further discussion for improvements on https://bugzilla.gnome.org/show_bug.cgi?id=792745
as current discussion is no more strictly related to this bug.

Comment 23 Francesco Giudici 2018-02-19 14:41:00 UTC

Added a new patch on top of the fg/dhcp_lease-rh1503587 branch to address the IPv6 DHCP renew issue.

Comment 24 Beniamino Galvani 2018-02-20 13:35:18 UTC

LGTM

Comment 25 Francesco Giudici 2018-02-20 15:57:17 UTC

Merged on master:
https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=b9e22ece2d

Comment 26 Francesco Giudici 2018-02-20 17:57:56 UTC

Backported on nm-1-10:
https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=bcb37e8093

Comment 28 Vladimir Benes 2018-02-21 12:02:39 UTC

inbuild into CI, working well.

Comment 31 errata-xmlrpc 2018-04-10 13:31:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0778

Note You need to log in before you can comment on or make changes to this bug.