Bug 1573780 - An ethernet link state change during a DHCP lease renewal causes NM to stop trying to renew the lease
Summary: An ethernet link state change during a DHCP lease renewal causes NM to stop t...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: NetworkManager
Version: 7.5
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Beniamino Galvani
QA Contact: Desktop QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-02 09:10 UTC by Patrick Talbert
Modified: 2018-10-31 08:28 UTC (History)
12 users (show)

Fixed In Version: NetworkManager-1.12.0-0.1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-30 11:13:09 UTC
Target Upstream Version:


Attachments (Terms of Use)
debug logs (237.49 KB, text/plain)
2018-05-02 09:10 UTC, Patrick Talbert
no flags Details
[PATCH] dhcp: fix handling of failure events (3.05 KB, patch)
2018-05-13 08:08 UTC, Beniamino Galvani
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:3207 None None None 2018-10-30 11:14:29 UTC
Red Hat Knowledge Base (Solution) 3672661 None None None 2018-10-31 08:28:01 UTC

Description Patrick Talbert 2018-05-02 09:10:26 UTC
Created attachment 1429803 [details]
debug logs

Description of problem:
This is a bit of an edge-case but it is straight forward to reproduce and should not happen. I am guessing some timer does not get rearmed but I have not looked that far yet...


Version-Release number of selected component (if applicable):
NetworkManager-1.10.2-13.el7.x86_64


How reproducible:
Always


Steps to Reproduce:
1. Create a connection which uses DHCP addressing and allow it to get a lease:

# nmcli con add type ethernet ifname ens3 con-name ens3


2. After some time, physically disconnect the interface AND disable the DHCP server.

3. Wait for dhclient to begin sending renewal requests... and then connect the interface again. Leave the DHCP server disabled.


Actual results:

As soon as the interface link comes back up, NM will kill the current dhclient process and start a new one. The new dhclient process will go through a single 45 second activation cycle and then time out. NetworkManager never retries and the connection is left in a 'connected' state without an IP address:

[root@r74client ~]# nmcli dev show ens3
GENERAL.DEVICE:                         ens3
GENERAL.TYPE:                           ethernet
GENERAL.HWADDR:                         52:54:00:C9:1A:2D
GENERAL.MTU:                            1500
GENERAL.STATE:                          100 (connected)
GENERAL.CONNECTION:                     ens3
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveConnection/1
WIRED-PROPERTIES.CARRIER:               on
IP4.GATEWAY:                            --
IP4.DNS[1]:                             192.168.123.1
IP4.DOMAIN[1]:                          dhcptest.local
IP6.ADDRESS[1]:                         fe80::5054:ff:fec9:1a2d/64
IP6.GATEWAY:                            --
IP6.ROUTE[1]:                           dst = ff00::/8, nh = ::, mt = 256, table=255
IP6.ROUTE[2]:                           dst = fe80::/64, nh = ::, mt = 256
IP6.ROUTE[3]:                           dst = fe80::/64, nh = ::, mt = 100

May 01 17:08:28 r74client.example.com dhclient[824]: bound to 192.168.123.128 -- renewal in 141 seconds.
May 01 17:10:49 r74client.example.com dhclient[824]: DHCPREQUEST on ens3 to 192.168.123.254 port 67 (xid=0x79cb3b31)
May 01 17:10:54 r74client.example.com dhclient[824]: DHCPREQUEST on ens3 to 192.168.123.254 port 67 (xid=0x79cb3b31)
May 01 17:11:00 r74client.example.com dhclient[824]: DHCPREQUEST on ens3 to 192.168.123.254 port 67 (xid=0x79cb3b31)
May 01 17:11:08 r74client.example.com dhclient[824]: DHCPREQUEST on ens3 to 192.168.123.254 port 67 (xid=0x79cb3b31)
May 01 17:11:17 r74client.example.com NetworkManager[698]: <info>  [1525187477.8961] device (ens3): carrier: link connected
May 01 17:11:17 r74client.example.com NetworkManager[698]: <info>  [1525187477.8974] device (ens3): DHCPv4 lease renewal requested
May 01 17:11:17 r74client.example.com NetworkManager[698]: <info>  [1525187477.9009] dhcp4 (ens3): canceled DHCP transaction, DHCP client pid 824
May 01 17:11:17 r74client.example.com NetworkManager[698]: <info>  [1525187477.9010] dhcp4 (ens3): state changed bound -> done
May 01 17:11:17 r74client.example.com NetworkManager[698]: <info>  [1525187477.9012] dhcp4 (ens3): activation: beginning transaction (timeout in 20 seconds)
May 01 17:11:17 r74client.example.com NetworkManager[698]: <info>  [1525187477.9136] dhcp4 (ens3): dhclient started with pid 1279
May 01 17:11:17 r74client.example.com dhclient[1279]: DHCPREQUEST on ens3 to 255.255.255.255 port 67 (xid=0x5370ef35)
May 01 17:11:21 r74client.example.com dhclient[1279]: DHCPREQUEST on ens3 to 255.255.255.255 port 67 (xid=0x5370ef35)
May 01 17:11:26 r74client.example.com dhclient[1279]: DHCPREQUEST on ens3 to 255.255.255.255 port 67 (xid=0x5370ef35)
May 01 17:11:35 r74client.example.com dhclient[1279]: DHCPDISCOVER on ens3 to 255.255.255.255 port 67 interval 5 (xid=0x6712a97)
May 01 17:11:37 r74client.example.com NetworkManager[698]: <warn>  [1525187497.9664] dhcp4 (ens3): request timed out
May 01 17:11:37 r74client.example.com NetworkManager[698]: <info>  [1525187497.9665] dhcp4 (ens3): state changed unknown -> timeout
May 01 17:11:37 r74client.example.com NetworkManager[698]: <info>  [1525187497.9710] dhcp4 (ens3): canceled DHCP transaction, DHCP client pid 1279
May 01 17:11:37 r74client.example.com NetworkManager[698]: <info>  [1525187497.9710] dhcp4 (ens3): state changed timeout -> done
May 01 17:13:29 r74client.example.com NetworkManager[698]: <info>  [1525187609.0025] manager: NetworkManager state is now CONNECTED_SITE
Tue May  1 17:19:48 CEST 2018


Expected results:

NM should indefinitely continue to try to acquire the lease as it normally would.



Additional info:

Comment 3 Charles Slivkoff 2018-05-02 18:15:08 UTC
Just so that this doesn't get overlooked, the actual scenario I used to report the issue to RH support was this:

- boot normally with both switch ports connected, local and uplink
- disconnect device port
- disconnect uplink port which block traffic to DHCP server
- wait at least one minute
- connect device port
- wait at least one minute
- connect uplink port such that DHCP server can be reached

1) observe that NetworkManager uses the old lease information to configure the NIC 2) observe that NetworkManager is not running dhclient at all
3) observe that NO RETRY is ever attempted without some manual intervention or carrier state change

The issue occurred during an actual network outage where the connectivity to the DHCP server exceeded the dhclient timeout after the network was restored locally, but not fully. Communication with the DHCP server was not possible but the local switch was online.

The problem was not noticed until after the DHCP lease expired and NetworkManager proceeded to bring the NIC down. This was a considerable time after the network outage had been resolved. 

Had NetworkManager continued to spawn dhclient, the lease would not have expired and the DOS would not have exceeded the original network outage.

Comment 4 Beniamino Galvani 2018-05-13 08:08:13 UTC
Created attachment 1435612 [details]
[PATCH] dhcp: fix handling of failure events

Comment 5 Thomas Haller 2018-05-14 06:08:13 UTC
(In reply to Beniamino Galvani from comment #4)
> Created attachment 1435612 [details]
> [PATCH] dhcp: fix handling of failure events

lgtm

Comment 8 Vladimir Benes 2018-09-05 11:06:21 UTC
tested on all architectures, not reproducible anymore

Comment 10 errata-xmlrpc 2018-10-30 11:13:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3207


Note You need to log in before you can comment on or make changes to this bug.