Bug 1573780

Summary: An ethernet link state change during a DHCP lease renewal causes NM to stop trying to renew the lease
Product: Red Hat Enterprise Linux 7 Reporter: Patrick Talbert <ptalbert>
Component: NetworkManagerAssignee: Beniamino Galvani <bgalvani>
Status: CLOSED ERRATA QA Contact: Desktop QE <desktop-qa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.5CC: atragler, bgalvani, caslivkoff, cg2v, fgiudici, fpokryvk, kdudka, lrintel, rkhan, sukulkar, thaller, vbenes
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: NetworkManager-1.12.0-0.1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-30 11:13:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
debug logs
none
[PATCH] dhcp: fix handling of failure events none

Description Patrick Talbert 2018-05-02 09:10:26 UTC
Created attachment 1429803 [details]
debug logs

Description of problem:
This is a bit of an edge-case but it is straight forward to reproduce and should not happen. I am guessing some timer does not get rearmed but I have not looked that far yet...


Version-Release number of selected component (if applicable):
NetworkManager-1.10.2-13.el7.x86_64


How reproducible:
Always


Steps to Reproduce:
1. Create a connection which uses DHCP addressing and allow it to get a lease:

# nmcli con add type ethernet ifname ens3 con-name ens3


2. After some time, physically disconnect the interface AND disable the DHCP server.

3. Wait for dhclient to begin sending renewal requests... and then connect the interface again. Leave the DHCP server disabled.


Actual results:

As soon as the interface link comes back up, NM will kill the current dhclient process and start a new one. The new dhclient process will go through a single 45 second activation cycle and then time out. NetworkManager never retries and the connection is left in a 'connected' state without an IP address:

[root@r74client ~]# nmcli dev show ens3
GENERAL.DEVICE:                         ens3
GENERAL.TYPE:                           ethernet
GENERAL.HWADDR:                         52:54:00:C9:1A:2D
GENERAL.MTU:                            1500
GENERAL.STATE:                          100 (connected)
GENERAL.CONNECTION:                     ens3
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveConnection/1
WIRED-PROPERTIES.CARRIER:               on
IP4.GATEWAY:                            --
IP4.DNS[1]:                             192.168.123.1
IP4.DOMAIN[1]:                          dhcptest.local
IP6.ADDRESS[1]:                         fe80::5054:ff:fec9:1a2d/64
IP6.GATEWAY:                            --
IP6.ROUTE[1]:                           dst = ff00::/8, nh = ::, mt = 256, table=255
IP6.ROUTE[2]:                           dst = fe80::/64, nh = ::, mt = 256
IP6.ROUTE[3]:                           dst = fe80::/64, nh = ::, mt = 100

May 01 17:08:28 r74client.example.com dhclient[824]: bound to 192.168.123.128 -- renewal in 141 seconds.
May 01 17:10:49 r74client.example.com dhclient[824]: DHCPREQUEST on ens3 to 192.168.123.254 port 67 (xid=0x79cb3b31)
May 01 17:10:54 r74client.example.com dhclient[824]: DHCPREQUEST on ens3 to 192.168.123.254 port 67 (xid=0x79cb3b31)
May 01 17:11:00 r74client.example.com dhclient[824]: DHCPREQUEST on ens3 to 192.168.123.254 port 67 (xid=0x79cb3b31)
May 01 17:11:08 r74client.example.com dhclient[824]: DHCPREQUEST on ens3 to 192.168.123.254 port 67 (xid=0x79cb3b31)
May 01 17:11:17 r74client.example.com NetworkManager[698]: <info>  [1525187477.8961] device (ens3): carrier: link connected
May 01 17:11:17 r74client.example.com NetworkManager[698]: <info>  [1525187477.8974] device (ens3): DHCPv4 lease renewal requested
May 01 17:11:17 r74client.example.com NetworkManager[698]: <info>  [1525187477.9009] dhcp4 (ens3): canceled DHCP transaction, DHCP client pid 824
May 01 17:11:17 r74client.example.com NetworkManager[698]: <info>  [1525187477.9010] dhcp4 (ens3): state changed bound -> done
May 01 17:11:17 r74client.example.com NetworkManager[698]: <info>  [1525187477.9012] dhcp4 (ens3): activation: beginning transaction (timeout in 20 seconds)
May 01 17:11:17 r74client.example.com NetworkManager[698]: <info>  [1525187477.9136] dhcp4 (ens3): dhclient started with pid 1279
May 01 17:11:17 r74client.example.com dhclient[1279]: DHCPREQUEST on ens3 to 255.255.255.255 port 67 (xid=0x5370ef35)
May 01 17:11:21 r74client.example.com dhclient[1279]: DHCPREQUEST on ens3 to 255.255.255.255 port 67 (xid=0x5370ef35)
May 01 17:11:26 r74client.example.com dhclient[1279]: DHCPREQUEST on ens3 to 255.255.255.255 port 67 (xid=0x5370ef35)
May 01 17:11:35 r74client.example.com dhclient[1279]: DHCPDISCOVER on ens3 to 255.255.255.255 port 67 interval 5 (xid=0x6712a97)
May 01 17:11:37 r74client.example.com NetworkManager[698]: <warn>  [1525187497.9664] dhcp4 (ens3): request timed out
May 01 17:11:37 r74client.example.com NetworkManager[698]: <info>  [1525187497.9665] dhcp4 (ens3): state changed unknown -> timeout
May 01 17:11:37 r74client.example.com NetworkManager[698]: <info>  [1525187497.9710] dhcp4 (ens3): canceled DHCP transaction, DHCP client pid 1279
May 01 17:11:37 r74client.example.com NetworkManager[698]: <info>  [1525187497.9710] dhcp4 (ens3): state changed timeout -> done
May 01 17:13:29 r74client.example.com NetworkManager[698]: <info>  [1525187609.0025] manager: NetworkManager state is now CONNECTED_SITE
Tue May  1 17:19:48 CEST 2018


Expected results:

NM should indefinitely continue to try to acquire the lease as it normally would.



Additional info:

Comment 3 Charles Slivkoff 2018-05-02 18:15:08 UTC
Just so that this doesn't get overlooked, the actual scenario I used to report the issue to RH support was this:

- boot normally with both switch ports connected, local and uplink
- disconnect device port
- disconnect uplink port which block traffic to DHCP server
- wait at least one minute
- connect device port
- wait at least one minute
- connect uplink port such that DHCP server can be reached

1) observe that NetworkManager uses the old lease information to configure the NIC 2) observe that NetworkManager is not running dhclient at all
3) observe that NO RETRY is ever attempted without some manual intervention or carrier state change

The issue occurred during an actual network outage where the connectivity to the DHCP server exceeded the dhclient timeout after the network was restored locally, but not fully. Communication with the DHCP server was not possible but the local switch was online.

The problem was not noticed until after the DHCP lease expired and NetworkManager proceeded to bring the NIC down. This was a considerable time after the network outage had been resolved. 

Had NetworkManager continued to spawn dhclient, the lease would not have expired and the DOS would not have exceeded the original network outage.

Comment 4 Beniamino Galvani 2018-05-13 08:08:13 UTC
Created attachment 1435612 [details]
[PATCH] dhcp: fix handling of failure events

Comment 5 Thomas Haller 2018-05-14 06:08:13 UTC
(In reply to Beniamino Galvani from comment #4)
> Created attachment 1435612 [details]
> [PATCH] dhcp: fix handling of failure events

lgtm

Comment 8 Vladimir Benes 2018-09-05 11:06:21 UTC
tested on all architectures, not reproducible anymore

Comment 10 errata-xmlrpc 2018-10-30 11:13:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3207