Bug 1573780
Summary: | An ethernet link state change during a DHCP lease renewal causes NM to stop trying to renew the lease | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Patrick Talbert <ptalbert> | ||||||
Component: | NetworkManager | Assignee: | Beniamino Galvani <bgalvani> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Desktop QE <desktop-qa-list> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 7.5 | CC: | atragler, bgalvani, caslivkoff, cg2v, fgiudici, fpokryvk, kdudka, lrintel, rkhan, sukulkar, thaller, vbenes | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | NetworkManager-1.12.0-0.1.el7 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-10-30 11:13:09 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Just so that this doesn't get overlooked, the actual scenario I used to report the issue to RH support was this: - boot normally with both switch ports connected, local and uplink - disconnect device port - disconnect uplink port which block traffic to DHCP server - wait at least one minute - connect device port - wait at least one minute - connect uplink port such that DHCP server can be reached 1) observe that NetworkManager uses the old lease information to configure the NIC 2) observe that NetworkManager is not running dhclient at all 3) observe that NO RETRY is ever attempted without some manual intervention or carrier state change The issue occurred during an actual network outage where the connectivity to the DHCP server exceeded the dhclient timeout after the network was restored locally, but not fully. Communication with the DHCP server was not possible but the local switch was online. The problem was not noticed until after the DHCP lease expired and NetworkManager proceeded to bring the NIC down. This was a considerable time after the network outage had been resolved. Had NetworkManager continued to spawn dhclient, the lease would not have expired and the DOS would not have exceeded the original network outage. Created attachment 1435612 [details]
[PATCH] dhcp: fix handling of failure events
(In reply to Beniamino Galvani from comment #4) > Created attachment 1435612 [details] > [PATCH] dhcp: fix handling of failure events lgtm Applied to master: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=bd63d39252eab03ea2c3f02b9c9b7d3ab338ef30 tested on all architectures, not reproducible anymore Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3207 |
Created attachment 1429803 [details] debug logs Description of problem: This is a bit of an edge-case but it is straight forward to reproduce and should not happen. I am guessing some timer does not get rearmed but I have not looked that far yet... Version-Release number of selected component (if applicable): NetworkManager-1.10.2-13.el7.x86_64 How reproducible: Always Steps to Reproduce: 1. Create a connection which uses DHCP addressing and allow it to get a lease: # nmcli con add type ethernet ifname ens3 con-name ens3 2. After some time, physically disconnect the interface AND disable the DHCP server. 3. Wait for dhclient to begin sending renewal requests... and then connect the interface again. Leave the DHCP server disabled. Actual results: As soon as the interface link comes back up, NM will kill the current dhclient process and start a new one. The new dhclient process will go through a single 45 second activation cycle and then time out. NetworkManager never retries and the connection is left in a 'connected' state without an IP address: [root@r74client ~]# nmcli dev show ens3 GENERAL.DEVICE: ens3 GENERAL.TYPE: ethernet GENERAL.HWADDR: 52:54:00:C9:1A:2D GENERAL.MTU: 1500 GENERAL.STATE: 100 (connected) GENERAL.CONNECTION: ens3 GENERAL.CON-PATH: /org/freedesktop/NetworkManager/ActiveConnection/1 WIRED-PROPERTIES.CARRIER: on IP4.GATEWAY: -- IP4.DNS[1]: 192.168.123.1 IP4.DOMAIN[1]: dhcptest.local IP6.ADDRESS[1]: fe80::5054:ff:fec9:1a2d/64 IP6.GATEWAY: -- IP6.ROUTE[1]: dst = ff00::/8, nh = ::, mt = 256, table=255 IP6.ROUTE[2]: dst = fe80::/64, nh = ::, mt = 256 IP6.ROUTE[3]: dst = fe80::/64, nh = ::, mt = 100 May 01 17:08:28 r74client.example.com dhclient[824]: bound to 192.168.123.128 -- renewal in 141 seconds. May 01 17:10:49 r74client.example.com dhclient[824]: DHCPREQUEST on ens3 to 192.168.123.254 port 67 (xid=0x79cb3b31) May 01 17:10:54 r74client.example.com dhclient[824]: DHCPREQUEST on ens3 to 192.168.123.254 port 67 (xid=0x79cb3b31) May 01 17:11:00 r74client.example.com dhclient[824]: DHCPREQUEST on ens3 to 192.168.123.254 port 67 (xid=0x79cb3b31) May 01 17:11:08 r74client.example.com dhclient[824]: DHCPREQUEST on ens3 to 192.168.123.254 port 67 (xid=0x79cb3b31) May 01 17:11:17 r74client.example.com NetworkManager[698]: <info> [1525187477.8961] device (ens3): carrier: link connected May 01 17:11:17 r74client.example.com NetworkManager[698]: <info> [1525187477.8974] device (ens3): DHCPv4 lease renewal requested May 01 17:11:17 r74client.example.com NetworkManager[698]: <info> [1525187477.9009] dhcp4 (ens3): canceled DHCP transaction, DHCP client pid 824 May 01 17:11:17 r74client.example.com NetworkManager[698]: <info> [1525187477.9010] dhcp4 (ens3): state changed bound -> done May 01 17:11:17 r74client.example.com NetworkManager[698]: <info> [1525187477.9012] dhcp4 (ens3): activation: beginning transaction (timeout in 20 seconds) May 01 17:11:17 r74client.example.com NetworkManager[698]: <info> [1525187477.9136] dhcp4 (ens3): dhclient started with pid 1279 May 01 17:11:17 r74client.example.com dhclient[1279]: DHCPREQUEST on ens3 to 255.255.255.255 port 67 (xid=0x5370ef35) May 01 17:11:21 r74client.example.com dhclient[1279]: DHCPREQUEST on ens3 to 255.255.255.255 port 67 (xid=0x5370ef35) May 01 17:11:26 r74client.example.com dhclient[1279]: DHCPREQUEST on ens3 to 255.255.255.255 port 67 (xid=0x5370ef35) May 01 17:11:35 r74client.example.com dhclient[1279]: DHCPDISCOVER on ens3 to 255.255.255.255 port 67 interval 5 (xid=0x6712a97) May 01 17:11:37 r74client.example.com NetworkManager[698]: <warn> [1525187497.9664] dhcp4 (ens3): request timed out May 01 17:11:37 r74client.example.com NetworkManager[698]: <info> [1525187497.9665] dhcp4 (ens3): state changed unknown -> timeout May 01 17:11:37 r74client.example.com NetworkManager[698]: <info> [1525187497.9710] dhcp4 (ens3): canceled DHCP transaction, DHCP client pid 1279 May 01 17:11:37 r74client.example.com NetworkManager[698]: <info> [1525187497.9710] dhcp4 (ens3): state changed timeout -> done May 01 17:13:29 r74client.example.com NetworkManager[698]: <info> [1525187609.0025] manager: NetworkManager state is now CONNECTED_SITE Tue May 1 17:19:48 CEST 2018 Expected results: NM should indefinitely continue to try to acquire the lease as it normally would. Additional info: