Bug 1330893

Summary: NetworkManager.service never reaches its 'startup complete' state IFF MTU=9000 (ixgbe driver)
Product: Red Hat Enterprise Linux 7 Reporter: Karsten Weiss <knweiss>
Component: NetworkManagerAssignee: Beniamino Galvani <bgalvani>
Status: CLOSED ERRATA QA Contact: Desktop QE <desktop-qa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.2CC: aloughla, atragler, bgalvani, fgiudici, lrintel, mjtrangoni, rkhan, thaller, vbenes
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: NetworkManager-1.4.0-0.1.git20160606.b769b4df.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-03 19:09:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
NetworkManager logs, nm-online strace, systemd-analyze plots
none
[PATCH] device: remove pending dhcp actions also in IP_DONE state none

Description Karsten Weiss 2016-04-27 09:13:31 UTC
Created attachment 1151252 [details]
NetworkManager logs, nm-online strace, systemd-analyze plots

Description of problem:

NetworkManager.service never reaches its 'startup complete' state and
thus NetworkManager-wait-online.service (nm-online -s -timeout=30)
always times out and fails IFF I set MTU=9000 on ens1f0 (ixgbe driver).

Version-Release number of selected component (if applicable):

(Full disclosure: I see this on a CentOS 7.2 system. So feel free to ignore
this bug report.)

NetworkManager-1.0.6-29.el7_2.x86_64
kernel-3.10.0-327.13.1.el7.x86_64
initscripts-9.49.30-1.el7_2.2.x86_64

How reproducible:

Always, if I use "MTU=9000" in /etc/sysconfig/network-scripts/ifcfg-ens1f0.

It works fine if I comment his line (=> default MTU=1500).

Steps to Reproduce:
1. Set "MTU=9000" in /etc/sysconfig/network-scripts/ifcfg-ens1f0
2. Reboot
3. "systemctl --failed" will show NetworkManager-wait-online.service as a
failed service.

Actual results:

NetworkManager-wait-online.service times out after 30s and fails during
startup.

Reason: "nm-online -s --timeout=30" is not able to connect to NetworkManager
because NM doesn't reach "startup complete" state. ("-s" : --wait-for-startup)

Please notice that despite of this fact all the network devices are
actually configured correctly - including the MTU=9000 on ens1f0!

Expected results:

NetworkManager-wait-online.service (nm-online) finishes successfully after
a reasonable number of seconds (as it does with MTU=1500).

Additional info:

# grep 'complete' NetworkManager_MTU1500_info.txt 
Apr 26 16:20:34 smtcfc0157 NetworkManager[1331]: <info>  startup complete
# grep 'complete' NetworkManager_MTU9000_info.txt 
#

Setting MTU=9000 seems to trigger a DHCPv4 renewal:

$ grep ens1f0 NetworkManager-dispatcher_MTU9000.txt |cut -d: -f4-
 ------------ Action ID 0x7f352c0031e0 'up' Interface ens1f0 Environment ------------
   DEVICE_IP_IFACE=ens1f0
   DEVICE_IFACE=ens1f0
   CONNECTION_FILENAME=/etc/sysconfig/network-scripts/ifcfg-ens1f0
 Dispatching action 'up' for ens1f0
 Dispatch 'up' on ens1f0 complete
 ------------ Action ID 0x7f352c003170 'dhcp4-change' Interface ens1f0 Environment ------------
   DEVICE_IP_IFACE=ens1f0
   DEVICE_IFACE=ens1f0
   CONNECTION_FILENAME=/etc/sysconfig/network-scripts/ifcfg-ens1f0
 Dispatching action 'dhcp4-change' for ens1f0
 Dispatch 'dhcp4-change' on ens1f0 complete

Compare this to the MTU=1500 case:

$ grep ens1f0 NetworkManager-dispatcher_MTU1500.txt |cut -d: -f4-
 ------------ Action ID 0x7f344c0031e0 'up' Interface ens1f0 Environment ------------
   DEVICE_IP_IFACE=ens1f0
   DEVICE_IFACE=ens1f0
   CONNECTION_FILENAME=/etc/sysconfig/network-scripts/ifcfg-ens1f0
 Dispatching action 'up' for ens1f0
 Dispatch 'up' on ens1f0 complete

$ ethtool -i ens1f0
driver: ixgbe
version: 4.0.1-k-rh7.2
firmware-version: 0x80000868
bus-info: 0000:06:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

# nmcli d s
DEVICE  TYPE        STATE        CONNECTION  
eno1    ethernet    connected    team1-slave 
eno2    ethernet    connected    team1-slave 
eno3    ethernet    connected    team1-slave 
eno4    ethernet    connected    team1-slave 
ens1f0  ethernet    connected    private     
team1   team        connected    team1       
ens1f1  ethernet    unavailable  --          
ens5f0  ethernet    unavailable  --          
ens5f1  ethernet    unavailable  --          
ib0     infiniband  unmanaged    --          
ib1     infiniband  unmanaged    --          
lo      loopback    unmanaged    --          

# nmcli c s
NAME           UUID                                  TYPE            DEVICE 
System ens1f1  3ba7a201-5d77-d373-4bef-c46ac05ad53e  802-3-ethernet  --     
System ens5f1  d6a47d27-79f0-63fc-251b-e991514a87a6  802-3-ethernet  --     
team1-slave    24871ea9-4411-efbd-924f-49cd9fbda6e2  802-3-ethernet  eno3   
team1-slave    abf4c85b-57cc-4484-4fa9-b4a71689c359  802-3-ethernet  eno1   
team1          4293abb7-d898-84ff-dae6-bffba04cbee9  team            team1  
System ens5f0  c7ca5207-4897-488b-a379-6ba658e133cf  802-3-ethernet  --     
private        0720bdf0-87bd-7885-f805-bbeef9d40ecb  802-3-ethernet  ens1f0 
team1-slave    b186f945-cc80-911d-668c-b51be8596980  802-3-ethernet  eno2   
team1-slave    8e777a66-a032-83ef-59c9-77e69b94ede4  802-3-ethernet  eno4 

Increasing the NetworkManager-wait-online.service timeout does not help
as it will still time out. The boot process will just take more time.

Disabling NetworkManager-wait-online.service does not help as it is pulled
in by network.service anyway.

Setting MTU=9000 via DHCP doesn't help either.

I've attached a some log files (info and debug) from NetworkManager
and NetworkManager-dispatcher, a strace from NetworkManager-wait-online's
nm-online and a systemd-analyze plot.

Comment 2 Beniamino Galvani 2016-05-02 15:31:33 UTC
Created attachment 1152996 [details]
[PATCH] device: remove pending dhcp actions also in IP_DONE state

Untested fix.

Comment 3 Francesco Giudici 2016-05-10 13:31:14 UTC
LGTM

Comment 6 Vladimir Benes 2016-09-26 15:06:39 UTC
[root ~]# systemctl --failed --all
  UNIT                               LOAD   ACTIVE SUB    DESCRIPTION
● NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

1 loaded units listed.
To show all installed unit files use 'systemctl list-unit-files'.


in 1.0.6 and no fail in 1.4.0-11

was verified with:
01:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)

Comment 8 errata-xmlrpc 2016-11-03 19:09:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2581.html