911301 – r8169: NetworkManager takes too long to bring up interface during boot; problems with DHCP packets

Bug 911301 - r8169: NetworkManager takes too long to bring up interface during boot; problems with DHCP packets

Summary: r8169: NetworkManager takes too long to bring up interface during boot; probl...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-02-14 16:59 UTC by Ray Mikkelson
Modified:	2013-08-01 02:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	859210
Environment:
Last Closed:	2013-08-01 02:25:04 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Requested HW info (10.64 KB, text/plain) 2013-02-15 19:36 UTC, Ray Mikkelson	no flags	Details
Logs (both client and DHCP server) of long NM start during boot (6.65 KB, text/plain) 2013-05-03 23:40 UTC, Ray Mikkelson	no flags	Details
View All

Description Ray Mikkelson 2013-02-14 16:59:56 UTC

+++ This bug was initially created as a clone of Bug #859210 +++

Created attachment 615086 [details]
/var/log/messages for 3 separate boot sequences

Description of problem:

NetworkManager takes too long to bring up interface during boot which causes ypbind to fail, which causes cron to not work for non-root (i.e. NIS-served) users, which causes....

On my Dell Optiplex 390 with a Intel i5 2400 CPU with Intel H61 Express Chipset, it takes between 34s and 36s for the p4p1 interface activation to complete during the boot process.  This causes ypbind to fail upon boot, which causes a host of other problems.

I also get the following error messages (twice during the boot):

Sep 19 16:20:45 myserver NetworkManager[890]: <warn> failed to allocate link cache: (-12) Netlink Error (errno = Operation not supported)
Sep 19 16:20:45 myserver NetworkManager[890]: <warn> /sys/devices/virtual/net/lo: couldn't determine device driver; ignoring...

And it seems that NM fails to see the first couple of DHCPREQUEST responses because it sends multiple requests (I think that this is what consumes most of the time for NM to start):

Sep 19 17:13:34 myserver dhclient[925]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67
Sep 19 17:13:42 myserver dhclient[925]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67
Sep 19 17:14:04 myserver dhclient[925]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67






Version-Release number of selected component (if applicable):

NetworkManager.x86_64                   1:0.9.4-6.git20120521.fc16   @updates   


How reproducible:

The time interval varies, but it's always > 30s.


Steps to Reproduce:
1. boot the server
2.
3.
  


Expected results:

Since it only takes NM 2s to send out the first DHCPREQUEST message, I would think that it should only take 3 or 4s for NM to start.


Additional info:

I'll attach an edited (to focus on the NM messages) /var/log/messages file for 3 different boot sequences.  They take between 34s and 40s for NM to activate.  The 3rd sequence has ypbind waiting for the NetworkManager-wait-online service, so ypbind actually starts cleanly on the last boot.

--- Additional comment from Ray Mikkelson on 2012-09-20 16:19:18 EDT ---

Just thought to look at the logs of my DHCP server.  I think that the problem might actually be that the first couple of DHCPREQUEST messages may never go out on the wire (sent to the DHCP server).  Here's what my DHCP server sees:


Sep 19 17:14:41 dhcpserver dhcpd: DHCPDISCOVER from d0:67:e5:1c:95:09 via eth0
Sep 19 17:14:41 dhcpserver dhcpd: DHCPOFFER on 192.168.1.164 to d0:67:e5:1c:95:09 via eth0
Sep 19 17:14:41 dhcpserver dhcpd: DHCPREQUEST for 192.168.1.164 (192.168.1.12) from d0:67:e5:1c:95:09 via eth0
Sep 19 17:14:41 dhcpserver dhcpd: DHCPACK on 192.168.1.164 to d0:67:e5:1c:95:09 via eth0


I don't know why the times are skewed between the 2 boxes.  I run NTP (my DHCP server also happens to be my NTP server as well) on both boxes, and right now, they are within 1s of each other.  Maybe Fedora is not updating the HW clock when it shuts down??

I just checked, the HW clock is 38s slow on the client.  So the dhcpserver's timestamp of Sep 19 17:14:41 is roughly the same time as Sep 19 17:14:03 on the client box.  That would match up with the last DHCPREQUEST message of:

Sep 19 17:14:04 myserver dhclient[925]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67

--- Additional comment from Fedora End Of Life on 2013-01-16 08:14:40 EST ---

This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

--- Additional comment from Fedora End Of Life on 2013-02-13 08:56:59 EST ---

Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 1 Ray Mikkelson 2013-02-14 17:11:33 UTC

This problem still happens with F17.  I am running:

Installed Packages
Name        : NetworkManager
Arch        : x86_64
Epoch       : 1
Version     : 0.9.6.4
Release     : 3.fc17
Size        : 7.0 M
Repo        : installed
From repo   : updates



From my /var/log/messages (42s for NM to complete startup):

Feb 12 14:49:18 myserver NetworkManager[761]: <info> NetworkManager (version 0.9.6.4-3.fc17) is starting...
Feb 12 14:49:19 myserver NetworkManager[761]: <warn> failed to allocate link cache: (-10) Operation not supported
Feb 12 14:49:21 myserver NetworkManager[761]: <info> dhclient started with pid 813
Feb 12 14:49:21 myserver dhclient[813]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67 (xid=0x309d92ef)
Feb 12 14:49:28 myserver dhclient[813]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67 (xid=0x309d92ef)
Feb 12 14:49:40 myserver dhclient[813]: DHCPDISCOVER on p4p1 to 255.255.255.255 port 67 interval 5 (xid=0x5c93a11b)
Feb 12 14:49:45 myserver dhclient[813]: DHCPDISCOVER on p4p1 to 255.255.255.255 port 67 interval 13 (xid=0x5c93a11b)
Feb 12 14:49:51 myserver sm-notify[893]: DNS resolution of myclient.mydomain.com failed; retrying later
Feb 12 14:49:51 myserver ypbind: Binding NIS service
Feb 12 14:49:53 myserver sm-notify[893]: DNS resolution of myclient.mydomain.com failed; retrying later
Feb 12 14:49:57 myserver sm-notify[893]: DNS resolution of myclient.mydomain.com failed; retrying later
Feb 12 14:49:58 myserver dhclient[813]: DHCPDISCOVER on p4p1 to 255.255.255.255 port 67 interval 16 (xid=0x5c93a11b)
Feb 12 14:49:58 myserver dhclient[813]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67 (xid=0x5c93a11b)
Feb 12 14:49:58 myserver dhclient[813]: DHCPOFFER from 192.168.1.12
Feb 12 14:49:58 myserver dhclient[813]: DHCPACK from 192.168.1.12 (xid=0x5c93a11b)
Feb 12 14:50:00 myserver NetworkManager[761]: <info> Activation (p4p1) Stage 5 of 5 (IPv4 Commit) complete.

Comment 2 Jirka Klimes 2013-02-15 11:59:38 UTC

It looks like a DHCP problem in getting DHCPOFFER and thus it could be DHCP misconfiguration.
However, what seems more probable is there is an issue with r8169 driver.

If you search for "r8169" on the Internet, you will encounter many complaints about that, like your problems with DHCP, dropping packets, bad performance, etc.  Some suggest r8168 should be used instead:
http://www.foxhop.net/realtek-dropping-packets-on-linux-ubuntu-and-fedora
http://nelsonslog.wordpress.com/2012/01/22/realtek-ethernet-drivers-r8169-vs-r8168/

Please would you grab more information about yout card and driver, like this:
$ lspci -v
$ dmesg | grep r8169
$ nmcli -f general dev list iface p4p1
$ modinfo r8169

-> reassigning to kernel.

Comment 3 Ray Mikkelson 2013-02-15 19:33:09 UTC

I'm not sure that it is the driver, since the long startup time (>30s) only happens during boot.  I just did a test now with the system up and stable, and I consistently get startup times of about 1s (see below).  I also don't believe that it is a DHCP server problem, because the server doesn't know whether the client is just booting or not (like the tests I just ran), and the DHCP server works just fine when the system's stable.  Also, I have many other client OS's (Windows XP/7/8, CentOS 5.X/6.X, Ubuntu, Solaris) that do not have problems.  

However, I do agree that the problem does have something to do with the hardware.  I have a similarly set up F17 box on an Optiplex GX620 manufactured in 2006, and NM consistently starts up in about 4-8s at boot time on that box.


From /var/log/messages (on the Optiplex 390, when the system is already up):

root> systemctl restart NetworkManager.service
Feb 15 12:48:53 myserver NetworkManager[2360]: <info> NetworkManager (version 0.9.6.4-3.fc17) is starting...
Feb 15 12:48:54 myserver NetworkManager[2360]: <info> Activation (p4p1) Stage 5 of 5 (IPv4 Commit) complete.

root> systemctl restart NetworkManager.service
Feb 15 12:49:29 myserver NetworkManager[2492]: <info> NetworkManager (version 0.9.6.4-3.fc17) is starting...
Feb 15 12:49:29 myserver NetworkManager[2492]: <info> Activation (p4p1) Stage 5 of 5 (IPv4 Commit) complete.

root> systemctl restart NetworkManager.service
Feb 15 13:01:24 myserver NetworkManager[2851]: <info> NetworkManager (version 0.9.6.4-3.fc17) is starting...
Feb 15 13:01:24 myserver NetworkManager[2851]: <info> Activation (p4p1) Stage 5 of 5 (IPv4 Commit) complete.


I'll attach the data you requested.

Thanks for your help.

Comment 4 Ray Mikkelson 2013-02-15 19:34:40 UTC

Forgot to mention that I don't see any dropped packets, etc...

From ifconfig:

p4p1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.164  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::d267:e5ff:fe1c:9509  prefixlen 64  scopeid 0x20<link>
        ether d0:67:e5:1c:95:09  txqueuelen 1000  (Ethernet)
        RX packets 2284994  bytes 1440198213 (1.3 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2967469  bytes 2969410060 (2.7 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Comment 5 Ray Mikkelson 2013-02-15 19:36:30 UTC

Created attachment 697963 [details]
Requested HW info

Here's the output from the requested commands to execute...

Comment 6 Neil Horman 2013-05-01 18:27:32 UTC

my guess would be that the phy is misreporting link status, indicating that the phy is ready for use when its really not ready yet.  That may explain why the driver is announcing it is up, and why NM is trying to DHCP the interface, despite not sending/receiving anything, and not generating any errors (if the phy just didn't transmit them you would likely just silently loose the frames).  If you take a Tcpdump from a mirrored link on the switch that the system was connected to, you would likely see no traffic at all from the system until the phy was actually enabled, and that would further confirm the cause.

Comment 7 Ray Mikkelson 2013-05-03 23:39:13 UTC

Sorry, I don't have the time to figure out how to mirror a port and then get a tcpdump. That's why I had included the logs of the DHCP server in the first place; it's acting as my TCP/IP monitor. I have absolutely no reason to doubt it's accuracy. I have all sorts of clients (Fedora, Windows, CentOS, Ubuntu, Solaris, etc.) and they all behave properly WRT DHCP. Furthermore, the DHCP server cannot distinguish between the scenarios of when the client is booting and when it has been up for a while -- and I get similar logs for each from the server's perspective. Why would it ignore packets from the client when booting and not when it's up?? Makes no sense...

I did rerun a test with my Fedora 17 client (NetworkManager.x86_64 1:0.9.6.4-3.fc17 @updates) with the clocks of the client and DHCP server sync'ed together so that it's easier to analyze the log files. To summarize the attached logs:

18:29:54 NM starts
18:29:57 Carrier is declared ON, Activation scheduled, Stage 3 of 5 completes
18:29:57 First DHCPREQUEST goes out (but not really)
18:30:01 Next DHCPREQUEST goes out (but not really)
18:30:12 First DHCPDISCOVER goes out (but not really)
18:30:15 Next DHCPDISCOVER goes out (but not really)
18:30:19 Next DHCPDISCOVER goes out (but not really)
18:30:24 Next DHCPDISCOVER goes out (but not really)
18:30:34 Next DHCPDISCOVER goes out & is finally seen by the DHCP server

Note that this is 37 seconds after the first attempt! Why would it take so long? This takes about 1 second once the system is completely up...

18:30:35 Stage 5 of 5 completes

I'm attaching the new logs with the clocks sync'ed with the same times...

Comment 8 Ray Mikkelson 2013-05-03 23:40:50 UTC

Created attachment 743402 [details]
Logs (both client and DHCP server) of long NM start during boot

Logs go with the 5/3/13 comment

Comment 9 Neil Horman 2013-05-04 22:34:21 UTC

I'm sorry, at what point in time did I suggest the dhcp server was a problem here?  The logs are irrelevant.  All they show is that eventually, you get a dhcpdiscover at the server (1 request), and it sends an offer to the client (again, 1 offer), which is recieved and properly completes the dhcp process.  What you want to know is why the other requests were never acted on, and I think its pretty obvious that those logs tell you nothing about that.  Thats why I want the tcpdumps, to see if those initial discovers make it out on the wire.  If you cant figure out how to mirror a port on a switch to capture a dump, you can take the tcpdump on the dhcp server itself, but you'll want to filter out other clients if you have alot of traffic there.  If you don't want to figure out how to take a tcpdump, theres not much I can help you with.

Comment 10 Ray Mikkelson 2013-05-06 15:52:12 UTC

My point was that a tcpdump, with 99.99% certainty, will show nothing different than the logs from the DHCP server, and will be a complete waste of my time.  That said, I will try to get you the dumps today.  And I will eat crow if I'm wrong...

Comment 11 Neil Horman 2013-05-06 16:33:19 UTC

you're probably right, it won't, but its nice to be sure (which is why I was hoping to get a mirrored port tcpdump).  If you're taking it from the server, there is still some question as to weather or not your loosing the frame in the phy, or on the network (like a spanning tree port state issue).

Comment 12 Fedora End Of Life 2013-07-03 23:35:44 UTC

This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 13 Fedora End Of Life 2013-08-01 02:25:12 UTC

Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.