+++ This bug was initially created as a clone of Bug #859210 +++ Created attachment 615086 [details] /var/log/messages for 3 separate boot sequences Description of problem: NetworkManager takes too long to bring up interface during boot which causes ypbind to fail, which causes cron to not work for non-root (i.e. NIS-served) users, which causes.... On my Dell Optiplex 390 with a Intel i5 2400 CPU with Intel H61 Express Chipset, it takes between 34s and 36s for the p4p1 interface activation to complete during the boot process. This causes ypbind to fail upon boot, which causes a host of other problems. I also get the following error messages (twice during the boot): Sep 19 16:20:45 myserver NetworkManager[890]: <warn> failed to allocate link cache: (-12) Netlink Error (errno = Operation not supported) Sep 19 16:20:45 myserver NetworkManager[890]: <warn> /sys/devices/virtual/net/lo: couldn't determine device driver; ignoring... And it seems that NM fails to see the first couple of DHCPREQUEST responses because it sends multiple requests (I think that this is what consumes most of the time for NM to start): Sep 19 17:13:34 myserver dhclient[925]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67 Sep 19 17:13:42 myserver dhclient[925]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67 Sep 19 17:14:04 myserver dhclient[925]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67 Version-Release number of selected component (if applicable): NetworkManager.x86_64 1:0.9.4-6.git20120521.fc16 @updates How reproducible: The time interval varies, but it's always > 30s. Steps to Reproduce: 1. boot the server 2. 3. Expected results: Since it only takes NM 2s to send out the first DHCPREQUEST message, I would think that it should only take 3 or 4s for NM to start. Additional info: I'll attach an edited (to focus on the NM messages) /var/log/messages file for 3 different boot sequences. They take between 34s and 40s for NM to activate. The 3rd sequence has ypbind waiting for the NetworkManager-wait-online service, so ypbind actually starts cleanly on the last boot. --- Additional comment from Ray Mikkelson on 2012-09-20 16:19:18 EDT --- Just thought to look at the logs of my DHCP server. I think that the problem might actually be that the first couple of DHCPREQUEST messages may never go out on the wire (sent to the DHCP server). Here's what my DHCP server sees: Sep 19 17:14:41 dhcpserver dhcpd: DHCPDISCOVER from d0:67:e5:1c:95:09 via eth0 Sep 19 17:14:41 dhcpserver dhcpd: DHCPOFFER on 192.168.1.164 to d0:67:e5:1c:95:09 via eth0 Sep 19 17:14:41 dhcpserver dhcpd: DHCPREQUEST for 192.168.1.164 (192.168.1.12) from d0:67:e5:1c:95:09 via eth0 Sep 19 17:14:41 dhcpserver dhcpd: DHCPACK on 192.168.1.164 to d0:67:e5:1c:95:09 via eth0 I don't know why the times are skewed between the 2 boxes. I run NTP (my DHCP server also happens to be my NTP server as well) on both boxes, and right now, they are within 1s of each other. Maybe Fedora is not updating the HW clock when it shuts down?? I just checked, the HW clock is 38s slow on the client. So the dhcpserver's timestamp of Sep 19 17:14:41 is roughly the same time as Sep 19 17:14:03 on the client box. That would match up with the last DHCPREQUEST message of: Sep 19 17:14:04 myserver dhclient[925]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67 --- Additional comment from Fedora End Of Life on 2013-01-16 08:14:40 EST --- This message is a reminder that Fedora 16 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '16'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 16's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 16 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping --- Additional comment from Fedora End Of Life on 2013-02-13 08:56:59 EST --- Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.
This problem still happens with F17. I am running: Installed Packages Name : NetworkManager Arch : x86_64 Epoch : 1 Version : 0.9.6.4 Release : 3.fc17 Size : 7.0 M Repo : installed From repo : updates From my /var/log/messages (42s for NM to complete startup): Feb 12 14:49:18 myserver NetworkManager[761]: <info> NetworkManager (version 0.9.6.4-3.fc17) is starting... Feb 12 14:49:19 myserver NetworkManager[761]: <warn> failed to allocate link cache: (-10) Operation not supported Feb 12 14:49:21 myserver NetworkManager[761]: <info> dhclient started with pid 813 Feb 12 14:49:21 myserver dhclient[813]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67 (xid=0x309d92ef) Feb 12 14:49:28 myserver dhclient[813]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67 (xid=0x309d92ef) Feb 12 14:49:40 myserver dhclient[813]: DHCPDISCOVER on p4p1 to 255.255.255.255 port 67 interval 5 (xid=0x5c93a11b) Feb 12 14:49:45 myserver dhclient[813]: DHCPDISCOVER on p4p1 to 255.255.255.255 port 67 interval 13 (xid=0x5c93a11b) Feb 12 14:49:51 myserver sm-notify[893]: DNS resolution of myclient.mydomain.com failed; retrying later Feb 12 14:49:51 myserver ypbind: Binding NIS service Feb 12 14:49:53 myserver sm-notify[893]: DNS resolution of myclient.mydomain.com failed; retrying later Feb 12 14:49:57 myserver sm-notify[893]: DNS resolution of myclient.mydomain.com failed; retrying later Feb 12 14:49:58 myserver dhclient[813]: DHCPDISCOVER on p4p1 to 255.255.255.255 port 67 interval 16 (xid=0x5c93a11b) Feb 12 14:49:58 myserver dhclient[813]: DHCPREQUEST on p4p1 to 255.255.255.255 port 67 (xid=0x5c93a11b) Feb 12 14:49:58 myserver dhclient[813]: DHCPOFFER from 192.168.1.12 Feb 12 14:49:58 myserver dhclient[813]: DHCPACK from 192.168.1.12 (xid=0x5c93a11b) Feb 12 14:50:00 myserver NetworkManager[761]: <info> Activation (p4p1) Stage 5 of 5 (IPv4 Commit) complete.
It looks like a DHCP problem in getting DHCPOFFER and thus it could be DHCP misconfiguration. However, what seems more probable is there is an issue with r8169 driver. If you search for "r8169" on the Internet, you will encounter many complaints about that, like your problems with DHCP, dropping packets, bad performance, etc. Some suggest r8168 should be used instead: http://www.foxhop.net/realtek-dropping-packets-on-linux-ubuntu-and-fedora http://nelsonslog.wordpress.com/2012/01/22/realtek-ethernet-drivers-r8169-vs-r8168/ Please would you grab more information about yout card and driver, like this: $ lspci -v $ dmesg | grep r8169 $ nmcli -f general dev list iface p4p1 $ modinfo r8169 -> reassigning to kernel.
I'm not sure that it is the driver, since the long startup time (>30s) only happens during boot. I just did a test now with the system up and stable, and I consistently get startup times of about 1s (see below). I also don't believe that it is a DHCP server problem, because the server doesn't know whether the client is just booting or not (like the tests I just ran), and the DHCP server works just fine when the system's stable. Also, I have many other client OS's (Windows XP/7/8, CentOS 5.X/6.X, Ubuntu, Solaris) that do not have problems. However, I do agree that the problem does have something to do with the hardware. I have a similarly set up F17 box on an Optiplex GX620 manufactured in 2006, and NM consistently starts up in about 4-8s at boot time on that box. From /var/log/messages (on the Optiplex 390, when the system is already up): root> systemctl restart NetworkManager.service Feb 15 12:48:53 myserver NetworkManager[2360]: <info> NetworkManager (version 0.9.6.4-3.fc17) is starting... Feb 15 12:48:54 myserver NetworkManager[2360]: <info> Activation (p4p1) Stage 5 of 5 (IPv4 Commit) complete. root> systemctl restart NetworkManager.service Feb 15 12:49:29 myserver NetworkManager[2492]: <info> NetworkManager (version 0.9.6.4-3.fc17) is starting... Feb 15 12:49:29 myserver NetworkManager[2492]: <info> Activation (p4p1) Stage 5 of 5 (IPv4 Commit) complete. root> systemctl restart NetworkManager.service Feb 15 13:01:24 myserver NetworkManager[2851]: <info> NetworkManager (version 0.9.6.4-3.fc17) is starting... Feb 15 13:01:24 myserver NetworkManager[2851]: <info> Activation (p4p1) Stage 5 of 5 (IPv4 Commit) complete. I'll attach the data you requested. Thanks for your help.
Forgot to mention that I don't see any dropped packets, etc... From ifconfig: p4p1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.1.164 netmask 255.255.255.0 broadcast 192.168.1.255 inet6 fe80::d267:e5ff:fe1c:9509 prefixlen 64 scopeid 0x20<link> ether d0:67:e5:1c:95:09 txqueuelen 1000 (Ethernet) RX packets 2284994 bytes 1440198213 (1.3 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2967469 bytes 2969410060 (2.7 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Created attachment 697963 [details] Requested HW info Here's the output from the requested commands to execute...
my guess would be that the phy is misreporting link status, indicating that the phy is ready for use when its really not ready yet. That may explain why the driver is announcing it is up, and why NM is trying to DHCP the interface, despite not sending/receiving anything, and not generating any errors (if the phy just didn't transmit them you would likely just silently loose the frames). If you take a Tcpdump from a mirrored link on the switch that the system was connected to, you would likely see no traffic at all from the system until the phy was actually enabled, and that would further confirm the cause.
Sorry, I don't have the time to figure out how to mirror a port and then get a tcpdump. That's why I had included the logs of the DHCP server in the first place; it's acting as my TCP/IP monitor. I have absolutely no reason to doubt it's accuracy. I have all sorts of clients (Fedora, Windows, CentOS, Ubuntu, Solaris, etc.) and they all behave properly WRT DHCP. Furthermore, the DHCP server cannot distinguish between the scenarios of when the client is booting and when it has been up for a while -- and I get similar logs for each from the server's perspective. Why would it ignore packets from the client when booting and not when it's up?? Makes no sense... I did rerun a test with my Fedora 17 client (NetworkManager.x86_64 1:0.9.6.4-3.fc17 @updates) with the clocks of the client and DHCP server sync'ed together so that it's easier to analyze the log files. To summarize the attached logs: 18:29:54 NM starts 18:29:57 Carrier is declared ON, Activation scheduled, Stage 3 of 5 completes 18:29:57 First DHCPREQUEST goes out (but not really) 18:30:01 Next DHCPREQUEST goes out (but not really) 18:30:12 First DHCPDISCOVER goes out (but not really) 18:30:15 Next DHCPDISCOVER goes out (but not really) 18:30:19 Next DHCPDISCOVER goes out (but not really) 18:30:24 Next DHCPDISCOVER goes out (but not really) 18:30:34 Next DHCPDISCOVER goes out & is finally seen by the DHCP server Note that this is 37 seconds after the first attempt! Why would it take so long? This takes about 1 second once the system is completely up... 18:30:35 Stage 5 of 5 completes I'm attaching the new logs with the clocks sync'ed with the same times...
Created attachment 743402 [details] Logs (both client and DHCP server) of long NM start during boot Logs go with the 5/3/13 comment
I'm sorry, at what point in time did I suggest the dhcp server was a problem here? The logs are irrelevant. All they show is that eventually, you get a dhcpdiscover at the server (1 request), and it sends an offer to the client (again, 1 offer), which is recieved and properly completes the dhcp process. What you want to know is why the other requests were never acted on, and I think its pretty obvious that those logs tell you nothing about that. Thats why I want the tcpdumps, to see if those initial discovers make it out on the wire. If you cant figure out how to mirror a port on a switch to capture a dump, you can take the tcpdump on the dhcp server itself, but you'll want to filter out other clients if you have alot of traffic there. If you don't want to figure out how to take a tcpdump, theres not much I can help you with.
My point was that a tcpdump, with 99.99% certainty, will show nothing different than the logs from the DHCP server, and will be a complete waste of my time. That said, I will try to get you the dumps today. And I will eat crow if I'm wrong...
you're probably right, it won't, but its nice to be sure (which is why I was hoping to get a mirrored port tcpdump). If you're taking it from the server, there is still some question as to weather or not your loosing the frame in the phy, or on the network (like a spanning tree port state issue).
This message is a reminder that Fedora 17 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 17. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '17'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 17's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 17 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior to Fedora 17's end of life. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.