Hide Forgot
Description of problem: In a large environment, DHCP can take a while to complete. It appears that autofs tries to start before the process is finished. And thus, it can't contact the LDAP servers to get the automount options. Version-Release number of selected component (if applicable): NetworkManager-0.8.1-5.el6_0.1.x86_64 autofs-5.0.5-23.el6.x86_64 How reproducible: Intermittent Steps to Reproduce: 1. Machine boots up 2. NetworkManager starts the network process 3. autofs failes to query ldap 4. NetworkManager finishes network process Actual results: autofs fails to load the automount maps, and doesn't mount the nfs shares. After stage 3 of 5 for NetworkManager, automount kicks in. Expected results: autofs should wait for NetworkManager to finish. Additional info: This may well be a NetworkManager problem, but I figured I'd start here. Mar 2 11:06:50 build7900 NetworkManager[1601]: <info> monitoring kernel firmware directory '/lib/firmware'. Mar 2 11:06:50 build7900 NetworkManager[1601]: ifcfg-rh: Acquired D-Bus service com.redhat.ifcfgrh1 Mar 2 11:06:50 build7900 NetworkManager[1601]: <info> Loaded plugin ifcfg-rh: (c) 2007 - 2008 Red Hat, Inc. To report bugs please use the NetworkManager mailing list. Mar 2 11:06:50 build7900 NetworkManager[1601]: ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-eth0 ... Mar 2 11:06:51 build7900 NetworkManager[1601]: ifcfg-rh: read connection 'System eth0' Mar 2 11:06:51 build7900 NetworkManager[1601]: ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-lo ... Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> WiFi enabled by radio killswitch; enabled by state file Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> WWAN enabled by radio killswitch; enabled by state file Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> WiMAX enabled by radio killswitch; enabled by state file Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> Networking is enabled by state file Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): carrier is ON Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): new Ethernet device (driver: 'e1000e' ifindex: 2) Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): exported as /org/freedesktop/NetworkManager/Devices/0 Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): now managed Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): device state change: 1 -> 2 (reason 41) Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): preparing device. Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> Activation (eth0) starting connection 'System eth0' Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): device state change: 2 -> 7 (reason 0) Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) scheduled. Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> modem-manager is now available Mar 2 11:06:51 build7900 NetworkManager[1601]: <warn> bluez error getting default adapter: The name org.bluez was not provided by any .service files Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> Trying to start the supplicant... Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) started... Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> Activation (eth0) Beginning DHCPv4 transaction (timeout in 45 seconds) Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> dhclient started with pid 1638 Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) complete. Mar 2 11:06:51 build7900 dhclient: Internet Systems Consortium DHCP Client 4.1.1-P1 Mar 2 11:06:51 build7900 dhclient: Copyright 2004-2010 Internet Systems Consortium. Mar 2 11:06:51 build7900 dhclient: All rights reserved. Mar 2 11:06:51 build7900 dhclient: For info, please visit https://www.isc.org/software/dhcp/ Mar 2 11:06:51 build7900 dhclient: Mar 2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): DHCPv4 state changed nbi -> preinit Mar 2 11:06:51 build7900 dhclient: Listening on LPF/eth0/18:a9:05:b1:49:44 Mar 2 11:06:51 build7900 dhclient: Sending on LPF/eth0/18:a9:05:b1:49:44 Mar 2 11:06:51 build7900 dhclient: Sending on Socket/fallback Mar 2 11:06:51 build7900 kernel: RPC: Registered udp transport module. Mar 2 11:06:51 build7900 kernel: RPC: Registered tcp transport module. Mar 2 11:06:51 build7900 kernel: RPC: Registered tcp NFSv4.1 backchannel transport module. Mar 2 11:06:52 build7900 acpid: starting up Mar 2 11:06:52 build7900 acpid: 1 rule loaded Mar 2 11:06:52 build7900 acpid: waiting for events: event logging is off Mar 2 11:06:52 build7900 acpid: client connected from 1779[68:68] Mar 2 11:06:52 build7900 acpid: 1 client rule loaded Mar 2 11:06:52 build7900 pcscd: pcscdaemon.c:506:main() pcsc-lite 1.5.2 daemon ready. Mar 2 11:06:54 build7900 automount[1820]: bind_ldap_anonymous: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 2 11:06:54 build7900 automount[1820]: bind_ldap_anonymous: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 2 11:06:54 build7900 sssd: Starting up Mar 2 11:06:55 build7900 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67 Mar 2 11:06:55 build7900 dhclient: DHCPACK from 10.193.54.250 Mar 2 11:06:55 build7900 dhclient: bound to 10.193.54.58 -- renewal in 63949 seconds. Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> (eth0): DHCPv4 state changed preinit -> reboot Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) scheduled... Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) started... Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> address 10.193.54.58 Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> prefix 23 (255.255.254.0) Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> gateway 128.193.54.250 Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> hostname 'build7900.site.edu' Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> nameserver '10.193.40.2' Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> nameserver '10.193.15.13' Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> nameserver '10.193.15.12' Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> domain name 'site.edu' Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) scheduled... Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) complete. Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) started... Mar 2 11:06:55 build7900 sssd[be[default]]: Starting up Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> (eth0): device state change: 7 -> 8 (reason 0) Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> Policy set 'System eth0' (eth0) as default for IPv4 routing and DNS. Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) successful, device activated. Mar 2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) complete.
I have patches which may resolve this but they are completely untested (a recent second attempt in fact) so far. I can make a test rpm which includes them if you are willing to try it.
Yes, I would be willing to test it. I'm still in the early stages of testing deployments for rhel6, so I have machines to test on.
Created attachment 482822 [details] Patch - fix master map source server unavailable handling
Created attachment 482823 [details] Patch - wait for master map available at start
Created attachment 482824 [details] Patch - add master read wait option
A package containing the above three patches (which may help with this problem) is available at: http://people.redhat.com/~ikent/autofs-5.0.5-29.bz681986.1 Could you please test the package please.
After installing the package and rebooting, things got worse. Even trying to restart autofs. It doesn't appear to be able to read the automount map. Mar 8 08:09:44 build7900 NetworkManager[1582]: <info> Trying to start the supplicant... Mar 8 08:09:44 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) started... Mar 8 08:09:44 build7900 NetworkManager[1582]: <info> Activation (eth0) Beginning DHCPv4 transaction (timeout in 45 seconds) Mar 8 08:09:44 build7900 NetworkManager[1582]: <info> dhclient started with pid 1613 Mar 8 08:09:44 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) complete. Mar 8 08:09:44 build7900 dhclient: Internet Systems Consortium DHCP Client 4.1.1-P1 Mar 8 08:09:44 build7900 dhclient: Copyright 2004-2010 Internet Systems Consortium. Mar 8 08:09:44 build7900 dhclient: All rights reserved. Mar 8 08:09:44 build7900 dhclient: For info, please visit https://www.isc.org/software/dhcp/ Mar 8 08:09:44 build7900 dhclient: Mar 8 08:09:44 build7900 NetworkManager[1582]: <info> (eth0): DHCPv4 state chang Mar 8 08:09:47 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:47 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:47 build7900 automount[1796]: main:2176: automount: failed to read master map, will retry in background! Mar 8 08:09:47 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:47 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:48 build7900 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67 Mar 8 08:09:48 build7900 dhclient: DHCPACK from 10.193.54.250 Mar 8 08:09:48 build7900 dhclient: bound to 10.193.54.146 -- renewal in 57706 seconds. Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> (eth0): DHCPv4 state changed preinit -> reboot Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) scheduled... Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) started... Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> address 10.193.54.146 Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> prefix 23 (255.255.254.0) Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> gateway 10.193.54.250 Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> hostname 'build7900.site.edu' Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> nameserver '10.193.40.2' Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> nameserver '10.193.15.13' Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> nameserver '10.193.15.12' Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> domain name 'site.edu' Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) scheduled... Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) complete. Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) started... Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> (eth0): device state change: 7 -> 8 (reason 0) Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Policy set 'System eth0' (eth0) as default for IPv4 routing and DNS. Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) successful, device activated. Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) complete. Mar 8 08:09:49 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:49 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:51 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:51 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:53 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:53 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server ... Mar 8 08:12:49 build7900 automount[1796]: problem reading master map, maximum wait exceeded Mar 8 08:12:49 build7900 automount[1796]: main:2197: automount: failed to read master map after 180 seconds!
(In reply to comment #8) > After installing the package and rebooting, things got worse. Even trying to > restart autofs. It doesn't appear to be able to read the automount map. I did say it is untested. snip ... > Mar 8 08:12:49 build7900 automount[1796]: problem reading master map, maximum > wait exceeded > Mar 8 08:12:49 build7900 automount[1796]: main:2197: automount: failed to read > master map after 180 seconds! At the moment, if the master map can't be read after a given amount of time the daemon will exit. Did you try increasing the time to wait? If you updated the package you may have a .rpmnew configuration file. In it you should see: +# MASTER_WAIT - set the default maximum time to wait for the +# master map to become available if it cannot +# be read at program start (default 180). +# +#MASTER_WAIT=180 Uncomment MASTER_WAIT=180 line and increase the timeout to a value that should work for your site. Of course there may be some other problem. Ian
(In reply to comment #9) > > Did you try increasing the time to wait? > > If you updated the package you may have a .rpmnew configuration > file. In it you should see: > > +# MASTER_WAIT - set the default maximum time to wait for the > +# master map to become available if it cannot > +# be read at program start (default 180). > +# > +#MASTER_WAIT=180 > > Uncomment MASTER_WAIT=180 line and increase the timeout to a > value that should work for your site. Um ... of course the configuration entry needs to be in /etc/sysconfig/autofs, not the .rpmnew file.
Not sure your new package is working. With the machine fully up and on the network, starting the autofs service generates the same errors. Doesn't matter what the MASTER_WAIT is set to. Rolling back to the old version works fine with the machine is already up. The machine is up way before the 180 seconds passes.
I tried really hard to get autofs to start in the middle of the NetworkManager initialization but couldn't. Even changing the S number to be equal to the NerworkManager S number and removing the network service dependency didn't do it. But clearly you do have autofs starting during this time. So what about making the NetworkManager init script wait for the network to come up. In /etc/sysconfig/network, set NETWORKWAIT=yes and if that isn't enough set NETWORKDELAY=<seconds> to delay a further <seconds> after NetworkManager thinks the network is up. Ensure you remove the test package, of course, since it doesn't work for you (although it did work ok for me, when dropping the network, starting autofs, then bringing the network up).
Setting NETWORKWAIT=yes seems to have solved my problem. Automount is now waiting until after DHCP has finished. I've rebooted many times the last couple of days and it has worked every time. As I said, our DHCP service can sometimes take a while. Kickstart even sometimes times out trying to get an IP. That may be why it's difficult to reproduce. Even here, it was only failing every once in a while. But when it failed, the machine would be unusable. I didn't want to be in a situation where I had to check every machine every time it rebooted. Thanks for the help!
(In reply to comment #13) > Setting NETWORKWAIT=yes seems to have solved my problem. Automount is now > waiting until after DHCP has finished. I've rebooted many times the last > couple of days and it has worked every time. Yep, but it's the NetworkManager init script that is waiting for the network interface to complete initialization. Actually nothing to do with autofs. > > As I said, our DHCP service can sometimes take a while. Kickstart even > sometimes times out trying to get an IP. That may be why it's difficult to > reproduce. Even here, it was only failing every once in a while. But when it > failed, the machine would be unusable. I didn't want to be in a situation > where I had to check every machine every time it rebooted. It's my opinion that the NetworkManager init script should always wait for interface initialization to complete since autofs may not only be service that needs the network to start properly. But there may be other reason this isn't the default, I don't know. At least I finally spotted this and will be able to advise people immediately in future. I'm guessing we can close this now?
Yes, I agree completely it's an issue with NetworkManager. Thank you for your help with pinning that down. I think we can close it now.