| Summary: | autofs tries to start before NetworkManager is finished | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Steve Cleveland <steve.cleveland> | ||||||||
| Component: | autofs | Assignee: | Ian Kent <ikent> | ||||||||
| Status: | CLOSED NOTABUG | QA Contact: | yanfu,wang <yanwang> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 6.0 | CC: | ikent, rwheeler | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2011-03-17 22:45:10 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
|
Description
Steve Cleveland
2011-03-03 19:20:13 UTC
I have patches which may resolve this but they are completely untested (a recent second attempt in fact) so far. I can make a test rpm which includes them if you are willing to try it. Yes, I would be willing to test it. I'm still in the early stages of testing deployments for rhel6, so I have machines to test on. Created attachment 482822 [details]
Patch - fix master map source server unavailable handling
Created attachment 482823 [details]
Patch - wait for master map available at start
Created attachment 482824 [details]
Patch - add master read wait option
A package containing the above three patches (which may help with this problem) is available at: http://people.redhat.com/~ikent/autofs-5.0.5-29.bz681986.1 Could you please test the package please. After installing the package and rebooting, things got worse. Even trying to restart autofs. It doesn't appear to be able to read the automount map. Mar 8 08:09:44 build7900 NetworkManager[1582]: <info> Trying to start the supplicant... Mar 8 08:09:44 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) started... Mar 8 08:09:44 build7900 NetworkManager[1582]: <info> Activation (eth0) Beginning DHCPv4 transaction (timeout in 45 seconds) Mar 8 08:09:44 build7900 NetworkManager[1582]: <info> dhclient started with pid 1613 Mar 8 08:09:44 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) complete. Mar 8 08:09:44 build7900 dhclient: Internet Systems Consortium DHCP Client 4.1.1-P1 Mar 8 08:09:44 build7900 dhclient: Copyright 2004-2010 Internet Systems Consortium. Mar 8 08:09:44 build7900 dhclient: All rights reserved. Mar 8 08:09:44 build7900 dhclient: For info, please visit https://www.isc.org/software/dhcp/ Mar 8 08:09:44 build7900 dhclient: Mar 8 08:09:44 build7900 NetworkManager[1582]: <info> (eth0): DHCPv4 state chang Mar 8 08:09:47 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:47 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:47 build7900 automount[1796]: main:2176: automount: failed to read master map, will retry in background! Mar 8 08:09:47 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:47 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:48 build7900 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67 Mar 8 08:09:48 build7900 dhclient: DHCPACK from 10.193.54.250 Mar 8 08:09:48 build7900 dhclient: bound to 10.193.54.146 -- renewal in 57706 seconds. Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> (eth0): DHCPv4 state changed preinit -> reboot Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) scheduled... Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) started... Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> address 10.193.54.146 Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> prefix 23 (255.255.254.0) Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> gateway 10.193.54.250 Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> hostname 'build7900.site.edu' Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> nameserver '10.193.40.2' Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> nameserver '10.193.15.13' Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> nameserver '10.193.15.12' Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> domain name 'site.edu' Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) scheduled... Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) complete. Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) started... Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> (eth0): device state change: 7 -> 8 (reason 0) Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Policy set 'System eth0' (eth0) as default for IPv4 routing and DNS. Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) successful, device activated. Mar 8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) complete. Mar 8 08:09:49 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:49 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:51 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:51 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:53 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server Mar 8 08:09:53 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server ... Mar 8 08:12:49 build7900 automount[1796]: problem reading master map, maximum wait exceeded Mar 8 08:12:49 build7900 automount[1796]: main:2197: automount: failed to read master map after 180 seconds! (In reply to comment #8) > After installing the package and rebooting, things got worse. Even trying to > restart autofs. It doesn't appear to be able to read the automount map. I did say it is untested. snip ... > Mar 8 08:12:49 build7900 automount[1796]: problem reading master map, maximum > wait exceeded > Mar 8 08:12:49 build7900 automount[1796]: main:2197: automount: failed to read > master map after 180 seconds! At the moment, if the master map can't be read after a given amount of time the daemon will exit. Did you try increasing the time to wait? If you updated the package you may have a .rpmnew configuration file. In it you should see: +# MASTER_WAIT - set the default maximum time to wait for the +# master map to become available if it cannot +# be read at program start (default 180). +# +#MASTER_WAIT=180 Uncomment MASTER_WAIT=180 line and increase the timeout to a value that should work for your site. Of course there may be some other problem. Ian (In reply to comment #9) > > Did you try increasing the time to wait? > > If you updated the package you may have a .rpmnew configuration > file. In it you should see: > > +# MASTER_WAIT - set the default maximum time to wait for the > +# master map to become available if it cannot > +# be read at program start (default 180). > +# > +#MASTER_WAIT=180 > > Uncomment MASTER_WAIT=180 line and increase the timeout to a > value that should work for your site. Um ... of course the configuration entry needs to be in /etc/sysconfig/autofs, not the .rpmnew file. Not sure your new package is working. With the machine fully up and on the network, starting the autofs service generates the same errors. Doesn't matter what the MASTER_WAIT is set to. Rolling back to the old version works fine with the machine is already up. The machine is up way before the 180 seconds passes. I tried really hard to get autofs to start in the middle of the NetworkManager initialization but couldn't. Even changing the S number to be equal to the NerworkManager S number and removing the network service dependency didn't do it. But clearly you do have autofs starting during this time. So what about making the NetworkManager init script wait for the network to come up. In /etc/sysconfig/network, set NETWORKWAIT=yes and if that isn't enough set NETWORKDELAY=<seconds> to delay a further <seconds> after NetworkManager thinks the network is up. Ensure you remove the test package, of course, since it doesn't work for you (although it did work ok for me, when dropping the network, starting autofs, then bringing the network up). Setting NETWORKWAIT=yes seems to have solved my problem. Automount is now waiting until after DHCP has finished. I've rebooted many times the last couple of days and it has worked every time. As I said, our DHCP service can sometimes take a while. Kickstart even sometimes times out trying to get an IP. That may be why it's difficult to reproduce. Even here, it was only failing every once in a while. But when it failed, the machine would be unusable. I didn't want to be in a situation where I had to check every machine every time it rebooted. Thanks for the help! (In reply to comment #13) > Setting NETWORKWAIT=yes seems to have solved my problem. Automount is now > waiting until after DHCP has finished. I've rebooted many times the last > couple of days and it has worked every time. Yep, but it's the NetworkManager init script that is waiting for the network interface to complete initialization. Actually nothing to do with autofs. > > As I said, our DHCP service can sometimes take a while. Kickstart even > sometimes times out trying to get an IP. That may be why it's difficult to > reproduce. Even here, it was only failing every once in a while. But when it > failed, the machine would be unusable. I didn't want to be in a situation > where I had to check every machine every time it rebooted. It's my opinion that the NetworkManager init script should always wait for interface initialization to complete since autofs may not only be service that needs the network to start properly. But there may be other reason this isn't the default, I don't know. At least I finally spotted this and will be able to advise people immediately in future. I'm guessing we can close this now? Yes, I agree completely it's an issue with NetworkManager. Thank you for your help with pinning that down. I think we can close it now. |