Bug 681986

Summary: autofs tries to start before NetworkManager is finished
Product: Red Hat Enterprise Linux 6 Reporter: Steve Cleveland <steve.cleveland>
Component: autofsAssignee: Ian Kent <ikent>
Status: CLOSED NOTABUG QA Contact: yanfu,wang <yanwang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.0CC: ikent, rwheeler
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-03-17 22:45:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Patch - fix master map source server unavailable handling
none
Patch - wait for master map available at start
none
Patch - add master read wait option none

Description Steve Cleveland 2011-03-03 19:20:13 UTC
Description of problem:

In a large environment, DHCP can take a while to complete.  It appears that autofs tries to start before the process is finished.  And thus, it can't contact the LDAP servers to get the automount options.

Version-Release number of selected component (if applicable):

NetworkManager-0.8.1-5.el6_0.1.x86_64
autofs-5.0.5-23.el6.x86_64

How reproducible:

Intermittent

Steps to Reproduce:
1. Machine boots up
2. NetworkManager starts the network process
3. autofs failes to query ldap
4. NetworkManager finishes network process
  
Actual results:

autofs fails to load the automount maps, and doesn't mount the nfs shares.

After stage 3 of 5 for NetworkManager, automount kicks in.

Expected results:

autofs should wait for NetworkManager to finish.

Additional info:

This may well be a NetworkManager problem, but I figured I'd start here.


Mar  2 11:06:50 build7900 NetworkManager[1601]: <info> monitoring kernel firmware directory '/lib/firmware'.
Mar  2 11:06:50 build7900 NetworkManager[1601]:    ifcfg-rh: Acquired D-Bus service com.redhat.ifcfgrh1
Mar  2 11:06:50 build7900 NetworkManager[1601]: <info> Loaded plugin ifcfg-rh: (c) 2007 - 2008 Red Hat, Inc.  To report bugs please use the NetworkManager mailing list.
Mar  2 11:06:50 build7900 NetworkManager[1601]:    ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-eth0 ... 
Mar  2 11:06:51 build7900 NetworkManager[1601]:    ifcfg-rh:     read connection 'System eth0'
Mar  2 11:06:51 build7900 NetworkManager[1601]:    ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-lo ... 
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> WiFi enabled by radio killswitch; enabled by state file
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> WWAN enabled by radio killswitch; enabled by state file
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> WiMAX enabled by radio killswitch; enabled by state file
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> Networking is enabled by state file
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): carrier is ON
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): new Ethernet device (driver: 'e1000e' ifindex: 2)
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): exported as /org/freedesktop/NetworkManager/Devices/0
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): now managed
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): device state change: 1 -> 2 (reason 41)
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): preparing device.
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> Activation (eth0) starting connection 'System eth0'
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): device state change: 2 -> 7 (reason 0)
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) scheduled.
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> modem-manager is now available
Mar  2 11:06:51 build7900 NetworkManager[1601]: <warn> bluez error getting default adapter: The name org.bluez was not provided by any .service files
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> Trying to start the supplicant...
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) started...
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> Activation (eth0) Beginning DHCPv4 transaction (timeout in 45 seconds)
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> dhclient started with pid 1638
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) complete.
Mar  2 11:06:51 build7900 dhclient: Internet Systems Consortium DHCP Client 4.1.1-P1
Mar  2 11:06:51 build7900 dhclient: Copyright 2004-2010 Internet Systems Consortium.
Mar  2 11:06:51 build7900 dhclient: All rights reserved.
Mar  2 11:06:51 build7900 dhclient: For info, please visit https://www.isc.org/software/dhcp/
Mar  2 11:06:51 build7900 dhclient: 
Mar  2 11:06:51 build7900 NetworkManager[1601]: <info> (eth0): DHCPv4 state changed nbi -> preinit
Mar  2 11:06:51 build7900 dhclient: Listening on LPF/eth0/18:a9:05:b1:49:44
Mar  2 11:06:51 build7900 dhclient: Sending on   LPF/eth0/18:a9:05:b1:49:44
Mar  2 11:06:51 build7900 dhclient: Sending on   Socket/fallback
Mar  2 11:06:51 build7900 kernel: RPC: Registered udp transport module.
Mar  2 11:06:51 build7900 kernel: RPC: Registered tcp transport module.
Mar  2 11:06:51 build7900 kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
Mar  2 11:06:52 build7900 acpid: starting up
Mar  2 11:06:52 build7900 acpid: 1 rule loaded
Mar  2 11:06:52 build7900 acpid: waiting for events: event logging is off
Mar  2 11:06:52 build7900 acpid: client connected from 1779[68:68]
Mar  2 11:06:52 build7900 acpid: 1 client rule loaded
Mar  2 11:06:52 build7900 pcscd: pcscdaemon.c:506:main() pcsc-lite 1.5.2 daemon ready.
Mar  2 11:06:54 build7900 automount[1820]: bind_ldap_anonymous: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Mar  2 11:06:54 build7900 automount[1820]: bind_ldap_anonymous: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Mar  2 11:06:54 build7900 sssd: Starting up
Mar  2 11:06:55 build7900 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Mar  2 11:06:55 build7900 dhclient: DHCPACK from 10.193.54.250
Mar  2 11:06:55 build7900 dhclient: bound to 10.193.54.58 -- renewal in 63949 seconds.
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info> (eth0): DHCPv4 state changed preinit -> reboot
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) scheduled...
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) started...
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info>   address 10.193.54.58
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info>   prefix 23 (255.255.254.0)
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info>   gateway 128.193.54.250
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info>   hostname 'build7900.site.edu'
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info>   nameserver '10.193.40.2'
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info>   nameserver '10.193.15.13'
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info>   nameserver '10.193.15.12'
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info>   domain name 'site.edu'
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) scheduled...
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) complete.
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) started...
Mar  2 11:06:55 build7900 sssd[be[default]]: Starting up
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info> (eth0): device state change: 7 -> 8 (reason 0)
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info> Policy set 'System eth0' (eth0) as default for IPv4 routing and DNS.
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) successful, device activated.
Mar  2 11:06:55 build7900 NetworkManager[1601]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) complete.

Comment 2 Ian Kent 2011-03-04 04:43:50 UTC
I have patches which may resolve this but they are completely
untested (a recent second attempt in fact) so far.

I can make a test rpm which includes them if you are willing
to try it.

Comment 3 Steve Cleveland 2011-03-04 16:05:08 UTC
Yes, I would be willing to test it.  I'm still in the early stages of testing deployments for rhel6, so I have machines to test on.

Comment 4 Ian Kent 2011-03-08 02:19:18 UTC
Created attachment 482822 [details]
Patch - fix master map source server unavailable handling

Comment 5 Ian Kent 2011-03-08 02:20:26 UTC
Created attachment 482823 [details]
Patch - wait for master map available at start

Comment 6 Ian Kent 2011-03-08 02:21:28 UTC
Created attachment 482824 [details]
Patch - add master read wait option

Comment 7 Ian Kent 2011-03-08 02:23:19 UTC
A package containing the above three patches (which may
help with this problem) is available at:
http://people.redhat.com/~ikent/autofs-5.0.5-29.bz681986.1

Could you please test the package please.

Comment 8 Steve Cleveland 2011-03-08 16:24:30 UTC
After installing the package and rebooting, things got worse.  Even trying to restart autofs.  It doesn't appear to be able to read the automount map.


Mar  8 08:09:44 build7900 NetworkManager[1582]: <info> Trying to start the supplicant...
Mar  8 08:09:44 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) started...
Mar  8 08:09:44 build7900 NetworkManager[1582]: <info> Activation (eth0) Beginning DHCPv4 transaction (timeout in 45 seconds)
Mar  8 08:09:44 build7900 NetworkManager[1582]: <info> dhclient started with pid 1613
Mar  8 08:09:44 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 3 of 5 (IP Configure Start) complete.
Mar  8 08:09:44 build7900 dhclient: Internet Systems Consortium DHCP Client 4.1.1-P1
Mar  8 08:09:44 build7900 dhclient: Copyright 2004-2010 Internet Systems Consortium.
Mar  8 08:09:44 build7900 dhclient: All rights reserved.
Mar  8 08:09:44 build7900 dhclient: For info, please visit https://www.isc.org/software/dhcp/
Mar  8 08:09:44 build7900 dhclient: 
Mar  8 08:09:44 build7900 NetworkManager[1582]: <info> (eth0): DHCPv4 state chang
Mar  8 08:09:47 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Mar  8 08:09:47 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Mar  8 08:09:47 build7900 automount[1796]: main:2176: automount: failed to read master map, will retry in background!
Mar  8 08:09:47 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Mar  8 08:09:47 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Mar  8 08:09:48 build7900 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Mar  8 08:09:48 build7900 dhclient: DHCPACK from 10.193.54.250
Mar  8 08:09:48 build7900 dhclient: bound to 10.193.54.146 -- renewal in 57706 seconds.
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info> (eth0): DHCPv4 state changed preinit -> reboot
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) scheduled...
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) started...
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info>   address 10.193.54.146
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info>   prefix 23 (255.255.254.0)
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info>   gateway 10.193.54.250
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info>   hostname 'build7900.site.edu'
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info>   nameserver '10.193.40.2'
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info>   nameserver '10.193.15.13'
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info>   nameserver '10.193.15.12'
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info>   domain name 'site.edu'
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) scheduled...
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) complete.
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) started...
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info> (eth0): device state change: 7 -> 8 (reason 0)
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info> Policy set 'System eth0' (eth0) as default for IPv4 routing and DNS.
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) successful, device activated.
Mar  8 08:09:48 build7900 NetworkManager[1582]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) complete.
Mar  8 08:09:49 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Mar  8 08:09:49 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Mar  8 08:09:51 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Mar  8 08:09:51 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Mar  8 08:09:53 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
Mar  8 08:09:53 build7900 automount[1796]: bind_ldap_simple: lookup(ldap): Unable to bind to the LDAP server: (default), error Can't contact LDAP server
...
Mar  8 08:12:49 build7900 automount[1796]: problem reading master map, maximum wait exceeded
Mar  8 08:12:49 build7900 automount[1796]: main:2197: automount: failed to read master map after 180 seconds!

Comment 9 Ian Kent 2011-03-09 03:22:20 UTC
(In reply to comment #8)
> After installing the package and rebooting, things got worse.  Even trying to
> restart autofs.  It doesn't appear to be able to read the automount map.

I did say it is untested.

snip ...

> Mar  8 08:12:49 build7900 automount[1796]: problem reading master map, maximum
> wait exceeded
> Mar  8 08:12:49 build7900 automount[1796]: main:2197: automount: failed to read
> master map after 180 seconds!

At the moment, if the master map can't be read after a given
amount of time the daemon will exit.

Did you try increasing the time to wait?

If you updated the package you may have a .rpmnew configuration
file. In it you should see:

+# MASTER_WAIT - set the default maximum time to wait for the
+#              master map to become available if it cannot
+#              be read at program start (default 180).
+#
+#MASTER_WAIT=180

Uncomment MASTER_WAIT=180 line and increase the timeout to a
value that should work for your site.

Of course there may be some other problem.

Ian

Comment 10 Ian Kent 2011-03-09 03:24:45 UTC
(In reply to comment #9)
> 
> Did you try increasing the time to wait?
> 
> If you updated the package you may have a .rpmnew configuration
> file. In it you should see:
> 
> +# MASTER_WAIT - set the default maximum time to wait for the
> +#              master map to become available if it cannot
> +#              be read at program start (default 180).
> +#
> +#MASTER_WAIT=180
> 
> Uncomment MASTER_WAIT=180 line and increase the timeout to a
> value that should work for your site.

Um ... of course the configuration entry needs to be in
/etc/sysconfig/autofs, not the .rpmnew file.

Comment 11 Steve Cleveland 2011-03-11 23:16:06 UTC
Not sure your new package is working.  With the machine fully up and on the network, starting the autofs service generates the same errors.  Doesn't matter what the MASTER_WAIT is set to.  Rolling back to the old version works fine with the machine is already up.

The machine is up way before the 180 seconds passes.

Comment 12 Ian Kent 2011-03-14 14:05:48 UTC
I tried really hard to get autofs to start in the middle of the
NetworkManager initialization but couldn't. Even changing the
S number to be equal to the NerworkManager S number and removing
the network service dependency didn't do it.

But clearly you do have autofs starting during this time.

So what about making the NetworkManager init script wait for
the network to come up.

In /etc/sysconfig/network, set NETWORKWAIT=yes and if that isn't
enough set NETWORKDELAY=<seconds> to delay a further <seconds>
after NetworkManager thinks the network is up.

Ensure you remove the test package, of course, since it doesn't
work for you (although it did work ok for me, when dropping the
network, starting autofs, then bringing the network up).

Comment 13 Steve Cleveland 2011-03-15 16:26:23 UTC
Setting NETWORKWAIT=yes seems to have solved my problem.  Automount is now waiting until after DHCP has finished.  I've rebooted many times the last couple of days and it has worked every time.

As I said, our DHCP service can sometimes take a while.  Kickstart even sometimes times out trying to get an IP.  That may be why it's difficult to reproduce.  Even here, it was only failing every once in a while.  But when it failed, the machine would be unusable.  I didn't want to be in a situation where I had to check every machine every time it rebooted.

Thanks for the help!

Comment 14 Ian Kent 2011-03-16 03:40:28 UTC
(In reply to comment #13)
> Setting NETWORKWAIT=yes seems to have solved my problem.  Automount is now
> waiting until after DHCP has finished.  I've rebooted many times the last
> couple of days and it has worked every time.

Yep, but it's the NetworkManager init script that is waiting
for the network interface to complete initialization. Actually
nothing to do with autofs.

> 
> As I said, our DHCP service can sometimes take a while.  Kickstart even
> sometimes times out trying to get an IP.  That may be why it's difficult to
> reproduce.  Even here, it was only failing every once in a while.  But when it
> failed, the machine would be unusable.  I didn't want to be in a situation
> where I had to check every machine every time it rebooted.

It's my opinion that the NetworkManager init script should
always wait for interface initialization to complete since
autofs may not only be service that needs the network to
start properly. But there may be other reason this isn't the
default, I don't know.

At least I finally spotted this and will be able to advise
people immediately in future.

I'm guessing we can close this now?

Comment 16 Steve Cleveland 2011-03-17 21:38:32 UTC
Yes, I agree completely it's an issue with NetworkManager.  Thank you for your help with pinning that down.  I think we can close it now.