Bug 739902

Summary: Service does not start after NetworkManager-wait-online.service
Product: [Fedora] Fedora Reporter: Marcus Moeller <marcus.moeller>
Component: NetworkManagerAssignee: Dan Williams <dcbw>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 15CC: dcbw, harald, jklimes, johannbg, kay, lpoetter, metherid, mschmidt, notting, plautrba
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-13 14:05:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg.out
none
/var/log/messages during service startup none

Description Marcus Moeller 2011-09-20 11:29:47 UTC
Description of problem:

I have created a service called krb5cc_0.service, that is responsible for fetching kerberos credentials with the following settings:

[Unit]
Description=Provides the credential cache for naming services
Before=nslcd.service
After=syslog.target network.target time-sync.target NetworkManager-wait-online.service
Requires=time-sync.target network.target

but the service is still started before nm has changed resolv.conf, which leads to a service fail.

Sep 20 08:27:50 mymachine NetworkManager[858]: <info>   nameserver '192.168.0.1'
Sep 20 08:27:50 mymachine NetworkManager[858]: <info>   domain name 'domain.example'
Sep 20 08:27:50 mymachine NetworkManager[858]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) scheduled...
Sep 20 08:27:50 mymachine NetworkManager[858]: <info> Activation (eth0) Stage 4 of 5 (IP4 Configure Get) complete.
Sep 20 08:27:50 mymachine NetworkManager[858]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) started...
Sep 20 08:27:50 mymachine avahi-daemon[844]: Joining mDNS multicast group on interface eth0.IPv4 with address 192.168.0.19.
Sep 20 08:27:50 mymachine avahi-daemon[844]: New relevant interface eth0.IPv4 for mDNS.
Sep 20 08:27:50 mymachine avahi-daemon[844]: Registering new address record for 192.168.0.19 on eth0.IPv4.
Sep 20 08:27:58 mymachine modem-manager[936]: <info>  (ttyS0) closing serial port...
Sep 20 08:27:58 mymachine modem-manager[936]: <info>  (ttyS0) serial port closed
Sep 20 08:27:58 mymachine modem-manager[936]: <info>  (ttyS0) opening serial port...
Sep 20 08:27:59 mymachine ntpdate[1193]: name server cannot be used: Temporary failure in name resolution (-3)
Sep 20 08:28:04 mymachine modem-manager[936]: <info>  (ttyS0) closing serial port...
Sep 20 08:28:04 mymachine modem-manager[936]: <info>  (ttyS0) serial port closed
Sep 20 08:28:19 mymachine ntpdate[1197]: name server cannot be used: Temporary failure in name resolution (-3)
Sep 20 08:28:19 mymachine systemd[1]: ntpdate.service: main process exited, code=exited, status=1
Sep 20 08:28:19 mymachine systemd[1]: Unit ntpdate.service entered failed state.
Sep 20 08:28:20 mymachine ntpd[1202]: ntpd 4.2.6p3 Fri May  6 16:26:49 UTC 2011 (1)
Sep 20 08:28:20 mymachine ntpd[1202]: proto: precision = 1.187 usec
Sep 20 08:28:20 mymachine ntpd[1202]: 0.0.0.0 c01d 0d kern kernel time sync enabled
Sep 20 08:28:20 mymachine ntpd[1202]: Listen and drop on 0 v4wildcard 0.0.0.0 UDP 123
Sep 20 08:28:20 mymachine ntpd[1202]: Listen normally on 1 lo 127.0.0.1 UDP 123
Sep 20 08:28:20 mymachine ntpd[1202]: Listen normally on 2 eth0 192.168.0.19 UDP 123
Sep 20 08:28:20 mymachine ntpd[1202]: peers refreshed
Sep 20 08:28:20 mymachine ntpd[1202]: Listening on routing socket on fd #19 for interface updates
Sep 20 08:28:20 mymachine ntpd[1202]: restrict: error in address '::' on line 7. Ignoring...
Sep 20 08:28:20 mymachine ntpd[1202]: restrict: error in address '::1' on line 13. Ignoring...
Sep 20 08:28:20 mymachine ntpd[1202]: Deferring DNS for time.ethz.ch 1
Sep 20 08:28:20 mymachine systemd[1]: krb5cc_0.service: main process exited, code=exited, status=1
Sep 20 08:28:20 mymachine ntpd[1202]: Deferring DNS for swisstime.ethz.ch 1
Sep 20 08:28:20 mymachine ntpd[1202]: 0.0.0.0 c016 06 restart
Sep 20 08:28:20 mymachine ntpd[1202]: 0.0.0.0 c012 02 freq_set kernel -6.492 PPM
Sep 20 08:28:20 mymachine systemd[1]: Unit krb5cc_0.service entered failed state.
Sep 20 08:28:20 mymachine nslcd[1213]: version 0.7.13 starting
Sep 20 08:28:20 mymachine nslcd[1213]: accepting connections
Sep 20 08:28:20 mymachine rpcbind: cannot create socket for udp6
Sep 20 08:28:20 mymachine rpcbind: cannot create socket for tcp6
Sep 20 08:28:20 mymachine dbus: [system] Activating service name='org.freedesktop.ColorManager' (using servicehelper)
Sep 20 08:28:20 mymachine dbus: [system] Successfully activated service 'org.freedesktop.ColorManager'
Sep 20 08:28:20 mymachine nscd: 1349 Access Vector Cache (AVC) started
Sep 20 08:28:20 mymachine NetworkManager[858]: <info> (eth0): device state change: ip-config -> activated (reason 'none') [70 100 0]
Sep 20 08:28:20 mymachine nscd: Can't send to audit system: USER_AVC avc:  netlink poll: error 4#012: exe="?" sauid=28 hostname=? addr=? terminal=?
Sep 20 08:28:20 mymachine nscd: Can't send to audit system: USER_AVC avc:  netlink recvfrom: error 1#012: exe="?" sauid=28 hostname=? addr=? terminal=?
Sep 20 08:28:20 mymachine nscd: Can't send to audit system: USER_AVC avc:  netlink thread: errors encountered, terminating#012: exe="?" sauid=28 hostname=? addr=? terminal=?
Sep 20 08:28:20 mymachine NetworkManager[858]: <info> Policy set 'System eth0' (eth0) as default for IPv4 routing and DNS.
Sep 20 08:28:20 mymachine NetworkManager[858]: <info> Activation (eth0) successful, device activated.
Sep 20 08:28:20 mymachine NetworkManager[858]: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) complete.

As seen above, this also applies to ntpdate, which is not able to query time servers, because DNS resolution is not available, yet.

Not sure if it's related but I got the following interface configuration setup:

IPV6INIT="no"
BOOTPROTO="dhcp"
DEVICE="eth0"
ONBOOT="yes"
PEERDNS="no"
DNS1="192.168.0.1"
DNS2="192.168.0.2"
DOMAIN="domain.example"

Comment 1 Marcus Moeller 2011-09-20 14:16:39 UTC
Created attachment 524043 [details]
dmesg.out

dmesg output with log_buf_len=1M systemd.log_target=kmsg systemd.log_level=debug set

Comment 2 Marcus Moeller 2011-09-20 14:26:16 UTC
Created attachment 524045 [details]
/var/log/messages during service startup

Comment 3 Marcus Moeller 2011-09-20 14:56:08 UTC
It seems to be related to nscd.service

Stopping LSB: Starts the Name Switch Cache Daemon...

has been logged during system startup.

Disabling nscd (which is not a good idea in general) let's the other services start correctly.

Comment 4 Bill Nottingham 2011-09-20 15:55:27 UTC
The wait-online service appears to have timed out. It has a default timeout of 30 seconds.

Comment 5 Marcus Moeller 2011-09-20 16:12:53 UTC
I have already tried to set it to 60sec, but it only takes longer for the service to fail.

Comment 6 Marcus Moeller 2011-09-21 08:16:14 UTC
Besides that, everything is fine, if I disable nscd, so I guess it has to be related to that one.

Comment 7 Marcus Moeller 2011-09-21 14:17:56 UTC
Ok, it seems to work fine again, using these systemd service definitions:

# cat nscd.service
[Unit]
Description=Name Switch Cache Daemon
After=syslog.target network.target
    
[Service]
Type=forking
PIDFile=/run/nscd/nscd.pid
EnvironmentFile=-/etc/sysconfig/nscd
ExecStart=/usr/sbin/nscd $NSCD_OPTIONS
ExecReload=/usr/sbin/nscd -i passwd
ExecReload=/usr/sbin/nscd -i group
ExecReload=/usr/sbin/nscd -i hosts
ExecReload=/usr/sbin/nscd -i services
ExecStop=/usr/sbin/nscd -K
     
[Install]
WantedBy=multi-user.target
     
# cat /etc/tmpfiles.d/nscd.conf
d /run/nscd 0755 root root -

Comment 8 Jirka Klimes 2011-09-21 16:29:21 UTC
(In reply to comment #7)
> Ok, it seems to work fine again, using these systemd service definitions:

Nice to hear that. However, that's not the default service file, right?
nscd is still using an initscript /etc/init.d/nscd. That should be fixed, I guess.

(In reply to comment #6)
> Besides that, everything is fine, if I disable nscd, so I guess it has to be
> related to that one.

There's a bug 700507 (RHEL 6) on nscd due to these lines:

Sep 20 15:35:02 mymachine nscd: Can't send to audit system: USER_AVC avc: 
netlink poll: error 4#012: exe="?" sauid=28 hostname=? addr=? terminal=?
Sep 20 15:35:02 mymachine nscd: Can't send to audit system: USER_AVC avc: 
netlink recvfrom: error 1#012: exe="?" sauid=28 hostname=? addr=? terminal=?
Sep 20 15:35:02 mymachine nscd: Can't send to audit system: USER_AVC avc: 
netlink thread: errors encountered, terminating#012: exe="?" sauid=28
hostname=? addr=? terminal=?

It's fixed in RHEL 6, however the fix is not delivered to Fedora 15 yet.
Looking into bodhi, the fix is included in glibc-2.14.90-9 (F16) that is in
'testing' state, no valid update for F15.
I'm not sure if the bug is connected with the issue here.

From NM point of view, there is a big delay (29sec!) between these two lines:
Sep 20 15:35:06 mymachine NetworkManager[889]: <info> Activation (eth0) Stage 5
of 5 (IP Configure Commit) started...
Sep 20 15:36:35 mymachine NetworkManager[889]: <info> Activation (eth0) Stage 5
of 5 (IP Configure Commit) complete.
even if they are logged in the same function:
src/nm-device.c:nm_device_activate_stage5_ip_config_commit()

nscd somehow badly interferes with NM, probably. It would be useful to find out
where NM is stuck. E.g. with: 

sudo pstack `pidof NetworkManager`

Also appending '--log-level=DEBUG' to ExecStart in
/lib/systemd/system/NetworkManager.service will help increasing logging level
of NM.

Comment 9 Marcus Moeller 2012-06-13 14:05:53 UTC
Works in more recent Fedora versions