Bug 754299 - Boot hangs for 60s mounting NFS filesystems
Summary: Boot hangs for 60s mounting NFS filesystems
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: initscripts
Version: 16
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Bill Nottingham
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-11-16 01:26 UTC by Ben Webb
Modified: 2014-03-17 03:29 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-02-13 19:31:31 UTC
Type: ---


Attachments (Terms of Use)
dmesg (99.83 KB, text/plain)
2011-11-16 01:27 UTC, Ben Webb
no flags Details
blame (1.80 KB, text/plain)
2011-11-16 01:28 UTC, Ben Webb
no flags Details
blame-F15 (1.77 KB, text/plain)
2011-11-16 01:29 UTC, Ben Webb
no flags Details

Description Ben Webb 2011-11-16 01:26:30 UTC
Description of problem:
Fedora 16 hangs for 60s trying to mount NFS filesystems. This is on several systems with 'always on' network connections (em1, static IP, using old-style network script rather than NetworkManager) and 4 similar 'always on' NFS mounts in fstab. It looks like all of the 'mount's are being started before the network is fully up, and they wait a while before retrying.

See attachments for the 'systemd-analyze blame' output and dmesg. Note that of the 4 mounts (salilab-*.mount), one mounts in a reasonable (but not great) time, while the others take about 60s. This behavior is consistent across reboots, although the one 'reasonable' mount changes each time. Note that not much is going on in dmesg between 14s and 74s, while it waits for the mounts to complete.

The mounts themselves in fstab look like:
diva:/diva1             /salilab/diva1  	nfs   rsize=32768,wsize=32768,noatime,bg,hard,intr,nfsvers=3 0 0
xylophone:/park1        /salilab/park1  	nfs    rsize=32768,wsize=32768,noatime,bg,hard,intr,nfsvers=3 0 0
xylophone:/park2        /salilab/park2  	nfs    rsize=32768,wsize=32768,noatime,bg,hard,intr,nfsvers=3 0 0
kazoo:/park3            /salilab/park3  	nfs    rsize=32768,wsize=32768,noatime,bg,hard,intr,nfsvers=3 0 0

For comparison, this 60s timeout does not occur on a similar F15 system. (See blame-F15 for the timings on a similarly configured F15 NFS client.)

If I comment out these entries from fstab, booting proceeds normally. I can then mount the filesystems with 'mount -t nfs -a' without having to wait 60s - it completes in less than a second.

We could probably lessen the problem by tweaking our NFS timeouts, but it would be preferable if systemd simply didn't start the mount until the network was up.

FWIW, these NFS mounts do not contain home directories or other data that is needed for logins to proceed, so there's no reason (in our case) for systemd to wait for them to complete before starting up X (note the 'bg' mount option).

Version-Release number of selected component (if applicable):
systemd-37-3.fc16.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. Configure an NFS client with a static IP and an NFS filesystem in /etc/fstab.
2. Remove NetworkManager if installed.
3. Boot.
  
Actual results:
System hangs for 60s trying to mount NFS filesystems (but does finally succeed).

Expected results:
No hang.

Comment 1 Ben Webb 2011-11-16 01:27:43 UTC
Created attachment 533873 [details]
dmesg

dmesg output from boot with systemd debugging turned on

Comment 2 Ben Webb 2011-11-16 01:28:41 UTC
Created attachment 533874 [details]
blame

Output of "systemd-analyze blame" on an affected system.

Comment 3 Ben Webb 2011-11-16 01:29:20 UTC
Created attachment 533875 [details]
blame-F15

Output of "systemd-analyze blame" on an unaffected (F15) system for comparison.

Comment 4 Ben Webb 2011-11-16 07:40:18 UTC
An observation: adding "mountproto=tcp" to the mount options (the default - on my systems at least - is udp) causes systemd to not wait for the mount to complete, so bootup proceeds. The mounts still take 60s to complete, but at least X and other services can start up in the meantime. Still not an ideal situation though.

Comment 5 Michal Schmidt 2011-11-16 10:08:05 UTC
The dmesg log shows that the network service completed successfully. The network should be up at this point.

Try if putting NETWORKDELAY=10 into /etc/sysconfig/network is a workaround.

...
[   14.477932] systemd[1]: network.service: control process exited, code=exited status=0
[   14.477937] systemd[1]: network.service got final SIGCHLD for state start
[   14.478017] systemd[1]: network.service changed start -> running
[   14.478032] systemd[1]: Job network.service/start finished, result=done
...
[   14.511111] systemd[1]: network.target changed dead -> active
[   14.511119] systemd[1]: Job network.target/start finished, result=done
[   14.522224] systemd[1]: About to execute: /bin/mount /salilab/diva1
[   14.552046] systemd[1]: Forked /bin/mount as 1139
[   14.552075] systemd[1]: salilab-diva1.mount changed dead -> mounting
...

Comment 6 Ben Webb 2011-11-16 15:56:21 UTC
(In reply to comment #5)
> The dmesg log shows that the network service completed successfully.

While the service completed, the actual network device doesn't come up until later:

[   15.422130] tg3 0000:08:00.0: em1: Link is up at 1000 Mbps, full duplex
[   15.422334] tg3 0000:08:00.0: em1: Flow control is on for TX and on for RX
[   15.424853] ADDRCONF(NETDEV_CHANGE): em1: link becomes ready

This is a fraction of a second *after* the NFS mount processes have been spawned.

> Try if putting NETWORKDELAY=10 into /etc/sysconfig/network is a workaround.

I'll give that a try, thanks.

Comment 7 Ben Webb 2011-11-16 16:48:28 UTC
(In reply to comment #5)
> Try if putting NETWORKDELAY=10 into /etc/sysconfig/network is a workaround.

No, that doesn't work - it just makes the network service take 10s longer to complete - systemd-analyze blame now says:
 60203ms salilab-diva1.mount
 60163ms salilab-park1.mount
 60107ms salilab-park2.mount
 60059ms salilab-park3.mount
 12524ms network.service

It does now, however, start mount processes after the network device comes up. So I guess that wasn't the issue.

But I notice that it also starts the mounts before the rpcbind service - maybe this is why they initially fail and then wait 60s before retrying? (My understanding is that systemd provides fake sockets so things that need them will block until the real socket server comes on line, but maybe NFS mounts do a non-blocking connect so this mechanism does not work correctly.)

Comment 8 Bill Nottingham 2011-11-16 19:09:17 UTC
If you add:

WantedBy=remote-fs-pre.target

to rpcbind.service, does it start working? (This wouldn't be the final fix, I would think rpcbind should be working correctly with socket activation here.)

Also, your boot log exposes a bunch of flaws with netfs, which needs rework. But you end up not using it, so that's not the cause of this bug.

Comment 9 Ben Webb 2011-11-16 19:40:46 UTC
(In reply to comment #8)
> If you add:
> 
> WantedBy=remote-fs-pre.target
> 
> to rpcbind.service, does it start working?

No, that doesn't change anything. I also tried
WantedBy=multi-user.target remote-fs-pre.target
since it already had a WantedBy in there - same effect.

However, this doesn't prove that rpcbind *isn't* the culprit, because systemd still insists on starting the mounts before rpcbind, even with this change:

[   14.486122] systemd[1]: Job network.target/start finished, result=done
[   14.494871] systemd[1]: About to execute: /bin/mount /salilab/diva1
[   14.527054] systemd[1]: Forked /bin/mount as 1137
[   14.527083] systemd[1]: salilab-diva1.mount changed dead -> mounting
[   14.527480] systemd[1]: About to execute: /bin/mount /salilab/park1
[   14.559050] systemd[1]: Forked /bin/mount as 1140
[   14.559080] systemd[1]: salilab-park1.mount changed dead -> mounting
[   14.559404] systemd[1]: About to execute: /bin/mount /salilab/park2
[   14.599040] systemd[1]: Forked /bin/mount as 1148
[   14.599066] systemd[1]: salilab-park2.mount changed dead -> mounting
[   14.599334] systemd[1]: About to execute: /bin/mount /salilab/park3
[   14.648040] systemd[1]: Forked /bin/mount as 1151
[   14.648066] systemd[1]: salilab-park3.mount changed dead -> mounting
[   14.648189] systemd[1]: About to execute: /sbin/rpcbind -w
[   14.688040] systemd[1]: Forked /sbin/rpcbind as 1156
[   14.688078] systemd[1]: rpcbind.service changed dead -> start

Comment 10 Michal Schmidt 2011-11-18 14:53:04 UTC
(In reply to comment #8)
> WantedBy=remote-fs-pre.target
> 
> to rpcbind.service, does it start working? (This wouldn't be the final fix, I
> would think rpcbind should be working correctly with socket activation here.)

WantedBy= does not influence ordering, but Before= would.
And rpcbind does not really support socket activation (bug 747363).

Comment 11 Ben Webb 2011-11-18 16:49:13 UTC
(In reply to comment #10)
> WantedBy= does not influence ordering, but Before= would.
> And rpcbind does not really support socket activation (bug 747363).

OK, I changed my /lib/systemd/system/rpcbind.service to read

[Unit]
Description=RPC bind service
Requires=rpcbind.socket
After=syslog.target network.target
Before=remote-fs-pre.target

[Service]
Type=forking
ExecStart=/sbin/rpcbind -w

[Install]
WantedBy=multi-user.target
Also=rpcbind.socket

and rebooted the system. It still doesn't seem to influence the ordering though:

[   15.124047] systemd[1]: Forked /bin/mount as 1126
[   15.124072] systemd[1]: salilab-park3.mount changed dead -> mounting
[   15.124171] systemd[1]: About to execute: /sbin/rpcbind -w
[   15.164069] systemd[1]: Forked /sbin/rpcbind as 1131
[   15.164116] systemd[1]: rpcbind.service changed dead -> start

Did I edit the wrong file, or edit it incorrectly? (There is no rpcbind.service file in /etc/systemd/system/.) Do I need to explicitly tell systemd that I changed this file so it can update the dependencies?

Comment 12 Fedora End Of Life 2013-01-16 16:07:59 UTC
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 13 Fedora End Of Life 2013-02-13 19:31:35 UTC
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.