| Summary: | Boot hangs for 60s mounting NFS filesystems | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Ben Webb <ben> | ||||||||
| Component: | initscripts | Assignee: | Bill Nottingham <notting> | ||||||||
| Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 16 | CC: | iarlyy, initscripts-maint-list, johannbg, jonathan, lpoetter, metherid, mschmidt, notting, plautrba, ron, rvokal, systemd-maint | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2013-02-13 19:31:31 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
|
Description
Ben Webb
2011-11-16 01:26:30 UTC
Created attachment 533873 [details]
dmesg
dmesg output from boot with systemd debugging turned on
Created attachment 533874 [details]
blame
Output of "systemd-analyze blame" on an affected system.
Created attachment 533875 [details]
blame-F15
Output of "systemd-analyze blame" on an unaffected (F15) system for comparison.
An observation: adding "mountproto=tcp" to the mount options (the default - on my systems at least - is udp) causes systemd to not wait for the mount to complete, so bootup proceeds. The mounts still take 60s to complete, but at least X and other services can start up in the meantime. Still not an ideal situation though. The dmesg log shows that the network service completed successfully. The network should be up at this point. Try if putting NETWORKDELAY=10 into /etc/sysconfig/network is a workaround. ... [ 14.477932] systemd[1]: network.service: control process exited, code=exited status=0 [ 14.477937] systemd[1]: network.service got final SIGCHLD for state start [ 14.478017] systemd[1]: network.service changed start -> running [ 14.478032] systemd[1]: Job network.service/start finished, result=done ... [ 14.511111] systemd[1]: network.target changed dead -> active [ 14.511119] systemd[1]: Job network.target/start finished, result=done [ 14.522224] systemd[1]: About to execute: /bin/mount /salilab/diva1 [ 14.552046] systemd[1]: Forked /bin/mount as 1139 [ 14.552075] systemd[1]: salilab-diva1.mount changed dead -> mounting ... (In reply to comment #5) > The dmesg log shows that the network service completed successfully. While the service completed, the actual network device doesn't come up until later: [ 15.422130] tg3 0000:08:00.0: em1: Link is up at 1000 Mbps, full duplex [ 15.422334] tg3 0000:08:00.0: em1: Flow control is on for TX and on for RX [ 15.424853] ADDRCONF(NETDEV_CHANGE): em1: link becomes ready This is a fraction of a second *after* the NFS mount processes have been spawned. > Try if putting NETWORKDELAY=10 into /etc/sysconfig/network is a workaround. I'll give that a try, thanks. (In reply to comment #5) > Try if putting NETWORKDELAY=10 into /etc/sysconfig/network is a workaround. No, that doesn't work - it just makes the network service take 10s longer to complete - systemd-analyze blame now says: 60203ms salilab-diva1.mount 60163ms salilab-park1.mount 60107ms salilab-park2.mount 60059ms salilab-park3.mount 12524ms network.service It does now, however, start mount processes after the network device comes up. So I guess that wasn't the issue. But I notice that it also starts the mounts before the rpcbind service - maybe this is why they initially fail and then wait 60s before retrying? (My understanding is that systemd provides fake sockets so things that need them will block until the real socket server comes on line, but maybe NFS mounts do a non-blocking connect so this mechanism does not work correctly.) If you add: WantedBy=remote-fs-pre.target to rpcbind.service, does it start working? (This wouldn't be the final fix, I would think rpcbind should be working correctly with socket activation here.) Also, your boot log exposes a bunch of flaws with netfs, which needs rework. But you end up not using it, so that's not the cause of this bug. (In reply to comment #8) > If you add: > > WantedBy=remote-fs-pre.target > > to rpcbind.service, does it start working? No, that doesn't change anything. I also tried WantedBy=multi-user.target remote-fs-pre.target since it already had a WantedBy in there - same effect. However, this doesn't prove that rpcbind *isn't* the culprit, because systemd still insists on starting the mounts before rpcbind, even with this change: [ 14.486122] systemd[1]: Job network.target/start finished, result=done [ 14.494871] systemd[1]: About to execute: /bin/mount /salilab/diva1 [ 14.527054] systemd[1]: Forked /bin/mount as 1137 [ 14.527083] systemd[1]: salilab-diva1.mount changed dead -> mounting [ 14.527480] systemd[1]: About to execute: /bin/mount /salilab/park1 [ 14.559050] systemd[1]: Forked /bin/mount as 1140 [ 14.559080] systemd[1]: salilab-park1.mount changed dead -> mounting [ 14.559404] systemd[1]: About to execute: /bin/mount /salilab/park2 [ 14.599040] systemd[1]: Forked /bin/mount as 1148 [ 14.599066] systemd[1]: salilab-park2.mount changed dead -> mounting [ 14.599334] systemd[1]: About to execute: /bin/mount /salilab/park3 [ 14.648040] systemd[1]: Forked /bin/mount as 1151 [ 14.648066] systemd[1]: salilab-park3.mount changed dead -> mounting [ 14.648189] systemd[1]: About to execute: /sbin/rpcbind -w [ 14.688040] systemd[1]: Forked /sbin/rpcbind as 1156 [ 14.688078] systemd[1]: rpcbind.service changed dead -> start (In reply to comment #8) > WantedBy=remote-fs-pre.target > > to rpcbind.service, does it start working? (This wouldn't be the final fix, I > would think rpcbind should be working correctly with socket activation here.) WantedBy= does not influence ordering, but Before= would. And rpcbind does not really support socket activation (bug 747363). (In reply to comment #10) > WantedBy= does not influence ordering, but Before= would. > And rpcbind does not really support socket activation (bug 747363). OK, I changed my /lib/systemd/system/rpcbind.service to read [Unit] Description=RPC bind service Requires=rpcbind.socket After=syslog.target network.target Before=remote-fs-pre.target [Service] Type=forking ExecStart=/sbin/rpcbind -w [Install] WantedBy=multi-user.target Also=rpcbind.socket and rebooted the system. It still doesn't seem to influence the ordering though: [ 15.124047] systemd[1]: Forked /bin/mount as 1126 [ 15.124072] systemd[1]: salilab-park3.mount changed dead -> mounting [ 15.124171] systemd[1]: About to execute: /sbin/rpcbind -w [ 15.164069] systemd[1]: Forked /sbin/rpcbind as 1131 [ 15.164116] systemd[1]: rpcbind.service changed dead -> start Did I edit the wrong file, or edit it incorrectly? (There is no rpcbind.service file in /etc/systemd/system/.) Do I need to explicitly tell systemd that I changed this file so it can update the dependencies? This message is a reminder that Fedora 16 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '16'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 16's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 16 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |