Bug 983915
Summary: | rpc.statd (nfs-lock.service) fails to properly start causing NFS mounts at boot to fail | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Jason <teradon> |
Component: | nfs-utils | Assignee: | Steve Dickson <steved> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 19 | CC: | arancox, aschorr, bcostescu, bfields, chupaka, Colin.Simpson, iand, jeff, johannbg, lnykryn, mattia.verga, mschmidt, msekleta, plautrba, steeve.mccauley, steved, systemd-maint, vpavlin, zbyszek |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-01-17 14:01:36 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Jason
2013-07-12 09:27:33 UTC
Note: In testing I ended up using a Fedora 18 install, Fedora 18 -> Fedora 19 using fedup then fully upgraded (distro-sync), and a clean Fedora 19 install and full "yum update" on it. The Fedora 18 install was for comparison since the NFS mounts worked reliably here, although the rpc.statd and nfs-lock.service issue (starting many instances) appears to have existed here as well, but the mounts waited long enough to make sure rpc.statd was always running/responsive. To test, again, I ended up making a NFS mount allowing sub-directory mounts and made directories 1-40 and mounted both the top level and the directories 1-40 (41 NFS mounts). Mount options in fstab: nfs tcp,vers=3,hard,bg 0 0 On a clean install after updating and reboot then adding the mounts, "mount -a -t nfs" works as expected, exactly one rpc.statd started and all mounts are there. On reboot, systemd tries to run in parallel a mount command for each mount and tries to start 41 rpc.statd's, which causes some to fail, some to start late enough to see one started, and anywhere from 4-24 or so to actually start and show up in "ps -ef | grep stat" Trying "systemctl enable nfs-lock.service" and "systemctl enable nfs.target" then reboot, you can see from the logs the system comes up hits network, then starts the mount process and starts 41 rpc.statd's, then does NFS stuff and starts NFS File Locking rpc.statd, then it waits for nfs-lock.service before finishing the actual mount command (although rpc.statd was already started). This results in nfs-lock.service sometimes working and going active (it becomes one of the processes that makes it in memory) and sometimes failing (rpc.statd already running or failed to create RPC listeners). Mounts still randomly fail, some work some don't. Each time you reboot, nfs-lock.service may or may not work and which mounts work is different and almost always multiple fail. Tried "systemctl enable NetworkManager-dispatcher.service" and "systemctl enable NetworkManager-wait-online.service" just in case (wait-online seems to already have been being used even though it was disabled) but there was no change. Changing mount option from "bg" to "fg" for this test setup did not change the results. For some reason in my Fedora 18 -> Fedora 19 fedup scenario this made the mounts reliably work although the rpc.statd and nfs-lock.service issue still remained. In this setup using a fresh Fedora 19 install fully upgraded did not seem to change the results. nfs-utils-1.2.8-3.0.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/nfs-utils-1.2.8-3.0.fc19 Package nfs-utils-1.2.8-3.0.fc19: * should fix your issue, * was pushed to the Fedora 19 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing nfs-utils-1.2.8-3.0.fc19' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2013-13591/nfs-utils-1.2.8-3.0.fc19 then log in and leave karma (feedback). Well, went back to a fresh Fedora 19 install, fully upgraded, made 41 NFS mounts and then tried the ("systemctl enable NetworkManager-dispatcher.service" was done on this): yum update --enablerepo=updates-testing nfs-utils-1.2.8-3.0.fc19 And rebooted - although I didn't expect this to work (bug is actually kind of two-parts if approached in this manner, there is still the fact that the units/spec file seems to be misbehaving as noted above and I'll note again) -- I expected to see nfs.target and nfs-lock.service enabled...? Seems they were still disabled for me and this is fresh install then fresh update and running the yum command given to upgrade to nfs-utils-1.2.8-3.0 -- ==== [root@localhost ~]# rpm -q nfs-utils nfs-utils-1.2.8-3.0.fc19.x86_64 [root@localhost ~]# systemctl status nfs.target nfs.target - Network File System Server Loaded: loaded (/usr/lib/systemd/system/nfs.target; disabled) Active: inactive (dead) [root@localhost ~]# systemctl status nfs-lock.service nfs-lock.service - NFS file locking service. Loaded: loaded (/usr/lib/systemd/system/nfs-lock.service; disabled) Active: inactive (dead) ==== Well, let's enable both and reboot again anyway. This should bring us to the state we expected to be in - based on my understanding. ==== [root@localhost ~]# systemctl enable nfs-lock.service ln -s '/usr/lib/systemd/system/nfs-lock.service' '/etc/systemd/system/nfs.target.wants/nfs-lock.service' [root@localhost ~]# systemctl enable nfs.target ln -s '/usr/lib/systemd/system/nfs.target' '/etc/systemd/system/multi-user.target.wants/nfs.target' [root@localhost ~]# systemctl status nfs.target nfs.target - Network File System Server Loaded: loaded (/usr/lib/systemd/system/nfs.target; enabled) Active: inactive (dead) [root@localhost ~]# systemctl status nfs-lock.service nfs-lock.service - NFS file locking service. Loaded: loaded (/usr/lib/systemd/system/nfs-lock.service; enabled) Active: inactive (dead) ==== After reboot: ==== [root@localhost ~]# ps -ef | grep statd rpcuser 877 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 879 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 880 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 881 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 883 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 884 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 885 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 889 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 891 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 907 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 908 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 909 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 910 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 911 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 913 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 915 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 919 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 921 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 926 1 0 00:58 ? 00:00:00 rpc.statd --no-notify rpcuser 927 1 0 00:58 ? 00:00:00 rpc.statd --no-notify root 1343 1275 0 01:03 pts/0 00:00:00 grep --color=auto statd [root@localhost ~]# systemctl status nfs.target nfs.target - Network File System Server Loaded: loaded (/usr/lib/systemd/system/nfs.target; enabled) Active: active since Fri 2013-07-26 00:58:40 EDT; 3min 26s ago Jul 26 00:58:40 localhost.localdomain systemd[1]: Starting Network File System Server. Jul 26 00:58:40 localhost.localdomain systemd[1]: Reached target Network File System Server. [root@localhost ~]# systemctl status nfs-lock.service nfs-lock.service - NFS file locking service. Loaded: loaded (/usr/lib/systemd/system/nfs-lock.service; enabled) Active: failed (Result: exit-code) since Fri 2013-07-26 00:58:40 EDT; 3min 38s ago Process: 900 ExecStart=/sbin/rpc.statd $STATDARG (code=exited, status=1/FAILURE) Process: 897 ExecStartPre=/usr/lib/nfs-utils/scripts/nfs-lock.preconfig (code=exited, status=0/SUCCESS) Jul 26 00:58:33 localhost.localdomain rpc.statd[928]: Version 1.2.8 starting Jul 26 00:58:33 localhost.localdomain sm-notify[929]: Version 1.2.8 starting Jul 26 00:58:40 localhost.localdomain systemd[1]: nfs-lock.service: control process exited, code=exited status=1 Jul 26 00:58:40 localhost.localdomain systemd[1]: Failed to start NFS file locking service.. Jul 26 00:58:40 localhost.localdomain systemd[1]: Unit nfs-lock.service entered failed state. ==== And I'm back to the original issue, rpc.statd/nfs-lock.service fails when it's enabled AND you have any mount in fstab set to mount at boot. Note: This is somewhat random. It sometimes does not fail. It seems to be a timing issue which wins the race in cores and time slices. The root cause of the bug itself causes several issues, randomly causes nfs-lock.service unit to fail (although rpc.statd is actually running just not under systemd control), can cause nfs mounts set to mount at boot to randomly fail (more likely the more mounts defined - this is an actual failure/problem), multiple rpc.statd daemons to be running and possibly no pid showing in it's pid file despite actually running. Looking at the unit files, it looks like nfs-lock.service should be defined to always start before remote-fs. Yet looking at the log files, remote-fs always begins first and starts the rpc.statd's THEN nfs-lock.service starts. This seems to be what gets us in trouble and why I opened this bug and causes my mounts to fail. remote-fs (due to NFS mounts) really seems like they need to wait for nfs-lock.service and not start any rpc.statd of their own. I could be interpreting something wrong, but the unit files/spec file doesn't seem to be behaving properly here. In addition, I did note that by default nfs-lock.service was disabled by default now and this is also a problem. If you have any nfs mounts, nfs-lock.service and thus nfs.target needs to be mounted, or you will get the same problem. systemd will in parallel try to mount every single NFS mount and as soon as you have more than one, that means you start duplicate rpc.statd instances. And you should only have one getting started. The changes in nfs-utils-1.2.8-3.0.fc19 from Bug 970595 to enable them by default should have taken care of that portion (as noted originally, in Fedora 18 nfs-lock.service was enabled by default and under target multi-user, we got into trouble in Fedora 19 by moving it under nfs.target which was disabled by default - although nfs-lock.service is also disabled by default, which is fine until you start adding NFS client mounts.) Every boot means a random chance of NFS mount failures and nfs-lock.service failure due to NFS client mounts [each and every mount] starting up [its own instance of an] rpc.statd before nfs-lock.service. The nfs-lock.service fails if its rpc.statd fails to start due to conflicting with the other rpc.statds from NFS client mounts. Mounts fail if they start too soon that there isn't a running rpc.statd pid yet but too late and their rpc.statd fails due to a conflicting rpc.statd from other NFS client mounts. Mounts before and after that window will work. ==== /var/log/messages Jul 26 00:57:50 localhost kernel: [ 0.000000] Initializing cgroup subsys cpuset Jul 26 00:57:50 localhost kernel: [ 0.000000] Initializing cgroup subsys cpu Jul 26 00:57:50 localhost systemd[1]: systemd 204 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ) Jul 26 00:57:50 localhost kernel: [ 0.000000] Linux version 3.9.9-302.fc19.x86_64 (mockbuild.fedoraproject.org) (gcc version 4.8.1 20130603 (Red Hat 4.8.1-1) (GCC) ) #1 SMP Sat Jul 6 13:41:07 UTC 2013 Jul 26 00:57:50 localhost kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.9.9-302.fc19.x86_64 root=UUID=ab1eb936-0ece-49a1-a0d0-3d1c2c4459d9 ro rd.md=0 rd.lvm=0 rd.dm=0 vconsole.keymap=us rd.luks=0 vconsole.font=latarcyrheb-sun16 LANG=en_US.UTF-8 divider=10 ... skipping ... Jul 26 00:58:31 localhost systemd[1]: Started Network Manager Wait Online. Jul 26 00:58:31 localhost systemd[1]: Starting Network is Online. Jul 26 00:58:31 localhost systemd[1]: Reached target Network is Online. Jul 26 00:58:31 localhost systemd[1]: Starting Network. Jul 26 00:58:31 localhost systemd[1]: Reached target Network. Jul 26 00:58:31 localhost systemd[1]: Starting Sendmail Mail Transport Agent... Jul 26 00:58:31 localhost systemd[1]: Starting OpenSSH server daemon... Jul 26 00:58:31 localhost NetworkManager[348]: <info> Activation (p2p1) successful, device activated. Jul 26 00:58:31 localhost NetworkManager[348]: <info> Activation (p2p1) Stage 4 of 5 (IPv6 Configure Timeout) started... Jul 26 00:58:31 localhost NetworkManager[348]: <info> Activation (p2p1) Stage 4 of 5 (IPv6 Configure Timeout) complete. Jul 26 00:58:31 localhost dbus-daemon[303]: dbus[303]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatche r.service' Jul 26 00:58:31 localhost dbus[303]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' Jul 26 00:58:31 localhost systemd[1]: Started Login and scanning of iSCSI devices. Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/40... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/39... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/38... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/37... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/36... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/35... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/34... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/33... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/32... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/31... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/30... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/29... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/28... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/27... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/26... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/25... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/24... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/23... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/22... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/21... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/20... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/19... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/18... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/17... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/16... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/15... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/14... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/13... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/12... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/11... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/10... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/9... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/8... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/7... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/6... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/5... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/4... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/3... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/2... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/1... Jul 26 00:58:32 localhost systemd[1]: Starting RPC bind service... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/tmp... Jul 26 00:58:32 localhost systemd[1]: Started OpenSSH server daemon. Jul 26 00:58:32 localhost systemd[1]: Starting Network Manager Script Dispatcher Service... Jul 26 00:58:32 localhost rpc.statd[877]: Version 1.2.8 starting Jul 26 00:58:32 localhost dbus-daemon[303]: dbus[303]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' Jul 26 00:58:32 localhost dbus[303]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' Jul 26 00:58:32 localhost systemd[1]: Started Network Manager Script Dispatcher Service. Jul 26 00:58:32 localhost rpc.statd[883]: Version 1.2.8 starting Jul 26 00:58:32 localhost rpc.statd[879]: Version 1.2.8 starting Jul 26 00:58:32 localhost rpc.statd[884]: Version 1.2.8 starting Jul 26 00:58:32 localhost rpc.statd[881]: Version 1.2.8 starting Jul 26 00:58:32 localhost rpc.statd[877]: Flags: TI-RPC Jul 26 00:58:32 localhost rpc.statd[881]: Flags: TI-RPC Jul 26 00:58:32 localhost rpc.statd[879]: Flags: TI-RPC Jul 26 00:58:32 localhost rpc.statd[885]: Version 1.2.8 starting Jul 26 00:58:32 localhost rpc.statd[885]: Flags: TI-RPC Jul 26 00:58:32 localhost rpc.statd[880]: Version 1.2.8 starting Jul 26 00:58:32 localhost rpc.statd[880]: Flags: TI-RPC Jul 26 00:58:32 localhost rpc.statd[891]: Version 1.2.8 starting Jul 26 00:58:32 localhost rpc.statd[891]: Flags: TI-RPC Jul 26 00:58:32 localhost rpc.statd[889]: Version 1.2.8 starting Jul 26 00:58:32 localhost rpc.statd[889]: Flags: TI-RPC Jul 26 00:58:32 localhost rpc.statd[883]: Flags: TI-RPC Jul 26 00:58:32 localhost rpc.statd[884]: Flags: TI-RPC Jul 26 00:58:32 localhost systemd[1]: Started RPC bind service. Jul 26 00:58:32 localhost systemd[1]: Starting NFS file locking service.... Jul 26 00:58:33 localhost rpc.statd[903]: Version 1.2.8 starting Jul 26 00:58:33 localhost rpc.statd[903]: Flags: TI-RPC Jul 26 00:58:33 localhost rpc.statd[906]: Version 1.2.8 starting Jul 26 00:58:33 localhost rpc.statd[906]: Flags: TI-RPC ... skipping ... Jul 26 00:58:33 localhost sm-notify[929]: Version 1.2.8 starting Jul 26 00:58:33 localhost systemd[1]: Stopping Sendmail Mail Transport Agent... Jul 26 00:58:33 localhost systemd[1]: Starting Sendmail Mail Transport Agent... Jul 26 00:58:33 localhost mount[813]: Statd service already running! Jul 26 00:58:33 localhost mount[747]: Statd service already running! Jul 26 00:58:33 localhost mount[771]: Statd service already running! Jul 26 00:58:33 localhost mount[824]: Statd service already running! Jul 26 00:58:33 localhost mount[810]: Statd service already running! Jul 26 00:58:33 localhost systemd[1]: PID file /run/sendmail.pid not readable (yet?) after start. Jul 26 00:58:33 localhost systemd[1]: Started Sendmail Mail Transport Agent. Jul 26 00:58:33 localhost systemd[1]: Starting Sendmail Mail Transport Client... Jul 26 00:58:33 localhost mount[792]: Statd service already running! Jul 26 00:58:33 localhost mount[786]: mount.nfs: rpc.statd is not running but is required for remote locking. Jul 26 00:58:33 localhost mount[786]: mount.nfs: Either use '-o nolock' to keep locks local, or start statd. Jul 26 00:58:33 localhost mount[786]: mount.nfs: an incorrect mount option was specified Jul 26 00:58:33 localhost systemd[1]: loki-21.mount mount process exited, code=exited status=32 Jul 26 00:58:33 localhost systemd[1]: Failed to mount /loki/21. Jul 26 00:58:33 localhost systemd[1]: Dependency failed for Remote File Systems. Jul 26 00:58:33 localhost systemd[1]: Unit loki-21.mount entered failed state. Jul 26 00:58:33 localhost systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... Jul 26 00:58:33 localhost systemd[1]: loki-34.mount mount process exited, code=exited status=32 Jul 26 00:58:33 localhost systemd[1]: Failed to mount /loki/34. Jul 26 00:58:33 localhost systemd[1]: Unit loki-34.mount entered failed state. Jul 26 00:58:33 localhost systemd[1]: loki-17.mount mount process exited, code=exited status=32 Jul 26 00:58:33 localhost systemd[1]: Failed to mount /loki/17. Jul 26 00:58:33 localhost systemd[1]: loki-15.mount mount process exited, code=exited status=32 ... skipping ... Jul 26 00:58:33 localhost systemd[1]: loki-29.mount mount process exited, code=exited status=32 Jul 26 00:58:33 localhost mount[798]: mount.nfs: rpc.statd is not running but is required for remote locking. Jul 26 00:58:33 localhost mount[798]: mount.nfs: Either use '-o nolock' to keep locks local, or start statd. Jul 26 00:58:33 localhost mount[798]: mount.nfs: an incorrect mount option was specified Jul 26 00:58:33 localhost mount[802]: mount.nfs: rpc.statd is not running but is required for remote locking. Jul 26 00:58:33 localhost mount[802]: mount.nfs: Either use '-o nolock' to keep locks local, or start statd. Jul 26 00:58:33 localhost mount[802]: mount.nfs: an incorrect mount option was specified Jul 26 00:58:34 localhost kernel: [ 63.148622] FS-Cache: Loaded Jul 26 00:58:34 localhost kernel: [ 63.393935] Key type dns_resolver registered Jul 26 00:58:34 localhost kernel: [ 63.680649] FS-Cache: Netfs 'nfs' registered for caching Jul 26 00:58:40 localhost mount[781]: Statd service already running! ... Jul 26 00:58:40 localhost mount[769]: mount.nfs: Either use '-o nolock' to keep locks local, or start statd. Jul 26 00:58:40 localhost mount[769]: mount.nfs: an incorrect mount option was specified Jul 26 00:58:40 localhost systemd[1]: Failed to mount /loki/29. Jul 26 00:58:40 localhost systemd[1]: Unit loki-29.mount entered failed state. Jul 26 00:58:40 localhost systemd[1]: loki-23.mount mount process exited, code=exited status=32 Jul 26 00:58:40 localhost systemd[1]: Failed to mount /loki/23. Jul 26 00:58:40 localhost systemd[1]: Unit loki-23.mount entered failed state. Jul 26 00:58:40 localhost systemd[1]: Started Permit User Sessions. Jul 26 00:58:40 localhost systemd[1]: Starting Command Scheduler... Jul 26 00:58:40 localhost systemd[1]: Started Command Scheduler. Jul 26 00:58:40 localhost systemd[1]: Starting Job spooling tools... Jul 26 00:58:40 localhost systemd[1]: Started Job spooling tools. Jul 26 00:58:40 localhost systemd[1]: Starting Terminate Plymouth Boot Screen... Jul 26 00:58:40 localhost mount[779]: mount.nfs: rpc.statd is not running but is required for remote locking. Jul 26 00:58:40 localhost mount[779]: mount.nfs: Either use '-o nolock' to keep locks local, or start statd. Jul 26 00:58:40 localhost mount[779]: mount.nfs: an incorrect mount option was specified Jul 26 00:58:40 localhost rpc.statd[912]: failed to create RPC listeners, exiting Jul 26 00:58:40 localhost rpc.statd[906]: failed to create RPC listeners, exiting ... Jul 26 00:58:40 localhost mount[783]: mount.nfs: an incorrect mount option was specified Jul 26 00:58:40 localhost systemd[1]: Started Sendmail Mail Transport Client. Jul 26 00:58:40 localhost systemd[1]: nfs-lock.service: control process exited, code=exited status=1 Jul 26 00:58:40 localhost systemd[1]: Failed to start NFS file locking service.. Jul 26 00:58:40 localhost systemd[1]: Unit nfs-lock.service entered failed state. Jul 26 00:58:40 localhost systemd[1]: Starting Network File System Server. Jul 26 00:58:40 localhost systemd[1]: Reached target Network File System Server. Jul 26 00:58:40 localhost systemd[1]: Starting Multi-User System. Jul 26 00:58:40 localhost systemd[1]: Reached target Multi-User System. Jul 26 00:58:40 localhost systemd[1]: Starting Update UTMP about System Runlevel Changes... Jul 26 00:58:40 localhost systemd[1]: Starting Stop Read-Ahead Data Collection 10s After Completed Startup. Jul 26 00:58:40 localhost systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup. Jul 26 00:58:40 localhost systemd[1]: Mounted /loki/5. ... ... ==== Note: Even with nfs.target and nfs-lock.service enabled, this seems to be an issue (start-up order and remote-fs not waiting for nfs-lock.service) based on /var/log/messages: ... Jul 26 00:58:31 localhost systemd[1]: Started Network Manager Wait Online. Jul 26 00:58:31 localhost systemd[1]: Starting Network is Online. Jul 26 00:58:31 localhost systemd[1]: Reached target Network is Online. Jul 26 00:58:31 localhost systemd[1]: Starting Network. Jul 26 00:58:31 localhost systemd[1]: Reached target Network. ... Jul 26 00:58:31 localhost systemd[1]: Mounting /loki/40... ... <---- Note: Why are we starting mounts here, NFS isn't on yet? Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/1... Jul 26 00:58:32 localhost systemd[1]: Starting RPC bind service... Jul 26 00:58:32 localhost systemd[1]: Mounting /loki/tmp... Jul 26 00:58:32 localhost systemd[1]: Started OpenSSH server daemon. Jul 26 00:58:32 localhost systemd[1]: Starting Network Manager Script Dispatcher Service... Jul 26 00:58:32 localhost rpc.statd[877]: Version 1.2.8 starting ... Jul 26 00:58:32 localhost rpc.statd[881]: Version 1.2.8 starting Jul 26 00:58:32 localhost rpc.statd[877]: Flags: TI-RPC ... <--- From the NFS mounts in fstab we see concurrent rpc.statd's ... <--- being started... again, why - NFS isn't on yet, specifically ... <--- nfs-lock.service didn't even try to start yet which is what ... <--- 'should' be bringing rpc.statd online and prevent these from ... <--- even trying to start Jul 26 00:58:32 localhost rpc.statd[884]: Flags: TI-RPC Jul 26 00:58:32 localhost systemd[1]: Started RPC bind service. Jul 26 00:58:32 localhost systemd[1]: Starting NFS file locking service.... Jul 26 00:58:33 localhost rpc.statd[903]: Version 1.2.8 starting Note: "Starting NFS file locking service..." shows up way too late. rpc.statd was already tried to be started countless times which causes nfs-lock.service unit to fail and the mount flakiness. If you force nfs-lock.service to come online before remote-fs and remote-fs to wait for it, the issue never happens and only one rpc.statd starts. However, from unit files, this appears to be what is defined. But doesn't look like what is happening? Hopefully that clarifies what I'm seeing. Need anything specific, let me know. So far it's been incredibly reliable to reproduce. nfs-utils-1.2.8-3.0.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report. (In reply to Fedora Update System from comment #5) > nfs-utils-1.2.8-3.0.fc19 has been pushed to the Fedora 19 stable repository. > If problems still persist, please make note of it in this bug report. Please review my comment #4 where I already reviewed the package "nfs-utils-1.2.8-3.0.fc19" which does not address this bug. In also taking a look at the spec file diff, looks like it only enables nfs.target and nfs-lock.service on install and does a reenable on upgrade (but only if already enabled, which is appropriate). So for a Fresh Fedora 19 where these are disabled by default, they would still be disabled after the yum update, which at least explains why there was no change. This still leaves the root of this bug, rpc.statd / nfs-lock.service is not starting up properly and nfs-lock.service can be seen failing as in the Description and comment #4. This can cause NFS mounts at boot to fail as well as other symptoms like multiple rpc.statd's to show up in the process list. I noticed on my Fedora 18 which I had restored from a backup due to this issue that I suddenly started running into this same issue on a recent yum update where I wasn't having issues before. Based on the /var/log/messages it is having the same behavior, mounts are starting before the NFS File Locking service, so the mounts are starting their own rpc.statd before NFS File Locking can start the rpc.statd. Oddly I still seem to have less chance of issues under F18 where I still have certain failure under F19. So the mounts do sometimes fail, but not always. If I reboot a few times, it eventually mounts everything fine on it's own, although I still have twenty rpc.statd's in my process list and a bunch of errors about rpc.statd already running in my logs - NFS File Locking started after the mount attempts. Same fix seems to apply, if you manually tweak things to force a single rpc.statd to start early in the boot cycle everything works properly (such as getting nfs-lock.service to start early and making sure rpc.statd is started before the NFS client mounts start). As it stands now, nfs-lock.service with a NFS client mount under fstab seems to cause nfs-lock.service to fail (rpc.statd already running) in Fedora 19. I have experienced what appears to be the same problem with rpc.statd getting started more than once. I notice that the nfs-lock.service file says "Before=remote-fs-pre.target", but the "remote-fs-pre.target" is not actually being activated. I patched the remote-fs.target by adding "Requires=remote-fs-pre.target", and that fixed the problem for me. It causes nfs-lock.service to start before remote-fs-pre.target, then the nfs mounts occur, and finally remote-fs.target is reached. So I think the problem was simply that the remote-fs-pre.target was not being activated. Created attachment 795785 [details]
remote-fs.target should require that remote-fs-pre.target start before it
This fixes the problem for me -- it activate the remote-fs-pre.target, so the nfs-lock.service will get started before that and before all NFS mounts.
(In reply to Andrew J. Schorr from comment #9) > Created attachment 795785 [details] > remote-fs.target should require that remote-fs-pre.target start before it > > This fixes the problem for me -- it activate the remote-fs-pre.target, so > the nfs-lock.service will get started before that and before all NFS mounts. I can verify, I was experiencing the same issue where remote-fs-pre.target did not seem to be working but I had not noticed it was not getting activated, although after checking it was indeed inactive, although from looking at units I was under the impression this should be getting activated but wasn't working as expected. I applied the "Requires=remote-fs-pre.target" to the remote-fs.target as suggested and as given by the supplied diff attachment 795785 [details] and rebooted. remote-fs-pre.target activated this time and worked as I expected it should have since I created this Bug. nfs-lock.service started before any mounts and then the mounts after it [rpc.statd] was running causing no issues or mount failures. Tried this in my test setup which always had a 100% failure rate without any issue. This seems to be the appropriate fix [along with the changes already made in nfs-utils-1.2.8-3.0.fc19 to make nfs-lock.service and nfs.target enabled by default]. Thanks for catching that - works for me. Had similar problems with F19 which didn't appear in earlier releases up to F17 (didn't use F18). The number of NFS mounts on the clients was smaller (4-5), but almost always one of them would fail at boot. Systems are up-to-date, now using nfs-utils-1.2.8-4.0.fc19.x86_64. The patch supplied in comment 9 seems to fix the problem for me too. Please release it as a fix. Yes, I do believe that my patch works. However, I am not sure that this solves the problem in the "correct" way. There appears to be some logic in src/core/mount.c that attempts to make this all happen automatically. It could be that the proper fix is to tweak the logic in src/core/mount.c:mount_add_default_dependencies, but I'm not sufficiently familiar with systemd's internals to provide such a patch. For example, this old (and subsequently obsoleted) git commit was attempting to address such issues, I think: http://cgit.freedesktop.org/systemd/systemd/commit/?id=db1355b1c181a4b1ac277064918d7c794dfb6edd But there have clearly been a lot of changes to the code since then. It looks as if somebody previously tried a patch similar to mine here: http://cgit.freedesktop.org/systemd/systemd/commit/?id=6bde0b3220e95a08cefb70846f73b2cf24b7734a But it was reverted here: http://cgit.freedesktop.org/systemd/systemd/commit/?id=a69fe051b85865cad5df74f9f58915e36c4b1418 So this seems to be an area of conceptual difficulty. I hope somebody can find the "right" fix. Until then, my patch in Comment 9 should work. resigning to systemd since that is the owner of /usr/lib/systemd/system/remote-fs.target If I understand correctly, the issue is that nothing pulled in remote-fs-pre.target into the boot transaction, so all the ordering related to that unit did not come into effect. The patch in comment #9 helps, but it is not the proper way to fix this. (In reply to Andrew J. Schorr from comment #12) [...] > But it was reverted here: > > http://cgit.freedesktop.org/systemd/systemd/commit/ > ?id=a69fe051b85865cad5df74f9f58915e36c4b1418 Right. Let's paste the relevant commit message here: We should not pull in remote-fs-pre.target unconditionally. It's supposed to be pulled in by the implementors of it, rather then its users. The "implementor" in this case is nfs-lock.service (it is a service that implements functionality to allow remote mounts to work). It should be nfs-lock.service who pulls remote-fs-pre.target in (using "Wants=remote-fs-pre.target"), not remote-fs.target. See also "man systemd.special", where remote-fs-pre.target is described. Created attachment 800040 [details]
nfs-lock.service should pull in remote-fs-pre.target
Patched nfs-lock.service; if I understood correctly the unit section should Want the remote-fs-pre.target. I can verify that without this patch things break as I originally submitted in the bug (and remote-fs-pre.target does not activate) but after the patch and reboot, things work as expected (and remote-fs-pre.target is activated). Based on my understanding now, along with the previous update in the bug, this seems to be the appropriate fix.
Patch adds to:
/usr/lib/systemd/system/nfs-lock.service
Under '[Unit]' section to include:
Wants=remote-fs-pre.target
Created attachment 800042 [details]
nfs-lock.service should pull in remote-fs-pre.target
Corrected how I generated the diff - didn't include proper path
With the fix suggested in comments #14 and #15, I got again NFS volumes not mounted at boot. So far, the fix in comment #9 seems to actually resolve the issue for me. That is strange. I think both patches should have the equivalent effect. The patch in comment #16 seems to work for me. If you run "systemctl list-dependencies nfs-lock.service", do you see remote-fs-pre.target in the output? And if you run "systemctl --before list-dependencies nfs-lock.service", do you see it there as well? Also, what do "systemctl status nfs-lock.service" and "systemctl status remote-fs-pre.target" show? Regards, Andy With the change to remote-fs.target (as in comment #9), I don't see remote-fs-pre.target for 'systemctl list-dependencies nfs-lock.service', but I do see it for 'systemctl --before list-dependencies nfs-lock.service'. For both nfs-lock.service and remote-fs-pre.target I see no errors in the status output. With the change to nfs-lock.service (as in comment #14), I see remote-fs-pre.target listed in both cases. While doing these tests, including multiple reboots, I have observed my initial problem (some NFS volumes not mounted at boot) in both cases. It's the first time I see it happen after changes to remote-fs.target. So it's likely that I encounter a different issue... My apologizes for taking so long to get back to this... but... With https://bugzilla.redhat.com/show_bug.cgi?id=786050 I added both +Wants=remote-fs-pre.target +Before=remote-fs-pre.target (see commit e1db7048) For bz https://bugzilla.redhat.com/show_bug.cgi?id=817895 I was told the Wants= was "superfluous" so I removed it (see commit 996074322) Now with Comment 14 your saying I should add the Wants= back??? I'm a bit confused.... Well, the semantics of the targets may have changed a bit. I haven't been paying close attention to upstream development in this regard lately, but I know some new network related targets have been introduced. Michal, could you check if the current NFS units have sane dependencies? Any news on this? I've run in the same problem on F21 and opened a bug before seeing this. I think there are several other possible duplicates, see #1007982 #1046015 #1055201 *** Bug 1158164 has been marked as a duplicate of this bug. *** This message is a notice that Fedora 19 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 19. It is Fedora's policy to close all bug reports from releases that are no longer maintained. Approximately 4 (four) weeks from now this bug will be closed as EOL if it remains open with a Fedora 'version' of '19'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 19 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. Since this was reported against F19 which is now EOL I didn't investigate further. However I did a quick test on F21 and mounting NFS shares on boot seems to work fine. Later updates to F19 as well as F20 have both been working fine for some time. I believe the problem was solved along the way, although I cannot point out exactly when. (In reply to Bogdan Costescu from comment #26) > Later updates to F19 as well as F20 have both been working fine for some > time. I believe the problem was solved along the way, although I cannot > point out exactly when. Due to this comment, I'm going to close this out. |