Bug 1158164 - Start of rpc-statd.service is very slow
Summary: Start of rpc-statd.service is very slow
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: rpcbind
Version: 21
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Steve Dickson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-10-28 18:35 UTC by Mattia Verga
Modified: 2014-12-17 15:14 UTC (History)
3 users (show)

Fixed In Version: rpcbind-0.2.2-0.0.fc21
Clone Of:
Environment:
Last Closed: 2014-12-06 10:41:41 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
/var/log/messages (172.71 KB, text/plain)
2014-10-28 18:35 UTC, Mattia Verga
no flags Details
Strace output of '/usr/sbin/rpc.statd --no-notify $STATDARGS' (22.45 KB, text/plain)
2014-11-10 17:13 UTC, Mattia Verga
no flags Details
Strace of forked process (400.28 KB, text/plain)
2014-11-10 18:39 UTC, Mattia Verga
no flags Details

Description Mattia Verga 2014-10-28 18:35:29 UTC
Created attachment 951481 [details]
/var/log/messages

I have two directories on my NAS mounted in NFS on my Fedora workstation.

With F19 I had no problem mounting them in /etc/fstab with:

'192.168.1.200:/Mattia/    /mnt/qnap   nfs defaults   0 0'

After formatting and upgrading the workstation to F21 I have very slow startup with a wait of over 60 seconds just before entering graphical interface caused by these NFS mounts.
I tried to modify my fstab like:

'192.168.1.200:/Mattia/    /mnt/qnap   nfs rw   0 0'

or

'192.168.1.200:/Mattia/    /mnt/qnap   nfs rsize=8192,wsize=8192,timeo=14,soft   0 0'

I also tried to enable nfs-client.service, but I always get the delay at startup.
If I comment out the lines in fstab relative to NFS shares the delay disappear.

After a boot without mounting NFS shares, if I re-enable the lines in fstab and manually mount the shares with:
'mount /mnt/qnap'
I can see a long delay before the command is executed.

Again, I did not have this problem on F19, so it's not NAS related. I attach the output of /var/log/messages: the delay is clearly visible between 19:12:58 and 19:13:58 when I get the message
'rpc.statd[1240]: failed to create RPC listeners, exiting'

By disabling the NFS mounts in fstab I don't have the RPC listeners error.

Comment 1 Mattia Verga 2014-11-03 08:22:13 UTC
Seems mount.nfs tries to start rpc-statd.service but fails.
During boot I can see "A start job is running for NFS status monitor for NFSv2/3" and causes a 60 seconds timeout on boot. After that rpc.statd.service is in failed state.

I then tried to enable ntfs-server.service and this way I can see that rpc-statd.service is started correctly, but I always get the timeout on boot, seems like mount.nfs already tries to start the service even if it's already running...

Comment 2 Mattia Verga 2014-11-06 18:26:52 UTC

*** This bug has been marked as a duplicate of bug 983915 ***

Comment 3 Mattia Verga 2014-11-09 19:26:32 UTC
I wrongly marked this bug as duplicate of bug 983915, but it's not.

With a clean F21 installation with no NFS mountpoints created in /etc/fstab simply try to start rpc-statd.service and you will see it takes 60 seconds to start the service.

Comment 4 J. Bruce Fields 2014-11-10 14:17:11 UTC
Is that true even if you start rpc.statd by hand instead of from the unit file?  I wonder if it would be useful in that case to, say, run it under strace, and see if we can figure out what system call it's hanging on.

Or it might be worty watching network traffic in wireshark while rpc.statd is starting--maybe it's waiting for a dns query or something.

Comment 5 Mattia Verga 2014-11-10 17:13:15 UTC
Created attachment 955897 [details]
Strace output of '/usr/sbin/rpc.statd --no-notify $STATDARGS'

I'm not expert of strace, however I gave it a try and made a trace (see attached).

The problem seems to be at the end of the trace, where the command "read(3, "\0", 1)" takes 60 seconds to complete.

I have no idea what it means, I hope you found it useful in some way. If you need a trace with wireshark too let me know and I will figure out how it works...

Comment 6 J. Bruce Fields 2014-11-10 17:39:57 UTC
(In reply to Mattia Verga from comment #5)
> Created attachment 955897 [details]
> Strace output of '/usr/sbin/rpc.statd --no-notify $STATDARGS'
> 
> I'm not expert of strace, however I gave it a try and made a trace (see
> attached).
> 
> The problem seems to be at the end of the trace, where the command "read(3,
> "\0", 1)" takes 60 seconds to complete.

The end of the trace looks like:

  pipe([3, 4])                   = 0 <0.000017>
  clone(...) = 1723 <0.000186>
  close(4)                       = 0 <0.000015>
  read(3, "\0", 1)               = 1 <60.075102>

So I think what happened is it forked a child and just used the read() on its end of the pipe as a way to wait for that child to report that it's ready.

So what would really be interesting to know is what happened in the child.  I think the simplest would be to just add the "--foreground" option to statd to tell it not to do that.

Comment 7 Mattia Verga 2014-11-10 18:39:43 UTC
Created attachment 955925 [details]
Strace of forked process

The --foreground option has no effect, I always get the delay mounting nfs filesystems (only the first filesystem I mount, following ones have no delay since stats is already been started).

Without the foreground option I created the strace of the forked process that shows a 60 seconds timeout in the output.

Comment 8 J. Bruce Fields 2014-11-10 19:14:41 UTC
(In reply to Mattia Verga from comment #7)
> The --foreground option has no effect, I always get the delay mounting nfs
> filesystems (only the first filesystem I mount

OK, sounds expected, assuming by "has no effect" you just mean "doesn't fix the delay.  (I was suggesting --foreground just as a way to make statd easier to trace.)

> Without the foreground option I created the strace of the forked process
> that shows a 60 seconds timeout in the output.

Searching for "<6" and then working backwards, I see:

19:16:54 socket(PF_LOCAL, SOCK_STREAM, 0) = 5 <0.000060>
...
19:16:54 connect(5, {sa_family=AF_LOCAL, sun_path="/var/run/rpcbind.sock"}, 23) = 0 <0.000390>
19:16:54 write(5, "\200\0\0@\3\275q\306\0\0\0\0\0\0\0\2\0\1\206\240\0\0\0\3\0\0\0\2\0\0\0\0"..., 68) = 68 <0.000022>
19:16:54 poll([{fd=5, events=POLLIN}], 1, 60000) = 0 (Timeout) <60.060231>

So, it looks like it sent a request to rpcbind and didn't get a response.  I wonder what's going on with rpcbind?  Is it even running at this point?

Comment 9 Mattia Verga 2014-11-11 17:51:27 UTC
Good suggestion. Looking rpcbind I found another strange behaviour: if I manually start rpcbind.service and then I start rpc-statd.service I get no lag in starting rpc-statd.

But If I start at first rpc-statd.service I can see that rpcbind.service is started as a requirement, but I get the 60 seconds timeout... and I have rpc-statd as PID 1719 and rpcbind as PID 1721.

# systemctl status rpc-statd.service
● rpc-statd.service - NFS status monitor for NFSv2/3 locking.
   Loaded: loaded (/usr/lib/systemd/system/rpc-statd.service; static)
   Active: active (running) since mar 2014-11-11 18:35:21 CET; 5min ago
  Process: 1718 ExecStart=/usr/sbin/rpc.statd --no-notify $STATDARGS (code=exited, status=0/SUCCESS)
 Main PID: 1719 (rpc.statd)
   CGroup: /system.slice/rpc-statd.service
           └─1719 /usr/sbin/rpc.statd --no-notify

# systemctl status rpcbind.service
● rpcbind.service - RPC bind service
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; static)
   Active: active (running) since mar 2014-11-11 18:34:21 CET; 6min ago
  Process: 1720 ExecStart=/sbin/rpcbind -w ${RPCBIND_ARGS} (code=exited, status=0/SUCCESS)
 Main PID: 1721 (rpcbind)
   CGroup: /system.slice/rpcbind.service
           └─1721 /sbin/rpcbind -w

Shouldn't rpcbind have a lower PID because it must be started before statd? Seems to me rpcbind is started after rpc-statd even if in rpc-statd.service file is clearly stated that "requires rpcbind.target"... and that generates the 60 seconds timeout we saw in strace.

Comment 10 Mattia Verga 2014-11-11 18:30:11 UTC
I've noticed that in /etc/systemd/system there were no rpcbind.target.wants directory and doing 'systemctl enable/disable rpcbind.service' only enabled/disabled rpcbind.socket.

I don't know if I'm doing things right, however I've edited rpcbind.service file adding the line 'WantedBy=rpcbind.target' under [Install] section. After that I did a 'systemd daemon-reload' and a 'systemctl enable rpcbind.service', which created rpcbind.target.wants directory and put a symlink of rpcbind.service.

Now mounting nfs filesystems at boot gives me decent times (10 seconds, I think it's normal, but having my Fedora on a ssd disk seems always too much). At least I don't have anymore the 60 seconds wait.

Did I make things right or my workaround can lead to other problems?

Comment 11 Jeff Layton 2014-11-11 19:30:42 UTC
AIUI, rpcbind is one of those services that's supposed to be started on demand by systemd when a request comes in on the socket. Perhaps there's some delay in that startup?

Comment 12 J. Bruce Fields 2014-11-11 19:51:47 UTC
Yes, there are systemd.socket and systemd.service unit files.

There was a patch back in 2012 that added socket activation support that still doesn't seem to be applied: http://lists.freedesktop.org/archives/systemd-devel/2012-February/004352.html

I wonder if the systemd setup assumes it supports socket activation but rpcbind actually doesn't?  But I don't understand well how this is meant to work.

Comment 13 Jeff Layton 2014-11-11 19:58:28 UTC
That may very well be the case. I'll note that on one of my client VMs, rpcbind doesn't get started at boot time, but systemd is listening on the unix socket:

systemd   1 root   41u     unix 0xffff880035d64a40      0t0      21133 /var/run/rpcbind.sock type=STREAM

Perhaps we should transition this to a rpcbind bug and pursue the problem there?

Comment 14 J. Bruce Fields 2014-11-11 20:53:34 UTC
I wonder if there's anything else we could check quickly.  Is there some systemd debugging that would show that it was trying to hand the socket off to rpcbind and failing?

Maybe simplest is just to apply the old patch and see what happens.

Comment 15 Mattia Verga 2014-11-12 16:19:58 UTC
I will build a patched version of rpcbind to try that as soon as I have enough spare time.

Meanwhile I noticed that the bug (bug #747363) with the request to apply that old patch has been closed as "wontfix" for F16 EOL without any activity of the package maintainer. Should I reopen that bug and move it to Rawhide?

A word from the maintainer would be appreciated.

Comment 16 J. Bruce Fields 2014-11-12 18:53:14 UTC
(In reply to Mattia Verga from comment #15)
> I will build a patched version of rpcbind to try that as soon as I have
> enough spare time.
> 
> Meanwhile I noticed that the bug (bug #747363) with the request to apply
> that old patch has been closed as "wontfix" for F16 EOL without any activity
> of the package maintainer. Should I reopen that bug and move it to Rawhide?

Good grief, thanks for spotting that.  So yes I think it makes sense to update that one and then maybe once you've confirmed this is the problem we could close this bug as a dup of that one.

Comment 17 Mattia Verga 2014-11-15 18:41:10 UTC
With rpcbind rebuilt with the patch I can confirm that rpcbind.service, rpcbind.socket and rpc-statd.service are being started correctly.

However there should be something other wrong, because I'm now unable to mount nfs directories: mount says that rpc-statd.service is not running, but systemctl says it is...

Comment 18 J. Bruce Fields 2014-11-17 16:46:11 UTC
Thanks for your persistence!  That sounds like progress, at least....

I took a quick look at nfs-utils/systemd/ to try to figure out how it ensures things get started on mount.  I guess it's the "Before=remote-fs-pre.target" in nfs-client.target.

Except that there's a special comment about statd: "we don't "Wants=rpc-statd.service" as "mount.nfs" will arrange to start that on demand if needed."  I wonder how that works.

The unit files here all refer to "rpcbind.target"--is there even such a thing?  Looks to me like those should all be rpcbind.socket.

Comment 19 J. Bruce Fields 2014-11-17 16:55:21 UTC
(In reply to J. Bruce Fields from comment #18)
> Except that there's a special comment about statd: "we don't
> "Wants=rpc-statd.service" as "mount.nfs" will arrange to start that on
> demand if needed."  I wonder how that works.

From systemd/README: " the rpc-statd.service unit is started by /usr/sbin/start-statd which mount.nfs will run if statd is needed."  I guess that's so we can skip starting statd if we're doing v4 or mounting with "nolock".

I can't see any evidence that we're actually installing such a thing.  Looks like we need to: something that just does "systemctl start rpc-statd.service" should be enough, I think?

Comment 20 Mattia Verga 2014-11-17 17:37:30 UTC
(In reply to J. Bruce Fields from comment #18)
> 
> The unit files here all refer to "rpcbind.target"--is there even such a
> thing?  Looks to me like those should all be rpcbind.socket.

No, there's no "rpcbind.target", at least I can't find any evidence of it and I think this is the real problem. Mounting a nfs share manually or at startup make rpc-statd to start before rpcbind and this leads to the 60 seconds timeout. (see comment #9)

After having tested the patched rpcbind version I came back to the official one and obviously I got the problem.
I then modified rpc-statd.service by adding a "Wants=nfs-config.service rpcbind.service" and a "After=nfs-config.service rpcbind.service". This way all goes well at startup, because we are sure to start rpcbind (service and socket) BEFORE rpc-statd.

I think the actual "Wants=rpcbind.target" and "After=rpcbind.target" do nothing since there's no rpcbind.target at all...

Comment 21 J. Bruce Fields 2014-11-17 22:22:19 UTC
(In reply to Mattia Verga from comment #20)
> No, there's no "rpcbind.target", at least I can't find any evidence of it
> and I think this is the real problem. Mounting a nfs share manually or at
> startup make rpc-statd to start before rpcbind and this leads to the 60
> seconds timeout. (see comment #9)
> 
> After having tested the patched rpcbind version I came back to the official
> one and obviously I got the problem.
> I then modified rpc-statd.service by adding a "Wants=nfs-config.service
> rpcbind.service" and a "After=nfs-config.service rpcbind.service". This way
> all goes well at startup, because we are sure to start rpcbind (service and
> socket) BEFORE rpc-statd.
> 
> I think the actual "Wants=rpcbind.target" and "After=rpcbind.target" do
> nothing since there's no rpcbind.target at all...

OK, so just replacing "rpcbind.target" by "rpcbind.service" everywhere in nfs-utils/systemd works for you?

I'm still curious now how statd is getting started despite the missing /usr/sbin/start-statd.  I guess that's because you explicitly enabled statd and/or nfs-server earlier?  So there may still be something to fix there.

I'd also be happier if we could get this working with socket activation.  I'm not sure how to debug that.

Comment 22 Mattia Verga 2014-11-18 17:13:20 UTC
Yes, replacing rpcbind.target to rpcbind.service make it works. I did not explicitly enabled neither statd or nfs-server, simply doing a mount of a nfs share starts statd automatically, both from fstab and from command line.

I don't know anything about rpcbind, but when I built the patched version I saw that both the service and the socket were started at boot. Even so statd did not find them. Maybe it's just because that patch is many years old and is no more valid?

Comment 23 J. Bruce Fields 2014-11-25 22:45:41 UTC
Steved's latest version of the socket-activation patch would be worth trying:

http://marc.info/?l=linux-nfs&m=141693514115346&w=2

Comment 24 Fedora Update System 2014-11-26 14:51:32 UTC
rpcbind-0.2.2-0.0.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/rpcbind-0.2.2-0.0.fc21

Comment 25 Fedora Update System 2014-11-26 14:51:52 UTC
rpcbind-0.2.2-0.0.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/rpcbind-0.2.2-0.0.fc20

Comment 26 Fedora Update System 2014-11-27 08:37:58 UTC
Package rpcbind-0.2.2-0.0.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing rpcbind-0.2.2-0.0.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-15835/rpcbind-0.2.2-0.0.fc20
then log in and leave karma (feedback).

Comment 27 Fedora Update System 2014-12-06 10:41:41 UTC
rpcbind-0.2.2-0.0.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 28 Fedora Update System 2014-12-17 15:14:53 UTC
rpcbind-0.2.2-1.0.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/rpcbind-0.2.2-1.0.fc20


Note You need to log in before you can comment on or make changes to this bug.