995861 – /var partition at 100% caused by rpc.statd errors: "my_svc_run() - select: Bad file descriptor" filling up /var/log/messages

Bug 995861 - /var partition at 100% caused by rpc.statd errors: "my_svc_run() - select: Bad file descriptor" filling up /var/log/messages

Summary: /var partition at 100% caused by rpc.statd errors: "my_svc_run() - select: Ba...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	nfs-utils
Sub Component:
Version:	19
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1153007
TreeView+	depends on / blocked

Reported:	2013-08-11 14:50 UTC by Marek
Modified:	2015-03-05 20:55 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Clones:	1153007 (view as bug list)
Environment:
Last Closed:	2015-02-17 16:43:06 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
"patched" SRPM (786.83 KB, application/x-rpm) 2013-09-15 11:56 UTC, Marek	no flags	Details
View All

Description Marek 2013-08-11 14:50:03 UTC

Description of problem: /var/log/messages running into GBs in size caused by rpc.statd. The machine is NFS client, mounting a D-Link ShareCenter NAS via /etc/fstab like this:

192.168.1.40:/mnt/HD/HD_a2/media/audio  /data/audio     nfs     rw,noatime,hard,intr,async,rsize=8192,wsize=8192,timeo=14       0 0
192.168.1.40:/mnt/HD/HD_a2/media/video  /data/video     nfs     rw,noatime,hard,intr,async,rsize=8192,wsize=8192,timeo=14       0 0
192.168.1.40:/mnt/HD/HD_a2/media/photos /data/photos    nfs     rw,noatime,hard,intr,async,rsize=8192,wsize=8192,timeo=14       0 0

I have /var as a separate partition; I thought 8GB was enough for a home PC...

Version-Release number of selected component (if applicable): nfs-utils-1.2.8-3.0.fc19.x86_64


How reproducible: intermittent, I can't reproduce it at will


Steps to Reproduce:
1.
2.
3.

Actual results: /var at 100%, the machine either doesn't start or when already started, becomes sluggish and unresponsive. I "solved" it by 'chmod 000 /usr/sbin/rpc.statd' but I can't use NFS anymore.


Expected results: this should not happen


Additional info:

Comment 1 Steve 2013-09-13 18:19:26 UTC

I have had this happen on F19 x86_64 Linux as well (3 times so far). This laptop is also an NFS client, NOT a server.

Comment 2 Marek 2013-09-15 11:56:24 UTC

Created attachment 797881 [details]
"patched" SRPM

I have hopefully "fixed" the issue by rebuilding nfs-utils from the SRPM (commented-out these lines in the source):

utils/statd/svc_run.c:119:                      xlog(L_ERROR, "my_svc_run() - select: %m");
utils/mountd/svc_run.c:113:                     xlog(L_ERROR, "my_svc_run() - select: %m");

Comment 3 J. Bruce Fields 2013-09-16 15:23:49 UTC

Is this something that started happening with some update, or that only affects some (e.g., only F19) clients?

So it sounds like the select() in my_svc_run is returning EBADF.  So presumably one of the descriptors in readfds (a copy of SVC_FDSET) is bad.

I'm a little suspicious of utils/statd/rmtcall.c:statd_get_socket(): first, because it modifies SVC_FDSET, which "man rpc" claims is a constant, and second because one of the few recent changes to statd is a change to that function (eb8229338f "rpc.statd: Fix socket finding loop.".  I wonder if something's wrong there.

The handling of bindresvport failures doesn't look right--shouldn't it return -1 in that case?  (Or set sockfd to -1 to ensure that happens later?)  But that's not new behavior.

Alternatively maybe something's changed in the rpc library we link to.

Comment 4 Marek 2013-09-19 15:44:25 UTC

I can only speak for myself, but I started seeing this in F19. I didn't have this issue from F15 through F18 on the same machine.

Comment 5 J. Bruce Fields 2013-09-19 18:07:21 UTC

Hm, maybe we could figure this out with strace.  If you can run rpc.statd under strace, then find the "select" calls that are returning EBADF, perhaps you can work out which file descriptor is bad....

Comment 6 Marek 2013-10-06 07:26:01 UTC

Well, the issue being intermittent, I would have to be quite lucky to strace it just when it happens. To put it into context, between upgrade to F19 (about a week after it was released) until I 'fixed' it by removing the log message from the source, it happened maybe 5 times. But I'll try.

Comment 7 Marek 2013-10-06 07:28:15 UTC

An afterthought: it seems that the problem happens at boot. My PC being a home machine I turn it on/off every day.

Comment 8 Steve Dickson 2013-10-07 11:58:30 UTC

(In reply to Marek from comment #2)
> Created attachment 797881 [details]
> "patched" SRPM
> 
> I have hopefully "fixed" the issue by rebuilding nfs-utils from the SRPM
> (commented-out these lines in the source):
> 
> utils/statd/svc_run.c:119:                      xlog(L_ERROR, "my_svc_run()
> - select: %m");
> utils/mountd/svc_run.c:113:                     xlog(L_ERROR, "my_svc_run()
> - select: %m");

Well you fixed the problem of /var/log/message filling up not
the reason select is returning EBADF which means rpc.statd 
will be spinning in the same loop and you will probably not 
notice it. 

I too am looking very suspiciously at  commit eb822933...
I wonder if that 
+
+       while (--i >= 0)
+               close(tmp_sockets[i]);
is closing too many fds....

Comment 9 Marek 2013-10-09 15:58:24 UTC

(In reply to Steve Dickson from comment #8)
> (In reply to Marek from comment #2)
> > Created attachment 797881 [details]
> > "patched" SRPM
> > 
> > I have hopefully "fixed" the issue by rebuilding nfs-utils from the SRPM
> > (commented-out these lines in the source):
> > 
> > utils/statd/svc_run.c:119:                      xlog(L_ERROR, "my_svc_run()
> > - select: %m");
> > utils/mountd/svc_run.c:113:                     xlog(L_ERROR, "my_svc_run()
> > - select: %m");
> 
> Well you fixed the problem of /var/log/message filling up not
> the reason select is returning EBADF which means rpc.statd 
> will be spinning in the same loop and you will probably not 
> notice it. 
> 
> I too am looking very suspiciously at  commit eb822933...
> I wonder if that 
> +
> +       while (--i >= 0)
> +               close(tmp_sockets[i]);
> is closing too many fds....

Well that's why I wrote "fixed" and not just _fixed_.

Comment 10 Mustafa Mehmed 2013-10-24 06:45:47 UTC

I am having the same problem. /var/log/messages is full of rpc.statd[1271]: my_svc_run() - select: Bad file descriptor error messages. 

I mount all of my data from a QNAP NAS TS-410 model which is running nfs3.

Comment 11 Mustafa Mehmed 2013-10-24 06:54:28 UTC

Sorry forgot to add :-

1. Running F19 64bit
2. kernel is 3.11.6-200.fc19.x86_64
3. nfs-utils-1.2.8-6.0.fc19.x86_64

Comment 12 Mustafa Mehmed 2013-11-19 22:30:55 UTC

Problem has returned.

1. F19 64bit
2. kernel 3.11.8-200.fc19.x86_64
3. nfs-utils-1.2.8-6.0

I mount all of my data from a QNAP NAS TS-410 model which is running nfs3.

Comment 13 Mustafa Mehmed 2013-11-19 22:36:36 UTC

More info.

I can boot to :-

3.11.6-201.fc19.x86_64

without the bug appearing.

Comment 14 Ben Greear 2014-04-11 18:56:47 UTC

One of our users sees this from time to time.  They are running a 3.9.11+ kernel currently, on Fedora 19.

Aside from just commenting out the messages that are filling the logs, are there any other fixes worth testing?

Comment 15 Ollie Walsh 2014-06-19 11:33:09 UTC

Just hit this on F19 3.11.8-200.fc19.x86_64.

Strace output for rpc.statd (pid 1286):
select(1024, [7 8 9 10 105], NULL, NULL, NULL) = -1 EBADF (Bad file descriptor)
sendto(4, "<27>Jun 19 11:39:58 rpc.statd[12"..., 79, MSG_NOSIGNAL, NULL, 0) = 79
select(1024, [7 8 9 10 105], NULL, NULL, NULL) = -1 EBADF (Bad file descriptor)
sendto(4, "<27>Jun 19 11:39:58 rpc.statd[12"..., 79, MSG_NOSIGNAL, NULL, 0) = 79
...
...

ls -l /proc/1286/fd/
total 0
lrwx------ 1 root root 64 Jun 18 16:07 0 -> /dev/null
lrwx------ 1 root root 64 Jun 18 16:07 1 -> /dev/null
lrwx------ 1 root root 64 Jun 19 11:41 10 -> socket:[1770674]
lrwx------ 1 root root 64 Jun 18 16:07 2 -> /dev/null
lrwx------ 1 root root 64 Jun 18 16:07 4 -> socket:[1770513]
l-wx------ 1 root root 64 Jun 19 11:41 6 -> /run/rpc.statd.pid (deleted)
lrwx------ 1 root root 64 Jun 19 11:41 7 -> socket:[1770656]
lrwx------ 1 root root 64 Jun 19 11:41 8 -> socket:[1770664]
lrwx------ 1 root root 64 Jun 19 11:41 9 -> socket:[1770670]

Comment 16 Ohad Basan 2014-10-07 10:52:26 UTC

I just hit that on RHEL 6.6 Beta

Comment 17 Steve Dickson 2014-12-13 16:01:38 UTC

When this bug hits... how many rpc.statd are running

I saw in one instance there were 48 daemons running when
there should only be one.

Comment 18 Ollie Walsh 2014-12-15 16:54:23 UTC

I don't recall the exact number but it was roughly equal to the number of nfs mounts.

Comment 19 Steve Dickson 2014-12-16 00:08:43 UTC

(In reply to Ollie Walsh from comment #18)
> I don't recall the exact number but it was roughly equal to the number of
> nfs mounts.
Hmm... interesting... The machine I saw that was effected by this issue
didn't have any NFS mounts... 

Can I assume those mount were successful or did they error out?

Comment 20 Ollie Walsh 2014-12-16 14:36:21 UTC

Successfully mounted IIRC (it was some time ago).

This was seen in a number of client VMs in an automated test environment for NFS servers, similar to the environment mentioned in Bug #1153007 I expect.

I think you're on the right track in that bug...

In my case each client is mounting nfs3 exports from multiple servers. Mount commands are potentially run concurrently.
I don't expect rpc.statd is running at boot as they do not mount any nfs exports in /etc/fstab and do not have any nfs exports themselves.

From a quick glance at the source for mount I can't see any form of locking between the poll to check if rpc.statd is running and starting rpc.statd. If multiple nfs exports are mounted concurrently, and rpc.statd is not already running, then I suspect multiple processes would be spawned.

The rpc.statd pid file logic doesn't appear to deal with this situation - it unlinks the file at creation and truncates it at exit. This seems a bit odd to me, but maybe there is a very good reason for it.

Comment 21 J. Bruce Fields 2014-12-16 15:00:29 UTC

(In reply to Ollie Walsh from comment #20)
> Successfully mounted IIRC (it was some time ago).
> 
> This was seen in a number of client VMs in an automated test environment for
> NFS servers, similar to the environment mentioned in Bug #1153007 I expect.
> 
> I think you're on the right track in that bug...
> 
> In my case each client is mounting nfs3 exports from multiple servers. Mount
> commands are potentially run concurrently.
> I don't expect rpc.statd is running at boot as they do not mount any nfs
> exports in /etc/fstab and do not have any nfs exports themselves.
> 
> From a quick glance at the source for mount I can't see any form of locking
> between the poll to check if rpc.statd is running and starting rpc.statd. If
> multiple nfs exports are mounted concurrently, and rpc.statd is not already
> running, then I suspect multiple processes would be spawned.
> 
> The rpc.statd pid file logic doesn't appear to deal with this situation - it
> unlinks the file at creation and truncates it at exit. This seems a bit odd
> to me, but maybe there is a very good reason for it.

My understanding from nfs-utils/systemd/README is that mount's supposed to call out to a /usr/sbin/start-statd.  If we install a script there that just does "systemctl start rpc-statd.service" then systemd can ensure that only one instance is started.  Is start-statd getting installed, and in the right place?

(I assume we're still talking about Fedora here.  There's also a reference to RHEL6.6 above.)

Comment 22 Ollie Walsh 2014-12-16 15:15:46 UTC

The script exists:

http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=blob;f=utils/statd/start-statd;h=ec9383beaf953ab5afcee0f12764e493974af760;hb=HEAD

... but execs rpc.stat.d if systemctl fails so that's not going to help.

Comment 23 J. Bruce Fields 2014-12-16 16:11:46 UTC

(In reply to Ollie Walsh from comment #22)
> The script exists:
> 
> http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=blob;f=utils/statd/start-
> statd;h=ec9383beaf953ab5afcee0f12764e493974af760;hb=HEAD
> 
> ... but execs rpc.stat.d if systemctl fails so that's not going to help.

That fallback certainly looks wrong.

But I don't understand why it's actually causing a problem in your case.  The "systemctl" start should just be a successful no-op in the case statd is already running.

Comment 24 Ollie Walsh 2014-12-16 16:45:37 UTC

Ah, start-statd isn't using systemctl on f19 (nfs-utils 1.2.8):

[... ~]# cat /usr/sbin/start-statd
#!/bin/bash -p
# nfsmount calls this script when mounting a filesystem with locking
# enabled, but when statd does not seem to be running (based on
# /var/run/rpc.statd.pid).
# It should run statd with whatever flags are apropriate for this
# site.
PATH=/sbin:/usr/sbin
exec rpc.statd --no-notify

Comment 25 Ollie Walsh 2014-12-16 16:48:14 UTC

Also there has been a change to prevent multiple rpc.statd instances since v1.2.8:
http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=commit;h=342446a4a624d4ee8254af859bb7f1de6d268679

I'm not sure about this. It still looks like a race condition to me, just with a narrow time window.

Comment 26 J. Bruce Fields 2014-12-16 16:58:29 UTC

(In reply to Ollie Walsh from comment #25)
> Also there has been a change to prevent multiple rpc.statd instances since
> v1.2.8:
> http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=commit;
> h=342446a4a624d4ee8254af859bb7f1de6d268679
> 
> I'm not sure about this. It still looks like a race condition to me, just
> with a narrow time window.

Agreed, I don't understand that either.

Comment 27 Fedora End Of Life 2015-01-09 19:24:31 UTC

This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 28 Fedora End Of Life 2015-02-17 16:43:06 UTC

Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 29 Brett Yoakum 2015-03-05 20:53:29 UTC

As stated by Ohad Basan 2014-10-07 06:52:26 EDT this is happening on RHEL 6, which is still supported. I can also attest to the fact that this is happening in RHEL 6.

Comment 30 Brett Yoakum 2015-03-05 20:55:20 UTC

And.. Just realized this is for fedora only. I'll open a case with Red Hat.

Note You need to log in before you can comment on or make changes to this bug.