684308 – nfsd: nfsv4 idmapping failing: has idmapd died?

Bug 684308 - nfsd: nfsv4 idmapping failing: has idmapd died?

Summary: nfsd: nfsv4 idmapping failing: has idmapd died?

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	nfs-utils
Sub Component:
Version:	14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-03-11 18:15 UTC by Andrew McNabb
Modified:	2011-10-14 17:08 UTC (History)
CC List:	5 users (show)
Fixed In Version:	nfs-utils-1.2.5-2.fc17
Clone Of:
Environment:
Last Closed:	2011-10-04 17:42:03 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch swapping the fcntl's (497 bytes, patch) 2011-05-14 17:37 UTC, Luca Giuzzi	no flags	Details \| Diff
View All

Description Andrew McNabb 2011-03-11 18:15:06 UTC

Description of problem:

I've had idmapd occasionally die without giving any error message.  The kernel reports "kernel: [505261.523335] nfsd: nfsv4 idmapping failing: 
has idmapd died?"  I imagine it would be helpful to increase the verbosity level of rpc.idmapd, by setting "-v" in RPCIDMAPDARGS in /etc/sysconfig/nfs.  Is there a certain number of times that "-v" should be repeated to give helpful information but without significantly slowing down the machine?  Is there anything else that we should set to give more information about what's happening?


Version-Release number of selected component (if applicable):
nfs-utils-1.2.3-5.fc14.x86_64

How reproducible:
It's an intermittent problem, but it seems to happen on average to about one machine out of 30 per week.

Comment 1 Luca Giuzzi 2011-05-13 12:47:17 UTC

We are having similar problems with two F14 machines: rpc.idmapd dies randomly.

Perhaps, I can provide some further information.
A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes
a SIGUSR1 is received between invocations of
                fcntl(ic->ic_dirfd, F_SETSIG, 0);
                fcntl(ic->ic_dirfd, F_NOTIFY, 0);
in nfsopen()  [the relevant lines are 781-782 in idmapd.c].
This leads to an unprocessed SIGIO and the untimely and unfortunate death of the process, or so it appears. 

I wonder if a patch reversing the order of the fcntl could be enough to solve the problem or more might be needed (I have to admit that I have not looked at the code with enough care and attention, though).
Some further remarks: both machines where we found this problem have 32bit kernels. On the slowest one, attaching a strace to rpc.idmapd apparently solved the problem (!) suggesting that this could indeed be a timing issue.

Comment 2 Luca Giuzzi 2011-05-14 17:37:40 UTC

Created attachment 498941 [details]
Patch swapping the fcntl's

As a follow up to my previous comment: rpc.idmapd used to die with a 'I/O possible' error message (uncaught sigio). This trivial patch seems to have solved the issue (so far). Unfortunately we do not have any testcase to deterministically reproduce the bug; thus it is just a matter of seeing if the daemon dies once more.

Comment 3 Jeff Layton 2011-05-16 12:25:15 UTC

ACK -- patch looks sane to me. We shouldn't be resetting the signal type until after we disable the notification.

Comment 4 Jeff Layton 2011-05-16 12:25:42 UTC

Might be good to propose this patch formally on linux-nfs.org

Comment 5 Luca Giuzzi 2011-05-16 13:39:12 UTC

Done!
Thanks,
 lg

Comment 6 Steve Dickson 2011-05-16 14:20:50 UTC

(In reply to comment #1)
> We are having similar problems with two F14 machines: rpc.idmapd dies randomly.
> 
> Perhaps, I can provide some further information.
> A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes
> a SIGUSR1 is received between invocations of
>                 fcntl(ic->ic_dirfd, F_SETSIG, 0);
>                 fcntl(ic->ic_dirfd, F_NOTIFY, 0);
> in nfsopen()  [the relevant lines are 781-782 in idmapd.c].
> This leads to an unprocessed SIGIO and the untimely and unfortunate death of
> the process, or so it appears. 
> 
Hmm... I'm a bit concerned by the fact that there are
a couple of places that sets F_SETSIG and then sets F_NOTIFY, 
so does that mean those places are potential race cases as 
well?

Also, after you applied this patch, did the problem go away?
Finally, how often did this problem occur?

Comment 7 Steve Dickson 2011-05-16 14:24:55 UTC

Hey Luca,

Lets go ahead and have this conversion on the
linux-nfs mailing... More eyes is always better! :-)

Comment 8 Jeff Layton 2011-05-16 14:39:23 UTC

(In reply to comment #6)
> Hmm... I'm a bit concerned by the fact that there are
> a couple of places that sets F_SETSIG and then sets F_NOTIFY, 
> so does that mean those places are potential race cases as 
> well?
> 

I think those places are already correct. They're actually setting dnotify watches and so they should set the signal first. Luca's patch fixes the one place where we're disabling a dnotify. In that case, we don't want to change the signal first since it's possible a notification will race in between the two calls and send a SIGIO instead of SIGUSR2. The right thing to do is to disable the notification first and then set the signal back to the default (SIGIO).

Comment 9 Roderick Johnstone 2011-10-05 09:17:01 UTC

Is there a chance that the fix for this can make it into F16 please?

Comment 10 Paul Raines 2011-10-14 17:08:35 UTC

And a fix for RHEL6 too please.

Note You need to log in before you can comment on or make changes to this bug.