Bug 684308

Summary: nfsd: nfsv4 idmapping failing: has idmapd died?
Product: [Fedora] Fedora Reporter: Andrew McNabb <amcnabb>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 14CC: jlayton, luca.giuzzi, raines, rmj, steved
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: nfs-utils-1.2.5-2.fc17 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-10-04 17:42:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch swapping the fcntl's none

Description Andrew McNabb 2011-03-11 18:15:06 UTC
Description of problem:

I've had idmapd occasionally die without giving any error message.  The kernel reports "kernel: [505261.523335] nfsd: nfsv4 idmapping failing: 
has idmapd died?"  I imagine it would be helpful to increase the verbosity level of rpc.idmapd, by setting "-v" in RPCIDMAPDARGS in /etc/sysconfig/nfs.  Is there a certain number of times that "-v" should be repeated to give helpful information but without significantly slowing down the machine?  Is there anything else that we should set to give more information about what's happening?


Version-Release number of selected component (if applicable):
nfs-utils-1.2.3-5.fc14.x86_64

How reproducible:
It's an intermittent problem, but it seems to happen on average to about one machine out of 30 per week.

Comment 1 Luca Giuzzi 2011-05-13 12:47:17 UTC
We are having similar problems with two F14 machines: rpc.idmapd dies randomly.

Perhaps, I can provide some further information.
A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes
a SIGUSR1 is received between invocations of
                fcntl(ic->ic_dirfd, F_SETSIG, 0);
                fcntl(ic->ic_dirfd, F_NOTIFY, 0);
in nfsopen()  [the relevant lines are 781-782 in idmapd.c].
This leads to an unprocessed SIGIO and the untimely and unfortunate death of the process, or so it appears. 

I wonder if a patch reversing the order of the fcntl could be enough to solve the problem or more might be needed (I have to admit that I have not looked at the code with enough care and attention, though).
Some further remarks: both machines where we found this problem have 32bit kernels. On the slowest one, attaching a strace to rpc.idmapd apparently solved the problem (!) suggesting that this could indeed be a timing issue.

Comment 2 Luca Giuzzi 2011-05-14 17:37:40 UTC
Created attachment 498941 [details]
Patch swapping the fcntl's

As a follow up to my previous comment: rpc.idmapd used to die with a 'I/O possible' error message (uncaught sigio). This trivial patch seems to have solved the issue (so far). Unfortunately we do not have any testcase to deterministically reproduce the bug; thus it is just a matter of seeing if the daemon dies once more.

Comment 3 Jeff Layton 2011-05-16 12:25:15 UTC
ACK -- patch looks sane to me. We shouldn't be resetting the signal type until after we disable the notification.

Comment 4 Jeff Layton 2011-05-16 12:25:42 UTC
Might be good to propose this patch formally on linux-nfs.org

Comment 5 Luca Giuzzi 2011-05-16 13:39:12 UTC
Done!
Thanks,
 lg

Comment 6 Steve Dickson 2011-05-16 14:20:50 UTC
(In reply to comment #1)
> We are having similar problems with two F14 machines: rpc.idmapd dies randomly.
> 
> Perhaps, I can provide some further information.
> A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes
> a SIGUSR1 is received between invocations of
>                 fcntl(ic->ic_dirfd, F_SETSIG, 0);
>                 fcntl(ic->ic_dirfd, F_NOTIFY, 0);
> in nfsopen()  [the relevant lines are 781-782 in idmapd.c].
> This leads to an unprocessed SIGIO and the untimely and unfortunate death of
> the process, or so it appears. 
> 
Hmm... I'm a bit concerned by the fact that there are
a couple of places that sets F_SETSIG and then sets F_NOTIFY, 
so does that mean those places are potential race cases as 
well?

Also, after you applied this patch, did the problem go away?
Finally, how often did this problem occur?

Comment 7 Steve Dickson 2011-05-16 14:24:55 UTC
Hey Luca,

Lets go ahead and have this conversion on the
linux-nfs mailing... More eyes is always better! :-)

Comment 8 Jeff Layton 2011-05-16 14:39:23 UTC
(In reply to comment #6)
> Hmm... I'm a bit concerned by the fact that there are
> a couple of places that sets F_SETSIG and then sets F_NOTIFY, 
> so does that mean those places are potential race cases as 
> well?
> 

I think those places are already correct. They're actually setting dnotify watches and so they should set the signal first. Luca's patch fixes the one place where we're disabling a dnotify. In that case, we don't want to change the signal first since it's possible a notification will race in between the two calls and send a SIGIO instead of SIGUSR2. The right thing to do is to disable the notification first and then set the signal back to the default (SIGIO).

Comment 9 Roderick Johnstone 2011-10-05 09:17:01 UTC
Is there a chance that the fix for this can make it into F16 please?

Comment 10 Paul Raines 2011-10-14 17:08:35 UTC
And a fix for RHEL6 too please.