Bug 684308 - nfsd: nfsv4 idmapping failing: has idmapd died?
nfsd: nfsv4 idmapping failing: has idmapd died?
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: nfs-utils (Show other bugs)
14
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Steve Dickson
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2011-03-11 13:15 EST by Andrew McNabb
Modified: 2011-10-14 13:08 EDT (History)
5 users (show)

See Also:
Fixed In Version: nfs-utils-1.2.5-2.fc17
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-10-04 13:42:03 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Patch swapping the fcntl's (497 bytes, patch)
2011-05-14 13:37 EDT, Luca Giuzzi
no flags Details | Diff

  None (edit)
Description Andrew McNabb 2011-03-11 13:15:06 EST
Description of problem:

I've had idmapd occasionally die without giving any error message.  The kernel reports "kernel: [505261.523335] nfsd: nfsv4 idmapping failing: 
has idmapd died?"  I imagine it would be helpful to increase the verbosity level of rpc.idmapd, by setting "-v" in RPCIDMAPDARGS in /etc/sysconfig/nfs.  Is there a certain number of times that "-v" should be repeated to give helpful information but without significantly slowing down the machine?  Is there anything else that we should set to give more information about what's happening?


Version-Release number of selected component (if applicable):
nfs-utils-1.2.3-5.fc14.x86_64

How reproducible:
It's an intermittent problem, but it seems to happen on average to about one machine out of 30 per week.
Comment 1 Luca Giuzzi 2011-05-13 08:47:17 EDT
We are having similar problems with two F14 machines: rpc.idmapd dies randomly.

Perhaps, I can provide some further information.
A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes
a SIGUSR1 is received between invocations of
                fcntl(ic->ic_dirfd, F_SETSIG, 0);
                fcntl(ic->ic_dirfd, F_NOTIFY, 0);
in nfsopen()  [the relevant lines are 781-782 in idmapd.c].
This leads to an unprocessed SIGIO and the untimely and unfortunate death of the process, or so it appears. 

I wonder if a patch reversing the order of the fcntl could be enough to solve the problem or more might be needed (I have to admit that I have not looked at the code with enough care and attention, though).
Some further remarks: both machines where we found this problem have 32bit kernels. On the slowest one, attaching a strace to rpc.idmapd apparently solved the problem (!) suggesting that this could indeed be a timing issue.
Comment 2 Luca Giuzzi 2011-05-14 13:37:40 EDT
Created attachment 498941 [details]
Patch swapping the fcntl's

As a follow up to my previous comment: rpc.idmapd used to die with a 'I/O possible' error message (uncaught sigio). This trivial patch seems to have solved the issue (so far). Unfortunately we do not have any testcase to deterministically reproduce the bug; thus it is just a matter of seeing if the daemon dies once more.
Comment 3 Jeff Layton 2011-05-16 08:25:15 EDT
ACK -- patch looks sane to me. We shouldn't be resetting the signal type until after we disable the notification.
Comment 4 Jeff Layton 2011-05-16 08:25:42 EDT
Might be good to propose this patch formally on linux-nfs@vger.kernel.org
Comment 5 Luca Giuzzi 2011-05-16 09:39:12 EDT
Done!
Thanks,
 lg
Comment 6 Steve Dickson 2011-05-16 10:20:50 EDT
(In reply to comment #1)
> We are having similar problems with two F14 machines: rpc.idmapd dies randomly.
> 
> Perhaps, I can provide some further information.
> A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes
> a SIGUSR1 is received between invocations of
>                 fcntl(ic->ic_dirfd, F_SETSIG, 0);
>                 fcntl(ic->ic_dirfd, F_NOTIFY, 0);
> in nfsopen()  [the relevant lines are 781-782 in idmapd.c].
> This leads to an unprocessed SIGIO and the untimely and unfortunate death of
> the process, or so it appears. 
> 
Hmm... I'm a bit concerned by the fact that there are
a couple of places that sets F_SETSIG and then sets F_NOTIFY, 
so does that mean those places are potential race cases as 
well?

Also, after you applied this patch, did the problem go away?
Finally, how often did this problem occur?
Comment 7 Steve Dickson 2011-05-16 10:24:55 EDT
Hey Luca,

Lets go ahead and have this conversion on the
linux-nfs mailing... More eyes is always better! :-)
Comment 8 Jeff Layton 2011-05-16 10:39:23 EDT
(In reply to comment #6)
> Hmm... I'm a bit concerned by the fact that there are
> a couple of places that sets F_SETSIG and then sets F_NOTIFY, 
> so does that mean those places are potential race cases as 
> well?
> 

I think those places are already correct. They're actually setting dnotify watches and so they should set the signal first. Luca's patch fixes the one place where we're disabling a dnotify. In that case, we don't want to change the signal first since it's possible a notification will race in between the two calls and send a SIGIO instead of SIGUSR2. The right thing to do is to disable the notification first and then set the signal back to the default (SIGIO).
Comment 9 Roderick Johnstone 2011-10-05 05:17:01 EDT
Is there a chance that the fix for this can make it into F16 please?
Comment 10 Paul Raines 2011-10-14 13:08:35 EDT
And a fix for RHEL6 too please.

Note You need to log in before you can comment on or make changes to this bug.