Bug 684308

Summary:

nfsd: nfsv4 idmapping failing: has idmapd died?

Product:

[Fedora] Fedora

Reporter:

Andrew McNabb <amcnabb>

Component:

nfs-utils

Assignee:

Steve Dickson <steved>

Status:

CLOSED RAWHIDE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

jlayton, luca.giuzzi, raines, rmj, steved

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

nfs-utils-1.2.5-2.fc17

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-10-04 17:42:03 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Patch swapping the fcntl's	none

Description Andrew McNabb 2011-03-11 18:15:06 UTC

Description of problem:

I've had idmapd occasionally die without giving any error message.  The kernel reports "kernel: [505261.523335] nfsd: nfsv4 idmapping failing: 
has idmapd died?"  I imagine it would be helpful to increase the verbosity level of rpc.idmapd, by setting "-v" in RPCIDMAPDARGS in /etc/sysconfig/nfs.  Is there a certain number of times that "-v" should be repeated to give helpful information but without significantly slowing down the machine?  Is there anything else that we should set to give more information about what's happening?


Version-Release number of selected component (if applicable):
nfs-utils-1.2.3-5.fc14.x86_64

How reproducible:
It's an intermittent problem, but it seems to happen on average to about one machine out of 30 per week.

Comment 1 Luca Giuzzi 2011-05-13 12:47:17 UTC

We are having similar problems with two F14 machines: rpc.idmapd dies randomly.

Perhaps, I can provide some further information.
A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes
a SIGUSR1 is received between invocations of
                fcntl(ic->ic_dirfd, F_SETSIG, 0);
                fcntl(ic->ic_dirfd, F_NOTIFY, 0);
in nfsopen()  [the relevant lines are 781-782 in idmapd.c].
This leads to an unprocessed SIGIO and the untimely and unfortunate death of the process, or so it appears. 

I wonder if a patch reversing the order of the fcntl could be enough to solve the problem or more might be needed (I have to admit that I have not looked at the code with enough care and attention, though).
Some further remarks: both machines where we found this problem have 32bit kernels. On the slowest one, attaching a strace to rpc.idmapd apparently solved the problem (!) suggesting that this could indeed be a timing issue.

Comment 2 Luca Giuzzi 2011-05-14 17:37:40 UTC

Created attachment 498941 [details]
Patch swapping the fcntl's

As a follow up to my previous comment: rpc.idmapd used to die with a 'I/O possible' error message (uncaught sigio). This trivial patch seems to have solved the issue (so far). Unfortunately we do not have any testcase to deterministically reproduce the bug; thus it is just a matter of seeing if the daemon dies once more.

Comment 3 Jeff Layton 2011-05-16 12:25:15 UTC

ACK -- patch looks sane to me. We shouldn't be resetting the signal type until after we disable the notification.

Comment 4 Jeff Layton 2011-05-16 12:25:42 UTC

Might be good to propose this patch formally on linux-nfs.org

Comment 5 Luca Giuzzi 2011-05-16 13:39:12 UTC

Done!
Thanks,
 lg

Comment 6 Steve Dickson 2011-05-16 14:20:50 UTC

(In reply to comment #1)
> We are having similar problems with two F14 machines: rpc.idmapd dies randomly.
> 
> Perhaps, I can provide some further information.
> A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes
> a SIGUSR1 is received between invocations of
>                 fcntl(ic->ic_dirfd, F_SETSIG, 0);
>                 fcntl(ic->ic_dirfd, F_NOTIFY, 0);
> in nfsopen()  [the relevant lines are 781-782 in idmapd.c].
> This leads to an unprocessed SIGIO and the untimely and unfortunate death of
> the process, or so it appears. 
> 
Hmm... I'm a bit concerned by the fact that there are
a couple of places that sets F_SETSIG and then sets F_NOTIFY, 
so does that mean those places are potential race cases as 
well?

Also, after you applied this patch, did the problem go away?
Finally, how often did this problem occur?

Comment 7 Steve Dickson 2011-05-16 14:24:55 UTC

Hey Luca,

Lets go ahead and have this conversion on the
linux-nfs mailing... More eyes is always better! :-)

Comment 8 Jeff Layton 2011-05-16 14:39:23 UTC

(In reply to comment #6)
> Hmm... I'm a bit concerned by the fact that there are
> a couple of places that sets F_SETSIG and then sets F_NOTIFY, 
> so does that mean those places are potential race cases as 
> well?
> 

I think those places are already correct. They're actually setting dnotify watches and so they should set the signal first. Luca's patch fixes the one place where we're disabling a dnotify. In that case, we don't want to change the signal first since it's possible a notification will race in between the two calls and send a SIGIO instead of SIGUSR2. The right thing to do is to disable the notification first and then set the signal back to the default (SIGIO).

Comment 9 Roderick Johnstone 2011-10-05 09:17:01 UTC

Is there a chance that the fix for this can make it into F16 please?

Comment 10 Paul Raines 2011-10-14 17:08:35 UTC

And a fix for RHEL6 too please.