Red Hat Bugzilla – Full Text Bug Listing
|Summary:||nfsd: nfsv4 idmapping failing: has idmapd died?|
|Product:||[Fedora] Fedora||Reporter:||Andrew McNabb <amcnabb>|
|Component:||nfs-utils||Assignee:||Steve Dickson <steved>|
|Status:||CLOSED RAWHIDE||QA Contact:||Fedora Extras Quality Assurance <extras-qa>|
|Version:||14||CC:||jlayton, luca.giuzzi, raines, rmj, steved|
|Fixed In Version:||nfs-utils-1.2.5-2.fc17||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2011-10-04 13:42:03 EDT||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Andrew McNabb 2011-03-11 13:15:06 EST
Description of problem: I've had idmapd occasionally die without giving any error message. The kernel reports "kernel: [505261.523335] nfsd: nfsv4 idmapping failing: has idmapd died?" I imagine it would be helpful to increase the verbosity level of rpc.idmapd, by setting "-v" in RPCIDMAPDARGS in /etc/sysconfig/nfs. Is there a certain number of times that "-v" should be repeated to give helpful information but without significantly slowing down the machine? Is there anything else that we should set to give more information about what's happening? Version-Release number of selected component (if applicable): nfs-utils-1.2.3-5.fc14.x86_64 How reproducible: It's an intermittent problem, but it seems to happen on average to about one machine out of 30 per week.
Comment 1 Luca Giuzzi 2011-05-13 08:47:17 EDT
We are having similar problems with two F14 machines: rpc.idmapd dies randomly. Perhaps, I can provide some further information. A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes a SIGUSR1 is received between invocations of fcntl(ic->ic_dirfd, F_SETSIG, 0); fcntl(ic->ic_dirfd, F_NOTIFY, 0); in nfsopen() [the relevant lines are 781-782 in idmapd.c]. This leads to an unprocessed SIGIO and the untimely and unfortunate death of the process, or so it appears. I wonder if a patch reversing the order of the fcntl could be enough to solve the problem or more might be needed (I have to admit that I have not looked at the code with enough care and attention, though). Some further remarks: both machines where we found this problem have 32bit kernels. On the slowest one, attaching a strace to rpc.idmapd apparently solved the problem (!) suggesting that this could indeed be a timing issue.
Comment 2 Luca Giuzzi 2011-05-14 13:37:40 EDT
Created attachment 498941 [details] Patch swapping the fcntl's As a follow up to my previous comment: rpc.idmapd used to die with a 'I/O possible' error message (uncaught sigio). This trivial patch seems to have solved the issue (so far). Unfortunately we do not have any testcase to deterministically reproduce the bug; thus it is just a matter of seeing if the daemon dies once more.
Comment 3 Jeff Layton 2011-05-16 08:25:15 EDT
ACK -- patch looks sane to me. We shouldn't be resetting the signal type until after we disable the notification.
Comment 4 Jeff Layton 2011-05-16 08:25:42 EDT
Might be good to propose this patch formally on firstname.lastname@example.org
Comment 5 Luca Giuzzi 2011-05-16 09:39:12 EDT
Done! Thanks, lg
Comment 6 Steve Dickson 2011-05-16 10:20:50 EDT
(In reply to comment #1) > We are having similar problems with two F14 machines: rpc.idmapd dies randomly. > > Perhaps, I can provide some further information. > A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes > a SIGUSR1 is received between invocations of > fcntl(ic->ic_dirfd, F_SETSIG, 0); > fcntl(ic->ic_dirfd, F_NOTIFY, 0); > in nfsopen() [the relevant lines are 781-782 in idmapd.c]. > This leads to an unprocessed SIGIO and the untimely and unfortunate death of > the process, or so it appears. > Hmm... I'm a bit concerned by the fact that there are a couple of places that sets F_SETSIG and then sets F_NOTIFY, so does that mean those places are potential race cases as well? Also, after you applied this patch, did the problem go away? Finally, how often did this problem occur?
Comment 7 Steve Dickson 2011-05-16 10:24:55 EDT
Hey Luca, Lets go ahead and have this conversion on the linux-nfs mailing... More eyes is always better! :-)
Comment 8 Jeff Layton 2011-05-16 10:39:23 EDT
(In reply to comment #6) > Hmm... I'm a bit concerned by the fact that there are > a couple of places that sets F_SETSIG and then sets F_NOTIFY, > so does that mean those places are potential race cases as > well? > I think those places are already correct. They're actually setting dnotify watches and so they should set the signal first. Luca's patch fixes the one place where we're disabling a dnotify. In that case, we don't want to change the signal first since it's possible a notification will race in between the two calls and send a SIGIO instead of SIGUSR2. The right thing to do is to disable the notification first and then set the signal back to the default (SIGIO).
Comment 9 Roderick Johnstone 2011-10-05 05:17:01 EDT
Is there a chance that the fix for this can make it into F16 please?
Comment 10 Paul Raines 2011-10-14 13:08:35 EDT
And a fix for RHEL6 too please.