Description of problem: I've had idmapd occasionally die without giving any error message. The kernel reports "kernel: [505261.523335] nfsd: nfsv4 idmapping failing: has idmapd died?" I imagine it would be helpful to increase the verbosity level of rpc.idmapd, by setting "-v" in RPCIDMAPDARGS in /etc/sysconfig/nfs. Is there a certain number of times that "-v" should be repeated to give helpful information but without significantly slowing down the machine? Is there anything else that we should set to give more information about what's happening? Version-Release number of selected component (if applicable): nfs-utils-1.2.3-5.fc14.x86_64 How reproducible: It's an intermittent problem, but it seems to happen on average to about one machine out of 30 per week.
We are having similar problems with two F14 machines: rpc.idmapd dies randomly. Perhaps, I can provide some further information. A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes a SIGUSR1 is received between invocations of fcntl(ic->ic_dirfd, F_SETSIG, 0); fcntl(ic->ic_dirfd, F_NOTIFY, 0); in nfsopen() [the relevant lines are 781-782 in idmapd.c]. This leads to an unprocessed SIGIO and the untimely and unfortunate death of the process, or so it appears. I wonder if a patch reversing the order of the fcntl could be enough to solve the problem or more might be needed (I have to admit that I have not looked at the code with enough care and attention, though). Some further remarks: both machines where we found this problem have 32bit kernels. On the slowest one, attaching a strace to rpc.idmapd apparently solved the problem (!) suggesting that this could indeed be a timing issue.
Created attachment 498941 [details] Patch swapping the fcntl's As a follow up to my previous comment: rpc.idmapd used to die with a 'I/O possible' error message (uncaught sigio). This trivial patch seems to have solved the issue (so far). Unfortunately we do not have any testcase to deterministically reproduce the bug; thus it is just a matter of seeing if the daemon dies once more.
ACK -- patch looks sane to me. We shouldn't be resetting the signal type until after we disable the notification.
Might be good to propose this patch formally on linux-nfs.org
Done! Thanks, lg
(In reply to comment #1) > We are having similar problems with two F14 machines: rpc.idmapd dies randomly. > > Perhaps, I can provide some further information. > A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes > a SIGUSR1 is received between invocations of > fcntl(ic->ic_dirfd, F_SETSIG, 0); > fcntl(ic->ic_dirfd, F_NOTIFY, 0); > in nfsopen() [the relevant lines are 781-782 in idmapd.c]. > This leads to an unprocessed SIGIO and the untimely and unfortunate death of > the process, or so it appears. > Hmm... I'm a bit concerned by the fact that there are a couple of places that sets F_SETSIG and then sets F_NOTIFY, so does that mean those places are potential race cases as well? Also, after you applied this patch, did the problem go away? Finally, how often did this problem occur?
Hey Luca, Lets go ahead and have this conversion on the linux-nfs mailing... More eyes is always better! :-)
(In reply to comment #6) > Hmm... I'm a bit concerned by the fact that there are > a couple of places that sets F_SETSIG and then sets F_NOTIFY, > so does that mean those places are potential race cases as > well? > I think those places are already correct. They're actually setting dnotify watches and so they should set the signal first. Luca's patch fixes the one place where we're disabling a dnotify. In that case, we don't want to change the signal first since it's possible a notification will race in between the two calls and send a SIGIO instead of SIGUSR2. The right thing to do is to disable the notification first and then set the signal back to the default (SIGIO).
Is there a chance that the fix for this can make it into F16 please?
And a fix for RHEL6 too please.