Bug 684308
| Summary: | nfsd: nfsv4 idmapping failing: has idmapd died? | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Andrew McNabb <amcnabb> | ||||
| Component: | nfs-utils | Assignee: | Steve Dickson <steved> | ||||
| Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 14 | CC: | jlayton, luca.giuzzi, raines, rmj, steved | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | nfs-utils-1.2.5-2.fc17 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-10-04 17:42:03 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Andrew McNabb
2011-03-11 18:15:06 UTC
We are having similar problems with two F14 machines: rpc.idmapd dies randomly.
Perhaps, I can provide some further information.
A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes
a SIGUSR1 is received between invocations of
fcntl(ic->ic_dirfd, F_SETSIG, 0);
fcntl(ic->ic_dirfd, F_NOTIFY, 0);
in nfsopen() [the relevant lines are 781-782 in idmapd.c].
This leads to an unprocessed SIGIO and the untimely and unfortunate death of the process, or so it appears.
I wonder if a patch reversing the order of the fcntl could be enough to solve the problem or more might be needed (I have to admit that I have not looked at the code with enough care and attention, though).
Some further remarks: both machines where we found this problem have 32bit kernels. On the slowest one, attaching a strace to rpc.idmapd apparently solved the problem (!) suggesting that this could indeed be a timing issue.
Created attachment 498941 [details]
Patch swapping the fcntl's
As a follow up to my previous comment: rpc.idmapd used to die with a 'I/O possible' error message (uncaught sigio). This trivial patch seems to have solved the issue (so far). Unfortunately we do not have any testcase to deterministically reproduce the bug; thus it is just a matter of seeing if the daemon dies once more.
ACK -- patch looks sane to me. We shouldn't be resetting the signal type until after we disable the notification. Might be good to propose this patch formally on linux-nfs.org Done! Thanks, lg (In reply to comment #1) > We are having similar problems with two F14 machines: rpc.idmapd dies randomly. > > Perhaps, I can provide some further information. > A strace of rpc.idmapd (and a quick check to idmapd.c) showed that sometimes > a SIGUSR1 is received between invocations of > fcntl(ic->ic_dirfd, F_SETSIG, 0); > fcntl(ic->ic_dirfd, F_NOTIFY, 0); > in nfsopen() [the relevant lines are 781-782 in idmapd.c]. > This leads to an unprocessed SIGIO and the untimely and unfortunate death of > the process, or so it appears. > Hmm... I'm a bit concerned by the fact that there are a couple of places that sets F_SETSIG and then sets F_NOTIFY, so does that mean those places are potential race cases as well? Also, after you applied this patch, did the problem go away? Finally, how often did this problem occur? Hey Luca, Lets go ahead and have this conversion on the linux-nfs mailing... More eyes is always better! :-) (In reply to comment #6) > Hmm... I'm a bit concerned by the fact that there are > a couple of places that sets F_SETSIG and then sets F_NOTIFY, > so does that mean those places are potential race cases as > well? > I think those places are already correct. They're actually setting dnotify watches and so they should set the signal first. Luca's patch fixes the one place where we're disabling a dnotify. In that case, we don't want to change the signal first since it's possible a notification will race in between the two calls and send a SIGIO instead of SIGUSR2. The right thing to do is to disable the notification first and then set the signal back to the default (SIGIO). Is there a chance that the fix for this can make it into F16 please? And a fix for RHEL6 too please. |