| Summary: | Processes hanging while accessing automounted filesystems | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora EPEL | Reporter: | Martin Simmons <martin> | ||||||||||
| Component: | am-utils | Assignee: | Ian Kent <ikent> | ||||||||||
| Status: | CLOSED EOL | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||
| Severity: | medium | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | el6 | CC: | ikent, martin | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2020-11-30 15:56:59 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Attachments: |
|
||||||||||||
Created attachment 1225310 [details]
Session log from running try-hang-nfs.sh -x
Created attachment 1225311 [details]
Debug log from running try-hang-nfs.sh -x
(In reply to Martin Simmons from comment #0) > Created attachment 1225309 [details] > Script to reproduce > > Description of problem: > > Since updating from am-utils 6.2.0-8.el6 to am-utils 6.2.0-22.el6, I have > had major problems with processes hanging while trying to access automounted > filesystems. > > I've attached a self-contained script (try-hang-nfs.sh) showing the problem > (triggered in this case by using the -n option to amd). I will also attach > the session log (session-hang-nfs.log) and debug log (amd-hang-nfs.log). > > Version-Release number of selected component (if applicable): 6.2.0-22.el6 > > > How reproducible: 100% after some time > > > Steps to Reproduce: > Run the attached script try-hang-nfs.sh. It assumes a nfs server called > lwfs1-cam with CNAME mailhost exporting /var/mail. > > > Actual results: > There is a 52s delay at 14:20:37 in the session log (session-hang-nfs.log). > The kernel reported "/nfs not responding, still trying" during this time. > > Expected results: > Should not have this delay. > > Additional info: > I've deliberately shortened the amd timeouts in the script. The delay > increases with the default timeouts. > > On the production machine, I also have a second map that shares its > underlying nfs mount point in ${autodir}, which causes similar problems. > > This kind of setup has work since the mid 1990's at least. > > Rebuilding amd without am-utils-6.2-fix-umount-to-mount-race.patch fixes it. > I think this patch is flawed because mntfs structures are shared, so cannot > be marked as not mounted as the patch does. Thanks very much for the information, I'll have a look at it. I'm not sure how time will go so if I don't get time soon I'll drop the patch from the package until I can return to it. Ian (In reply to Martin Simmons from comment #0) > > Rebuilding amd without am-utils-6.2-fix-umount-to-mount-race.patch fixes it. > I think this patch is flawed because mntfs structures are shared, so cannot > be marked as not mounted as the patch does. I was thinking about what you said on the am-utils mailing list and I agree I didn't properly taken into account the shared mounts case. That'll make it tricky to fix the original reporters bug, ;) Ian (In reply to Ian Kent from comment #4) > (In reply to Martin Simmons from comment #0) > > > > Rebuilding amd without am-utils-6.2-fix-umount-to-mount-race.patch fixes it. > > I think this patch is flawed because mntfs structures are shared, so cannot > > be marked as not mounted as the patch does. > > I was thinking about what you said on the am-utils mailing list > and I agree I didn't properly taken into account the shared > mounts case. This, together with the original race, is quite an interesting bug. On one hand clearing the MFF_MOUNTED flag is wrong but the test in amfs_lookup_node() of: if (!(mf->mf_flags & MFF_MOUNTED) || (mf->mf_flags & MFF_UNMOUNTING)) { emit an in progress message and continue without mounting } has unclear intent too. Ignoring the MFF_UNMOUNTING check, it says if already mounted assume there's a mount in progress but, IIRC for the original case, a umount has just occurred, and MFF_MOUNTED doesn't get cleared because a mount attempt is also in progress and the ref count has been incremented. Consequently the needed new mount isn't done. Perhaps MFF_MOUNTED should be conditionally cleared so it is cleared early enough for the new mount to be done. I'll need to think about that. Ian Created attachment 1226653 [details]
Patch - fix umount to mount race (updated)
Here is a patch I believe fixes the problem here.
A package with the change of comment #6 has been pushed to the testing repositories so it should show up fairly soon. Could you please test the package to see if it resolves the reported problem in your environment. Certainly the test script you provided was very useful (thanks a lot for the effort you put in to provide it) and the package passes that test. It also still passes the test for the original problem so it looks good from my POV. Posting a comment to the update at: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-8616f525e3 would also be helpful. Ian Thanks, I'll leave it running for a while with the testing package and report back. I can confirm that the package in testing fixes the problem. At least I now have some idea how amd works after more than 20 years of using it :-) This message is a reminder that EPEL 6 is nearing its end of life. Fedora will stop maintaining and issuing updates for EPEL 6 on 2020-11-30. It is our policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of 'el6'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later EPEL version. Thank you for reporting this issue and we are sorry that we were not able to fix it before EPEL 6 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. EPEL el6 changed to end-of-life (EOL) status on 2020-11-30. EPEL el6 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of EPEL please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed. |
Created attachment 1225309 [details] Script to reproduce Description of problem: Since updating from am-utils 6.2.0-8.el6 to am-utils 6.2.0-22.el6, I have had major problems with processes hanging while trying to access automounted filesystems. I've attached a self-contained script (try-hang-nfs.sh) showing the problem (triggered in this case by using the -n option to amd). I will also attach the session log (session-hang-nfs.log) and debug log (amd-hang-nfs.log). Version-Release number of selected component (if applicable): 6.2.0-22.el6 How reproducible: 100% after some time Steps to Reproduce: Run the attached script try-hang-nfs.sh. It assumes a nfs server called lwfs1-cam with CNAME mailhost exporting /var/mail. Actual results: There is a 52s delay at 14:20:37 in the session log (session-hang-nfs.log). The kernel reported "/nfs not responding, still trying" during this time. Expected results: Should not have this delay. Additional info: I've deliberately shortened the amd timeouts in the script. The delay increases with the default timeouts. On the production machine, I also have a second map that shares its underlying nfs mount point in ${autodir}, which causes similar problems. This kind of setup has work since the mid 1990's at least. Rebuilding amd without am-utils-6.2-fix-umount-to-mount-race.patch fixes it. I think this patch is flawed because mntfs structures are shared, so cannot be marked as not mounted as the patch does.