This bug has been copied from bug #174821 and has been proposed to be backported to 4.5 z-stream (EUS).
A patch addressing this issue has been included in build 2.6.9-55.0.3.EL.
have tested the bug on 2.6.9-55.EL and 2.6.9-55.0.4, the bug is produced successfully on 2.6.9-55.0.2 and did not happen on 2.6.9-55.0.4. so the bug should have been fixed.
I retested this bug, and found that it failed on amd64-4as.lab.boston.redhat.com after running the test case for more than 4 hours. the output file is attached. on the other 3 host, ppcp-4as-bos.lab.boston.redhat.com, i386-4as-bos.lab.boston.redhat.com and ia64-4as.lab.boston.redhat.com, the test does not fail untill now, has running for about 16 hours.
Created attachment 161833 [details] the output of test case breakme.sh, the test fails.
(In reply to comment #6) > Created an attachment (id=161833) [edit] > the output of test case breakme.sh, the test fails. > I did uncover a problem with this patch during further testing when preparing the patch for posting upstream. I resolved it late last week and I'm about to update the patch for the RHEL4 and RHEL5 kernels. I'll post it here as soon as I've cast the patch against the RHEL4 kernel, later today. The symptom for this problem is different to the one produced by the original bug. You should see the test script exit upon receiving an ENOENT (rather than hanging indefinitely) and autofs should continue to run. The problem is that, with more than one process on the wait queue, occasionally the order of the woken processes can lead to an error return from the lookup calls. Ian
(In reply to comment #7) > (In reply to comment #6) > > Created an attachment (id=161833) [edit] [edit] > > the output of test case breakme.sh, the test fails. > > > Sorry, the statements below actually refer to another bug (see bz#253231) and although it's a RHEL5 bug the code is the same so this bug also exists in RHEL4. However, the test I was using when I discovered this was the same as the test used to reproduce this bug. The problem with waiter wakeup order does also exist here and it causes the mount callback to the daemon to not happen and consequently ENOENT is returned. > I did uncover a problem with this patch during further > testing when preparing the patch for posting upstream. > I resolved it late last week and I'm about to update > the patch for the RHEL4 and RHEL5 kernels. > > I'll post it here as soon as I've cast the patch against > the RHEL4 kernel, later today. > > The symptom for this problem is different to the one > produced by the original bug. You should see the test > script exit upon receiving an ENOENT (rather than hanging > indefinitely) and autofs should continue to run. The > problem is that, with more than one process on the wait > queue, occasionally the order of the woken processes > can lead to an error return from the lookup calls. Ian
Created attachment 161841 [details] Patch to fix wakeup order of processes when rehashing dentry Please apply and test this fix. I think it was a coincidence that I also discovered this when testing on x86_64, it is a potential problem for all archs. Ian
(In reply to comment #10) > Hi Ian > > Is the test failure that QA uncovered a result of bz246530? If so, would it be > reasonable to proceed with the patch for this issue as is and address bz246530 > in the next async update? Oddly enough I thought it was when I discovered it but in fact it's an error with patch for the issue here, the mount expire race. Ian
From: Zhang Kexin <kzhang> Subject: Re: [Fwd: kernel 4.5.z test build] Hi Don, Martin, I have run the test for bug248126 on all six architecture except IA64(kernel fro ia64 can not be installed), the bug is not reproduced. on ppciseries, the test has been running more than 5 hours, the other hosts have run the test more than 8 hours. thanks, Kexin
Created attachment 173241 [details] Patch to sync autofs4 with upstream There is a risk that of some confusion regarding various patches. In order to be able to use the same patches everywhere we need to sync the source with the various kernels with upstream. This patch brings the RHEL 4 kernel in line with upstream.
Created attachment 173261 [details] Patch to fix issue reported during QA This patch fixes a fail reported during QA testing. It is in fact a hunk from another autofs4 patch that resolves a deadlock during directory creation under load (see bug #246530 for info). The deadlock patch delays hashing of dentrys at directory creation until the actual create operation and so dentrys remain unhashed for a relatively long time so the code in this patch was needed their. With the expire/mount race fix here, dentrys are unhashed for a relatively brief time so the code in this patch was not identified as needed during development. However, if there are many process concurrently accessing directories it's possible there will be two or more waiters in the queue. Only one of the waiter will have the dentry required to complete the lookup and the others need to perform a d_lookup to get the correct dentry. This patch allows these processes to perform the needed d_lookup. Ian
(In reply to comment #19) > Hi Ian - > > Does rhel4.5 need lookup-expire-race patch? I noticed that it's attached to bz > 174821, but not here. Yes, definitely. Would you like me to post it to verify it against the kernel we're using here and post it here? Ian
I don't think there is need to repost the patches. Thanks.
A patch for this issue has been included in build 2.6.9-55.0.7.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2007-0939.html