Description of problem: Customer has a very large number of direct maps. They ran into an issue where "automounter WILL cease functioning after about 24-72 hours." To work around the issue they added 'ulimit -n 20480; ulimit -s 65535' to /etc/init.d/autofs. Ian Kent has a patch for autofs to add a configuration option to increase max open files and max stack size. This way the customer doesn't have to hack /etc/init.d/autofs
(In reply to comment #0) > Description of problem: > Customer has a very large number of direct maps. They ran into an issue where > "automounter WILL cease functioning after about 24-72 hours." To work around > the issue they added 'ulimit -n 20480; ulimit -s 65535' to /etc/init.d/autofs. > Ian Kent has a patch for autofs to add a configuration option to increase max > open files and max stack size. This way the customer doesn't have to hack > /etc/init.d/autofs Right, but there are a couple of unanswered questions at this stage. First, assuming we do need to add these configuration options, there is the question of their initial values. The values above are quite large and most people won't come close to needing them set that high. Is the customer happy to update their autofs configuration with the values they need to use? Second, I don't think the "ulimit -n 20480" does anything because autofs explicitly sets the maximum open files when it starts. We need to know what happens when that setting isn't used? Of course, that doesn't mean we then wouldn't add it as a configuration option but there would be no reason to increase it beyond the current 10240 that we use. Finally, there is the question of whether the problem is being caused by another issue. Recently an upstream change caused some out of bounds (by one) array references to a stack variable in several functions in the LDAP lookup module to come to light. RHEL autofs doesn't have this change but those illegal references are present, and even although we haven't had any other reports of a problem due to this, it may be the cause here. So, I recommend I fix the array reference issue and provide a scratch build for testing before we go any further. Is that OK? Ian
Event posted on 07-29-2009 09:19am EDT by jbrier >So, I recommend I fix the array reference issue and provide a >scratch build for testing before we go any further. >Is that OK? In the email correspondence between Frank Hirtz, Ian Kent, et al I think that was determined to be the best thing to try, *first* Please do that Ian and I will pass the test packages along to the customer. Just for clarification, the Issue-Tracker was originally opened as an RFE but I meant this to go to Engineering with the expectation that we would try to fix the array reference issue first, as a potential Bug, NOT an RFE. Jeremy West reassigned the escalation group as such. The title of the IT/BZ is probably not currently accurate. John Brier This event sent from IssueTracker by jbrier issue 322595
(In reply to comment #2) > Event posted on 07-29-2009 09:19am EDT by jbrier > > >So, I recommend I fix the array reference issue and provide a > >scratch build for testing before we go any further. > > >Is that OK? > > In the email correspondence between Frank Hirtz, Ian Kent, et al I think > that was determined to be the best thing to try, *first* > > Please do that Ian and I will pass the test packages along to the > customer. Done, the package with the array reference fix can be found at: http://people.redhat.com/~ikent/autofs-5.0.1-0.rc2.130.bz514412.1 > > Just for clarification, the Issue-Tracker was originally opened as an RFE > but I meant this to go to Engineering with the expectation that we would > try to fix the array reference issue first, as a potential Bug, NOT an > RFE. Jeremy West reassigned the escalation group as such. The title of > the IT/BZ is probably not currently accurate. It matters not, both changes are straight forward. We just needed a bug so we could track the work. Ian
I'll have a look at the strace but I don't think it will be useful. When autofs hangs then we need some specific information. In particular a gdb backtrace of the threads in automount. The corresponding debuginfo package needs to be installed for the backtrace to be useful. The debuginfo packages are also available on my people page for the packages being used. When autofs hangs do: gdb -p <automount pid> /usr/sbin/automount gdb> thr a a bt and save the output and post it to the bug. A copy of /proc/mounts would also be useful. Also, if using the RHEL-5.4 kernel, could we use two different setups for this, one with the autofs configuration USE_MISC_DEVICE="yes" and the other with USE_MISC_DEVICE="no" or commented out. If we aren't using the 5.4 kernel which revision is in use? Ian
Created attachment 358519 [details] Patch to add back ommitted locking on direct map re-read
Created attachment 358810 [details] Patch to fix recent "dont umount existing direct mount on reread" change This latest debug and trace information lead me to look again at the area of code that needed the locking fix above. It certainly looks like there is an incorrect check in another recent fix which is causing a deadlock. Have a look at the description in the patch for a more detailed explanation.
Created attachment 358812 [details] Patch to fix libxml2 thread safety issue I see from the debug log your using LDAP (yeah, I knew that anyway). Recent changes seem to have altered the concurrency behaviour of autofs a little which has caused a issue with libxml2 to show up. We have just received feedback from a customer that tested this patch advising us it fixes the issue for them so I've included in the update here as well.
I can't be sure this revision fixes the hang we are seeing but the evidence appears to match. It looks like a deadlock has been introduced a recent bug fix which makes sense because the really significant changes that went into 5.4 has been tested extensively. Could you please test revision 0.rc2.130.bz514412.3. It can be found at: http://people.redhat.com/~ikent/autofs-5.0.1-0.rc2.130.bz514412.3. Ian
I've marked this bug as dependent on bug 513289 because the package tested here included that correction also. The changes for this bug and 513289 are relatively small so are low risk and have been verified by customers. I believe I will be able to write an RHTS regression test for the issue in this bug and will postpone marking the two bugs here as MODIFIED util that task has been done. However, the changes have been committed to CVS and autofs built as revision 0.rc2.132. We should have this build tested again by the customers concerned for completeness (while the test is written). Ian
(In reply to comment #37) > > I believe I will be able to write an RHTS regression test > for the issue in this bug and will postpone marking the two > bugs here as MODIFIED util that task has been done. However, > the changes have been committed to CVS and autofs built as > revision 0.rc2.132. The RHTS test bugzillas/bz493791 has been updated to check for the regression identified and resolved in this bug. Setting bug status to MODIFIED. Ian
(In reply to comment #38) > (In reply to comment #37) > > The RHTS test bugzillas/bz493791 has been updated to check > for the regression identified and resolved in this bug. It turns out that this regression is triggered by requesting a map re-load where an entry in a direct map has been removed and not a modification, as was initially thought. Ian
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0265.html