Description of problem: automount daemon has been found to be hung at times, with new mounts not happening, and at least one thread is waiting for an expire ioctl. Version-Release number of selected component (if applicable): autofs-5.0.1-0.rc2.131.el5_4.1-x86_64 on kernel-2.6.18-164.10.1.el5 How reproducible: Still unknown, but the customer has over 100 servers running into this issue without any known or observable common factor. Steps to Reproduce: Not known. Actual results: automount is hung with threads like below: (gdb) thr a a bt Thread 6 (process 3406): #0 0x00002b05c9c21658 in do_sigwait () from /lib64/libpthread.so.0 #1 0x00002b05c9c216fd in sigwait () from /lib64/libpthread.so.0 #2 0x00002b05c97b855d in statemachine (arg=<value optimized out>) at automount.c:1315 #3 0x00002b05c97b970b in main (argc=-758954912, argv=<value optimized out>) at automount.c:2143 Thread 5 (process 17655): #0 0x00002b05caae2557 in ioctl () from /lib64/libc.so.6 #1 0x00002b05c97d1eae in expire (logopt=0, cmd=<value optimized out>, fd=3, ioctlfd=11, path=0x2b05d2c3a6a0 "/home", arg=0x42366020) at dev-ioctl-lib.c:669 #2 0x00002b05c97d23bb in dev_ioctl_expire (logopt=3, ioctlfd=-1, path=0x2b05d2c3a6a0 "/home", when=<value optimized out>) at dev-ioctl-lib.c:706 #3 0x00002b05c97bccb0 in expire_proc_indirect (arg=<value optimized out>) at indirect.c:499 #4 0x00002b05c9c19617 in start_thread () from /lib64/libpthread.so.0 #5 0x00002b05caae9c2d in clone () from /lib64/libc.so.6 Thread 4 (process 3414): #0 0x00002b05caae0e46 in poll () from /lib64/libc.so.6 #1 0x00002b05c97bb204 in handle_mounts (arg=0x7fff9c360a20) at automount.c:866 #2 0x00002b05c9c19617 in start_thread () from /lib64/libpthread.so.0 #3 0x00002b05caae9c2d in clone () from /lib64/libc.so.6 Thread 3 (process 3411): #0 0x00002b05c9c1df70 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00002b05c97bd9f9 in handle_packet_expire_indirect (ap=<value optimized out>, pkt=<value optimized out>) at indirect.c:678 #2 0x00002b05c97bb752 in handle_mounts (arg=0x7fff9c360a20) at automount.c:1039 #3 0x00002b05c9c19617 in start_thread () from /lib64/libpthread.so.0 #4 0x00002b05caae9c2d in clone () from /lib64/libc.so.6 Thread 2 (process 3408): #0 0x00002b05c9c1df70 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00002b05c97c6b38 in st_queue_handler (arg=<value optimized out>) at state.c:1117 #2 0x00002b05c9c19617 in start_thread () from /lib64/libpthread.so.0 #3 0x00002b05caae9c2d in clone () from /lib64/libc.so.6 Thread 1 (process 3407): #0 0x00002b05c9c1df70 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00002b05c97cd50c in alarm_handler (arg=<value optimized out>) at alarm.c:223 #2 0x00002b05c9c19617 in start_thread () from /lib64/libpthread.so.0 #3 0x00002b05caae9c2d in clone () from /lib64/libc.so.6 #0 0x00002b05caae2557 in ioctl () from /lib64/libc.so.6 Threads are called in the following manner Thread 6 Main thread Thread 1 Alarm Thread Thread 2 State Queue Thread Thread 3 Handle Mount calls. Thread 5 Called to expire indirect mounts. Thread 6 is the main thread. This thread creats the alarm handler thread Thread 1, State queue thread Thread 2, Thread 3 to handle mount calls among others and finally calls statemachine() and waits for a signal. Thread 1 is the alarm thread created by thread 6 to handle any SIGALRM. Thread 2 is the state queue thread. Thread 3 is used to handle mount calls. It listens to a filedescriptor which receives commands from the kernel. In this case, it has received a request to expire indirect mounts. This results in a new Thread 5 created to expire the indirect mount. Thread 5 is called to expire indirect mounts. Expected results: Smooth operation. :) Additional info: Customer maps are not exactly simple, with perl program maps that consult NIS (not sure if LDAP is also used) and also check for local mounts and fastest servers, and white/black server lists depending on client locality. Still, it seems to only hang during expire, so it should not be related. SysRq-T logs do not show automount processes, even though automount *is* still running. strace on existing hung automount processes yeld some ioctl calls (perhaps repeated expire calls) but strace won't provide autofs ioctl details. :-/ Ian Kent has been notified and apparently has a patch that resolves a race in the expire code. Need to talk this through with him.
Created attachment 422457 [details] Patch - fix incorrect pthreads condition handling for expire requests This patch bring RHEL-5 autofs inline with upstream expire thread handling.
Created attachment 422459 [details] Patch - expire thread use pending mutex This patch bring the expire thread creation inline with the mount thread creation handling.
The symptoms we are seeing are similar to thread creation hangs that have been seen before but we don't know yet if the problem here is is due to that yet. However, we do need to try these patches before we can move on. A number of changes were needed to backport these to RHEL-5 so I must carry out some sanity testing before posting a test build.
A test build with the above patches is available at: http://people.redhat.com/~ikent/autofs-5.0.1-0.rc2.144.bz601935.1.el5 As I said I'm not sure yet that this addresses the problem we are seeing but we need to check so please give this a try.
Event posted on 14-06-2010 01:35pm BST by dswegen The customer has tried the test package over the weekend on 700+ systems, with no recurrence of the issue apart from one machine which was found to not have the updated rpm in place. dswegen assigned to issue for Nomura EMEA. This event sent from IssueTracker by dswegen issue 984963
Build autofs-5.0.1-0.rc2.145.el5 includes the above correction.
run regression testsuite and no unexpected failure found according to the errata How to test: we can't directly reproduce bug 601935. However, the change is to the code to create expire threads. If this change was to introduce a regression then running the autofs regression test suite would result in unexpected failures. So the verification, as best we can do, is to run the test suite and check that any failures are justified. job links shown as below: against RHEL5.6: x86_64: https://beaker.engineering.redhat.com/jobs/85225 i386: https://beaker.engineering.redhat.com/jobs/85226 ppc64: https://beaker.engineering.redhat.com/jobs/85227 s390x: https://beaker.engineering.redhat.com/jobs/85228 ia64: https://beaker.engineering.redhat.com/jobs/85229 test against 5.7 autofs-5.0.1-0.rc2.156.el5: i386: https://beaker.engineering.redhat.com/jobs/85861 s390x: https://beaker.engineering.redhat.com/jobs/85862 x86_64: https://beaker.engineering.redhat.com/jobs/86464 ppc64: https://beaker.engineering.redhat.com/jobs/86488 ia64: https://beaker.engineering.redhat.com/jobs/86504 There're failure of bz130467 and bz248152 caused by bug 706794, in fact it mounted correctly by using ip address, and not the hostname which is used to do determine in test case.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1079.html