601935 – Race condition or other deadlocking issue on expire code path

Bug 601935 - Race condition or other deadlocking issue on expire code path

Summary: Race condition or other deadlocking issue on expire code path

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	autofs
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Ian Kent
QA Contact:	yanfu,wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	615259 1002896
TreeView+	depends on / blocked

Reported:	2010-06-08 21:31 UTC by Fabio Olive Leite
Modified:	2018-10-27 12:00 UTC (History)
CC List:	5 users (show)
Fixed In Version:	autofs-5.0.1-0.rc2.145.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1002896 (view as bug list)
Environment:
Last Closed:	2011-07-21 08:39:02 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch - fix incorrect pthreads condition handling for expire requests (4.92 KB, patch) 2010-06-09 07:28 UTC, Ian Kent	no flags	Details \| Diff
Patch - expire thread use pending mutex (12.69 KB, patch) 2010-06-09 07:32 UTC, Ian Kent	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:1079	0	normal	SHIPPED_LIVE	autofs bug fix and enhancement update	2011-07-21 08:37:25 UTC

Description Fabio Olive Leite 2010-06-08 21:31:45 UTC

Description of problem:

automount daemon has been found to be hung at times, with new mounts not happening, and at least one thread is waiting for an expire ioctl.

Version-Release number of selected component (if applicable):

autofs-5.0.1-0.rc2.131.el5_4.1-x86_64 on kernel-2.6.18-164.10.1.el5

How reproducible:

Still unknown, but the customer has over 100 servers running into this issue without any known or observable common factor.

Steps to Reproduce:

Not known.
  
Actual results:

automount is hung with threads like below:

(gdb) thr a a bt

Thread 6 (process 3406):
#0  0x00002b05c9c21658 in do_sigwait () from /lib64/libpthread.so.0
#1  0x00002b05c9c216fd in sigwait () from /lib64/libpthread.so.0
#2  0x00002b05c97b855d in statemachine (arg=<value optimized out>) at automount.c:1315
#3  0x00002b05c97b970b in main (argc=-758954912, argv=<value optimized out>) at automount.c:2143

Thread 5 (process 17655):
#0  0x00002b05caae2557 in ioctl () from /lib64/libc.so.6
#1  0x00002b05c97d1eae in expire (logopt=0, cmd=<value optimized out>, fd=3, ioctlfd=11, path=0x2b05d2c3a6a0 "/home", arg=0x42366020) at dev-ioctl-lib.c:669
#2  0x00002b05c97d23bb in dev_ioctl_expire (logopt=3, ioctlfd=-1, path=0x2b05d2c3a6a0 "/home", when=<value optimized out>) at dev-ioctl-lib.c:706
#3  0x00002b05c97bccb0 in expire_proc_indirect (arg=<value optimized out>) at indirect.c:499
#4  0x00002b05c9c19617 in start_thread () from /lib64/libpthread.so.0
#5  0x00002b05caae9c2d in clone () from /lib64/libc.so.6

Thread 4 (process 3414):
#0  0x00002b05caae0e46 in poll () from /lib64/libc.so.6
#1  0x00002b05c97bb204 in handle_mounts (arg=0x7fff9c360a20) at automount.c:866
#2  0x00002b05c9c19617 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b05caae9c2d in clone () from /lib64/libc.so.6

Thread 3 (process 3411):
#0  0x00002b05c9c1df70 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b05c97bd9f9 in handle_packet_expire_indirect (ap=<value optimized out>, pkt=<value optimized out>) at indirect.c:678
#2  0x00002b05c97bb752 in handle_mounts (arg=0x7fff9c360a20) at automount.c:1039
#3  0x00002b05c9c19617 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b05caae9c2d in clone () from /lib64/libc.so.6

Thread 2 (process 3408):
#0  0x00002b05c9c1df70 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b05c97c6b38 in st_queue_handler (arg=<value optimized out>) at state.c:1117
#2  0x00002b05c9c19617 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b05caae9c2d in clone () from /lib64/libc.so.6

Thread 1 (process 3407):
#0  0x00002b05c9c1df70 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b05c97cd50c in alarm_handler (arg=<value optimized out>) at alarm.c:223
#2  0x00002b05c9c19617 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b05caae9c2d in clone () from /lib64/libc.so.6
#0  0x00002b05caae2557 in ioctl () from /lib64/libc.so.6

Threads are called in the following manner
Thread 6 Main thread
       Thread 1 Alarm Thread
       Thread 2 State Queue Thread
       Thread 3 Handle Mount calls.
               Thread 5 Called to expire indirect mounts.

Thread 6 is the main thread. This thread creats the alarm handler thread Thread 1, State queue thread Thread 2, Thread 3 to handle mount calls   among others and finally calls statemachine() and waits for a signal.
Thread 1 is the alarm thread created by thread 6 to handle any SIGALRM.
Thread 2 is the state queue thread.
Thread 3 is used to handle mount calls. It listens to a filedescriptor which receives commands from the kernel. In this case, it has received a request to expire indirect mounts. This results in a new Thread 5 created to expire the indirect mount.
Thread 5 is called to expire indirect mounts.


Expected results:

Smooth operation. :)

Additional info:

Customer maps are not exactly simple, with perl program maps that consult NIS (not sure if LDAP is also used) and also check for local mounts and fastest servers, and white/black server lists depending on client locality. Still, it seems to only hang during expire, so it should not be related.

SysRq-T logs do not show automount processes, even though automount *is* still running. strace on existing hung automount processes yeld some ioctl calls (perhaps repeated expire calls) but strace won't provide autofs ioctl details. :-/

Ian Kent has been notified and apparently has a patch that resolves a race in the expire code. Need to talk this through with him.

Comment 4 Ian Kent 2010-06-09 07:28:52 UTC

Created attachment 422457 [details]
Patch - fix incorrect pthreads condition handling for expire requests

This patch bring RHEL-5 autofs inline with upstream expire
thread handling.

Comment 5 Ian Kent 2010-06-09 07:32:17 UTC

Created attachment 422459 [details]
Patch - expire thread use pending mutex

This patch bring the expire thread creation inline with
the mount thread creation handling.

Comment 6 Ian Kent 2010-06-09 07:36:20 UTC

The symptoms we are seeing are similar to thread creation hangs
that have been seen before but we don't know yet if the problem
here is is due to that yet. However, we do need to try these
patches before we can move on.

A number of changes were needed to backport these to RHEL-5
so I must carry out some sanity testing before posting a
test build.

Comment 7 Ian Kent 2010-06-09 12:32:58 UTC

A test build with the above patches is available at:
http://people.redhat.com/~ikent/autofs-5.0.1-0.rc2.144.bz601935.1.el5

As I said I'm not sure yet that this addresses the problem
we are seeing but we need to check so please give this a
try.

Comment 14 Issue Tracker 2010-06-14 12:35:52 UTC

Event posted on 14-06-2010 01:35pm BST by dswegen

The customer has tried the test package over the weekend on 700+ systems,
with no recurrence of the issue apart from one machine which was found to
not have the updated rpm in place.

dswegen assigned to issue for Nomura EMEA.

This event sent from IssueTracker by dswegen 
 issue 984963

Comment 20 Ian Kent 2010-06-21 05:55:41 UTC

Build autofs-5.0.1-0.rc2.145.el5 includes the above correction.

Comment 28 yanfu,wang 2011-06-01 03:17:32 UTC

run regression testsuite and no unexpected failure found according to the errata How to test: we can't directly reproduce bug 601935. However, the change is to the code to create expire threads. If this change was to introduce a regression then running the autofs regression test suite would result in unexpected failures. So the verification, as best we can do, is to run the test suite and check that any failures are justified. 

job links shown as below: 
against RHEL5.6: 
x86_64: https://beaker.engineering.redhat.com/jobs/85225 
i386: https://beaker.engineering.redhat.com/jobs/85226 
ppc64: https://beaker.engineering.redhat.com/jobs/85227 
s390x: https://beaker.engineering.redhat.com/jobs/85228 
ia64: https://beaker.engineering.redhat.com/jobs/85229 

test against 5.7 autofs-5.0.1-0.rc2.156.el5: 
i386: https://beaker.engineering.redhat.com/jobs/85861 
s390x: https://beaker.engineering.redhat.com/jobs/85862 
x86_64: https://beaker.engineering.redhat.com/jobs/86464 
ppc64: https://beaker.engineering.redhat.com/jobs/86488 
ia64: https://beaker.engineering.redhat.com/jobs/86504 
There're failure of bz130467 and bz248152 caused by bug 706794, in fact it mounted correctly by using ip address, and not the hostname which is used to do determine in test case.

Comment 29 errata-xmlrpc 2011-07-21 08:39:02 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1079.html

Comment 30 errata-xmlrpc 2011-07-21 12:34:05 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1079.html

Note You need to log in before you can comment on or make changes to this bug.