Description of problem:
autofs can deadlock if multiple simulaneous access requests
are made for a program map at the same time.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Run at least 3 instances of the autofs RHTS conectathon
test suite simultaneously in different windows.
The test will hang in all three windows when the deadlock
All three test runs should run through to completion.
Created attachment 228481 [details]
Prevent deadlock during multiple simulaneous to program map
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Why does this bug refer to a deadlock in program maps, when the patch addresses
This bug report is really devoid of useful information. Ian, can you provide at
least *some* insight into the real problem being fixed? Some stack traces or
debug logs that show the actual problem would be helpful.
I can't reproduce this myself now, on F-8 or RHEL-5 with
The patch as well as the bug title look suspect.
The patch itself is incorrect as it doesn't add the mutex
umlock in case the following "if test" fails.
I'm tempted to revert this but let me do some more digging
OK, I think I understand the point of this patch now but it is
incomplete. And it relates to submounts not program mounts.
I've had some difficulty understanding pthread condition handling
and this is an example. Maybe I need to revisit how this is done
at some later stage and possibly use an alternate method.
The difficulty comes in handling the pthread condition used to pass
parameters to handle_mounts() at startup. Since the pthread_cond_wait
unlocks the condition mutex while waiting to be signaled, another
thread can lock it before the thread being started locks it to send
the the completion signal. This can cause the creator thread to be
signaled at the wrong time. The symptom we see looks like both signals
aren't seen leaving one thread waiting forever. But this shouldn't
be possible and it should only effect the execution order so I have
to assume the reason the signal isn't seen is due to corruption of
the parameters passed caused by this out of order execution. I also
believe that this corruption often results in threads mysteriously
"going away" rather than a segv (probably why it's has caused me so
much confusion). Taking the "mounts" mutex prior to starting the
handle_mounts() thread prevents this. The original purpose of the
mounts mutex was to protect the list of submounts in the owning
autofs_point struct but I can't see any conflict using it to protect
the startup condition as well.
Anyway, the upshot of all this is that since we can potentially have
several submounts happening at once this is where we see the problem.
We don't see it at startup because initial mounts are done serially
but I think we're open to this race during map re-reads and and we
haven't seen it because they aren't done frequently. Consequently,
I think the mounts mutex should be taken in both places we start the
handle_mounts() thread prior to using the startup condition.
Have a look around and see if you concur with my analysis.
(In reply to comment #9)
> much confusion). Taking the "mounts" mutex prior to starting the
> handle_mounts() thread prevents this. The original purpose of the
That should be, prior to locking the startup condition mutex.
Created attachment 302658 [details]
Create a separate startup conditional for each handle_mounts invocation
If there can be multiple handle_mounts threads starting in parallel, then each
should have its own startup conditional. I think that makes more sense than
trying to serialize via other, unrelated locks. I also fixed some of the lock
I smoke tested this patch, but really it's for discussion.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.