Description of problem: autofs can deadlock if multiple simulaneous access requests are made for a program map at the same time. Version-Release number of selected component (if applicable): autofs-5.0.1-0.rc2.55 How reproducible: Fairly frequently Steps to Reproduce: 1. Run at least 3 instances of the autofs RHTS conectathon test suite simultaneously in different windows. Actual results: The test will hang in all three windows when the deadlock occurs. Expected results: All three test runs should run through to completion.
Created attachment 228481 [details] Prevent deadlock during multiple simulaneous to program map
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Why does this bug refer to a deadlock in program maps, when the patch addresses submount maps? --- a/modules/mount_autofs.c +++ b/modules/mount_autofs.c
This bug report is really devoid of useful information. Ian, can you provide at least *some* insight into the real problem being fixed? Some stack traces or debug logs that show the actual problem would be helpful.
I can't reproduce this myself now, on F-8 or RHEL-5 with revision 0.rc2.55. The patch as well as the bug title look suspect. The patch itself is incorrect as it doesn't add the mutex umlock in case the following "if test" fails. I'm tempted to revert this but let me do some more digging first. Ian
OK, I think I understand the point of this patch now but it is incomplete. And it relates to submounts not program mounts. I've had some difficulty understanding pthread condition handling and this is an example. Maybe I need to revisit how this is done at some later stage and possibly use an alternate method. The difficulty comes in handling the pthread condition used to pass parameters to handle_mounts() at startup. Since the pthread_cond_wait unlocks the condition mutex while waiting to be signaled, another thread can lock it before the thread being started locks it to send the the completion signal. This can cause the creator thread to be signaled at the wrong time. The symptom we see looks like both signals aren't seen leaving one thread waiting forever. But this shouldn't be possible and it should only effect the execution order so I have to assume the reason the signal isn't seen is due to corruption of the parameters passed caused by this out of order execution. I also believe that this corruption often results in threads mysteriously "going away" rather than a segv (probably why it's has caused me so much confusion). Taking the "mounts" mutex prior to starting the handle_mounts() thread prevents this. The original purpose of the mounts mutex was to protect the list of submounts in the owning autofs_point struct but I can't see any conflict using it to protect the startup condition as well. Anyway, the upshot of all this is that since we can potentially have several submounts happening at once this is where we see the problem. We don't see it at startup because initial mounts are done serially but I think we're open to this race during map re-reads and and we haven't seen it because they aren't done frequently. Consequently, I think the mounts mutex should be taken in both places we start the handle_mounts() thread prior to using the startup condition. Have a look around and see if you concur with my analysis. Ian
(In reply to comment #9) > much confusion). Taking the "mounts" mutex prior to starting the > handle_mounts() thread prevents this. The original purpose of the That should be, prior to locking the startup condition mutex.
Created attachment 302658 [details] Create a separate startup conditional for each handle_mounts invocation If there can be multiple handle_mounts threads starting in parallel, then each should have its own startup conditional. I think that makes more sense than trying to serialize via other, unrelated locks. I also fixed some of the lock release ordering. I smoke tested this patch, but really it's for discussion.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0354.html