Bug 1496901
| Summary: | Autofs processes hung while waiting for the release of an entry master_lock that is held by another thread waiting on a bind mount | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Thiago Rafael Becker <tbecker> | ||||
| Component: | autofs | Assignee: | Ian Kent <ikent> | ||||
| Status: | CLOSED ERRATA | QA Contact: | xiaoli feng <xifeng> | ||||
| Severity: | urgent | Docs Contact: | Marc Muehlfeld <mmuehlfe> | ||||
| Priority: | urgent | ||||||
| Version: | 6.8 | CC: | djeffery, ikent, james.hofmeister, jshivers, mjones, mthacker, swhiteho, tbecker, xifeng, xzhou | ||||
| Target Milestone: | rc | Keywords: | ZStream | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | autofs-5.0.5-134.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: |
Previously, autofs did not hold a lock long enough during a master map re-read. Additionally, autofs unnecessarily took a lock for map read operations. As a consequence, the original lookup failed to complete, and autofs did not respond. This update corrects the locking of the master map and map read operations. As a result, autofs no longer hangs due to map dependencies.
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1499287 1501922 (view as bug list) | Environment: | |||||
| Last Closed: | 2018-06-19 05:22:12 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1499287, 1501922 | ||||||
| Attachments: |
|
||||||
HPE informs that the patch seems to fix the issue. They asked another week to monitor the servers. Update from HPE: I installed the package on 3 systems on Sep 27th. Prior to installing the package one of the systems had been experiencing the automount hangups as much as 3 times per day. Since installing the package there have been no issues. I think we’re ready to call this a fix. Ian, if this is good enough for you, I'll proceed with the request for an accelerated fix. (In reply to Thiago Rafael Becker from comment #3) > Update from HPE: > > I installed the package on 3 systems on Sep 27th. Prior to installing the > package one of the systems had been experiencing the automount hangups as > much as 3 times per day. Since installing the package there have been no > issues. I think we’re ready to call this a fix. > > Ian, if this is good enough for you, I'll proceed with the request for an > accelerated fix. Indeed it is but before we can start working on any type of accelerated fix we need pm_ack and rhel-6.10.0 ack on the bug so that the change will be included in rhel-6.10.0 and so not allow it to become "unfixed" in a subsequent release. Not sure who to ask for review of this to get the pm_ack (which should be enough to cover the rhel-6.10.0 ack too). I also need to clone the bug for RHEL-7 since it is a problem there too. *sigh* I've been trying to reproduce the problem seen here for some time now without any success. The problem is that the window for which the deadlock can occur is very small. For indirect autofs mounts (and the customer mounts are all indirect) a map read is not actually done. Once it discovers that the mount settings don't require a map read it returns without doing anything. So the time the master map lock is held is quite short. In all my tests, using both file maps and NIS as the map source the map reads complete far too quickly and no deadlock is seen. The network doesn't even come into play because it doesn't even get that far, I can't think of any way to alter the timing of the map read using something that might be expected, so I'm stuck. I think we'll need to classify this one as sanity only! What do you think? I cann't agree more. From my test, I also can't trigger this issue. Even if there is a network delay. So What the only thing that the QE can do is doing the sanity check. And thanks so much for your help about the reproducer. (In reply to xiaoli feng from comment #22) > I cann't agree more. From my test, I also can't trigger this issue. Even if > there is a network delay. So What the only thing that the QE can do is doing > the sanity check. And thanks so much for your help about the reproducer. In our defence those customer maps are probably the most complex amd maps we'll ever see. That complexity has built up over a long time to (probably) solve access consistency requirements and has lead to the use of some of the more difficult to understand amd map features. The mad map sublink option, for example, I had to go back to the amd source three times during development to re-work out how it was supposed to work, not to mention several other advanced map options and, well, there's the complex key matching and map re-use within the same lookup as well .... I can tell you it hurts my head when I have to look at this customers maps, ;) Hello, Once the world has to deal with multiple architectures at scale again (aka intel x86 is not the only kid on the block), you'll appreciate the effort put into map differentiation in this way. We were able to provide common application frameworks between RS6K, MIPS, Sparc, Intel, Itanium, and Power over AIX, Solaris, Linux, IRIX, DG/UX and Mainframes. Please take into consideration the complexity of your customers environments who operate heterogeneous unixes at a global scale. Sincerest regards, the customer. :) (In reply to Michael Jones from comment #24) > Hello, > Once the world has to deal with multiple architectures at scale again (aka > intel x86 is not the only kid on the block), you'll appreciate the effort > put into map differentiation in this way. We were able to provide common > application frameworks between RS6K, MIPS, Sparc, Intel, Itanium, and Power > over AIX, Solaris, Linux, IRIX, DG/UX and Mainframes. Please take into > consideration the complexity of your customers environments who operate > heterogeneous unixes at a global scale. The fact is the complexity of the maps was a big factor in being able to make the implementation worthwhile. There were only a few other map features not covered by these maps and without the maps provided we would still be tracking down mistakes and omissions with the implementation. Thanks Chevron for the help. Ian Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1917 |
Created attachment 1332098 [details] Patch being tested by the client at the oppening of this ticket. Description of problem: Automount may deadlock if a dependent path of an entry is in the same Client is seeing automount not mounting properly and automount processes in uninterruptible state for a long time, provided a vmcore. In the vmcore, several automount processes where stuck in autofs4_wait waiting for bind mounts. The vmcore showed no mount processes on the system, and no pending signals for the hung tasks. PID: 48555 TASK: ffff883016f0a040 CPU: 20 COMMAND: "automount" #0 [ffff88301a39b9c8] schedule at ffffffff8154a640 #1 [ffff88301a39baa0] autofs4_wait at ffffffffa0cc17b5 [autofs4] #2 [ffff88301a39bb50] autofs4_d_automount at ffffffffa0cc00e9 [autofs4] #3 [ffff88301a39bb90] follow_managed at ffffffff811a9906 #4 [ffff88301a39bbf0] do_lookup at ffffffff811a9a5f #5 [ffff88301a39bc50] __link_path_walk at ffffffff811aa6e3 #6 [ffff88301a39bd30] path_walk at ffffffff811ab29a #7 [ffff88301a39bd70] filename_lookup at ffffffff811ab4ab #8 [ffff88301a39bdb0] do_filp_open at ffffffff811ac984 #9 [ffff88301a39bf20] do_sys_open at ffffffff81196aa7 #10 [ffff88301a39bf70] sys_open at ffffffff81196bb0 #11 [ffff88301a39bf80] system_call_fastpath at ffffffff8100b0d2 We requested an application core. in this core, several automount threads were waiting for the master_lock to be released, and thread 15 was holding the master_lock and waiting for master_mapent lock for autofs_point 0x7f2fc8049420 to be released to access this map. Thread 16 was holding this master_mapent lock, and waiting for a bind mount to a directory in the same master map. Thread 15 (Thread 0x7f2fda5b6700 (LWP 3857)): #0 pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:83 #1 0x00007f2ffd41f51a in master_source_writelock (entry=<value optimized out>) at master.c:573 #2 0x00007f2ffd40e2d3 in do_read_map (ap=0x7f2fc8049a40, map=0x7f2fc800b960, age=1504998001) at lookup.c:318 #3 0x00007f2ffd40e657 in lookup_map_read_map (ap=0x7f2fc8049a40, source=<value optimized out>, age=1504998001) at lookup.c:471 #4 lookup_nss_read_map (ap=0x7f2fc8049a40, source=<value optimized out>, age=1504998001) at lookup.c:576 #5 0x00007f2ffd41031e in do_readmap (arg=0x7f2fdc03bff0) at state.c:479 #6 0x00007f2ffcfbdaa1 in start_thread (arg=0x7f2fda5b6700) at pthread_create.c:301 #7 0x00007f2ffbedabbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 #Thread 16 (Thread 0x7f2fda6e3700 (LWP 3858)):0 0x00007f2ffbed1383 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87 #1 0x00007f2ffd40a647 in timed_read (logopt=0, wait=4294967295, options=<value optimized out>, prog=<value optimized out>, argv=<value optimized out>) at spawn.c:107 #2 do_spawn (logopt=0, wait=4294967295, options=<value optimized out>, prog=<value optimized out>, argv=<value optimized out>) at spawn.c:272 #3 0x00007f2ffd40b137 in spawn_bind_mount (logopt=0) at spawn.c:536 #4 0x00007f2ffa736876 in mount_mount (ap=0x7f2fc8049a40, root=0x7f2fc806a3e0 "/a/b/c", name=0x7f2fda6e0990 "d", name_len=4, what=0x7f2fda6df610 "/a/e/c/d", fstype=0x7f2ff84ed73a "bind", options=0x7f2fc81254a0 "rw,hard,nosuid,intr,nosuid,retrans=12,timeo=45", context=0x35343d6f656d) at mount_bind.c:171 #5 0x00007f2ffd40c77f in do_mount (ap=0x7f2fc8049a40, root=0x7f2fc806a3e0 "/a/b/c", name=0x7f2fda6e0990 "d", name_len=4, what=0x7f2fda6df610 "/a/e/c/d", fstype=0x7f2ff84ed73a "bind", options=0x7f2fc81254a0 "rw,hard,nosuid,intr,nosuid,retrans=12,timeo=45") at mount.c:78 #6 0x00007f2ff84d0d27 in do_link_mount (ap=0x7f2fc8049a40, name=0x7f2fda6e0990 "d", entry=0x7f2fc8279240, flags=<value optimized out>) at parse_amd.c:965 #7 0x00007f2ff84d1381 in amd_mount (ap=0x7f2fc8049a40, name=0x7f2fda6e0990 "d", entry=0x7f2fc8279240, source=<value optimized out>, sv=<value optimized out>, flags=30736, ctxt=0x7f2fcc006be0) at parse_amd.c:1402 #8 0x00007f2ff84d3367 in parse_mount (ap=0x7f2fc8049a40, name=0x7f2fda6e0990 "d", name_len=<value optimized out>, mapent=<value optimized out>, context=0x7f2fcc006be0) at parse_amd.c:1988 #9 0x00007f2ffada993d in lookup_mount (ap=0x7f2fc8049a40, name=<value optimized out>, name_len=<value optimized out>, context=0x7f2fc8216c80) at lookup_yp.c:938 #10 0x00007f2ffd40d380 in do_lookup_mount (ap=0x7f2fc8049a40, map=0x7f2fc800b960, name=0x7f2fda6e0e50 "d", name_len=4) at lookup.c:780 #11 0x00007f2ffd40d928 in lookup_nss_mount (ap=0x7f2fc8049a40, source=0x0, name=0x7f2fda6e0e50 "d", name_len=4) at lookup.c:1133 #12 0x00007f2ffd405bd8 in do_mount_indirect (arg=<value optimized out>) at indirect.c:769 #13 0x00007f2ffcfbdaa1 in start_thread (arg=0x7f2fda6e3700) at pthread_create.c:301 #14 0x00007f2ffbedabbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Our initial hypothesis was a SIGCHILD race, and one child process never receiving the signal that a mount was finished, thus not allowing the threads to proceed. We implemented a change in automount to add a timeout to do_bind_mount, but the problem manifested itself again with this patch. The current hypothesys is being tested by the client and the patch is attached in this bug. Version-Release number of selected component (if applicable): autofs-5.0.5-132.el6.x86_64 How reproducible: Often. Steps to Reproduce: TBD. Actual results: Applications attempting to open an automount get stuck or fail. Expected results: Applications to continue.