Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1496901

Summary:

Autofs processes hung while waiting for the release of an entry master_lock that is held by another thread waiting on a bind mount

Product:

Red Hat Enterprise Linux 6

Reporter:

Thiago Rafael Becker <tbecker>

Component:

autofs

Assignee:

Ian Kent <ikent>

Status:

CLOSED ERRATA

QA Contact:

xiaoli feng <xifeng>

Severity:

urgent

Docs Contact:

Marc Muehlfeld <mmuehlfe>

Priority:

urgent

Version:

6.8

CC:

djeffery, ikent, james.hofmeister, jshivers, mjones, mthacker, swhiteho, tbecker, xifeng, xzhou

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

autofs-5.0.5-134.el6

Doc Type:

Bug Fix

Doc Text:

Previously, autofs did not hold a lock long enough during a master map re-read. Additionally, autofs unnecessarily took a lock for map read operations. As a consequence, the original lookup failed to complete, and autofs did not respond. This update corrects the locking of the master map and map read operations. As a result, autofs no longer hangs due to map dependencies.

Story Points:

---

Clone Of:

Clones:

1499287 1501922 (view as bug list)

Environment:

Last Closed:

2018-06-19 05:22:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1499287, 1501922

Attachments:

Description	Flags
Patch being tested by the client at the oppening of this ticket.	none

Description Thiago Rafael Becker 2017-09-28 18:12:49 UTC

Created attachment 1332098 [details]
Patch being tested by the client at the oppening of this ticket.

Description of problem:

Automount may deadlock if a dependent path of an entry is in the same 

Client is seeing automount not mounting properly and automount processes in uninterruptible state for a long time, provided a vmcore. In the vmcore, several automount processes where stuck in autofs4_wait waiting for bind mounts. The vmcore showed no mount processes on the system, and no pending signals for the hung tasks.

PID: 48555  TASK: ffff883016f0a040  CPU: 20  COMMAND: "automount"                                                      
 #0 [ffff88301a39b9c8] schedule at ffffffff8154a640        
 #1 [ffff88301a39baa0] autofs4_wait at ffffffffa0cc17b5 [autofs4]                                                      
 #2 [ffff88301a39bb50] autofs4_d_automount at ffffffffa0cc00e9 [autofs4]                                               
 #3 [ffff88301a39bb90] follow_managed at ffffffff811a9906  
 #4 [ffff88301a39bbf0] do_lookup at ffffffff811a9a5f       
 #5 [ffff88301a39bc50] __link_path_walk at ffffffff811aa6e3                                                            
 #6 [ffff88301a39bd30] path_walk at ffffffff811ab29a       
 #7 [ffff88301a39bd70] filename_lookup at ffffffff811ab4ab 
 #8 [ffff88301a39bdb0] do_filp_open at ffffffff811ac984    
 #9 [ffff88301a39bf20] do_sys_open at ffffffff81196aa7     
#10 [ffff88301a39bf70] sys_open at ffffffff81196bb0        
#11 [ffff88301a39bf80] system_call_fastpath at ffffffff8100b0d2 

We requested an application core. in this core, several automount threads were waiting for the master_lock to be released, and thread 15 was holding the master_lock and waiting for master_mapent lock for autofs_point 0x7f2fc8049420 to be released to access this map. Thread 16 was holding this master_mapent lock, and waiting for a bind mount to a directory in the same master map.

Thread 15 (Thread 0x7f2fda5b6700 (LWP 3857)):
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:83
#1  0x00007f2ffd41f51a in master_source_writelock (entry=<value optimized out>) at master.c:573
#2  0x00007f2ffd40e2d3 in do_read_map (ap=0x7f2fc8049a40, map=0x7f2fc800b960, age=1504998001) at lookup.c:318
#3  0x00007f2ffd40e657 in lookup_map_read_map (ap=0x7f2fc8049a40, source=<value optimized out>, age=1504998001) at lookup.c:471
#4  lookup_nss_read_map (ap=0x7f2fc8049a40, source=<value optimized out>, age=1504998001) at lookup.c:576
#5  0x00007f2ffd41031e in do_readmap (arg=0x7f2fdc03bff0) at state.c:479
#6  0x00007f2ffcfbdaa1 in start_thread (arg=0x7f2fda5b6700) at pthread_create.c:301
#7  0x00007f2ffbedabbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115


#Thread 16 (Thread 0x7f2fda6e3700 (LWP 3858)):0  0x00007f2ffbed1383 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007f2ffd40a647 in timed_read (logopt=0, wait=4294967295, options=<value optimized out>, prog=<value optimized out>, 
    argv=<value optimized out>) at spawn.c:107
#2  do_spawn (logopt=0, wait=4294967295, options=<value optimized out>, prog=<value optimized out>, argv=<value optimized out>)
    at spawn.c:272
#3  0x00007f2ffd40b137 in spawn_bind_mount (logopt=0) at spawn.c:536
#4  0x00007f2ffa736876 in mount_mount (ap=0x7f2fc8049a40, root=0x7f2fc806a3e0 "/a/b/c", name=0x7f2fda6e0990 "d", 
    name_len=4, what=0x7f2fda6df610 "/a/e/c/d", fstype=0x7f2ff84ed73a "bind", 
    options=0x7f2fc81254a0 "rw,hard,nosuid,intr,nosuid,retrans=12,timeo=45", context=0x35343d6f656d) at mount_bind.c:171
#5  0x00007f2ffd40c77f in do_mount (ap=0x7f2fc8049a40, root=0x7f2fc806a3e0 "/a/b/c", name=0x7f2fda6e0990 "d", 
    name_len=4, what=0x7f2fda6df610 "/a/e/c/d", fstype=0x7f2ff84ed73a "bind", 
    options=0x7f2fc81254a0 "rw,hard,nosuid,intr,nosuid,retrans=12,timeo=45") at mount.c:78
#6  0x00007f2ff84d0d27 in do_link_mount (ap=0x7f2fc8049a40, name=0x7f2fda6e0990 "d", entry=0x7f2fc8279240, flags=<value optimized out>)
    at parse_amd.c:965
#7  0x00007f2ff84d1381 in amd_mount (ap=0x7f2fc8049a40, name=0x7f2fda6e0990 "d", entry=0x7f2fc8279240, source=<value optimized out>, 
    sv=<value optimized out>, flags=30736, ctxt=0x7f2fcc006be0) at parse_amd.c:1402
#8  0x00007f2ff84d3367 in parse_mount (ap=0x7f2fc8049a40, name=0x7f2fda6e0990 "d", name_len=<value optimized out>, 
    mapent=<value optimized out>, context=0x7f2fcc006be0) at parse_amd.c:1988
#9  0x00007f2ffada993d in lookup_mount (ap=0x7f2fc8049a40, name=<value optimized out>, name_len=<value optimized out>, 
    context=0x7f2fc8216c80) at lookup_yp.c:938
#10 0x00007f2ffd40d380 in do_lookup_mount (ap=0x7f2fc8049a40, map=0x7f2fc800b960, name=0x7f2fda6e0e50 "d", name_len=4) at lookup.c:780
#11 0x00007f2ffd40d928 in lookup_nss_mount (ap=0x7f2fc8049a40, source=0x0, name=0x7f2fda6e0e50 "d", name_len=4) at lookup.c:1133
#12 0x00007f2ffd405bd8 in do_mount_indirect (arg=<value optimized out>) at indirect.c:769
#13 0x00007f2ffcfbdaa1 in start_thread (arg=0x7f2fda6e3700) at pthread_create.c:301
#14 0x00007f2ffbedabbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Our initial hypothesis was a SIGCHILD race, and one child process never receiving the signal that a mount was finished, thus not allowing the threads to proceed. We implemented a change in automount to add a timeout to do_bind_mount, but the problem manifested itself again with this patch.

The current hypothesys is being tested by the client and the patch is attached in this bug.

Version-Release number of selected component (if applicable):
autofs-5.0.5-132.el6.x86_64

How reproducible:
Often.

Steps to Reproduce:
TBD.

Actual results:
Applications attempting to open an automount get stuck or fail.

Expected results:
Applications to continue.

Comment 2 Thiago Rafael Becker 2017-09-29 19:23:06 UTC

HPE informs that the patch seems to fix the issue. They asked another week to monitor the servers.

Comment 3 Thiago Rafael Becker 2017-10-04 17:13:19 UTC

Update from HPE:

I installed the package on 3 systems on Sep 27th.  Prior to installing the package one of the systems had been experiencing the automount hangups as much as 3 times per day.  Since installing the package there have been no issues.   I think we’re ready to call this a fix.

Ian, if this is good enough for you, I'll proceed with the request for an accelerated fix.

Comment 4 Ian Kent 2017-10-04 23:51:25 UTC

(In reply to Thiago Rafael Becker from comment #3)
> Update from HPE:
> 
> I installed the package on 3 systems on Sep 27th.  Prior to installing the
> package one of the systems had been experiencing the automount hangups as
> much as 3 times per day.  Since installing the package there have been no
> issues.   I think we’re ready to call this a fix.
> 
> Ian, if this is good enough for you, I'll proceed with the request for an
> accelerated fix.

Indeed it is but before we can start working on any type of accelerated
fix we need pm_ack and rhel-6.10.0 ack on the bug so that the change
will be included in rhel-6.10.0 and so not allow it to become "unfixed"
in a subsequent release. Not sure who to ask for review of this to get
the pm_ack (which should be enough to cover the rhel-6.10.0 ack too).

I also need to clone the bug for RHEL-7 since it is a problem there
too.

Comment 21 Ian Kent 2017-11-07 05:55:01 UTC

*sigh*

I've been trying to reproduce the problem seen here for some
time now without any success.

The problem is that the window for which the deadlock can occur
is very small. For indirect autofs mounts (and the customer
mounts are all indirect) a map read is not actually done. Once
it discovers that the mount settings don't require a map read
it returns without doing anything. So the time the master map
lock is held is quite short.

In all my tests, using both file maps and NIS as the map source
the map reads complete far too quickly and no deadlock is seen.

The network doesn't even come into play because it doesn't even
get that far, I can't think of any way to alter the timing of
the map read using something that might be expected, so I'm stuck.

I think we'll need to classify this one as sanity only!
What do you think?

Comment 22 xiaoli feng 2017-11-08 06:40:12 UTC

I cann't agree more. From my test, I also can't trigger this issue. Even if there is a network delay. So What the only thing that the QE can do is doing the sanity check. And thanks so much for your help about the reproducer.

Comment 23 Ian Kent 2017-11-08 09:21:00 UTC

(In reply to xiaoli feng from comment #22)
> I cann't agree more. From my test, I also can't trigger this issue. Even if
> there is a network delay. So What the only thing that the QE can do is doing
> the sanity check. And thanks so much for your help about the reproducer.

In our defence those customer maps are probably the most
complex amd maps we'll ever see.

That complexity has built up over a long time to (probably)
solve access consistency requirements and has lead to the
use of some of the more difficult to understand amd map
features.

The mad map sublink option, for example, I had to go back
to the amd source three times during development to re-work
out how it was supposed to work, not to mention several other
advanced map options and, well, there's the complex key
matching and map re-use within the same lookup as well ....

I can tell you it hurts my head when I have to look at this
customers maps, ;)

Comment 24 Michael Jones 2017-11-28 23:06:05 UTC

Hello,
Once the world has to deal with multiple architectures at scale again (aka intel x86 is not the only kid on the block), you'll appreciate the effort put into map differentiation in this way.  We were able to provide common application frameworks between RS6K, MIPS, Sparc, Intel, Itanium, and Power over AIX, Solaris, Linux, IRIX, DG/UX and Mainframes. Please take into consideration the complexity of your customers environments who operate heterogeneous unixes at a global scale. 

Sincerest regards, 

the customer. :)

Comment 25 Ian Kent 2017-11-29 02:00:27 UTC

(In reply to Michael Jones from comment #24)
> Hello,
> Once the world has to deal with multiple architectures at scale again (aka
> intel x86 is not the only kid on the block), you'll appreciate the effort
> put into map differentiation in this way.  We were able to provide common
> application frameworks between RS6K, MIPS, Sparc, Intel, Itanium, and Power
> over AIX, Solaris, Linux, IRIX, DG/UX and Mainframes. Please take into
> consideration the complexity of your customers environments who operate
> heterogeneous unixes at a global scale. 

The fact is the complexity of the maps was a big factor
in being able to make the implementation worthwhile.

There were only a few other map features not covered by
these maps and without the maps provided we would still
be tracking down mistakes and omissions with the
implementation.

Thanks Chevron for the help.
Ian

Comment 31 errata-xmlrpc 2018-06-19 05:22:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1917