1083744 – autofs can ghost non-existent map entries given the right timing

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1083744 - autofs can ghost non-existent map entries given the right timing

Summary: autofs can ghost non-existent map entries given the right timing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	autofs
Sub Component:
Version:	6.4
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Ian Kent
QA Contact:	JianHong Yin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1089889 1094661
TreeView+	depends on / blocked

Reported:	2014-04-02 21:01 UTC by cperl
Modified:	2018-12-05 17:59 UTC (History)
CC List:	7 users (show)
Fixed In Version:	autofs-5.0.5-94.el6
Doc Type:	Bug Fix
Doc Text:	Previously, when an autofs mount used the "browse" option, autofs could create directories that did not correspond to map entries within its map. This was due to autofs not properly checking for entries that corresponded to previously failed mount attempts, the so-called "negative entries". The check for negative map entries has been corrected, and autofs now works as intended.
Clone Of:
Clones:	1094661 (view as bug list)
Environment:
Last Closed:	2014-10-14 08:18:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Only skip if the mapent is not a negative cache entry (480 bytes, patch) 2014-04-02 21:04 UTC, cperl	no flags	Details \| Diff
Patch - fix negative status being reset on map read (1.23 KB, patch) 2014-04-03 03:09 UTC, Ian Kent	no flags	Details \| Diff
Patch - fix use cache entry after free mistake (1.29 KB, patch) 2014-04-08 06:15 UTC, Ian Kent	no flags	Details \| Diff
Patch - check for non existent negative entries in lookup_ghost() (1.62 KB, patch) 2014-04-08 06:19 UTC, Ian Kent	no flags	Details \| Diff
Slightly modified reproducer script (145 bytes, text/plain) 2014-04-08 08:05 UTC, Ian Kent	no flags	Details
Patch - check for non existent negative entries in lookup_ghost() (2.08 KB, patch) 2014-04-24 06:23 UTC, Ian Kent	no flags	Details \| Diff
Patch - check for non existent negative entries in lookup_ghost() (1.29 KB, patch) 2014-05-08 06:26 UTC, Ian Kent	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:1587	0	normal	SHIPPED_LIVE	autofs bug fix update	2014-10-14 01:39:54 UTC

Description cperl 2014-04-02 21:01:10 UTC

Description of problem:

Autofs can sometimes wind up ghosting for entries that don't exist in its maps.

Our users tend to use our systems in such a way that mercurial (hg) will wind
up stat(2)'ing for ".hg" directories on autofs file systems.  We do not have any
".hg" entries in our autofs maps, yet sometimes, a directory for ".hg" will be
ghosted.

I've debugged the problem and believe its a cache interaction problem.  In
short, a negative cache entry is added to the cache for ".hg" by one thread
(because ".hg" is not in the map), while at the same time another thread is
processing a HUP signal (or perhaps some other part of the code has submitted
st_add_task(ap, ST_READMAP)).  This other thread will eventually call
do_readmap, and eventually lookup_prune_cache.  If we're still in the same
*second* (because the "age" and "status" fields of struct mapent are time_t and
therefore have second granularity), then the initial check in
lookup_prune_one_cache will be true for the negative cache entry for ".hg":

if (me->age >= age) {
  /*
   * Reset time of last fail for valid map entries to
   * force entry update and subsequent mount retry.
   */
  me->status = 0;
  me = cache_enumerate(mc, me);
  continue;
}

And the entry will be skipped.  Next, do_readmap will move on and call
lookup_ghost, which will iterate over the cache entries and ghost each of them,
including the negative one.


Version-Release number of selected component (if applicable):

autofs-5.0.5-75.el6_4:

Name        : autofs                       Relocations: (not relocatable)
Version     : 5.0.5                             Vendor: CentOS
Release     : 75.el6_4                      Build Date: Mon 23 Sep 2013
05:40:10 AM EDT
Install Date: (not installed)               Build Host:
c6b10.bsys.dev.centos.org
Group       : System Environment/Daemons    Source RPM:
autofs-5.0.5-75.el6_4.src.rpm
Size        : 3167907                          License: GPLv2+
Signature   : RSA/SHA1, Mon 23 Sep 2013 08:44:22 AM EDT, Key ID
0946fca2c105b9de
Packager    : CentOS BuildSystem <http://bugs.centos.org>
URL         : http://wiki.autofs.net/
Summary     : A tool for automatically mounting and unmounting filesystems
Description :
autofs is a daemon which automatically mounts filesystems when you use
them, and unmounts them later when you are not using them.  This can
include network filesystems, CD-ROMs, floppies, and so forth.

How reproducible:

Very.

Steps to Reproduce:
1. Create a simple test setup 

$ cat /tmp/auto.master 
/foo file:/tmp/auto.foo --ghost

$ cat /tmp/auto.foo 
a -fstype=nfs,intr,hard,vers=3 server-host:/export/a

(note in this case the host name and share do not exist, they don't need to)

2. Start the automount daemon

# /usr/sbin/automount --foreground --debug --pid-file /var/run/autofs.pid -v /tmp/auto.master 

... debug output here ...

3. Try stat'ing /foo/.hg while sending automount a HUP signal to force
   do_readmap(s):

$ while true
do
  date
  stat /foo/.hg &
  sleep $((${RANDOM}%2))
  sudo kill -HUP $(pgrep -f auto.master)
  sleep 5
done

Wed Apr  2 16:56:18 EDT 2014
[1] 7065
stat: cannot stat `/foo/.hg': No such file or directory
[1]+  Exit 1                  stat /foo/.hg

Wed Apr  2 16:56:23 EDT 2014
[1] 7098
  File: `/foo/.hg'
    Size: 0               Blocks: 0          IO Block: 1024   directory
    Device: 15h/21d Inode: 9559314     Links: 2
    Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2014-04-02 16:56:18.480756064 -0400
    Modify: 2014-04-02 16:56:18.480756064 -0400
    Change: 2014-04-02 16:56:18.480756064 -0400
[1]+  Done                    stat /foo/.hg
Wed Apr  2 16:56:29 EDT 2014
[1] 7108
stat: cannot stat `/foo/.hg': No such file or directory
[1]+  Exit 1                  stat /foo/.hg
^C

As you can see, at one point the stat of /foo/.hg returns results.  This is
because it was temporarily ghosted (mkdir_path was called for it), but it was
then cleaned up by the next HUP signal.  If you continue running this, you'll
encounter the same thing.  Obviously, it depends on the timing.

Actual results:

See the steps to reproduce.

Expected results:

The directory should never exist because its not in the map.

Additional info:

I've debugged this fairly extensively using SystemTap and am happy to share the
output which shows the interaction I describe above.

I've attached a simple patch that seems to fix the problem for me, although I
don't have a good enough mental model of how the automount code works to know
if there are other implications of this patch.

Comment 1 cperl 2014-04-02 21:04:08 UTC

Created attachment 882008 [details]
Only skip if the mapent is not a negative cache entry

Comment 3 Ian Kent 2014-04-03 03:03:27 UTC

(In reply to cperl from comment #1)
> Created attachment 882008 [details]
> Only skip if the mapent is not a negative cache entry

Thanks for your effort analyzing this problem.

After looking at this I initially thought this could be due to
another bug with a similar cause but I think you have identified
yet another corner case.

With the correction for the other bug, which doesn't reset the
status for entries that remain negative, your patch won't behave
as you expect any more (since me->status != 0 for negative
entries any more).

But your description raises a couple of interesting questions
for me.

Why am I creating directories at all when pruning stale entries?

The cache prune is meant to clean stale entries from the cache
and the ghost function is supposed to create the directories for
the browse functionality.

Why am I not checking for negative cache entries in the ghost
function?

We don't want directories created for map entries that don't
exist (ie. me->mapent == NULL) so surely that has to be a
mistake on my part.

There are a bunch of cases so let me think about it for a
while and have a closer look.

Ian

Comment 4 Ian Kent 2014-04-03 03:09:44 UTC

Created attachment 882079 [details]
Patch - fix negative status being reset on map read

If you do happen to feel compelled to spend more time on
this then you should do it with this patch included so
your up to date with changes in this area of the code.

Comment 5 cperl 2014-04-03 12:13:27 UTC

(In reply to Ian Kent from comment #4)
> Created attachment 882079 [details]
> Patch - fix negative status being reset on map read
> 
> If you do happen to feel compelled to spend more time on
> this then you should do it with this patch included so
> your up to date with changes in this area of the code.

Actually, that's kind of amusing.  My original attempt to fix this was by
adding code to lookup_ghost to skip those mapent's whose status was non zero,
but because of the status reset in the code you reference, that failed.  That
patch you pointed out now changes that.

Although, I'm not sure I see why that would affect what I added in my patch
(not that I think my patch is the right approach, just trying to better
understand the code).  With the additional patch you reference, status would
only be reset for those struct mapents with a valid "mapent" field.  But what I
added would make it so you'd only ever enter the conditional if status was 0,
meaning reseting status in any case is unnecessary (also meaning that I left
unnecessary code in my patch).

I think your questions are good ones and I wondered similar things myself,
however I don't possess enough knowledge about the code to make larger changes,
so I was shooting for the smallest targeted fix that would address my issue. :)

Assuming I was going for a targeted fix that I could apply to my environment
asap, perhaps the conditional could be something like:

if (me->age >= age && me->mapent) {
	me->status = 0;
	me = cache_enumerate(mc, me);
	continue;
}

So the second part of that would fail for the negative cache entry, causing it
to hit all the logic lower in the function, and ultimately to be pruned.
However, it would allow valid struct mapents that perhaps just failed to mount
to come in and have their status reset so another mount could be attempted.

Comment 6 cperl 2014-04-03 21:10:49 UTC

I've given this a little more thought and I think you can basically ignore my
last comment.

I believe I misunderstood the meaning of the fields in struct mapent and
therefore was lead away from fixing things in lookup_ghost and instead moved to
lookup_prune_one_cache.

I thought the "status" field encoded whether or not a struct mapent was a
positive or negative cache entry, but based on the patch you referenced and
your comments, I now believe that is determined based on whether or not the
"mapent" field is a non NULL pointer.

In that case, it seems like the easiest fix would be to simply look at
me->mapent in lookup_ghost and skip those that are NULL.  It doesn't make sense
to prune the negative cache entry as its still a valid negative entry.

Anyway, I defer to you at this point as you know much more than I do about this
code. :)

Comment 7 Ian Kent 2014-04-04 02:04:30 UTC

(In reply to cperl from comment #6)
> I've given this a little more thought and I think you can basically ignore my
> last comment.
> 
> I believe I misunderstood the meaning of the fields in struct mapent and
> therefore was lead away from fixing things in lookup_ghost and instead moved
> to
> lookup_prune_one_cache.
> 
> I thought the "status" field encoded whether or not a struct mapent was a
> positive or negative cache entry, but based on the patch you referenced and
> your comments, I now believe that is determined based on whether or not the
> "mapent" field is a non NULL pointer.

That's right, it almost does but ->mapent is what's used to
work out if a lookup can return a fail early. Umm .. sorry
I don't see the early check in RHEL-6, the check is done
later, odd I thought we added that.

There's a couple of cases to worry about.

During a map read negative map entries that are still valid
in the map will fill in ->mapent again and so need to clear
->status so a subsequent mount request will be attempted.

Then there's those that aren't in the map at all and need to
be kept for the duration of the timeout in ->status and then
removed.

The ->status field is just the negative timeout and once that
triggers the map entry is deleted.

Because people complained about the currency of indirect map
entries a lookup is done against the map source on every mount
request that comes in so the map entry (and so ->mapent) will
get refreshed (actually removed and added again in the lookup)
if the timeout has expired.

> 
> In that case, it seems like the easiest fix would be to simply look at
> me->mapent in lookup_ghost and skip those that are NULL.  It doesn't make
> sense
> to prune the negative cache entry as its still a valid negative entry.

Yes, maybe, that's the obvious first thought.

Valid entries that have become negative should already have
a directory created, entries (valid or negative) that have
been removed from the map should have been removed from the
cache, invalid entries (always negative) probably shouldn't
ever have directories created and should only ever exist in
the cache until they timeout and are removed. Unless an entry
corresponding to one is added to the map in which case it
belongs to the first case, a valid entry that has become
negative but probably with no directory yet created.

This is for the case where the master map entry is using
the "browse" option or the ghost option.

I'm pretty sure I've missed a case or two.
This sort of thing usually makes my head spin after a while.
It's actually pretty difficult to follow at times.

Ian

Comment 8 Ian Kent 2014-04-04 02:17:39 UTC

(In reply to cperl from comment #6)
> 
> Anyway, I defer to you at this point as you know much more than I do about
> this
> code. :)

Yeah, I've got a couple of pressing things to take care of.
I'll get back to this, hopefully, on Monday.

Ian

Comment 9 Ian Kent 2014-04-08 06:15:41 UTC

Created attachment 883889 [details]
Patch - fix use cache entry after free mistake

Comment 10 Ian Kent 2014-04-08 06:19:55 UTC

Created attachment 883890 [details]
Patch - check for non existent negative entries in lookup_ghost()

I tested this using autofs-5.1.0-beta1 and was able to
reproduce it within 5 iterations of the reproducer each
time.

With this patch the problem didn't appear after 10 iterations.

The previous patch is a bug fix (so it's sensible to include
it) that allows this patch to be applied without fuzz.

Comment 11 Ian Kent 2014-04-08 07:52:18 UTC

I have made a test build with the above patches and placed
it at:
http://people.redhat.com/~ikent/autofs-5.0.5-92.bz1081285.2.el6/

It is against the current rhel-6.6 repository so you might
need to grab the source rpm and build it yourself.

Please test and report results here.

Ian

Comment 12 Ian Kent 2014-04-08 08:05:47 UTC

Created attachment 883929 [details]
Slightly modified reproducer script

This script assumes the following:

/test	/etc/auto.test
is present in the master map.

/etc/auto.test exists and contains an entry (I don't
think the entry even needs to work, the example below
doesn't) like:
a		test:/autofs

Ian

Comment 13 cperl 2014-04-08 13:32:32 UTC

With all 3 patches applied I can no longer reproduce.  I also confirmed via SystemTap tracing that during testing I was hitting the same conditions that previously caused the invalid negative entry to be ghosted.  Now, lookup_ghost() skips the invalid negative entry as intended.

Will these changes eventually land as updates to autofs in 6.4 or 6.5 or will we need to maintain our own built version for those releases until we move to 6.6?

Comment 14 cperl 2014-04-08 13:43:21 UTC

Just to be clear, I applied your patches to autofs-5.0.5-75.el6_4 to test.  If you still want me to test the version you posted, let me know.

Comment 15 Ian Kent 2014-04-09 10:56:48 UTC

(In reply to cperl from comment #13)
> With all 3 patches applied I can no longer reproduce.  I also confirmed via
> SystemTap tracing that during testing I was hitting the same conditions that
> previously caused the invalid negative entry to be ghosted.  Now,
> lookup_ghost() skips the invalid negative entry as intended.
> 
> Will these changes eventually land as updates to autofs in 6.4 or 6.5 or
> will we need to maintain our own built version for those releases until we
> move to 6.6?

Not without quite a bit more work.

This problem has been logged in public Bugzilla so it's lowest
priority from engineering POV and GSS is not in the loop.

To get this back ported then the bug needs to be proposed and
approved for back port. That means, logging the problem via.
usual support channels and making a strong business case as to
why the change should be back ported.

Then I think you'll only see it if you have an appropriate
subscription.

I'm not that in touch with that side of things, I just do my
best to fix problems and leave these issues to support.

Ian

Comment 16 cperl 2014-04-09 11:59:33 UTC

No problem, I can start things going through the official support channels.

Thanks for the help!

Comment 20 cperl 2014-04-23 13:33:50 UTC

I've just encountered a crash that I believe may be due to patch "Patch - check for non existent negative entries in lookup_ghost()".

I'm not entirely clear of the cause at the moment, but I do know the following:

Leading up to the crash, syslog recorded the following:

2014-04-23T08:20:47.212929-04:00 tot-qws-u12114c automount[5033]: rmdir_path: lstat of /j/_ackrc failed
2014-04-23T08:20:47.212993-04:00 tot-qws-u12114c automount[5033]: rmdir_path: lstat of /j/OFFICE failed
2014-04-23T08:20:47.213762-04:00 tot-qws-u12114c kernel: automount[30629]: segfault at 98 ip 00007f408d92d6d0 sp 00007f408d836dc0 error 4 in automount[7f408d919000+3c000]

Note that /j/_ackrc and /j/OFFICE are *not* valid entries in our map configuration, so they would only be there by virtue of being negative cache entries.

automount also produced a core file.  Running gdb against that core file shows the following threads:

(gdb) info threads
  14 Thread 0x7f408d8da700 (LWP 5033)  do_sigwait (set=<value optimized out>, sig=0x7fff4429a60c) at ../sysdeps/unix/sysv/linux/sigwait.c:65
  13 Thread 0x7f408aca0700 (LWP 30628)  0x00007f408c3f8293 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
  12 Thread 0x7f4088ada700 (LWP 30630)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  11 Thread 0x7f40896f8700 (LWP 5048)  0x00007f408c3f8293 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
  10 Thread 0x7f408a43e700 (LWP 5044)  0x00007f408c3f8293 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
  9 Thread 0x7f408a1fa700 (LWP 5046)  0x00007f408c3f8293 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
  8 Thread 0x7f408a560700 (LWP 5043)  0x00007f408c3f8293 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
  7 Thread 0x7f408a31c700 (LWP 5045)  0x00007f408c3f8293 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
  6 Thread 0x7f408aa85700 (LWP 30631)  _int_free (av=0x7f4080000020, p=0x7f40800008d0, have_lock=0) at malloc.c:4893
  5 Thread 0x7f408a0f9700 (LWP 5047)  0x00007f408c3f8293 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
  4 Thread 0x7f408d903700 (LWP 5035)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  3 Thread 0x7f408bef6700 (LWP 5038)  0x00007f408c3f8293 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
  2 Thread 0x7f408d914700 (LWP 5034)  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216
* 1 Thread 0x7f408d837700 (LWP 30629)  lookup_prune_one_cache (ap=0x7f408f654830, mc=0x7f408f654a00, age=1398255647) at lookup.c:1107


According to syslog, the thread that caused the SIGSEGV was 30629, so its backtrace:

(gdb) bt
#0  lookup_prune_one_cache (ap=0x7f408f654830, mc=0x7f408f654a00, age=1398255647) at lookup.c:1107
#1  0x00007f408d92d823 in lookup_prune_cache (ap=0x7f408f654830, age=1398255647) at lookup.c:1157
#2  0x00007f408d930757 in do_readmap (arg=0x7f4058000b80) at state.c:485
#3  0x00007f408d4e1851 in start_thread (arg=0x7f408d837700) at pthread_create.c:301
#4  0x00007f408c40194d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

And the current registers:

(gdb) i r
rax            0x7f4050002b90   139914196822928
rbx            0x0      0
rcx            0x0      0
rdx            0x6      6
rsi            0x7f4050000ef0   139914196815600
rdi            0x7f407c00b570   139914935055728
rbp            0x7f407c00b570   0x7f407c00b570
rsp            0x7f408d836dc0   0x7f408d836dc0
r8             0x7f407c000088   139914935009416
r9             0x7f407c00b570   139914935055728
r10            0x0      0
r11            0x246    582
r12            0x7f408f654a00   139915260414464
r13            0x5357b01f       1398255647
r14            0x7f408f654830   139915260414000
r15            0x0      0
rip            0x7f408d92d6d0   0x7f408d92d6d0 <lookup_prune_one_cache+560>
eflags         0x10246  [ PF ZF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0

And finally, when switching threads, it points to "me->status" as being the culprit for thread 1:

(gdb) thread 2
[Switching to thread 2 (Thread 0x7f408d914700 (LWP 5034))]#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216
216     62:     movq    %rax, %r14

(gdb) thread 1
[Switching to thread 1 (Thread 0x7f408d837700 (LWP 30629))]#0  lookup_prune_one_cache (ap=0x7f408f654830, mc=0x7f408f654a00, age=1398255647) at lookup.c:1107
1107                    else if (me->status) {

Which, seems like it is something that was added in the patch I referenced.

Comment 21 cperl 2014-04-23 13:49:23 UTC

I wonder if this is just due to reaching the end of the cache (cache_enumerate() returns NULL) and having the 'if (valid)' test fail, meaning the 'else if' is evaluated, causing a NULL pointer dereference.

You can definitely see that the pointer is NULL.  The relevant instruction that faulted was:

=> 0x00007f408d92d6d0 <+560>:   cmp    QWORD PTR [rbx+0x98],0x0

And you can look at the previous register output I posted to see that rbx is 0.

Comment 22 Ian Kent 2014-04-24 06:21:55 UTC

(In reply to cperl from comment #20)
> 
> (gdb) thread 1
> [Switching to thread 1 (Thread 0x7f408d837700 (LWP 30629))]#0 
> lookup_prune_one_cache (ap=0x7f408f654830, mc=0x7f408f654a00,
> age=1398255647) at lookup.c:1107
> 1107                    else if (me->status) {

Right.
This appears to be the patch I posted here but it isn't quite the
patch I committed and used in autofs-5.0.5-92.bz1081285.2.el6.

Sorry, but I see I haven't updated the patch, I'll do that.

Hopefully this is what I think it is.

The updated patch takes the cache write lock and does the lookup
for the cache entry again after acquiring it in case went away
while waiting for the lock.

Could you try using the package referred to in comment #11.
I usually include source rpms when I place test packages on
people.redhat.com so you should be able to use that if you
need to rebuild the package for any reason. 

Ian

Comment 23 Ian Kent 2014-04-24 06:23:45 UTC

Created attachment 889191 [details]
Patch - check for non existent negative entries in lookup_ghost()

Comment 24 Ian Kent 2014-04-24 06:34:09 UTC

(In reply to Ian Kent from comment #22)
> (In reply to cperl from comment #20)
> > 
> > (gdb) thread 1
> > [Switching to thread 1 (Thread 0x7f408d837700 (LWP 30629))]#0 
> > lookup_prune_one_cache (ap=0x7f408f654830, mc=0x7f408f654a00,
> > age=1398255647) at lookup.c:1107
> > 1107                    else if (me->status) {
> 
> Right.
> This appears to be the patch I posted here but it isn't quite the
> patch I committed and used in autofs-5.0.5-92.bz1081285.2.el6.

Oh dear, that's not quite correct either.

It seems that the outdated patch is used in the package of comment #11
but it isn't the one I actually committed to the RHEL repository.

The updated patch I've just posted was what was committed.

Let me put the correct package on people.redhat.com for you
to test.

Again, sorry for the inconvenience.

Ian

Comment 25 Ian Kent 2014-04-24 06:48:29 UTC

The RHEL package with what I believe are the correct patches
is located at:
http://people.redhat.com/~ikent/autofs-5.0.5-94.el6/

Please try testing again with this package.

Comment 26 cperl 2014-04-24 11:29:56 UTC

Thanks, I've grabbed the latest src rpm you referenced from people.redhat.com (autofs-5.0.5-94.el6) and am testing with that.

Comment 27 cperl 2014-05-06 12:10:09 UTC

I've been testing since my last comment and things have been going well.  I'm planning on rolling out autofs-5.0.5-94.el6 internally over the next few weeks.

Comment 28 Ian Kent 2014-05-07 00:36:11 UTC

(In reply to cperl from comment #27)
> I've been testing since my last comment and things have been going well. 
> I'm planning on rolling out autofs-5.0.5-94.el6 internally over the next few
> weeks.

Sounds good, I've been waiting for this feedback.

I'd recommend waiting for the zstream update.

That would have been well underway by now except for the mistake
with the patches.

So I can go ahead with the zstream change now, once I get to
the bottom of a related regression test failure.

Ian

Comment 29 Ian Kent 2014-05-08 06:26:30 UTC

Created attachment 893513 [details]
Patch - check for non existent negative entries in lookup_ghost()

A regression has been identified due to another recent
change. It's due to catch-22 like dependence on the map
entry status and this patch uses that status itself.

It turns out that it isn't necessary to use the map entry
status to resolve this problem, at least as far as the
reproducer here goes which should be sufficient.

This updated patch removes the dependence on the map entry
status leaving us free to work on resolving the conflict
between the other two bugs.

The other problem we face is that the rhel-6.6 development
has seen a large amount of change and QA on these changes
hasn't started.

In this case I think the best way to deal with this is to
go ahead with the zstream update and use that package and
make customer testing and acceptance a requirement of the
zstream QA process.

If that's ageeable to all I go ahead with this.

Ian
Ian

Comment 32 JianHong Yin 2014-05-08 08:20:02 UTC

(In reply to Ian Kent from comment #25)
> The RHEL package with what I believe are the correct patches
> is located at:
> http://people.redhat.com/~ikent/autofs-5.0.5-94.el6/
> 
> Please try testing again with this package.

Test passed; according test job in Comment 31.

Comment 33 Ian Kent 2014-05-08 09:49:31 UTC

(In reply to Yin.JianHong from comment #32)
> (In reply to Ian Kent from comment #25)
> > The RHEL package with what I believe are the correct patches
> > is located at:
> > http://people.redhat.com/~ikent/autofs-5.0.5-94.el6/
> > 
> > Please try testing again with this package.
> 
> Test passed; according test job in Comment 31.

Not quite what I was after.

We can't use revision 94 since the patch in it is now different
to the one in comment #29 and there's been a large amount of
change to autofs since revision 94. Testing isn't completed
for those changes so we can't use the current repository just
yet.

Having removed the dependency on the map entry status we're
left with the patch in comment #4 (from another rhel-6.6
bug) that introduces a regression so it needs more work.

But using the reproducer we have here with rhel-5.5 autofs
and the revised patch of comment #29 I don't see a failure.

What I'm recommending is that, since we can't use the current
rhel-6.6 package quite yet, I will add the patch of comment #29
"only" to our rhel-6.5 branch and carry out the testing as part
of the rhel-6.5 zstream bug 1089889 and return to fixing the
regression introduced by bug 1059549 later.

Importantly, we need confirmation that the updated rhel-5.5
package fixes the problem during day to day use.

Inconvenient I know but couldn't be avoided, sorry.
 
Ian

Ian

Comment 34 cperl 2014-05-08 12:21:33 UTC

Just want to clarify, when you say rhel-5.5, are you actually talking about Red Hat Enterprise Linux 5.5?  We're not running that in our environment.

Please tell me if this summary of the current status is correct:

The patch from comment #4 isn't actually correct irrespective of anything else.  It introduces a regression.  But, the patch you posted from comment #10 was dependent on the patch from comment #4 (for me->status being 0) and so if you modify the patch from comment #4, you might break the patch from comment #10.  However, it turns out that we don't need to look at me->status at all for this bug, just me->mapent, therefore you'd like to replace the patch from comment #10 with the patch from comment #29.

Assuming that is the case, do you have another src rpm that you want me to use or would prefer I take the latest patches attached to this bug and apply them on top of something else and test that?  If that latter, what version should I apply them on top of?

Comment 35 Ian Kent 2014-05-09 00:33:50 UTC

(In reply to cperl from comment #34)
> Just want to clarify, when you say rhel-5.5, are you actually talking about
> Red Hat Enterprise Linux 5.5?  We're not running that in our environment.

Umm ... a typo on my part.
That should be rhel-6.5, sorry.

> 
> Please tell me if this summary of the current status is correct:
> 
> The patch from comment #4 isn't actually correct irrespective of anything
> else.  It introduces a regression.  But, the patch you posted from comment
> #10 was dependent on the patch from comment #4 (for me->status being 0) and
> so if you modify the patch from comment #4, you might break the patch from
> comment #10.  However, it turns out that we don't need to look at me->status
> at all for this bug, just me->mapent, therefore you'd like to replace the
> patch from comment #10 with the patch from comment #29.

Exactly so, yes.
Using the reproducer worked fine without the patch from
comment #4, and using only the patch from comment #29.

The main problem is trying to resolve a regression involving
three patches is much harder than between two especially when
there's a catch-22 situation like this.

> 
> Assuming that is the case, do you have another src rpm that you want me to
> use or would prefer I take the latest patches attached to this bug and apply
> them on top of something else and test that?  If that latter, what version
> should I apply them on top of?

My problem is that we have a zstream bug for this that needs
to be resolved so the change can be pushed out to rhel-6.5.

Given the above that's being held up due to the regression
I've found which isn't acceptable.

But now I've added the ability for autofs to parse amd format
maps to the rhel-6.6 development branch and that is going
to take quite a while to test so we can't use that source.

Bottom line is the patch in comment #29 passed the reproducer
and we really need to know it resolves the problem in day to
day use before pushing out the zstream change.

So, if it is possible for you to use the patch from comment #29
only added to a rhel-6.5 source rpm and test it that's exactly
what we need.

Sorry for the inconvenience, but I seem to have got myself
into a difficult situation wrt. regressions, ;)

Ian

Comment 36 cperl 2014-05-09 00:55:59 UTC

Ok, I'll test autofs-5.0.5-89.el6_5.1 (the latest version I currently see for 6.5) with the patch from comment #29 applied.  What about the patch from comment #9, an obvious use after free mistake?  Should I apply that patch when testing as well?

Can you give more details about the regression that the patch from comment #4 introduces?  I'm happy to test thoroughly with what I mentioned above, but if the regression isn't too serious, I may continue on with rolling out autofs-5.0.5-94.el6 internally as its already been through 2 weeks of testing on many machines internally and has been working pretty well for us.

Comment 37 Ian Kent 2014-05-09 12:45:24 UTC

(In reply to cperl from comment #36)
> Ok, I'll test autofs-5.0.5-89.el6_5.1 (the latest version I currently see
> for 6.5) with the patch from comment #29 applied.  What about the patch from
> comment #9, an obvious use after free mistake?  Should I apply that patch
> when testing as well?

Yep, your right, I'll be adding that patch too, although it doesn't
affect the case we're trying to resolve it's just sensible to include
while were here.

> 
> Can you give more details about the regression that the patch from comment
> #4 introduces?  I'm happy to test thoroughly with what I mentioned above,
> but if the regression isn't too serious, I may continue on with rolling out
> autofs-5.0.5-94.el6 internally as its already been through 2 weeks of
> testing on many machines internally and has been working pretty well for us.

The problem is pretty straight forward.

If an attempt to mount a mount fails due to an eror in a
map entry and the problem is then fixed and a HUP signal
sent autofs would refuse to mount it because the negative
status wasn't being reset on HUP signal.

That was fixed by always resetting the status on HUP signal.

Then we got a bug were there were excessive key not found
messages in the log, because the status was unconditionally
reset. Easy enough, only reset the status for non-negative
entries ....

But, the regression test for the first fix also tests using
a wildcard map entry which the second change breaks since
it is a valid map entry which doesn't have the mapent field
reset on receiving a HUP signal so the status isn't reset.

It isn't a big deal but it's incorrect behaviour.

I just need to work out how to reliably identify entries
that have been created using a wildcard lookup so I can
reset the status of those too.

Ian

Comment 38 cperl 2014-05-12 21:22:40 UTC

I've tested autofs-5.0.5-89.el6_5.1 with the patches from comment #9 and comment #29 applied and I am not able to reproduce the problem.

I inspected things with SystemTap as well and verified that we are hitting the case where the ".hg" has a new enough timestamp and is therefore not pruned by lookup_prune_cache().  It is being passed to lookup_ghost() but lookup_ghost() is correctly skipping the entry because of the check for !me->mapent.

Comment 39 Ian Kent 2014-05-13 10:31:30 UTC

(In reply to cperl from comment #38)
> I've tested autofs-5.0.5-89.el6_5.1 with the patches from comment #9 and
> comment #29 applied and I am not able to reproduce the problem.
> 
> I inspected things with SystemTap as well and verified that we are hitting
> the case where the ".hg" has a new enough timestamp and is therefore not
> pruned by lookup_prune_cache().  It is being passed to lookup_ghost() but
> lookup_ghost() is correctly skipping the entry because of the check for
> !me->mapent.

Great, thanks very much.
I'll get on with the change.

Comment 45 JianHong Yin 2014-08-20 04:14:50 UTC

Verified:
  https://beaker.engineering.redhat.com/jobs/725980

Comment 46 errata-xmlrpc 2014-10-14 08:18:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1587.html

Note You need to log in before you can comment on or make changes to this bug.