Bug 1965593 - Infinite loop at lookup_prune_one_cache
Summary: Infinite loop at lookup_prune_one_cache
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: autofs
Version: 34
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Ian Kent
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-28 09:16 UTC by Ian Collier
Modified: 2021-06-05 01:04 UTC (History)
1 user (show)

Fixed In Version: autofs-5.1.7-13.fc34
Clone Of:
Environment:
Last Closed: 2021-06-05 01:04:49 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Ian Collier 2021-05-28 09:16:46 UTC
After using a Fedora 34 workstation with automounted filesystems for 1-2 days,
new mounts become unresponsive with autofs running at high CPU.  The syslog
shows this soon after the last successful mount:

May 28 09:37:10 clpc80 automount[202813]: do_init: parse(sun): init gathered global options: nosuid,intr
May 28 09:37:10 clpc80 automount[202813]: read_one_map: map read not needed, so not done
May 28 09:37:10 clpc80 automount[202813]: lookup_prune_one_cache: removed map entry detected, mark negative
May 28 09:37:10 clpc80 automount[202813]: lookup_prune_one_cache: removed map entry detected, mark negative
May 28 09:37:10 clpc80 automount[202813]: lookup_prune_one_cache: prune posponed, /auto/users/imc mounted
May 28 09:37:10 clpc80 automount[202813]: lookup_prune_one_cache: removed map entry detected, mark negative
May 28 09:37:10 clpc80 automount[202813]: lookup_prune_one_cache: prune posponed, /auto/users/imc mounted
May 28 09:37:10 clpc80 automount[202813]: lookup_prune_one_cache: removed map entry detected, mark negative

with the last two entries repeated ad infinitum.

The thread that is looping appears to be at:

#3  log_debug (logopt=<optimized out>, 
    msg=msg@entry=0x560a9e10f058 "%s: removed map entry detected, mark negative") at /usr/src/debug/autofs-5.1.7-12.fc34.x86_64/lib/log.c:262
#4  0x0000560a9e109719 in lookup_prune_one_cache (ap=ap@entry=0x560a9fb82100, 
    mc=0x560a9fb8f550, age=age@entry=233729)
    at /usr/src/debug/autofs-5.1.7-12.fc34.x86_64/daemon/lookup.c:1406
#5  0x0000560a9e109843 in lookup_prune_cache (ap=0x560a9fb82100, age=233729)
    at /usr/src/debug/autofs-5.1.7-12.fc34.x86_64/daemon/lookup.c:1504
#6  0x0000560a9e10a52f in do_readmap (arg=0x7fd5cc000dc0)
    at /usr/src/debug/autofs-5.1.7-12.fc34.x86_64/daemon/state.c:462
#7  0x00007fd5f8e0d299 in start_thread () from /lib64/libpthread.so.0

and the thread is in the loop which starts at lookup.c:1335 "while (me) {"
and is hitting the "continue" at line 1419 without ever changing the value
of "me".

Comment 1 Ian Kent 2021-05-28 10:27:12 UTC
Arrgh, yes, I moved a statement and didn't properly account for it.
I'll sort that out straight away, thanks for reporting it.

Comment 2 Ian Kent 2021-05-28 11:00:44 UTC
Btw, what does you map look like, does it have a wildcard entry by any chance?

Comment 3 Ian Collier 2021-05-28 12:02:27 UTC
The maps (including auto.master) are managed by IPA and there is no wildcard
entry that I am aware of.  So each entry has an explicit key and info.

$ ipa automountkey-find cs auto-users
--------------------------
656 automount keys matched
--------------------------
...
  Key: imc
  Mount information: nfs-server:/homes/staff/imc
...

Comment 4 Ian Collier 2021-05-28 12:03:49 UTC
BTW on a tangentially related topic, there's a spelling error in "posponed" in the debug log exhibited above. :-)

Comment 5 Ian Kent 2021-05-28 12:22:25 UTC
(In reply to Ian Collier from comment #4)
> BTW on a tangentially related topic, there's a spelling error in "posponed"
> in the debug log exhibited above. :-)

Thanks, I'll fix that while I'm there.

At the moment I'm trying to write a simple reproducer script.
It's easiest to reproduce using a wildcard map entry.

It looks like maps that have explicit <key, location> entries would
need a map entry removal (or a memory allocation failure) for the
loop behavior to trigger.

Comment 6 Ian Collier 2021-05-28 13:44:32 UTC
Hmm, so there might be another bug somewhere.  Although we do remove map
entries from time to time, that hasn't happened in the last couple of days
as far as I am aware.  Could a map entry be considered stale if the system
had been left idle for several hours and then resumed activity?

Let me know if there's something I should look at next time it crashes.

Comment 7 Ian Collier 2021-05-28 15:07:25 UTC
Or would it be considered stale if any one entry in the map changes?  Because
we are certainly doing that at the moment on a few accounts (not necessarily
ones that are currently mounted, though).

Comment 8 Ian Kent 2021-05-29 01:03:54 UTC
(In reply to Ian Collier from comment #7)
> Or would it be considered stale if any one entry in the map changes?  Because
> we are certainly doing that at the moment on a few accounts (not necessarily
> ones that are currently mounted, though).

I'm still thinking about it, essentially what happens is the map entries
have a time stamp that is set when the map is read. The cache prune that
occurs after reading the map should pass over entries with a time stamp
that matches the one used for the current map read. Also entries younger
than 2 minutes are passed over (I've considered making that 2 minutes
configurable so people can tune cache behavior, but haven't done so yet).

Since entries that have been created due to a wild card map entry aren't
actually present in the map a wildcard map is the easiest way to reproduce
the problem. An attempt to prune them will be made on a map read once they
are older than 2 minutes. If they have an active mount on them they are
left alone.  

For now I'm most concerned with fixing that looping behavior.

Comment 9 Ian Kent 2021-05-31 00:06:08 UTC
(In reply to Ian Kent from comment #8)
> (In reply to Ian Collier from comment #7)
> > Or would it be considered stale if any one entry in the map changes?  Because
> > we are certainly doing that at the moment on a few accounts (not necessarily
> > ones that are currently mounted, though).
> 
> I'm still thinking about it, essentially what happens is the map entries
> have a time stamp that is set when the map is read. The cache prune that
> occurs after reading the map should pass over entries with a time stamp
> that matches the one used for the current map read. Also entries younger
> than 2 minutes are passed over (I've considered making that 2 minutes
> configurable so people can tune cache behavior, but haven't done so yet).
> 
> Since entries that have been created due to a wild card map entry aren't
> actually present in the map a wildcard map is the easiest way to reproduce
> the problem. An attempt to prune them will be made on a map read once they
> are older than 2 minutes. If they have an active mount on them they are
> left alone.  
> 
> For now I'm most concerned with fixing that looping behavior.

I've had a look around and there's a lot more to consider when re-loading
the map.

So, for now, I'm just going to fix the regression introduced by commit
256963d6b (autofs-5.1.7 - refactor lookup_prune_one_cache() a bit) and
look at the map re-read code later.

Comment 10 Fedora Update System 2021-05-31 01:51:29 UTC
FEDORA-2021-99a8333987 has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-99a8333987

Comment 11 Fedora Update System 2021-06-01 01:42:57 UTC
FEDORA-2021-99a8333987 has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-99a8333987`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-99a8333987

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 12 Fedora Update System 2021-06-05 01:04:49 UTC
FEDORA-2021-99a8333987 has been pushed to the Fedora 34 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.