2013218 – autofs lookups for unknown mounts are delayed for 50s

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2013218 - autofs lookups for unknown mounts are delayed for 50s

Summary: autofs lookups for unknown mounts are delayed for 50s

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	sssd
Sub Component:
Version:	8.4
Hardware:	All
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Pavel Březina
QA Contact:	shridhar
Docs Contact:
URL:
Whiteboard:	sync-to-jira
Depends On:
Blocks:	2016922 2016923
TreeView+	depends on / blocked

Reported:	2021-10-12 12:09 UTC by u.sibiller@atos.net
Modified:	2022-05-10 16:48 UTC (History)
CC List:	12 users (show)
Fixed In Version:	sssd-2.6.1-1.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2016922 2016923 (view as bug list)
Environment:
Last Closed:	2022-05-10 15:26:45 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
our sssd.conf (3.01 KB, text/plain) 2021-10-12 12:23 UTC, u.sibiller@atos.net	no flags	Details
logfile sssd_autofs.log (686.91 KB, text/plain) 2021-10-12 12:36 UTC, u.sibiller@atos.net	no flags	Details
sssdlogs with debug_level 9 everywhere (names and IPs changed with sed) (66.25 KB, application/gzip) 2021-10-15 14:45 UTC, u.sibiller@atos.net	no flags	Details
sssd logs, names and IPs mangled (66.25 KB, application/octet-stream) 2021-10-15 14:47 UTC, u.sibiller@atos.net	no flags	Details
log with fix included in sssd-2.4.0.el8-4.2.1 (430.00 KB, application/x-tar) 2021-10-18 11:35 UTC, u.sibiller@atos.net	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	SSSD sssd issues 5832	None	open	autofs lookups for unknown mounts are delayed for 50s	2021-10-19 11:08:42 UTC
Red Hat Issue Tracker	RHELPLAN-99515	None	None	None	2021-10-12 12:12:27 UTC
Red Hat Issue Tracker	SSSD-4042	None	None	None	2021-10-21 08:17:49 UTC
Red Hat Product Errata	RHBA-2022:2070	None	None	None	2022-05-10 15:27:13 UTC

Description u.sibiller@atos.net 2021-10-12 12:09:38 UTC

Description of problem:

We are using a sssd.conf that refers to two AD domains and a third "catchall" domain. All three have the same autofs configuration to lookup autofs mountpoints in ldap. This used to work unchanged for years, including RHEL7.9 and 8.3. Starting with 8.4 calls like "stat /share/unknown_share" take >50s to complete. Every single time. In 8.3 this returns with "stat: cannot stat '/share/unknown_share': No such file or directory" immediately.

Version-Release number of selected component (if applicable):
sssd-common-2.4.0-9.el8_4.2.x86_64

How reproducible:
stat on a automounter mountpoint that does not exist, e.g. /share/unknown_share (/share is handled by autofs)


Actual results:

After >50s we see  "stat: cannot stat '/share/unknown_share': No such file or directory"

Expected results:

"stat: cannot stat '/share/unknown_share': No such file or directory" should appear immediately

Additional info:

I have found two workarounds for this:
1. disable autofs for the first two (AD) domains by setting "autofs_provider = none", then restart sssd and autofs
2. replace /usr/libexec/sssd/sssd_autofs (only this file, nothing else) by the one from sssd-2.3.0-9 (RHEL8.3), then restart sssd and autofs

sssd.conf is attached.

Comment 1 u.sibiller@atos.net 2021-10-12 12:23:16 UTC

Created attachment 1832193 [details]
our sssd.conf

Comment 2 u.sibiller@atos.net 2021-10-12 12:36:27 UTC

Created attachment 1832194 [details]
logfile sssd_autofs.log

Here you can see sssd/autofs doing 51 CRs (Cache requests, I guess).

Comment 3 Pavel Březina 2021-10-15 12:28:41 UTC

Looks like SSSD is offline which makes autofs to retry again and again. Can you please also share your domain logs to see why SSSD is offline? Or is it expected that SSSD is offline?

Comment 5 u.sibiller@atos.net 2021-10-15 14:38:21 UTC

Well, I do not get what "offline" means in that context. The ssd service is up an running:
● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-10-15 16:26:51 CEST; 7min ago
 Main PID: 118292 (sssd)
    Tasks: 8 (limit: 11388)
   Memory: 61.8M
   CGroup: /system.slice/sssd.service
           ├─118292 /usr/sbin/sssd -i --logger=files
           ├─118293 /usr/libexec/sssd/sssd_be --domain implicit_files --uid 0 --gid 0 --logger=files
           ├─118294 /usr/libexec/sssd/sssd_be --domain DOMAIN1.OURNET --uid 0 --gid 0 --logger=files
           ├─118295 /usr/libexec/sssd/sssd_be --domain DOMAIN2.OURNET --uid 0 --gid 0 --logger=files
           ├─118296 /usr/libexec/sssd/sssd_be --domain unmatched --uid 0 --gid 0 --logger=files
           ├─118297 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
           ├─118298 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
           └─118299 /usr/libexec/sssd/sssd_autofs --uid 0 --gid 0 --logger=files

Oct 15 16:26:51 vumca615 systemd[1]: Starting System Security Services Daemon...
Oct 15 16:26:51 vumca615 sssd[sssd][118292]: Starting up
Oct 15 16:26:51 vumca615 sssd[be[implicit_files]][118293]: Starting up
Oct 15 16:26:51 vumca615 sssd[be[DOMAIN2.OURNET]][118295]: Starting up
Oct 15 16:26:51 vumca615 sssd[be[DOMAIN1.OURNET]][118294]: Starting up
Oct 15 16:26:51 vumca615 sssd[be[unmatched]][118296]: Starting up
Oct 15 16:26:51 vumca615 sssd[autofs][118299]: Starting up
Oct 15 16:26:51 vumca615 sssd[nss][118297]: Starting up
Oct 15 16:26:51 vumca615 sssd[pam][118298]: Starting up
Oct 15 16:26:51 vumca615 systemd[1]: Started System Security Services Daemon.

I'll attach a new log.

Comment 6 u.sibiller@atos.net 2021-10-15 14:45:16 UTC

Created attachment 1833464 [details]
sssdlogs with debug_level 9 everywhere (names and IPs changed with sed)

Comment 7 u.sibiller@atos.net 2021-10-15 14:47:41 UTC

Created attachment 1833465 [details]
sssd logs, names and IPs mangled

Comment 8 u.sibiller@atos.net 2021-10-15 15:14:28 UTC

This sssd status ist the one that matched the attached log tarball, above one was the wrong one (copy and paste mistake)

systemctl status sssd
● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-10-15 16:37:09 CEST; 35min ago
 Main PID: 118621 (sssd)
    Tasks: 8 (limit: 11388)
   Memory: 62.5M
   CGroup: /system.slice/sssd.service
           ├─118621 /usr/sbin/sssd -i --logger=files
           ├─118623 /usr/libexec/sssd/sssd_be --domain implicit_files --uid 0 --gid 0 --logger=files
           ├─118624 /usr/libexec/sssd/sssd_be --domain DOMAIN1.OURNET --uid 0 --gid 0 --logger=files
           ├─118625 /usr/libexec/sssd/sssd_be --domain DOMAIN2.OUTNET --uid 0 --gid 0 --logger=files
           ├─118626 /usr/libexec/sssd/sssd_be --domain unmatched --uid 0 --gid 0 --logger=files
           ├─118627 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
           ├─118628 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
           └─118629 /usr/libexec/sssd/sssd_autofs --uid 0 --gid 0 --logger=files

Oct 15 16:37:09 vumca615 systemd[1]: Starting System Security Services Daemon...
Oct 15 16:37:09 vumca615 sssd[sssd][118621]: Starting up
Oct 15 16:37:09 vumca615 sssd[be[DOMAIN2.OURNET]][118625]: Starting up
Oct 15 16:37:09 vumca615 sssd[be[implicit_files]][118623]: Starting up
Oct 15 16:37:09 vumca615 sssd[be[DOMAIN1.OURNET]][118624]: Starting up
Oct 15 16:37:09 vumca615 sssd[be[unmatched]][118626]: Starting up
Oct 15 16:37:09 vumca615 sssd[pam][118628]: Starting up
Oct 15 16:37:09 vumca615 sssd[autofs][118629]: Starting up
Oct 15 16:37:09 vumca615 sssd[nss][118627]: Starting up
Oct 15 16:37:09 vumca615 systemd[1]: Started System Security Services Daemon.

Comment 10 Ian Kent 2021-10-16 02:14:02 UTC

(In reply to Pavel Březina from comment #3)
> Looks like SSSD is offline which makes autofs to retry again and again. Can
> you please also share your domain logs to see why SSSD is offline? Or is it
> expected that SSSD is offline?

I had a similar situation with an autofs report a while ago.

From logs it looked like the reporter had an incorrect configuration
so I couldn't get further.

This might be a case where we there is no "correct" return.

In this case the map key doesn't exist so an ENOENT return is needed.

But we might not be able to do that.

For example, the cache isn't populated, the backend isn't responding
for some reason, sss returns server down to autofs (need to verify
that) and autofs waits ... so, in theory, both sss and autofs are
doing the right thing but the user experience isn't ok.

The bit that gets confusing is that the server shouldn't be down in
this case so something isn't quite right.

Comment 11 Ian Kent 2021-10-16 02:56:24 UTC

(In reply to u.sibiller from comment #2)
> Created attachment 1832194 [details]
> logfile sssd_autofs.log
> 
> Here you can see sssd/autofs doing 51 CRs (Cache requests, I guess).

The log appears to indicate there are multiple cache lookups
returning key not found, first from cache, then from provider
and then from cache again.

For some reason (and there's no indication sss thinks the provider
is down) ENOENT isn't being returned. 

Perhaps the logic: cache lookup fail <- not found
                   -> lookup from provider <- not found
                   -> lookup cache again <- not found

isn't quite right in that the second cache lookup shouldn't
be happening since the provider lookup has been done and appears
to have succeeded at returning not found.

But maybe the provider "looks" like it's down so the second cache
lookup is done. Or the provider lookup is returning not found but
ENOENT isn't being returned as it should.

Not knowing the sss code I don't know ...

It could be autofs not handling the return correctly causing those
retries seen in the sss log, probably a server down return ... since
that's pretty much the only return what will cause autofs to continue
the wait in this case (getautomntbyname_r()).

Comment 12 Pavel Březina 2021-10-18 10:47:54 UTC

I might have found the issue in SSSD, specifically this: https://github.com/pbrezina/sssd/commit/9a977270d802ed9506b9acc82011f04a06ead7ec

Please, try this scratch build and check if the issue still persist:
https://pbrezina.fedorapeople.org/scratch/2013218/

Comment 13 u.sibiller@atos.net 2021-10-18 11:34:16 UTC

Yes, this reduces the number of cache requests to 3, see attached log.

Thanks for the quick fix!

Comment 14 u.sibiller@atos.net 2021-10-18 11:35:18 UTC

Created attachment 1834208 [details]
log with fix included in sssd-2.4.0.el8-4.2.1

Comment 15 u.sibiller@atos.net 2021-10-18 12:03:55 UTC

Please note that these packages depend on samba-client-libs 4.13.3-5.el8_4 which RHN does not seem to know.

Also note the packages point to https://pagure.io/SSSD/sssd/ which is not the current home of the sssd project anymore.

Comment 16 Ian Kent 2021-10-18 22:25:53 UTC

(In reply to u.sibiller from comment #13)
> Yes, this reduces the number of cache requests to 3, see attached log.

That's interesting, it sounds right.

If the lookup key isn't found autofs will check for wildcard matches,
for historic reasons, it will look for '/' and if that's not found it
will look for '*' (the usual wildcard key).

Comment 17 Pavel Březina 2021-10-19 11:08:11 UTC

Upstream ticket:
https://github.com/SSSD/sssd/issues/5832

Comment 18 Pavel Březina 2021-10-19 11:09:27 UTC

Upstream PR:
https://github.com/SSSD/sssd/pull/5833

Comment 19 Pavel Březina 2021-10-19 11:10:04 UTC

Thank you for testing the scratch build. I have opened pull request against upstream.

Comment 21 u.sibiller@atos.net 2021-10-19 12:19:01 UTC

Hmm, "internal target release" 8.6.0 is quite far away. So is there a change to get fixed packages for 8.4 (as written before, above package depend on sama.client-libs  4.13.3-5.el8_4 which is not available (yet?)?

Comment 22 Alexey Tikhonov 2021-10-19 12:33:28 UTC

(In reply to u.sibiller from comment #21)
> Hmm, "internal target release" 8.6.0 is quite far away. So is there a change
> to get fixed packages for 8.4 (as written before, above package depend on
> sama.client-libs  4.13.3-5.el8_4 which is not available (yet?)?

Package from the comment 12 is merely a "test" build (i.e. "for testing purpose only"), it is not "supported".
Reported bug has to be fixed in a currently being developed release first, i.e. 8.6/9.0.
Backport of the fix to previously released minor RHEL versions will be considered as well. So "there is a chance" but no promise.

Comment 27 Alexey Tikhonov 2021-10-25 10:16:21 UTC

(In reply to u.sibiller from comment #21)
> Hmm, "internal target release" 8.6.0 is quite far away. So is there a change
> to get fixed packages for 8.4 (as written before, above package depend on
> sama.client-libs  4.13.3-5.el8_4 which is not available (yet?)?

Fix for 8.4 will be tracked via bz 2016922

Comment 28 Alexey Tikhonov 2021-10-25 12:54:38 UTC

Pushed PR: https://github.com/SSSD/sssd/pull/5833

* `master`
    * bb94a18f0f0cba1e9fb5abf78b995d69e5f3c559 - cache_req: return success for autofs when ENOENT is returned from provider

Comment 37 errata-xmlrpc 2022-05-10 15:26:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sssd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2070

Note You need to log in before you can comment on or make changes to this bug.