Bug 2013218

Summary:

autofs lookups for unknown mounts are delayed for 50s

Product:

Red Hat Enterprise Linux 8

Reporter:

u.sibiller <u.sibiller>

Component:

sssd

Assignee:

Pavel Březina <pbrezina>

Status:

CLOSED ERRATA

QA Contact:

shridhar <sgadekar>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

8.4

CC:

aboscatt, atikhono, dlavu, grajaiya, ikent, jhrozek, knweiss, lslebodn, mzidek, pbrezina, sgoveas, tscherf

Target Milestone:

Keywords:

Regression, Triaged, ZStream

Target Release:

---

Hardware:

All

OS:

Unspecified

Whiteboard:

sync-to-jira

Fixed In Version:

sssd-2.6.1-1.el8

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

2016922 2016923 (view as bug list)

Environment:

Last Closed:

2022-05-10 15:26:45 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2016922, 2016923

Attachments:

Description	Flags
our sssd.conf	none
logfile sssd_autofs.log	none
sssdlogs with debug_level 9 everywhere (names and IPs changed with sed)	none
sssd logs, names and IPs mangled	none
log with fix included in sssd-2.4.0.el8-4.2.1	none

Description u.sibiller@atos.net 2021-10-12 12:09:38 UTC

Description of problem:

We are using a sssd.conf that refers to two AD domains and a third "catchall" domain. All three have the same autofs configuration to lookup autofs mountpoints in ldap. This used to work unchanged for years, including RHEL7.9 and 8.3. Starting with 8.4 calls like "stat /share/unknown_share" take >50s to complete. Every single time. In 8.3 this returns with "stat: cannot stat '/share/unknown_share': No such file or directory" immediately.

Version-Release number of selected component (if applicable):
sssd-common-2.4.0-9.el8_4.2.x86_64

How reproducible:
stat on a automounter mountpoint that does not exist, e.g. /share/unknown_share (/share is handled by autofs)


Actual results:

After >50s we see  "stat: cannot stat '/share/unknown_share': No such file or directory"

Expected results:

"stat: cannot stat '/share/unknown_share': No such file or directory" should appear immediately

Additional info:

I have found two workarounds for this:
1. disable autofs for the first two (AD) domains by setting "autofs_provider = none", then restart sssd and autofs
2. replace /usr/libexec/sssd/sssd_autofs (only this file, nothing else) by the one from sssd-2.3.0-9 (RHEL8.3), then restart sssd and autofs

sssd.conf is attached.

Comment 1 u.sibiller@atos.net 2021-10-12 12:23:16 UTC

Created attachment 1832193 [details]
our sssd.conf

Comment 2 u.sibiller@atos.net 2021-10-12 12:36:27 UTC

Created attachment 1832194 [details]
logfile sssd_autofs.log

Here you can see sssd/autofs doing 51 CRs (Cache requests, I guess).

Comment 3 Pavel Březina 2021-10-15 12:28:41 UTC

Looks like SSSD is offline which makes autofs to retry again and again. Can you please also share your domain logs to see why SSSD is offline? Or is it expected that SSSD is offline?

Comment 5 u.sibiller@atos.net 2021-10-15 14:38:21 UTC

Well, I do not get what "offline" means in that context. The ssd service is up an running:
● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-10-15 16:26:51 CEST; 7min ago
 Main PID: 118292 (sssd)
    Tasks: 8 (limit: 11388)
   Memory: 61.8M
   CGroup: /system.slice/sssd.service
           ├─118292 /usr/sbin/sssd -i --logger=files
           ├─118293 /usr/libexec/sssd/sssd_be --domain implicit_files --uid 0 --gid 0 --logger=files
           ├─118294 /usr/libexec/sssd/sssd_be --domain DOMAIN1.OURNET --uid 0 --gid 0 --logger=files
           ├─118295 /usr/libexec/sssd/sssd_be --domain DOMAIN2.OURNET --uid 0 --gid 0 --logger=files
           ├─118296 /usr/libexec/sssd/sssd_be --domain unmatched --uid 0 --gid 0 --logger=files
           ├─118297 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
           ├─118298 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
           └─118299 /usr/libexec/sssd/sssd_autofs --uid 0 --gid 0 --logger=files

Oct 15 16:26:51 vumca615 systemd[1]: Starting System Security Services Daemon...
Oct 15 16:26:51 vumca615 sssd[sssd][118292]: Starting up
Oct 15 16:26:51 vumca615 sssd[be[implicit_files]][118293]: Starting up
Oct 15 16:26:51 vumca615 sssd[be[DOMAIN2.OURNET]][118295]: Starting up
Oct 15 16:26:51 vumca615 sssd[be[DOMAIN1.OURNET]][118294]: Starting up
Oct 15 16:26:51 vumca615 sssd[be[unmatched]][118296]: Starting up
Oct 15 16:26:51 vumca615 sssd[autofs][118299]: Starting up
Oct 15 16:26:51 vumca615 sssd[nss][118297]: Starting up
Oct 15 16:26:51 vumca615 sssd[pam][118298]: Starting up
Oct 15 16:26:51 vumca615 systemd[1]: Started System Security Services Daemon.

I'll attach a new log.

Comment 6 u.sibiller@atos.net 2021-10-15 14:45:16 UTC

Created attachment 1833464 [details]
sssdlogs with debug_level 9 everywhere (names and IPs changed with sed)

Comment 7 u.sibiller@atos.net 2021-10-15 14:47:41 UTC

Created attachment 1833465 [details]
sssd logs, names and IPs mangled

Comment 8 u.sibiller@atos.net 2021-10-15 15:14:28 UTC

This sssd status ist the one that matched the attached log tarball, above one was the wrong one (copy and paste mistake)

systemctl status sssd
● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-10-15 16:37:09 CEST; 35min ago
 Main PID: 118621 (sssd)
    Tasks: 8 (limit: 11388)
   Memory: 62.5M
   CGroup: /system.slice/sssd.service
           ├─118621 /usr/sbin/sssd -i --logger=files
           ├─118623 /usr/libexec/sssd/sssd_be --domain implicit_files --uid 0 --gid 0 --logger=files
           ├─118624 /usr/libexec/sssd/sssd_be --domain DOMAIN1.OURNET --uid 0 --gid 0 --logger=files
           ├─118625 /usr/libexec/sssd/sssd_be --domain DOMAIN2.OUTNET --uid 0 --gid 0 --logger=files
           ├─118626 /usr/libexec/sssd/sssd_be --domain unmatched --uid 0 --gid 0 --logger=files
           ├─118627 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
           ├─118628 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
           └─118629 /usr/libexec/sssd/sssd_autofs --uid 0 --gid 0 --logger=files

Oct 15 16:37:09 vumca615 systemd[1]: Starting System Security Services Daemon...
Oct 15 16:37:09 vumca615 sssd[sssd][118621]: Starting up
Oct 15 16:37:09 vumca615 sssd[be[DOMAIN2.OURNET]][118625]: Starting up
Oct 15 16:37:09 vumca615 sssd[be[implicit_files]][118623]: Starting up
Oct 15 16:37:09 vumca615 sssd[be[DOMAIN1.OURNET]][118624]: Starting up
Oct 15 16:37:09 vumca615 sssd[be[unmatched]][118626]: Starting up
Oct 15 16:37:09 vumca615 sssd[pam][118628]: Starting up
Oct 15 16:37:09 vumca615 sssd[autofs][118629]: Starting up
Oct 15 16:37:09 vumca615 sssd[nss][118627]: Starting up
Oct 15 16:37:09 vumca615 systemd[1]: Started System Security Services Daemon.

Comment 10 Ian Kent 2021-10-16 02:14:02 UTC

(In reply to Pavel Březina from comment #3)
> Looks like SSSD is offline which makes autofs to retry again and again. Can
> you please also share your domain logs to see why SSSD is offline? Or is it
> expected that SSSD is offline?

I had a similar situation with an autofs report a while ago.

From logs it looked like the reporter had an incorrect configuration
so I couldn't get further.

This might be a case where we there is no "correct" return.

In this case the map key doesn't exist so an ENOENT return is needed.

But we might not be able to do that.

For example, the cache isn't populated, the backend isn't responding
for some reason, sss returns server down to autofs (need to verify
that) and autofs waits ... so, in theory, both sss and autofs are
doing the right thing but the user experience isn't ok.

The bit that gets confusing is that the server shouldn't be down in
this case so something isn't quite right.

Comment 11 Ian Kent 2021-10-16 02:56:24 UTC

(In reply to u.sibiller from comment #2)
> Created attachment 1832194 [details]
> logfile sssd_autofs.log
> 
> Here you can see sssd/autofs doing 51 CRs (Cache requests, I guess).

The log appears to indicate there are multiple cache lookups
returning key not found, first from cache, then from provider
and then from cache again.

For some reason (and there's no indication sss thinks the provider
is down) ENOENT isn't being returned. 

Perhaps the logic: cache lookup fail <- not found
                   -> lookup from provider <- not found
                   -> lookup cache again <- not found

isn't quite right in that the second cache lookup shouldn't
be happening since the provider lookup has been done and appears
to have succeeded at returning not found.

But maybe the provider "looks" like it's down so the second cache
lookup is done. Or the provider lookup is returning not found but
ENOENT isn't being returned as it should.

Not knowing the sss code I don't know ...

It could be autofs not handling the return correctly causing those
retries seen in the sss log, probably a server down return ... since
that's pretty much the only return what will cause autofs to continue
the wait in this case (getautomntbyname_r()).

Comment 12 Pavel Březina 2021-10-18 10:47:54 UTC

I might have found the issue in SSSD, specifically this: https://github.com/pbrezina/sssd/commit/9a977270d802ed9506b9acc82011f04a06ead7ec

Please, try this scratch build and check if the issue still persist:
https://pbrezina.fedorapeople.org/scratch/2013218/

Comment 13 u.sibiller@atos.net 2021-10-18 11:34:16 UTC

Yes, this reduces the number of cache requests to 3, see attached log.

Thanks for the quick fix!

Comment 14 u.sibiller@atos.net 2021-10-18 11:35:18 UTC

Created attachment 1834208 [details]
log with fix included in sssd-2.4.0.el8-4.2.1

Comment 15 u.sibiller@atos.net 2021-10-18 12:03:55 UTC

Please note that these packages depend on samba-client-libs 4.13.3-5.el8_4 which RHN does not seem to know.

Also note the packages point to https://pagure.io/SSSD/sssd/ which is not the current home of the sssd project anymore.

Comment 16 Ian Kent 2021-10-18 22:25:53 UTC

(In reply to u.sibiller from comment #13)
> Yes, this reduces the number of cache requests to 3, see attached log.

That's interesting, it sounds right.

If the lookup key isn't found autofs will check for wildcard matches,
for historic reasons, it will look for '/' and if that's not found it
will look for '*' (the usual wildcard key).

Comment 17 Pavel Březina 2021-10-19 11:08:11 UTC

Upstream ticket:
https://github.com/SSSD/sssd/issues/5832

Comment 18 Pavel Březina 2021-10-19 11:09:27 UTC

Upstream PR:
https://github.com/SSSD/sssd/pull/5833

Comment 19 Pavel Březina 2021-10-19 11:10:04 UTC

Thank you for testing the scratch build. I have opened pull request against upstream.

Comment 21 u.sibiller@atos.net 2021-10-19 12:19:01 UTC

Hmm, "internal target release" 8.6.0 is quite far away. So is there a change to get fixed packages for 8.4 (as written before, above package depend on sama.client-libs  4.13.3-5.el8_4 which is not available (yet?)?

Comment 22 Alexey Tikhonov 2021-10-19 12:33:28 UTC

(In reply to u.sibiller from comment #21)
> Hmm, "internal target release" 8.6.0 is quite far away. So is there a change
> to get fixed packages for 8.4 (as written before, above package depend on
> sama.client-libs  4.13.3-5.el8_4 which is not available (yet?)?

Package from the comment 12 is merely a "test" build (i.e. "for testing purpose only"), it is not "supported".
Reported bug has to be fixed in a currently being developed release first, i.e. 8.6/9.0.
Backport of the fix to previously released minor RHEL versions will be considered as well. So "there is a chance" but no promise.

Comment 27 Alexey Tikhonov 2021-10-25 10:16:21 UTC

(In reply to u.sibiller from comment #21)
> Hmm, "internal target release" 8.6.0 is quite far away. So is there a change
> to get fixed packages for 8.4 (as written before, above package depend on
> sama.client-libs  4.13.3-5.el8_4 which is not available (yet?)?

Fix for 8.4 will be tracked via bz 2016922

Comment 28 Alexey Tikhonov 2021-10-25 12:54:38 UTC

Pushed PR: https://github.com/SSSD/sssd/pull/5833

* `master`
    * bb94a18f0f0cba1e9fb5abf78b995d69e5f3c559 - cache_req: return success for autofs when ENOENT is returned from provider

Comment 37 errata-xmlrpc 2022-05-10 15:26:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sssd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2070