Bug 2165143

Summary:	Autofs reports can't connect to sssd, retry for 10 seconds when real problem is empty LDAP object
Product:	Red Hat Enterprise Linux 8	Reporter:	Bill Marshall <bill.marshall>
Component:	autofs	Assignee:	Ian Kent <ikent>
Status:	CLOSED ERRATA	QA Contact:	Kun Wang <kunwan>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	8.7	CC:	abokovoy, atikhono, xzhou
Target Milestone:	rc	Keywords:	Triaged
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	autofs-5.1.4-93.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2170287 (view as bug list)		Environment:
Last Closed:	2023-05-16 09:05:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2170287

Description Bill Marshall 2023-01-27 21:20:42 UTC

Description of problem:
Customer is using autofs LDAP mappings stored in AD LDAP They have multiple sites, multiple environments (prod,preprod,dev,etc.) and we believed each environment had its own place in LDAP.

automount -m doesn't report any mappings for /home

Only thing visible in logs:
`automount[2400]: can't connect to sssd, retry for 10 seconds`
 
In reality, sss was fine, connection to LDAP was fine, but there was no data returned from LDAP. sss<>autofs didn’t communicate the actual details or the problem.

Version-Release number of selected component (if applicable):
autofs-5.1.4-83.el8.x86_64
sssd-2.7.3-4.el8_7.1.x86_64

How reproducible:
Once the environment is setup, easy to reproduce. We could fix/break/fix by changing the base DN in sssd.conf

Steps to Reproduce:
1. configure sssd against AD
2. set the ldap_autofs_search_base in sssd.conf to point at an empty space
3. restart sssd, autofs as needed

Actual results:
automount[2400]: can't connect to sssd, retry for 10 seconds


Expected results:
A useful error message that sssd is fine, but no data is available at the ldap_autofs_search_base 

Additional info:
strace of autofs during failure:


[pid 17986] socket(AF_UNIX, SOCK_STREAM, 0) = 4
[pid 17986] fcntl(4, F_GETFL)           = 0x2 (flags O_RDWR)
[pid 17986] fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 17986] fcntl(4, F_GETFD)           = 0
[pid 17986] fcntl(4, F_SETFD, FD_CLOEXEC) = 0
[pid 17986] connect(4, {sa_family=AF_UNIX, sun_path="/var/lib/sss/pipes/autofs"}, 110) = 0
[pid 17986] fstat(4, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
[pid 17986] poll([{fd=4, events=POLLOUT}], 1, 300000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 17986] sendto(4, "\24\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, 0) = 16
[pid 17986] poll([{fd=4, events=POLLOUT}], 1, 300000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 17986] sendto(4, "\1\0\0\0", 4, MSG_NOSIGNAL, NULL, 0) = 4
[pid 17986] poll([{fd=4, events=POLLIN}], 1, 300000) = 1 ([{fd=4, revents=POLLIN}])
[pid 17986] read(4, "\24\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0", 16) = 16
[pid 17986] poll([{fd=4, events=POLLIN}], 1, 300000) = 1 ([{fd=4, revents=POLLIN}])
[pid 17986] read(4, "\1\0\0\0", 4)      = 4
[pid 17986] poll([{fd=4, events=POLLOUT}], 1, 300000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 17986] sendto(4, "\34\0\0\0\321\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, 0) = 16
[pid 17986] poll([{fd=4, events=POLLOUT}], 1, 300000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 17986] sendto(4, "auto.master\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
[pid 17986] poll([{fd=4, events=POLLIN}], 1, 300000) = 1 ([{fd=4, revents=POLLIN}])
[pid 17986] read(4, "\30\0\0\0\321\0\0\0\0\0\0\0\0\0\0\0", 16) = 16
[pid 17986] poll([{fd=4, events=POLLIN}], 1, 300000) = 1 ([{fd=4, revents=POLLIN}])
[pid 17986] read(4, "\0\0\0\0\0\0\0\0", 8) = 8
[pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43
[pid 17986] write(2, "\n", 1

Comment 1 Ian Kent 2023-01-29 01:21:25 UTC

Yes, it does look like the handling of this failure case isn't right.

Could you please provide an autofs debug log including the start up of autofs
and an example of reproducing the problem.

Possibly the easiest way to do that is to set "logging = debug" in /etc/autofs.conf
and use "journalctl -f | tee autofs.log".

Comment 2 Ian Kent 2023-01-30 03:04:08 UTC

(In reply to Bill Marshall from comment #0)
> Description of problem:
> Customer is using autofs LDAP mappings stored in AD LDAP They have multiple
> sites, multiple environments (prod,preprod,dev,etc.) and we believed each
> environment had its own place in LDAP.
> 
> automount -m doesn't report any mappings for /home
> 
> Only thing visible in logs:
> `automount[2400]: can't connect to sssd, retry for 10 seconds`
>  
> In reality, sss was fine, connection to LDAP was fine, but there was no data
> returned from LDAP. sss<>autofs didn’t communicate the actual details or the
> problem.

I may have duplicated this based on your problem description.
I would still like the debug log if you can get it please.

> 
> Version-Release number of selected component (if applicable):
> autofs-5.1.4-83.el8.x86_64
> sssd-2.7.3-4.el8_7.1.x86_64

It looks like we have all the current fixes for sss communication here, that's
good.

> 
> How reproducible:
> Once the environment is setup, easy to reproduce. We could fix/break/fix by
> changing the base DN in sssd.conf
> 
> Steps to Reproduce:
> 1. configure sssd against AD
> 2. set the ldap_autofs_search_base in sssd.conf to point at an empty space
> 3. restart sssd, autofs as needed
> 
> Actual results:
> automount[2400]: can't connect to sssd, retry for 10 seconds
> 
> 
> Expected results:
> A useful error message that sssd is fine, but no data is available at the
> ldap_autofs_search_base 

autofs doesn't know anything about the sss configuration so the best I can
do is try and work out what the actual error condition is.

The difficulty identifying the actual failure cases so that autofs can take
sensible action and issue sensible messages has frequently been a challenge.

If I have duplicated this correctly those messages "can't connect to sssd,
 retry for 10 seconds" don't actually sleep for 10 seconds before continuing,
right?

Comment 4 Bill Marshall 2023-01-30 14:45:48 UTC

Alexander from sss is copied on this bug, so he can help w/ the autofs<>sss communication.

Re: lack of 10 sec sleep.

I do see repeated groups of 3 in messages:
2023-01-23T16:00:32.803144+00:00 host automount[483501]: can't connect to sssd, retry for 10 seconds
2023-01-23T16:00:32.831333+00:00 host automount[483501]: can't connect to sssd, retry for 10 seconds
2023-01-23T16:00:32.857864+00:00 host automount[483501]: can't connect to sssd, retry for 10 seconds
2023-01-23T16:03:57.231849+00:00 host automount[484120]: can't connect to sssd, retry for 10 seconds
2023-01-23T16:03:57.249833+00:00 host automount[484120]: can't connect to sssd, retry for 10 seconds
2023-01-23T16:03:57.275312+00:00 host automount[484120]: can't connect to sssd, retry for 10 seconds
2023-01-23T19:01:47.758438+00:00 host automount[505339]: can't connect to sssd, retry for 10 seconds
2023-01-23T19:01:47.791376+00:00 host automount[505339]: can't connect to sssd, retry for 10 seconds
2023-01-23T19:01:47.813129+00:00 host automount[505339]: can't connect to sssd, retry for 10 seconds
2023-01-23T19:15:16.102579+00:00 host automount[507408]: can't connect to sssd, retry for 10 seconds
2023-01-23T19:15:16.146749+00:00 host automount[507408]: can't connect to sssd, retry for 10 seconds
2023-01-23T19:15:16.164480+00:00 host automount[507408]: can't connect to sssd, retry for 10 seconds

Also in the shorter strace we ran:
% grep retry strace-automount1.txt
read(3, " nfs_retry_interval, nfs_retry_i"..., 4096) = 4096
[pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43
[pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43
[pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43

As for autofs in debug mode. The system we were working on is no longer available. I can see if a test environment can be setup.

Otherwise, I realized I had an strace of a debug session. If this useful? I can likely attach the whole strace if I sanitize it.

% grep write  strace-automount1.txt | grep -v '\\n'
write(2, "Starting automounter version 5.1"..., 65Starting automounter version 5.1.4-83.el8, master map auto.master) = 65
write(2, "using kernel protocol version 5."..., 34using kernel protocol version 5.05) = 34
[pid 17986] write(2, "lookup_nss_read_master: reading "..., 54lookup_nss_read_master: reading master sss auto.master) = 54
[pid 17986] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57
[pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43
[pid 17986] write(2, "lookup_nss_read_master: auto.mas"..., 69lookup_nss_read_master: auto.master not found, replacing '.' with '_') = 69
[pid 17986] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57
[pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43
[pid 17986] write(2, "lookup_nss_read_master: no map -"..., 58lookup_nss_read_master: no map - continuing to next source) = 58
[pid 17986] write(2, "lookup_nss_read_master: reading "..., 56lookup_nss_read_master: reading master files auto.master) = 56
[pid 17986] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57
[pid 17986] write(2, "lookup_read_master: lookup(file)"..., 50lookup_read_master: lookup(file): read entry /misc) = 50
[pid 17986] write(2, "lookup_read_master: lookup(file)"..., 49lookup_read_master: lookup(file): read entry /net) = 49
[pid 17986] write(2, "lookup_read_master: lookup(file)"..., 68lookup_read_master: lookup(file): read entry +dir:/etc/auto.master.d) = 68
[pid 17986] write(2, "lookup_nss_read_master: reading "..., 61lookup_nss_read_master: reading master dir /etc/auto.master.d) = 61
[pid 17986] write(2, "lookup_read_master: lookup(dir):"..., 60lookup_read_master: lookup(dir): scandir: /etc/auto.master.d) = 60
[pid 17986] write(2, "lookup_read_master: lookup(file)"..., 57lookup_read_master: lookup(file): read entry +auto.master) = 57
[pid 17986] write(2, "lookup_nss_read_master: reading "..., 54lookup_nss_read_master: reading master sss auto.master) = 54
[pid 17986] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57
[pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43
[pid 17986] write(2, "lookup_nss_read_master: no map -"..., 58lookup_nss_read_master: no map - continuing to next source) = 58
[pid 17986] write(2, "lookup_nss_read_master: reading "..., 56lookup_nss_read_master: reading master files auto.master) = 56
[pid 17986] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57
[pid 17986] write(2, "master_do_mount: mounting /misc", 31master_do_mount: mounting /misc) = 31
[pid 17996] write(2, "automount_path_to_fifo: fifo nam"..., 55automount_path_to_fifo: fifo name /run/autofs.fifo-misc) = 55
[pid 17996] write(2, "lookup_nss_read_map: reading map"..., 52lookup_nss_read_map: reading map file /etc/auto.misc) = 52
[pid 17996] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57
[pid 17996] write(2, "remount_active_mount: trying to "..., 57remount_active_mount: trying to re-connect to mount /misc) = 57
[pid 17996] write(2, "mounted indirect on /misc with t"..., 57mounted indirect on /misc with timeout 30, freq 8 seconds) = 57
[pid 17996] write(2, "remount_active_mount: re-connect"..., 49remount_active_mount: re-connected to mount /misc) = 49
[pid 17996] write(2, "st_ready: st_ready(): state = 0 "..., 42st_ready: st_ready(): state = 0 path /misc) = 42
[pid 17986] write(2, "master_do_mount: mounting /net", 30master_do_mount: mounting /net) = 30
[pid 17999] write(2, "automount_path_to_fifo: fifo nam"..., 54automount_path_to_fifo: fifo name /run/autofs.fifo-net) = 54
[pid 17999] write(2, "lookup_nss_read_map: reading map"..., 45lookup_nss_read_map: reading map hosts (null)) = 45
[pid 17999] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57
[pid 17999] write(2, "lookup_read_map: lookup(hosts): "..., 46lookup_read_map: lookup(hosts): read hosts map) = 46
[pid 17999] write(2, "lookup_read_map: lookup(hosts): "..., 84lookup_read_map: lookup(hosts): map not browsable, update existing host entries only) = 84
[pid 17999] write(2, "remount_active_mount: trying to "..., 56remount_active_mount: trying to re-connect to mount /net) = 56
[pid 17999] write(2, "mounted indirect on /net with ti"..., 56mounted indirect on /net with timeout 30, freq 8 seconds) = 56
[pid 17999] write(2, "remount_active_mount: re-connect"..., 48remount_active_mount: re-connected to mount /net) = 48
[pid 17999] write(2, "st_ready: st_ready(): state = 0 "..., 41st_ready: st_ready(): state = 0 path /net) = 41

Comment 5 Alexander Bokovoy 2023-01-30 16:06:55 UTC

Add Alexey to give a help hand from SSSD side.

Comment 6 Ian Kent 2023-01-31 00:13:36 UTC

(In reply to Bill Marshall from comment #4)
> Alexander from sss is copied on this bug, so he can help w/ the autofs<>sss
> communication.
> 
> Re: lack of 10 sec sleep.
> 
> I do see repeated groups of 3 in messages:
> 2023-01-23T16:00:32.803144+00:00 host automount[483501]: can't connect to
> sssd, retry for 10 seconds
> 2023-01-23T16:00:32.831333+00:00 host automount[483501]: can't connect to
> sssd, retry for 10 seconds
> 2023-01-23T16:00:32.857864+00:00 host automount[483501]: can't connect to
> sssd, retry for 10 seconds
> 2023-01-23T16:03:57.231849+00:00 host automount[484120]: can't connect to
> sssd, retry for 10 seconds
> 2023-01-23T16:03:57.249833+00:00 host automount[484120]: can't connect to
> sssd, retry for 10 seconds
> 2023-01-23T16:03:57.275312+00:00 host automount[484120]: can't connect to
> sssd, retry for 10 seconds
> 2023-01-23T19:01:47.758438+00:00 host automount[505339]: can't connect to
> sssd, retry for 10 seconds
> 2023-01-23T19:01:47.791376+00:00 host automount[505339]: can't connect to
> sssd, retry for 10 seconds
> 2023-01-23T19:01:47.813129+00:00 host automount[505339]: can't connect to
> sssd, retry for 10 seconds
> 2023-01-23T19:15:16.102579+00:00 host automount[507408]: can't connect to
> sssd, retry for 10 seconds
> 2023-01-23T19:15:16.146749+00:00 host automount[507408]: can't connect to
> sssd, retry for 10 seconds
> 2023-01-23T19:15:16.164480+00:00 host automount[507408]: can't connect to
> sssd, retry for 10 seconds
> 
> Also in the shorter strace we ran:
> % grep retry strace-automount1.txt
> read(3, " nfs_retry_interval, nfs_retry_i"..., 4096) = 4096
> [pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect
> to sssd, retry for 10 seconds) = 43
> [pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect
> to sssd, retry for 10 seconds) = 43
> [pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect
> to sssd, retry for 10 seconds) = 43
> 
> As for autofs in debug mode. The system we were working on is no longer
> available. I can see if a test environment can be setup.

Ok, probably not a problem now since I'm pretty sure I am reproducing what
your seeing.

> 
> Otherwise, I realized I had an strace of a debug session. If this useful? I
> can likely attach the whole strace if I sanitize it.

Unfortunately I use strace only for specific things and this isn't one
of those cases so it's not that useful.

I'll continue looking at what I have, at the very least there looks to
be an ordering problem that adds an additional call as well as the error
handling problem.


Ian

Comment 7 Alexey Tikhonov 2023-01-31 14:17:54 UTC

(In reply to Alexander Bokovoy from comment #5)
> Add Alexey to give a help hand from SSSD side.

It looks like Ian already nailed an issue. But please let me know if anything required from SSSD side.

Comment 15 errata-xmlrpc 2023-05-16 09:05:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (autofs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2970