Bug 2165143
| Summary: | Autofs reports can't connect to sssd, retry for 10 seconds when real problem is empty LDAP object | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Bill Marshall <bill.marshall> | |
| Component: | autofs | Assignee: | Ian Kent <ikent> | |
| Status: | CLOSED ERRATA | QA Contact: | Kun Wang <kunwan> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 8.7 | CC: | abokovoy, atikhono, xzhou | |
| Target Milestone: | rc | Keywords: | Triaged | |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
|
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | autofs-5.1.4-93.el8 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2170287 (view as bug list) | Environment: | ||
| Last Closed: | 2023-05-16 09:05:44 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2170287 | |||
Yes, it does look like the handling of this failure case isn't right. Could you please provide an autofs debug log including the start up of autofs and an example of reproducing the problem. Possibly the easiest way to do that is to set "logging = debug" in /etc/autofs.conf and use "journalctl -f | tee autofs.log". (In reply to Bill Marshall from comment #0) > Description of problem: > Customer is using autofs LDAP mappings stored in AD LDAP They have multiple > sites, multiple environments (prod,preprod,dev,etc.) and we believed each > environment had its own place in LDAP. > > automount -m doesn't report any mappings for /home > > Only thing visible in logs: > `automount[2400]: can't connect to sssd, retry for 10 seconds` > > In reality, sss was fine, connection to LDAP was fine, but there was no data > returned from LDAP. sss<>autofs didn’t communicate the actual details or the > problem. I may have duplicated this based on your problem description. I would still like the debug log if you can get it please. > > Version-Release number of selected component (if applicable): > autofs-5.1.4-83.el8.x86_64 > sssd-2.7.3-4.el8_7.1.x86_64 It looks like we have all the current fixes for sss communication here, that's good. > > How reproducible: > Once the environment is setup, easy to reproduce. We could fix/break/fix by > changing the base DN in sssd.conf > > Steps to Reproduce: > 1. configure sssd against AD > 2. set the ldap_autofs_search_base in sssd.conf to point at an empty space > 3. restart sssd, autofs as needed > > Actual results: > automount[2400]: can't connect to sssd, retry for 10 seconds > > > Expected results: > A useful error message that sssd is fine, but no data is available at the > ldap_autofs_search_base autofs doesn't know anything about the sss configuration so the best I can do is try and work out what the actual error condition is. The difficulty identifying the actual failure cases so that autofs can take sensible action and issue sensible messages has frequently been a challenge. If I have duplicated this correctly those messages "can't connect to sssd, retry for 10 seconds" don't actually sleep for 10 seconds before continuing, right? Alexander from sss is copied on this bug, so he can help w/ the autofs<>sss communication. Re: lack of 10 sec sleep. I do see repeated groups of 3 in messages: 2023-01-23T16:00:32.803144+00:00 host automount[483501]: can't connect to sssd, retry for 10 seconds 2023-01-23T16:00:32.831333+00:00 host automount[483501]: can't connect to sssd, retry for 10 seconds 2023-01-23T16:00:32.857864+00:00 host automount[483501]: can't connect to sssd, retry for 10 seconds 2023-01-23T16:03:57.231849+00:00 host automount[484120]: can't connect to sssd, retry for 10 seconds 2023-01-23T16:03:57.249833+00:00 host automount[484120]: can't connect to sssd, retry for 10 seconds 2023-01-23T16:03:57.275312+00:00 host automount[484120]: can't connect to sssd, retry for 10 seconds 2023-01-23T19:01:47.758438+00:00 host automount[505339]: can't connect to sssd, retry for 10 seconds 2023-01-23T19:01:47.791376+00:00 host automount[505339]: can't connect to sssd, retry for 10 seconds 2023-01-23T19:01:47.813129+00:00 host automount[505339]: can't connect to sssd, retry for 10 seconds 2023-01-23T19:15:16.102579+00:00 host automount[507408]: can't connect to sssd, retry for 10 seconds 2023-01-23T19:15:16.146749+00:00 host automount[507408]: can't connect to sssd, retry for 10 seconds 2023-01-23T19:15:16.164480+00:00 host automount[507408]: can't connect to sssd, retry for 10 seconds Also in the shorter strace we ran: % grep retry strace-automount1.txt read(3, " nfs_retry_interval, nfs_retry_i"..., 4096) = 4096 [pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43 [pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43 [pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43 As for autofs in debug mode. The system we were working on is no longer available. I can see if a test environment can be setup. Otherwise, I realized I had an strace of a debug session. If this useful? I can likely attach the whole strace if I sanitize it. % grep write strace-automount1.txt | grep -v '\\n' write(2, "Starting automounter version 5.1"..., 65Starting automounter version 5.1.4-83.el8, master map auto.master) = 65 write(2, "using kernel protocol version 5."..., 34using kernel protocol version 5.05) = 34 [pid 17986] write(2, "lookup_nss_read_master: reading "..., 54lookup_nss_read_master: reading master sss auto.master) = 54 [pid 17986] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57 [pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43 [pid 17986] write(2, "lookup_nss_read_master: auto.mas"..., 69lookup_nss_read_master: auto.master not found, replacing '.' with '_') = 69 [pid 17986] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57 [pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43 [pid 17986] write(2, "lookup_nss_read_master: no map -"..., 58lookup_nss_read_master: no map - continuing to next source) = 58 [pid 17986] write(2, "lookup_nss_read_master: reading "..., 56lookup_nss_read_master: reading master files auto.master) = 56 [pid 17986] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57 [pid 17986] write(2, "lookup_read_master: lookup(file)"..., 50lookup_read_master: lookup(file): read entry /misc) = 50 [pid 17986] write(2, "lookup_read_master: lookup(file)"..., 49lookup_read_master: lookup(file): read entry /net) = 49 [pid 17986] write(2, "lookup_read_master: lookup(file)"..., 68lookup_read_master: lookup(file): read entry +dir:/etc/auto.master.d) = 68 [pid 17986] write(2, "lookup_nss_read_master: reading "..., 61lookup_nss_read_master: reading master dir /etc/auto.master.d) = 61 [pid 17986] write(2, "lookup_read_master: lookup(dir):"..., 60lookup_read_master: lookup(dir): scandir: /etc/auto.master.d) = 60 [pid 17986] write(2, "lookup_read_master: lookup(file)"..., 57lookup_read_master: lookup(file): read entry +auto.master) = 57 [pid 17986] write(2, "lookup_nss_read_master: reading "..., 54lookup_nss_read_master: reading master sss auto.master) = 54 [pid 17986] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57 [pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43 [pid 17986] write(2, "lookup_nss_read_master: no map -"..., 58lookup_nss_read_master: no map - continuing to next source) = 58 [pid 17986] write(2, "lookup_nss_read_master: reading "..., 56lookup_nss_read_master: reading master files auto.master) = 56 [pid 17986] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57 [pid 17986] write(2, "master_do_mount: mounting /misc", 31master_do_mount: mounting /misc) = 31 [pid 17996] write(2, "automount_path_to_fifo: fifo nam"..., 55automount_path_to_fifo: fifo name /run/autofs.fifo-misc) = 55 [pid 17996] write(2, "lookup_nss_read_map: reading map"..., 52lookup_nss_read_map: reading map file /etc/auto.misc) = 52 [pid 17996] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57 [pid 17996] write(2, "remount_active_mount: trying to "..., 57remount_active_mount: trying to re-connect to mount /misc) = 57 [pid 17996] write(2, "mounted indirect on /misc with t"..., 57mounted indirect on /misc with timeout 30, freq 8 seconds) = 57 [pid 17996] write(2, "remount_active_mount: re-connect"..., 49remount_active_mount: re-connected to mount /misc) = 49 [pid 17996] write(2, "st_ready: st_ready(): state = 0 "..., 42st_ready: st_ready(): state = 0 path /misc) = 42 [pid 17986] write(2, "master_do_mount: mounting /net", 30master_do_mount: mounting /net) = 30 [pid 17999] write(2, "automount_path_to_fifo: fifo nam"..., 54automount_path_to_fifo: fifo name /run/autofs.fifo-net) = 54 [pid 17999] write(2, "lookup_nss_read_map: reading map"..., 45lookup_nss_read_map: reading map hosts (null)) = 45 [pid 17999] write(2, "do_init: parse(sun): init gather"..., 57do_init: parse(sun): init gathered global options: (null)) = 57 [pid 17999] write(2, "lookup_read_map: lookup(hosts): "..., 46lookup_read_map: lookup(hosts): read hosts map) = 46 [pid 17999] write(2, "lookup_read_map: lookup(hosts): "..., 84lookup_read_map: lookup(hosts): map not browsable, update existing host entries only) = 84 [pid 17999] write(2, "remount_active_mount: trying to "..., 56remount_active_mount: trying to re-connect to mount /net) = 56 [pid 17999] write(2, "mounted indirect on /net with ti"..., 56mounted indirect on /net with timeout 30, freq 8 seconds) = 56 [pid 17999] write(2, "remount_active_mount: re-connect"..., 48remount_active_mount: re-connected to mount /net) = 48 [pid 17999] write(2, "st_ready: st_ready(): state = 0 "..., 41st_ready: st_ready(): state = 0 path /net) = 41 Add Alexey to give a help hand from SSSD side. (In reply to Bill Marshall from comment #4) > Alexander from sss is copied on this bug, so he can help w/ the autofs<>sss > communication. > > Re: lack of 10 sec sleep. > > I do see repeated groups of 3 in messages: > 2023-01-23T16:00:32.803144+00:00 host automount[483501]: can't connect to > sssd, retry for 10 seconds > 2023-01-23T16:00:32.831333+00:00 host automount[483501]: can't connect to > sssd, retry for 10 seconds > 2023-01-23T16:00:32.857864+00:00 host automount[483501]: can't connect to > sssd, retry for 10 seconds > 2023-01-23T16:03:57.231849+00:00 host automount[484120]: can't connect to > sssd, retry for 10 seconds > 2023-01-23T16:03:57.249833+00:00 host automount[484120]: can't connect to > sssd, retry for 10 seconds > 2023-01-23T16:03:57.275312+00:00 host automount[484120]: can't connect to > sssd, retry for 10 seconds > 2023-01-23T19:01:47.758438+00:00 host automount[505339]: can't connect to > sssd, retry for 10 seconds > 2023-01-23T19:01:47.791376+00:00 host automount[505339]: can't connect to > sssd, retry for 10 seconds > 2023-01-23T19:01:47.813129+00:00 host automount[505339]: can't connect to > sssd, retry for 10 seconds > 2023-01-23T19:15:16.102579+00:00 host automount[507408]: can't connect to > sssd, retry for 10 seconds > 2023-01-23T19:15:16.146749+00:00 host automount[507408]: can't connect to > sssd, retry for 10 seconds > 2023-01-23T19:15:16.164480+00:00 host automount[507408]: can't connect to > sssd, retry for 10 seconds > > Also in the shorter strace we ran: > % grep retry strace-automount1.txt > read(3, " nfs_retry_interval, nfs_retry_i"..., 4096) = 4096 > [pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect > to sssd, retry for 10 seconds) = 43 > [pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect > to sssd, retry for 10 seconds) = 43 > [pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect > to sssd, retry for 10 seconds) = 43 > > As for autofs in debug mode. The system we were working on is no longer > available. I can see if a test environment can be setup. Ok, probably not a problem now since I'm pretty sure I am reproducing what your seeing. > > Otherwise, I realized I had an strace of a debug session. If this useful? I > can likely attach the whole strace if I sanitize it. Unfortunately I use strace only for specific things and this isn't one of those cases so it's not that useful. I'll continue looking at what I have, at the very least there looks to be an ordering problem that adds an additional call as well as the error handling problem. Ian (In reply to Alexander Bokovoy from comment #5) > Add Alexey to give a help hand from SSSD side. It looks like Ian already nailed an issue. But please let me know if anything required from SSSD side. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (autofs bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2970 |
Description of problem: Customer is using autofs LDAP mappings stored in AD LDAP They have multiple sites, multiple environments (prod,preprod,dev,etc.) and we believed each environment had its own place in LDAP. automount -m doesn't report any mappings for /home Only thing visible in logs: `automount[2400]: can't connect to sssd, retry for 10 seconds` In reality, sss was fine, connection to LDAP was fine, but there was no data returned from LDAP. sss<>autofs didn’t communicate the actual details or the problem. Version-Release number of selected component (if applicable): autofs-5.1.4-83.el8.x86_64 sssd-2.7.3-4.el8_7.1.x86_64 How reproducible: Once the environment is setup, easy to reproduce. We could fix/break/fix by changing the base DN in sssd.conf Steps to Reproduce: 1. configure sssd against AD 2. set the ldap_autofs_search_base in sssd.conf to point at an empty space 3. restart sssd, autofs as needed Actual results: automount[2400]: can't connect to sssd, retry for 10 seconds Expected results: A useful error message that sssd is fine, but no data is available at the ldap_autofs_search_base Additional info: strace of autofs during failure: [pid 17986] socket(AF_UNIX, SOCK_STREAM, 0) = 4 [pid 17986] fcntl(4, F_GETFL) = 0x2 (flags O_RDWR) [pid 17986] fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 17986] fcntl(4, F_GETFD) = 0 [pid 17986] fcntl(4, F_SETFD, FD_CLOEXEC) = 0 [pid 17986] connect(4, {sa_family=AF_UNIX, sun_path="/var/lib/sss/pipes/autofs"}, 110) = 0 [pid 17986] fstat(4, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0 [pid 17986] poll([{fd=4, events=POLLOUT}], 1, 300000) = 1 ([{fd=4, revents=POLLOUT}]) [pid 17986] sendto(4, "\24\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, 0) = 16 [pid 17986] poll([{fd=4, events=POLLOUT}], 1, 300000) = 1 ([{fd=4, revents=POLLOUT}]) [pid 17986] sendto(4, "\1\0\0\0", 4, MSG_NOSIGNAL, NULL, 0) = 4 [pid 17986] poll([{fd=4, events=POLLIN}], 1, 300000) = 1 ([{fd=4, revents=POLLIN}]) [pid 17986] read(4, "\24\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0", 16) = 16 [pid 17986] poll([{fd=4, events=POLLIN}], 1, 300000) = 1 ([{fd=4, revents=POLLIN}]) [pid 17986] read(4, "\1\0\0\0", 4) = 4 [pid 17986] poll([{fd=4, events=POLLOUT}], 1, 300000) = 1 ([{fd=4, revents=POLLOUT}]) [pid 17986] sendto(4, "\34\0\0\0\321\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, 0) = 16 [pid 17986] poll([{fd=4, events=POLLOUT}], 1, 300000) = 1 ([{fd=4, revents=POLLOUT}]) [pid 17986] sendto(4, "auto.master\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 [pid 17986] poll([{fd=4, events=POLLIN}], 1, 300000) = 1 ([{fd=4, revents=POLLIN}]) [pid 17986] read(4, "\30\0\0\0\321\0\0\0\0\0\0\0\0\0\0\0", 16) = 16 [pid 17986] poll([{fd=4, events=POLLIN}], 1, 300000) = 1 ([{fd=4, revents=POLLIN}]) [pid 17986] read(4, "\0\0\0\0\0\0\0\0", 8) = 8 [pid 17986] write(2, "can't connect to sssd, retry for"..., 43can't connect to sssd, retry for 10 seconds) = 43 [pid 17986] write(2, "\n", 1