Bug 2155070
| Summary: | file descriptor leak making autofs hit max open files soft limit | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Paul Raines <raines> | ||||||||
| Component: | autofs | Assignee: | Ian Kent <ikent> | ||||||||
| Status: | CLOSED MIGRATED | QA Contact: | Kun Wang <kunwan> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | CentOS Stream | CC: | atikhono, bstinson, jwboyer, xzhou | ||||||||
| Target Milestone: | rc | Keywords: | MigratedToJIRA | ||||||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2023-09-23 11:26:56 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Paul Raines
2022-12-19 20:28:13 UTC
Basically leaking a fd with every mount:
# for f in {1..20} ; do
> /bin/ls -l /proc/$(systemctl show --property MainPID --value autofs)/fd/ | wc -l
> ls -ald /autofs/space/sis_001/
> sleep 1
> umount /autofs/space/sis_001
> done
758
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
759
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
760
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
761
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
762
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
763
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
764
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
765
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
766
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
767
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
768
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
769
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
770
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
771
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
772
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
773
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
774
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
775
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
776
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
777
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
Can you post your configuration, in particular the maps. If you don't want to post them describe what they are, eg. indirect or direct mount maps, do they have map entries with offsets etc. Or you could provide a debug log. (In reply to Paul Raines from comment #1) > Basically leaking a fd with every mount: This example doesn't necessarily provide evidence of a file handle leak. Direct and offset mounts do consume a file handle when they are covered by a real mount (since, originally you couldn't open a file handle for ioctls after they were covered, not the case now though). To change that is a very large amount of work. It isn't out of the question though and I have thought about it from time to time. However, your point that there might be a leak that's may be fairly straight forward to fix is well taken, I'll have a look. Ian (In reply to Paul Raines from comment #1) > Basically leaking a fd with every mount: > > # for f in {1..20} ; do > > /bin/ls -l /proc/$(systemctl show --property MainPID --value autofs)/fd/ | wc -l > > ls -ald /autofs/space/sis_001/ > > sleep 1 > > umount /autofs/space/sis_001 > > done > 758 > drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/ > 759 > drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/ > 760 I've tried this with rev 83 and a map that would consume a file handle and observe that autofs does cater for correcting the manual umount which would leave an open file handle if it wasn't. I'll need more specific information about the map used, and ideally, a debug log from the startup of autofs to the end of the above test run. (In reply to Paul Raines from comment #0) > > Reproducible but takes months to reach limit in normal usage > > Steps to Reproduce: > > Just running autofs on department clients with decently sized maps > > Actual results: > > autofs eventually stops working with above error as I find it using > the soft limit of open files of 20480 > > Expected results: > > autofs working normally with < 100 open file descriptors > Again, this is dependent on your map as to whether you will use less file handles or more. Plain old indirect mount maps should use just a few file handles regardless of the number of active mounts and that should remain fairly consistent over time. Ian Can you tell me how to get the debug logs you want? My maps are indirect and are quite large, but the lines in them are all one of two forms like: sis_001 sis:/local_data/001 or foo_002 :/autofs/space/bar_003 and all indirect submaps are served via LDAP/sssd (nsswitch.conf has -> automount: files sss) The /etc/auto.master on each box is actually the untouched RedHat8 default /misc /etc/auto.misc /net -hosts +dir:/etc/auto.master.d +auto.master where in LDAP auto.master is /autofs/space auto.space intr,rw,hard,nodev,nosuid /autofs/homes auto.homes intr,rw,hard,nodev,nosuid,vers=3,proto=tcp /autofs/cmas auto.cmas intr,rw,hard,nodev,nosuid,nodev,vers=3 /autofs/links auto.links intr,rw,hard,nodev,nosuid,nodev,vers=3 /autofs/cluster auto.cluster intr,rw,hard,nodev,nosuid /autofs/eris auto.eris intr,rw,hard,nodev,nosuid,vers=3 /autofs/nas auto.nas intr,rw,hard,nodev,nosuid /autofs/vast auto.vast intr,rw,hard,nodev,rsize=1048576,wsize=1048567,nosuid,vers=3,proto=tcp and for example auto.space has 2458 entries of the two forms of lines above After some experimenting, inn fact I think LDAP must be part of the issue. With /etc/auto.master simply being the line (all 2458 in the local file /etc/auto.space) /autofs/space /etc/auto.space intr,rw,hard,nodev/nosuid there is no issue with increasing fd's in the for loop. But if it is changed to /autofs/space auto.space intr,rw,hard,nodev,nosuid so it pulls the indirect map from LDAP, then the issue of increasing fd's appears in the for loop I have the exact same autofs config and maps on all my machines, CentOS Stream 8 and Rocky 8, and only see the issue on the former. I noticed ver 84 RPM is out for CentOS Stream 8 and upgraded a few boxes to it, but the problem still persists. In more weirdness, upon downgrading a CentOS Stream 8 box to the autofs-5.1.4-82.el8.x86_64.rpm package from Rocky 8.6, I had the issue go away there. And then upgrading that same box to ver 84 RPM, it is still "fixed" on that box. But when I tried to repeat this on 3 other CentOS Stream 8 boxes, the issue did NOT go away Also, oddly, upgrading to ver 83 RPM from Rocky 8.7 on my Rocky box does not result in the issue appearing there. This may be a more subtle interaction with various autofs, glibc, sssd, openldap packages. (In reply to Paul Raines from comment #6) > Can you tell me how to get the debug logs you want? Sure, but it depends on whether you want to capture the log from syslog or just catch the output of the systemd journal. In either case it's a matter of setting the config entry "logging = debug" in /etc/autofs.conf and "reload"ing the config or restarting autofs if you've changed it. If you choose to use syslog then you need to make sure the syslog log level actually sends the entries to syslog. If you really want to do this I'll dig out my notes. Catching the output from systemd is as simple as "journalctl -f | tee autofs.log" or similar. > > My maps are indirect and are quite large, but the lines in them are all one > of two forms like: > > sis_001 sis:/local_data/001 > > or > > foo_002 :/autofs/space/bar_003 Sub-mounts, yes, I checked that case and couldn't reproduce it with them either. > > and all indirect submaps are served via LDAP/sssd (nsswitch.conf has -> > automount: files sss) > > The /etc/auto.master on each box is actually the untouched RedHat8 default > > /misc /etc/auto.misc > /net -hosts > +dir:/etc/auto.master.d > +auto.master > > where in LDAP auto.master is > > /autofs/space auto.space intr,rw,hard,nodev,nosuid > /autofs/homes auto.homes intr,rw,hard,nodev,nosuid,vers=3,proto=tcp > /autofs/cmas auto.cmas intr,rw,hard,nodev,nosuid,nodev,vers=3 > /autofs/links auto.links intr,rw,hard,nodev,nosuid,nodev,vers=3 > /autofs/cluster auto.cluster intr,rw,hard,nodev,nosuid > /autofs/eris auto.eris intr,rw,hard,nodev,nosuid,vers=3 > /autofs/nas auto.nas intr,rw,hard,nodev,nosuid > /autofs/vast auto.vast > intr,rw,hard,nodev,rsize=1048576,wsize=1048567,nosuid,vers=3,proto=tcp > > > and for example auto.space has 2458 entries of the two forms of lines above > > After some experimenting, inn fact I think LDAP must be part of the issue. Don't think I open any file handles in the LDAP lookups ... mmm ... I wonder if the sss interface is opening a socket to communicate with the sssd daemon and maybe not closing it. I just get several entry points from the sss autofs shared library so we'll need to consult with the sss folks on this. > > With /etc/auto.master simply being the line (all 2458 in the local file > /etc/auto.space) > > /autofs/space /etc/auto.space intr,rw,hard,nodev/nosuid > > there is no issue with increasing fd's in the for loop. But if it is > changed to > > /autofs/space auto.space intr,rw,hard,nodev,nosuid > > so it pulls the indirect map from LDAP, then the issue of increasing fd's > appears in the for loop It is tempting to just say LDAP but the autofs LDAP lookup module is very different to the sss lookup module so lets say sss. I'll need to setup both direct LDAP and sss to test things out, that'll take a while since I'm already on leave for Christmas, ;) > > I have the exact same autofs config and maps on all my machines, CentOS > Stream 8 and Rocky 8, and only > see the issue on the former. > > I noticed ver 84 RPM is out for CentOS Stream 8 and upgraded a few boxes to > it, but the problem > still persists. > > In more weirdness, upon downgrading a CentOS Stream 8 box to the > autofs-5.1.4-82.el8.x86_64.rpm > package from Rocky 8.6, I had the issue go away there. And then upgrading > that same box to ver 84 RPM, > it is still "fixed" on that box. But when I tried to repeat this on 3 > other CentOS Stream 8 boxes, > the issue did NOT go away > > Also, oddly, upgrading to ver 83 RPM from Rocky 8.7 on my Rocky box does not > result in the issue appearing there. > > This may be a more subtle interaction with various autofs, glibc, sssd, > openldap packages. Right, this is going to be fun ... ***sigh***. Ian Alexey, could you have a look at this please, check comments #6 and #7. Any advice on this or who I should contact to further investigate would be much appreciated. Not sure how the sssd socket communication works. I tried restarting sssd on a CS8 box to see if that reset the fd's on autofs back down but it did not. I now have the lastest CS8 packages of autofs and sssd on a box (autofs-5.1.4-84.el8.x86_64 and sssd-2.7.3-5.0.1.el8.x86_64) and restarted both but it still shows the issue of ever increasing fd's on each mount/umount. I then changed /etc/nsswitch.conf for automount: to use ldap instead of sss. Then the problem went away!!!! Then changed it back to sss and the problem is back. So this is definitely a sssd / autofs interaction thing My sssd config is [domain/default] ldap_id_use_start_tls = True cache_credentials = False ldap_search_base = dc=foobar,dc=org id_provider = ldap auth_provider = ldap chpass_provider = ldap ldap_uri = ldap://ldap3.foobar.org, ldap://ldap4.foobar.org ldap_tls_cacertdir = /etc/openldap/cacerts ldap_tls_reqcert = never entry_cache_timeout = 600 autofs_provider=ldap ldap_autofs_search_base = dc=foobar,dc=org ldap_autofs_map_object_class=automountMap ldap_autofs_entry_object_class=automount ldap_autofs_map_name=automountMapName ldap_autofs_entry_key=automountKey ldap_autofs_entry_value=automountInformation [sssd] services = nss, pam, autofs config_file_version = 2 domains = default All other sections empty Hi, (In reply to Paul Raines from comment #0) > > # \ls -l /proc/2693/fd | wc -l > 20481 ... > # /bin/ls -l /proc/$(systemctl show --property MainPID --value autofs)/fd/ | grep socket | wc -l > 674 Where do those sockets point to? `lsof -p autofs`? (In reply to Ian Kent from comment #7) > > I wonder if the sss interface is opening a socket to communicate with > the sssd daemon and maybe not closing it. Those should be seen as UNIX sockets pointing to `/var/lib/sss/*` On RHEL9/C9S this is a socket / per service / per thread that makes a call to sss_client libs (but socket should be closed when thread exits). On RHEL8/C8S this should be a single socket per service (`man sssd` shouldn't mention `SSS_LOCKFREE` on those platforms). https://github.com/SSSD/sssd/blob/ddd85f95db69b49202d8da012597460b743ce460/src/sss_client/common.c#L67 What is `glibc` version? (JFTR: under 'service' I mean NSS/PAM/autofs/...)
> In more weirdness, upon downgrading a CentOS Stream 8 box to the autofs-5.1.4-82.el8.x86_64.rpm
package from Rocky 8.6, I had the issue go away there.
Was sssd downgraded as well?
Created attachment 1933796 [details] lsof -p output > > In more weirdness, upon downgrading a CentOS Stream 8 box to the autofs-5.1.4-82.el8.x86_64.rpm > > package from Rocky 8.6, I had the issue go away there.> > Was sssd downgraded as well? No, just autofs. And I have since upgraded it to ver 84 and even rebooted. That box still remains "fixed". But every other CS8 box I tried this on remained "unfixed". So I am not sure why that box is an outlier now. GLIBC ver is: glibc-2.28-209.el8.x86_64 See the cs14_autofs.txt attachment that shows the lsof requested right after a for loop that increased the fd's Created attachment 1933797 [details]
ss -a --unix -p | output
This might be useful too. I think it shows all these "extra" sockets connect to nothing.
Established connection to SSSD service looks like:
```
u_str ESTAB 0 0 /var/lib/sss/pipes/nss 22462 * 27091 users:(("sssd_nss",pid=1039,fd=25))
u_str ESTAB 0 0 * 27091 * 22462 users:(("agetty",pid=1051,fd=4))
```
-- local/peer match (and `lsof` says `type=STREAM (CONNECTED)`).
I'm not sure but those "u_str ESTAB 0 0 * 643868241 * 0" look like other side of connection closed socket but `automount` (or underlying lib) didn't process this event (and didn't close fd).
Since 'fd' is increased every "automount - sleep 1 - umount" I don't think those are sss_client/SSSD connections (those would be established, default `client_idle_timeout` is 60 secs, not less than 10 secs).
But to make sure you can enable `debug_level = 9` in `[nss]` and `[autofs]` sections of `sssd.conf`, restart SSSD and watch for "Client [cmd %s][uid %u][%p][%d] connected" messages in `/var/log/sssd/*.log`, where cmd == 'automount'.
(In reply to Paul Raines from comment #9) > > I then changed /etc/nsswitch.conf for automount: to use ldap instead of sss. > Then the problem > went away!!!! Then changed it back to sss and the problem is back. But this ^^ is suspicious, I agree. Checking SSSD logs is worth giving a try. Does automount load/unload plugins dynamically for every mount attempt? > Does automount load/unload plugins dynamically for every mount attempt?
Not sure what this means. Maybe this is a question for Ian
I put in the two debug_level lines in sssd.conf and restarted sssd and then ran the for loop test again. This is what I found:
# grep Client sssd_autofs.log |grep connected
(2022-12-20 16:39:09): [autofs] [accept_fd_handler] (0x0400): [CID#1] Client [cmd /usr/sbin/automount][uid 0][0x55cee5f62c60][23] connected!
(2022-12-20 16:39:12): [autofs] [accept_fd_handler] (0x0400): [CID#2] Client [cmd /usr/sbin/automount][uid 0][0x55cee5f87880][24] connected!
(2022-12-20 16:39:14): [autofs] [accept_fd_handler] (0x0400): [CID#3] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fa4de0][25] connected!
(2022-12-20 16:39:17): [autofs] [accept_fd_handler] (0x0400): [CID#4] Client [cmd /usr/sbin/automount][uid 0][0x55cee6006bd0][26] connected!
(2022-12-20 16:39:19): [autofs] [accept_fd_handler] (0x0400): [CID#5] Client [cmd /usr/sbin/automount][uid 0][0x55cee605dec0][27] connected!
(2022-12-20 16:39:21): [autofs] [accept_fd_handler] (0x0400): [CID#6] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fd2250][28] connected!
(2022-12-20 16:39:24): [autofs] [accept_fd_handler] (0x0400): [CID#7] Client [cmd /usr/sbin/automount][uid 0][0x55cee600af30][29] connected!
(2022-12-20 16:39:26): [autofs] [accept_fd_handler] (0x0400): [CID#8] Client [cmd /usr/sbin/automount][uid 0][0x55cee6067cc0][30] connected!
(2022-12-20 16:39:28): [autofs] [accept_fd_handler] (0x0400): [CID#9] Client [cmd /usr/sbin/automount][uid 0][0x55cee6013db0][31] connected!
(2022-12-20 16:39:31): [autofs] [accept_fd_handler] (0x0400): [CID#10] Client [cmd /usr/sbin/automount][uid 0][0x55cee605d9f0][32] connected!
(2022-12-20 16:39:33): [autofs] [accept_fd_handler] (0x0400): [CID#11] Client [cmd /usr/sbin/automount][uid 0][0x55cee603c740][33] connected!
(2022-12-20 16:39:35): [autofs] [accept_fd_handler] (0x0400): [CID#12] Client [cmd /usr/sbin/automount][uid 0][0x55cee608e5a0][34] connected!
(2022-12-20 16:39:38): [autofs] [accept_fd_handler] (0x0400): [CID#13] Client [cmd /usr/sbin/automount][uid 0][0x55cee6081190][35] connected!
(2022-12-20 16:39:40): [autofs] [accept_fd_handler] (0x0400): [CID#14] Client [cmd /usr/sbin/automount][uid 0][0x55cee60f2a20][36] connected!
(2022-12-20 16:39:42): [autofs] [accept_fd_handler] (0x0400): [CID#15] Client [cmd /usr/sbin/automount][uid 0][0x55cee6067770][37] connected!
(2022-12-20 16:39:45): [autofs] [accept_fd_handler] (0x0400): [CID#16] Client [cmd /usr/sbin/automount][uid 0][0x55cee60064c0][38] connected!
(2022-12-20 16:39:47): [autofs] [accept_fd_handler] (0x0400): [CID#17] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fb2b60][39] connected!
(2022-12-20 16:39:49): [autofs] [accept_fd_handler] (0x0400): [CID#18] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fd5d60][40] connected!
(2022-12-20 16:39:52): [autofs] [accept_fd_handler] (0x0400): [CID#19] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fc3d70][41] connected!
(2022-12-20 16:39:54): [autofs] [accept_fd_handler] (0x0400): [CID#20] Client [cmd /usr/sbin/automount][uid 0][0x55cee6059680][42] connected!
nothing for automount in sssd_nss.log
(In reply to Paul Raines from comment #16) > > Does automount load/unload plugins dynamically for every mount attempt? Well, not really, but there's a but ... > > Not sure what this means. Maybe this is a question for Ian Yes, I think so. > > I put in the two debug_level lines in sssd.conf and restarted sssd and then > ran the for loop test again. This is what I found: > > # grep Client sssd_autofs.log |grep connected > (2022-12-20 16:39:09): [autofs] [accept_fd_handler] (0x0400): [CID#1] Client > [cmd /usr/sbin/automount][uid 0][0x55cee5f62c60][23] connected! Don't really know what this is but it did make me think! I do open the sss shared library (aka. dlopen()) to get the sss library entry points. That has to be opening a file handle. What's more, a while ago I changed to re-initialising the lookup context in some/most cases. It looks like that dlopen()ed module handle is leaking on re-init, I'll need to check the locking so I can fix it without causing a problem so leave it with me, ;) Ian (In reply to Paul Raines from comment #16) > > I put in the two debug_level lines in sssd.conf and restarted sssd and then > ran the for loop test again. This is what I found: > > # grep Client sssd_autofs.log |grep connected > (2022-12-20 16:39:09): [autofs] [accept_fd_handler] (0x0400): [CID#1] Client [cmd /usr/sbin/automount][uid 0][0x55cee5f62c60][23] connected! > (2022-12-20 16:39:12): [autofs] [accept_fd_handler] (0x0400): [CID#2] Client [cmd /usr/sbin/automount][uid 0][0x55cee5f87880][24] connected! ... > (2022-12-20 16:39:54): [autofs] [accept_fd_handler] (0x0400): [CID#20] Client [cmd /usr/sbin/automount][uid 0][0x55cee6059680][42] connected! So it does indeed re-connect to SSSD... Could you please attach entire `sssd_autofs.log`? (In reply to Ian Kent from comment #17) > (In reply to Paul Raines from comment #16) > > > Does automount load/unload plugins dynamically for every mount attempt? > > Well, not really, but there's a but ... > > I do open the sss shared library (aka. dlopen()) to get the sss library > entry points. That has to be opening a file handle. > > What's more, a while ago I changed to re-initialising the lookup context > in some/most cases. Is `dlclose()` executed? If "yes" then I need to check if it's expected that fd open by sss_client is closed at this moment... > It looks like that dlopen()ed module handle is leaking on re-init, I'll > need to check the locking so I can fix it without causing a problem so > leave it with me, ;) But those handles hardly would be "unix socket type=stream"... (In reply to Alexey Tikhonov from comment #19) > > > > What's more, a while ago I changed to re-initialising the lookup context > > in some/most cases. > > Is `dlclose()` executed? > If "yes" then I need to check if it's expected that fd open by sss_client is > closed at this moment... It should be: https://github.com/SSSD/sssd/blob/02bdef7d7577aa0eaf10e1920714870d36c277a6/src/sss_client/common.c#L75 `man dlclose`: ``` Shared objects may export functions using the __attribute__((constructor)) and __attribute__((destructor)) function attributes. Constructor func‐ tions are executed before dlopen() returns, and destructor functions are executed before dlclose() returns. ``` So the question is if automount executes `dlclose()`. Another question is if map/unmap of shared library is justified, but I don't have insights here. Created attachment 1933922 [details]
sssd_autofs.log
In case you still needed it, here is the sssd_autofs.log
``` (16:39:09): [accept_fd_handler] (0x0400): [CID#1] Client [cmd /usr/sbin/automount][uid 0][0x55cee5f62c60][23] connected! (16:39:12): [accept_fd_handler] (0x0400): [CID#2] Client [cmd /usr/sbin/automount][uid 0][0x55cee5f87880][24] connected! (16:39:14): [accept_fd_handler] (0x0400): [CID#3] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fa4de0][25] connected! (16:39:17): [accept_fd_handler] (0x0400): [CID#4] Client [cmd /usr/sbin/automount][uid 0][0x55cee6006bd0][26] connected! (16:39:19): [accept_fd_handler] (0x0400): [CID#5] Client [cmd /usr/sbin/automount][uid 0][0x55cee605dec0][27] connected! (16:39:21): [accept_fd_handler] (0x0400): [CID#6] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fd2250][28] connected! (16:39:24): [accept_fd_handler] (0x0400): [CID#7] Client [cmd /usr/sbin/automount][uid 0][0x55cee600af30][29] connected! (16:39:26): [accept_fd_handler] (0x0400): [CID#8] Client [cmd /usr/sbin/automount][uid 0][0x55cee6067cc0][30] connected! (16:39:28): [accept_fd_handler] (0x0400): [CID#9] Client [cmd /usr/sbin/automount][uid 0][0x55cee6013db0][31] connected! (16:39:31): [accept_fd_handler] (0x0400): [CID#10] Client [cmd /usr/sbin/automount][uid 0][0x55cee605d9f0][32] connected! (16:39:33): [accept_fd_handler] (0x0400): [CID#11] Client [cmd /usr/sbin/automount][uid 0][0x55cee603c740][33] connected! (16:39:35): [accept_fd_handler] (0x0400): [CID#12] Client [cmd /usr/sbin/automount][uid 0][0x55cee608e5a0][34] connected! (16:39:38): [accept_fd_handler] (0x0400): [CID#13] Client [cmd /usr/sbin/automount][uid 0][0x55cee6081190][35] connected! (16:39:40): [accept_fd_handler] (0x0400): [CID#14] Client [cmd /usr/sbin/automount][uid 0][0x55cee60f2a20][36] connected! (16:39:42): [accept_fd_handler] (0x0400): [CID#15] Client [cmd /usr/sbin/automount][uid 0][0x55cee6067770][37] connected! (16:39:45): [accept_fd_handler] (0x0400): [CID#16] Client [cmd /usr/sbin/automount][uid 0][0x55cee60064c0][38] connected! (16:39:47): [accept_fd_handler] (0x0400): [CID#17] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fb2b60][39] connected! (16:39:49): [accept_fd_handler] (0x0400): [CID#18] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fd5d60][40] connected! (16:39:52): [accept_fd_handler] (0x0400): [CID#19] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fc3d70][41] connected! (16:39:54): [accept_fd_handler] (0x0400): [CID#20] Client [cmd /usr/sbin/automount][uid 0][0x55cee6059680][42] connected! (16:40:39): [client_close_fn] (0x2000): Terminated client [0x55cee5f62c60][23] (16:40:42): [client_close_fn] (0x2000): Terminated client [0x55cee5f87880][24] (16:40:44): [client_close_fn] (0x2000): Terminated client [0x55cee5fa4de0][25] (16:40:47): [client_close_fn] (0x2000): Terminated client [0x55cee6006bd0][26] (16:40:49): [client_close_fn] (0x2000): Terminated client [0x55cee605dec0][27] (16:40:51): [client_close_fn] (0x2000): Terminated client [0x55cee5fd2250][28] (16:40:54): [client_close_fn] (0x2000): Terminated client [0x55cee600af30][29] (16:40:56): [client_close_fn] (0x2000): Terminated client [0x55cee6067cc0][30] (16:40:58): [client_close_fn] (0x2000): Terminated client [0x55cee6013db0][31] (16:41:01): [client_close_fn] (0x2000): Terminated client [0x55cee605d9f0][32] (16:41:03): [client_close_fn] (0x2000): Terminated client [0x55cee603c740][33] (16:41:05): [client_close_fn] (0x2000): Terminated client [0x55cee608e5a0][34] (16:41:08): [client_close_fn] (0x2000): Terminated client [0x55cee6081190][35] (16:41:10): [client_close_fn] (0x2000): Terminated client [0x55cee60f2a20][36] (16:41:12): [client_close_fn] (0x2000): Terminated client [0x55cee6067770][37] (16:41:15): [client_close_fn] (0x2000): Terminated client [0x55cee60064c0][38] (16:41:17): [client_close_fn] (0x2000): Terminated client [0x55cee5fb2b60][39] (16:41:19): [client_close_fn] (0x2000): Terminated client [0x55cee5fd5d60][40] (16:41:22): [client_close_fn] (0x2000): Terminated client [0x55cee5fc3d70][41] (16:41:24): [client_close_fn] (0x2000): Terminated client [0x55cee6059680][42] (16:46:04): [accept_fd_handler] (0x0400): [CID#21] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fc3ae0][23] connected! (16:47:34): [client_close_fn] (0x2000): Terminated client [0x55cee5fc3ae0][23] ``` -- every mount attempt makes a new sss_client->SSSD connection (that are later closed on server side by idle timeout). So far it confirms that either `dlclose()` is missing or `__attribute__((destructor))` doesn't work as expected. (In reply to Alexey Tikhonov from comment #20) > (In reply to Alexey Tikhonov from comment #19) > > > > > > What's more, a while ago I changed to re-initialising the lookup context > > > in some/most cases. > > > > Is `dlclose()` executed? > > If "yes" then I need to check if it's expected that fd open by sss_client is > > closed at this moment... > > > It should be: > https://github.com/SSSD/sssd/blob/02bdef7d7577aa0eaf10e1920714870d36c277a6/ > src/sss_client/common.c#L75 > > `man dlclose`: > ``` > Shared objects may export functions using the __attribute__((constructor)) > and __attribute__((destructor)) function attributes. Constructor func‐ > tions are executed before dlopen() returns, and destructor functions are > executed before dlclose() returns. > ``` > > So the question is if automount executes `dlclose()`. It does but there is a case, when a map is re-read, that autofs will re-init the lookup module and that doesn't perform a dlclose() before doing the dlopen(). I need to fix that. The logic around whether to re-read the map on lookup is a bit complicated but should only trigger if the map has changed so maybe this is not the only problem. Ian (In reply to Alexey Tikhonov from comment #20) > > It should be: > https://github.com/SSSD/sssd/blob/02bdef7d7577aa0eaf10e1920714870d36c277a6/ > src/sss_client/common.c#L75 > Just a quick aside, can this called in the context of the autofs shared library ... #ifdef HAVE_PTHREAD_EXT static void sss_at_thread_exit(void *v) { sss_cli_close_socket(); } static void init_sd_key(void) { pthread_key_create(&sss_sd_key, sss_at_thread_exit); } #endif Shared libraries have a habit of being unloaded which can cause the destructor function to be called without it being in the address space of the caller. I'm not sure if that same problem exists with the at thread exit mechanism but I have got SEGV reports calling key destructor recently ... anyway another time ... Not sure if this illuminates or confuses things more, but that one CS8 box I said was mysteriously "fixed" doesn't really appear to be fixed. On the "for loop" test it still does not show incrementing fd's but over time (it is a batch node running many jobs from different users automounting many different data volumes) the fd's are still creeping up. It is at 209 now where as all my Rocky boxes are still comfortably staying in the 70's. The 'ss -a --unix -p' command shows many unconnected automount sockets. Maybe some kind of cache makes it not need to make a new sss connection on each mount on the for loop NOTE: following applies to RHEL9 only so not relevant to this BZ. (In reply to Ian Kent from comment #24) > > Just a quick aside, can this called in the context of the autofs shared > library ... > > #ifdef HAVE_PTHREAD_EXT > static void sss_at_thread_exit(void *v) > { > sss_cli_close_socket(); > } > > static void init_sd_key(void) > { > pthread_key_create(&sss_sd_key, sss_at_thread_exit); > } > #endif > > Shared libraries have a habit of being unloaded which can cause the > destructor function to be called without it being in the address > space of the caller. There are two different things used in sssd code: - `__attribute__((destructor))` - general module destructor - thread specific destructor set using `pthread_key_create()` > I'm not sure if that same problem exists with the at thread exit > mechanism but I have got SEGV reports calling key destructor > recently ... anyway another time ... It seems you are right: If a thread uses any function from sss_client (and thus `pthread_setspecific()` sets `sss_sd_key` value for this thread, enabling destructor) and then sss_client is unloaded before thread exits, then app will crash when thread exits (trying to execute unloaded `sss_at_thread_exit()`). But the same would happen if thread would try to execute any previously dlsym()-ed function from this lib. A bit a grey area... I need to figure out if this is an oversight in glibc (but probably it's just impossible to cleanup properly during dlclose()). If not then we have to handle this case in SSSD code. (Once again: this is not relevant to this BZ) (In reply to Alexey Tikhonov from comment #26) > NOTE: following applies to RHEL9 only so not relevant to this BZ. Yes, that's right, it's just something I'm struggling with atm. > > (In reply to Ian Kent from comment #24) > > > > Just a quick aside, can this called in the context of the autofs shared > > library ... > > > > #ifdef HAVE_PTHREAD_EXT > > static void sss_at_thread_exit(void *v) > > { > > sss_cli_close_socket(); > > } > > > > static void init_sd_key(void) > > { > > pthread_key_create(&sss_sd_key, sss_at_thread_exit); > > } > > #endif > > > > Shared libraries have a habit of being unloaded which can cause the > > destructor function to be called without it being in the address > > space of the caller. > > There are two different things used in sssd code: > - `__attribute__((destructor))` - general module destructor > - thread specific destructor set using `pthread_key_create()` > > > > I'm not sure if that same problem exists with the at thread exit > > mechanism but I have got SEGV reports calling key destructor > > recently ... anyway another time ... > > > It seems you are right: > If a thread uses any function from sss_client (and thus > `pthread_setspecific()` sets `sss_sd_key` value for this thread, enabling > destructor) and then sss_client is unloaded before thread exits, then app > will crash when thread exits (trying to execute unloaded > `sss_at_thread_exit()`). > > But the same would happen if thread would try to execute any previously > dlsym()-ed function from this lib. A bit a grey area... > > I need to figure out if this is an oversight in glibc (but probably it's > just impossible to cleanup properly during dlclose()). If not then we have > to handle this case in SSSD code. My understanding is that (and it's not the first time I've had to deal with it), by definition, thread specific destructor is called at thread exit but the library isn't necessarily the thread creator therefore it (the library) can be unloaded before thread exit so the needed function isn't available. So, in this case the only safe way is to use globally available functions for the destructor. Ian (In reply to Ian Kent from comment #27) > > > > But the same would happen if thread would try to execute any previously > > dlsym()-ed function from this lib. A bit a grey area... > > > > I need to figure out if this is an oversight in glibc (but probably it's > > just impossible to cleanup properly during dlclose()). If not then we have > > to handle this case in SSSD code. > > My understanding is that (and it's not the first time I've had to deal > with it), by definition, thread specific destructor is called at thread > exit but the library isn't necessarily the thread creator therefore it > (the library) can be unloaded before thread exit so the needed function > isn't available. > > So, in this case the only safe way is to use globally available functions > for the destructor. However, surely while the dlopen()ed handle is valid the library won't be unloaded. That has to be a reason to use dlopen()ing rather than demand loading ... My experience with this (so far) has been with demand loaded shared libraries. (In reply to Ian Kent from comment #23) > (In reply to Alexey Tikhonov from comment #20) > > (In reply to Alexey Tikhonov from comment #19) > > > > > > > > What's more, a while ago I changed to re-initialising the lookup context > > > > in some/most cases. > > > > > > Is `dlclose()` executed? > > > If "yes" then I need to check if it's expected that fd open by sss_client is > > > closed at this moment... > > > > > > It should be: > > https://github.com/SSSD/sssd/blob/02bdef7d7577aa0eaf10e1920714870d36c277a6/ > > src/sss_client/common.c#L75 > > > > `man dlclose`: > > ``` > > Shared objects may export functions using the __attribute__((constructor)) > > and __attribute__((destructor)) function attributes. Constructor func‐ > > tions are executed before dlopen() returns, and destructor functions are > > executed before dlclose() returns. > > ``` > > > > So the question is if automount executes `dlclose()`. > > It does but there is a case, when a map is re-read, that autofs will re-init > the lookup module and that doesn't perform a dlclose() before doing the > dlopen(). > > I need to fix that. > > The logic around whether to re-read the map on lookup is a bit complicated > but should only trigger if the map has changed so maybe this is not the only > problem. Upon further investigation the module handle isn't leaking, we'll need to look further for this. > > Ian (In reply to Paul Raines from comment #25) > Not sure if this illuminates or confuses things more, but that one CS8 box I > said was mysteriously "fixed" doesn't really appear to be fixed. > > On the "for loop" test it still does not show incrementing fd's but over > time (it is a batch node running many jobs from different users automounting > many different data volumes) the fd's are still creeping up. It is at 209 > now where as all my Rocky boxes are still comfortably staying in the 70's. > The 'ss -a --unix -p' command shows many unconnected automount sockets. Are they increasing? > > Maybe some kind of cache makes it not need to make a new sss connection on > each mount on the for loop Isn't it the case that your using sub-mounts? If you try and umount(8) a sub-mount (the base of the mount not a map key within the sub-mount) it should error with EBUSY "because" a file handle is open on the base of the mount and there will be one for every sub-mount. It's always been that way. So there is a reason for the file handles in use to increase if the number of active sub-mounts increases but, of course, that file handle should be closed when the sub-mount expires away. If the mount you are umounting is an offset (ie. part of a multi-mount) then it too will have an open file handle while it's mounted but that file handle with be opened on the directory covered by the mount so it will not prevent the umount. These user initialled umounts should be checked for and the file handles cleaned up during expire runs. Ian I will need to duplicate it after all ... (In reply to Ian Kent from comment #27) > (In reply to Alexey Tikhonov from comment #26) > > NOTE: following applies to RHEL9 only so not relevant to this BZ. > > Yes, that's right, it's just something I'm struggling with atm. JFTR: opened upstream ticket https://github.com/SSSD/sssd/issues/6505 (In reply to Ian Kent from comment #29) > > Upon further investigation the module handle isn't leaking, we'll need to > look further for this. Imo, attached `ss` output and log in comment 22 strongly hints that leaking FDs are from `libsss_autofs` being re-loaded (either without `dlclose()` or `__attribute__((destructor))` doesn't work as expected - I'll test it). (In reply to Alexey Tikhonov from comment #33) > (In reply to Ian Kent from comment #29) > > > > Upon further investigation the module handle isn't leaking, we'll need to > > look further for this. > > Imo, attached `ss` output and log in comment 22 strongly hints that leaking > FDs are from `libsss_autofs` being re-loaded (either without `dlclose()` or > `__attribute__((destructor))` doesn't work as expected - I'll test it). I tested this and `__attribute__((destructor))` works as expected. Since on RHEL 8 `sss_cli_sd` doesn't have `__thread` storage specifier, it should be pretty straightforward and shouldn't depend on what thread calls dlclose(). So probably dlclose() is missing after all. Btw, what is the reason to re-load plugins? Wouldn't it be possible to have a single shared map {"plugin_name", handle} (lazily initialized once)? > > Not sure if this illuminates or confuses things more, but that one CS8 box I > > said was mysteriously "fixed" doesn't really appear to be fixed. > > > > On the "for loop" test it still does not show incrementing fd's but over > > time (it is a batch node running many jobs from different users automounting > > many different data volumes) the fd's are still creeping up. It is at 209 > > now where as all my Rocky boxes are still comfortably staying in the 70's. > > The 'ss -a --unix -p' command shows many unconnected automount sockets. > > Are they increasing? The fd's are increasing over time with normal use on this box. But my "for loop" test does not show the fds' increasing on this one box while it still does on all my other CS8 boxes I tested. So that is the mystery. However, on this box when I tried the "for loop" test on a different mount in same indirect map that had never been mounted since the last reboot it showed the fd's increasing like on other boxes. So I suspect this was just some cache thing with /autofs/space/sis_001. > > > > Maybe some kind of cache makes it not need to make a new sss connection on > > each mount on the for loop > > Isn't it the case that your using sub-mounts? > ... Sorry, I don't follow this discussion. The map in question is indirect on /autofs/space with entires like in comment 6 (In reply to Alexey Tikhonov from comment #34) > (In reply to Alexey Tikhonov from comment #33) > > (In reply to Ian Kent from comment #29) > > > > > > Upon further investigation the module handle isn't leaking, we'll need to > > > look further for this. > > > > Imo, attached `ss` output and log in comment 22 strongly hints that leaking > > FDs are from `libsss_autofs` being re-loaded (either without `dlclose()` or > > `__attribute__((destructor))` doesn't work as expected - I'll test it). > > I tested this and `__attribute__((destructor))` works as expected. > Since on RHEL 8 `sss_cli_sd` doesn't have `__thread` storage specifier, it > should be pretty straightforward and shouldn't depend on what thread calls > dlclose(). > So probably dlclose() is missing after all. > > > Btw, what is the reason to re-load plugins? > Wouldn't it be possible to have a single shared map {"plugin_name", handle} > (lazily initialized once)? The lookup module reinit entry point is only called upon map reload. For each master map entry all the module structure fields are copied (such as the module handle) but the corresponding autofs master entry map options may have changed so they need to be re-calculated. Ian (In reply to Paul Raines from comment #35) > > > > Are they increasing? > > The fd's are increasing over time with normal use on this box. But my "for > loop" test does not show the fds' increasing on this one box while > it still does on all my other CS8 boxes I tested. So that is the mystery. > > However, on this box when I tried the "for loop" test on a different mount > in same indirect map that had never been mounted since the last reboot it > showed the fd's increasing like on other boxes. So I suspect this was just > some cache thing with /autofs/space/sis_001. > > > > > > > Maybe some kind of cache makes it not need to make a new sss connection on > > > each mount on the for loop > > > > Isn't it the case that your using sub-mounts? > > ... > > Sorry, I don't follow this discussion. The map in question is indirect on > /autofs/space with entires like in comment 6 This was one of the cases used in your maps: foo_002 :/autofs/space/bar_003 I mistakenly thought this was a sub-mount but I think it's actually a bind mount, sorry. If you were using sub-mounts (aka. option -fstype=autofs) then we can expect the file handle count to increase with each one until it expires away. Ian (In reply to Ian Kent from comment #36) > > > > I tested this and `__attribute__((destructor))` works as expected. > > Since on RHEL 8 `sss_cli_sd` doesn't have `__thread` storage specifier, it > > should be pretty straightforward and shouldn't depend on what thread calls > > dlclose(). > > So probably dlclose() is missing after all. > > > > > > Btw, what is the reason to re-load plugins? > > Wouldn't it be possible to have a single shared map {"plugin_name", handle} > > (lazily initialized once)? (This question is from a quick glance at autofs-5.1.4 sources, so sorry if I'm asking something silly.) > The lookup module reinit entry point is only called upon map reload. Is this `modules/lookup_sss.c::lookup_reinit()` (that uses existing handle: `do_init(..., 1)` -> `reinit_parse()`)? > For each master map entry all the module structure fields are copied (such > as the module handle) but the corresponding autofs master entry map options > may have changed so they need to be re-calculated. I'm not sure I properly understand the above, so I tried to trace `dlopen()` usage. It seems `open_lookup()` -> ... -> `dlopen()` is executed in: - do_read_master() (and immediately closed) - do_read_map() - do_lookup_mount() Where is handle closed in the latter 2 cases? Is this only in `umount_autofs()` (-> lookup_close_lookup() -> ... -> close_lookup() -> lookup_done() -> dlclose())? `open_lookup()` is also executed in lookup_multi.c -- is this a code used to handle case of multiple entries in nsswitch.conf (i.e. as in this ticket)? (In reply to Alexey Tikhonov from comment #38) > (In reply to Ian Kent from comment #36) > > > > > > I tested this and `__attribute__((destructor))` works as expected. > > > Since on RHEL 8 `sss_cli_sd` doesn't have `__thread` storage specifier, it > > > should be pretty straightforward and shouldn't depend on what thread calls > > > dlclose(). > > > So probably dlclose() is missing after all. > > > > > > > > > Btw, what is the reason to re-load plugins? > > > Wouldn't it be possible to have a single shared map {"plugin_name", handle} > > > (lazily initialized once)? > > > (This question is from a quick glance at autofs-5.1.4 sources, so sorry if > I'm asking something silly.) > > > The lookup module reinit entry point is only called upon map reload. > > Is this `modules/lookup_sss.c::lookup_reinit()` (that uses existing handle: > `do_init(..., 1)` -> `reinit_parse()`)? It is. > > > > For each master map entry all the module structure fields are copied (such > > as the module handle) but the corresponding autofs master entry map options > > may have changed so they need to be re-calculated. > > I'm not sure I properly understand the above, so I tried to trace `dlopen()` > usage. > > It seems `open_lookup()` -> ... -> `dlopen()` is executed in: > - do_read_master() (and immediately closed) The master map is used to define the automount managed mounts. It's read and closed once done. > - do_read_map() Only if it hasn't already been opened. > - do_lookup_mount() Again, only if it hasn't already been opened. > Where is handle closed in the latter 2 cases? > Is this only in `umount_autofs()` (-> lookup_close_lookup() -> ... -> > close_lookup() -> lookup_done() -> dlclose())? Basically yes, but it isn't quite that simple, which you've probably noticed. > > `open_lookup()` is also executed in lookup_multi.c -- is this a code used > to handle case of multiple entries in nsswitch.conf (i.e. as in this ticket)? It isn't, the type "multi" is a very old mechanism that provides similar function to nsswitch but isn't the same at all. It isn't used much at all as far as I know. Ian Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |