2155070 – file descriptor leak making autofs hit max open files soft limit

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2155070 - file descriptor leak making autofs hit max open files soft limit

Summary: file descriptor leak making autofs hit max open files soft limit

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	autofs
Sub Component:
Version:	CentOS Stream
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Ian Kent
QA Contact:	Kun Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-12-19 20:28 UTC by Paul Raines
Modified:	2023-09-23 11:26 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-09-23 11:26:56 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
lsof -p output (16.62 KB, text/plain) 2022-12-20 17:02 UTC, Paul Raines	no flags	Details
ss -a --unix -p \| output (18.71 KB, text/plain) 2022-12-20 17:12 UTC, Paul Raines	no flags	Details
sssd_autofs.log (208.70 KB, text/plain) 2022-12-21 14:34 UTC, Paul Raines	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHEL-7915	0	None	Migrated	None	2023-09-23 11:26:49 UTC
Red Hat Issue Tracker	RHELPLAN-142918	0	None	None	None	2022-12-19 20:31:24 UTC

Description Paul Raines 2022-12-19 20:28:13 UTC

Description of problem:

I had a CentOS 8 Stream system start failing to automount with

automount[2693]: open_fopen_r:212: failed to open file: Too many open files

Version-Release number of selected component (if applicable):

autofs-5.1.4-83.el8.x86_64

How reproducible:

Reproducible but takes months to reach limit in normal usage

Steps to Reproduce:

Just running autofs on department clients with decently sized maps

Actual results:

autofs eventually stops working with above error as I find it using
the soft limit of open files of 20480

Expected results:

autofs working normally with < 100 open file descriptors

Additional info:

After upgrading my 100+ CentOS Stream 8 systems in September 2022  I had one system today report

Dec 19 13:16:38 r01 automount[2693]: attempting to mount entry /autofs/space/sis_001
Dec 19 13:16:38 r01 automount[2693]: open_fopen_r:212: failed to open file: Too many open files
Dec 19 13:16:38 r01 automount[2693]: nsswitch_parse:172: couldn't open /etc/nsswitch.conf
Dec 19 13:16:38 r01 automount[2693]: lookup_nss_mount: can't to read name service switch config.
Dec 19 13:16:38 r01 automount[2693]: failed to mount /autofs/space/sis_001

On this system I find:

# cat /proc/sys/fs/file-max
158269986
# cat /proc/sys/fs/file-nr
26304   0       158269986
# sysctl fs.file-max
fs.file-max = 158269986
# systemctl show --property MainPID --value autofs
2693
# grep open.files /proc/2693/limits
Max open files            20480                262144               files
# \ls -l /proc/2693/fd | wc -l
  20481
# wc -l /etc/mtab
44 /etc/mtab
# grep nfs /etc/mtab | wc -l
9

Surveying my other CentOS Stream 8 boxes I find them with an almost fairly even (by eye) distribution of number of fd's in /proc/*/fd for the autofs process ranging from a 0 - 20480.  Most of these have been up and running since September though it does seem the higher numbers corelated with higher uptime.

Since September on new installs I have mainly been installing Rocky 8.  On these boxes and my old CentOS 7 boxes, I find all have fd's for autofs under 100.

Restarting autofs on a CentOS Stream 8 box will immediately bring the fd's back down in the <100 range and automounts start working again.

I note that on my Rocky 8 boxes the autofs version is slightly older than on the CentOS Stream 8 boxes, autofs-5.1.4-82.el8.x86_64.  Most have been running over 30 days.

Here is a CentOS Stream 8 box running for just 7 days:

# uptime
 15:19:12 up 7 days,  3:05,  2 users,  load average: 0.19, 0.12, 0.05
# /bin/ls /proc/$(systemctl show --property MainPID --value autofs)/fd/ | wc -l
745
# /bin/ls -l /proc/$(systemctl show --property MainPID --value autofs)/fd/ | grep socket | wc -l
674
# /bin/ls -l /proc/$(systemctl show --property MainPID --value autofs)/fd/ | grep -v socket | wc -l
72

So still an abnormally large number of fd's for the autofs process.  The excess of these fd's appear to always be sockets

For example on a Rocky box with 81 days uptime I have:

# /bin/ls -l /proc/$(systemctl show --property MainPID --value autofs)/fd/ | grep socket | wc -l
5
# /bin/ls -l /proc/$(systemctl show --property MainPID --value autofs)/fd/ | grep -v socket | wc -l
71

Most CentOS Stream 8 boxes are running 4.18.0-408.el8.x86_64 kernel while the Rocky boxes are running 4.18.0-372.9.1.el8.x86_64

Comment 1 Paul Raines 2022-12-19 21:14:26 UTC

Basically leaking a fd with every mount:

# for f in {1..20} ; do
> /bin/ls -l /proc/$(systemctl show --property MainPID --value autofs)/fd/ | wc -l
> ls -ald /autofs/space/sis_001/
> sleep 1
> umount /autofs/space/sis_001
> done
758
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
759
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
760
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
761
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
762
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
763
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
764
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
765
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
766
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
767
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
768
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
769
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
770
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
771
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
772
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
773
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
774
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
775
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
776
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
777
drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/

Comment 2 Ian Kent 2022-12-19 23:04:14 UTC

Can you post your configuration, in particular the maps.

If you don't want to post them describe what they are, eg. indirect
or direct mount maps, do they have map entries with offsets etc.

Or you could provide a debug log.

Comment 3 Ian Kent 2022-12-19 23:14:38 UTC

(In reply to Paul Raines from comment #1)
> Basically leaking a fd with every mount:

This example doesn't necessarily provide evidence of a file handle leak.

Direct and offset mounts do consume a file handle when they are covered
by a real mount (since, originally you couldn't open a file handle for
ioctls after they were covered, not the case now though).

To change that is a very large amount of work.
It isn't out of the question though and I have thought about it from
time to time.

However, your point that there might be a leak that's may be fairly
straight forward to fix is well taken, I'll have a look.

Ian

Comment 4 Ian Kent 2022-12-20 00:51:30 UTC

(In reply to Paul Raines from comment #1)
> Basically leaking a fd with every mount:
> 
> # for f in {1..20} ; do
> > /bin/ls -l /proc/$(systemctl show --property MainPID --value autofs)/fd/ | wc -l
> > ls -ald /autofs/space/sis_001/
> > sleep 1
> > umount /autofs/space/sis_001
> > done
> 758
> drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
> 759
> drwxr-xr-x. 8 root root 4096 Jun 20 14:18 /autofs/space/sis_001/
> 760

I've tried this with rev 83 and a map that would consume a file handle
and observe that autofs does cater for correcting the manual umount
which would leave an open file handle if it wasn't.

I'll need more specific information about the map used, and ideally, a
debug log from the startup of autofs to the end of the above test run.

Comment 5 Ian Kent 2022-12-20 00:58:14 UTC

(In reply to Paul Raines from comment #0)
> 
> Reproducible but takes months to reach limit in normal usage
> 
> Steps to Reproduce:
> 
> Just running autofs on department clients with decently sized maps
> 
> Actual results:
> 
> autofs eventually stops working with above error as I find it using
> the soft limit of open files of 20480
> 
> Expected results:
> 
> autofs working normally with < 100 open file descriptors
> 

Again, this is dependent on your map as to whether you will use less
file handles or more.

Plain old indirect mount maps should use just a few file handles
regardless of the number of active mounts and that should remain
fairly consistent over time.

Ian

Comment 6 Paul Raines 2022-12-20 03:52:21 UTC

Can you tell me how to get the debug logs you want?

My maps are indirect and are quite large, but the lines in them are all one of two forms like:

sis_001  sis:/local_data/001

or

foo_002  :/autofs/space/bar_003

and all indirect submaps are served via LDAP/sssd (nsswitch.conf has ->  automount:  files sss)

The /etc/auto.master on each box is actually the untouched RedHat8 default

/misc   /etc/auto.misc
/net    -hosts
+dir:/etc/auto.master.d
+auto.master

where in LDAP auto.master is

/autofs/space   auto.space      intr,rw,hard,nodev,nosuid
/autofs/homes   auto.homes      intr,rw,hard,nodev,nosuid,vers=3,proto=tcp
/autofs/cmas    auto.cmas       intr,rw,hard,nodev,nosuid,nodev,vers=3
/autofs/links   auto.links      intr,rw,hard,nodev,nosuid,nodev,vers=3
/autofs/cluster auto.cluster    intr,rw,hard,nodev,nosuid
/autofs/eris    auto.eris       intr,rw,hard,nodev,nosuid,vers=3
/autofs/nas     auto.nas        intr,rw,hard,nodev,nosuid
/autofs/vast    auto.vast       intr,rw,hard,nodev,rsize=1048576,wsize=1048567,nosuid,vers=3,proto=tcp


and for example auto.space has 2458 entries of the two forms of lines above

After some experimenting, inn fact I think LDAP must be part of the issue.  

With /etc/auto.master simply being the line (all 2458 in the local file /etc/auto.space)

/autofs/space  /etc/auto.space  intr,rw,hard,nodev/nosuid

there is no issue with increasing fd's in the for loop.  But if it is changed to 

/autofs/space auto.space intr,rw,hard,nodev,nosuid

so it pulls the indirect map from LDAP, then the issue of increasing fd's appears in the for loop

I have the exact same autofs config and maps on all my machines, CentOS Stream 8 and Rocky 8, and only 
see the issue on the former.

I noticed ver 84 RPM is out for CentOS Stream 8 and upgraded a few boxes to it, but the problem 
still persists.

In more weirdness, upon downgrading a CentOS Stream 8 box to the autofs-5.1.4-82.el8.x86_64.rpm 
package from Rocky 8.6, I had the issue go away there. And then upgrading that same box to ver 84 RPM, 
it is still  "fixed" on that box.  But when I tried to repeat this on 3 other CentOS Stream 8 boxes, 
the issue did NOT go away

Also, oddly, upgrading to ver 83 RPM from Rocky 8.7 on my Rocky box does not result in the issue appearing there.

This may be a more subtle interaction with various autofs, glibc, sssd, openldap packages.

Comment 7 Ian Kent 2022-12-20 09:27:35 UTC

(In reply to Paul Raines from comment #6)
> Can you tell me how to get the debug logs you want?

Sure, but it depends on whether you want to capture the log from syslog
or just catch the output of the systemd journal.

In either case it's a matter of setting the config entry "logging = debug"
in /etc/autofs.conf and "reload"ing the config or restarting autofs if
you've changed it.

If you choose to use syslog then you need to make sure the syslog log
level actually sends the entries to syslog. If you really want to do
this I'll dig out my notes.

Catching the output from systemd is as simple as 
"journalctl -f | tee autofs.log" or similar.

> 
> My maps are indirect and are quite large, but the lines in them are all one
> of two forms like:
> 
> sis_001  sis:/local_data/001
> 
> or
> 
> foo_002  :/autofs/space/bar_003

Sub-mounts, yes, I checked that case and couldn't reproduce it with
them either.

> 
> and all indirect submaps are served via LDAP/sssd (nsswitch.conf has -> 
> automount:  files sss)
> 
> The /etc/auto.master on each box is actually the untouched RedHat8 default
> 
> /misc   /etc/auto.misc
> /net    -hosts
> +dir:/etc/auto.master.d
> +auto.master
> 
> where in LDAP auto.master is
> 
> /autofs/space   auto.space      intr,rw,hard,nodev,nosuid
> /autofs/homes   auto.homes      intr,rw,hard,nodev,nosuid,vers=3,proto=tcp
> /autofs/cmas    auto.cmas       intr,rw,hard,nodev,nosuid,nodev,vers=3
> /autofs/links   auto.links      intr,rw,hard,nodev,nosuid,nodev,vers=3
> /autofs/cluster auto.cluster    intr,rw,hard,nodev,nosuid
> /autofs/eris    auto.eris       intr,rw,hard,nodev,nosuid,vers=3
> /autofs/nas     auto.nas        intr,rw,hard,nodev,nosuid
> /autofs/vast    auto.vast      
> intr,rw,hard,nodev,rsize=1048576,wsize=1048567,nosuid,vers=3,proto=tcp
> 
> 
> and for example auto.space has 2458 entries of the two forms of lines above
> 
> After some experimenting, inn fact I think LDAP must be part of the issue.  

Don't think I open any file handles in the LDAP lookups ... mmm ...

I wonder if the sss interface is opening a socket to communicate with
the sssd daemon and maybe not closing it. I just get several entry
points from the sss autofs shared library so we'll need to consult
with the sss folks on this.

> 
> With /etc/auto.master simply being the line (all 2458 in the local file
> /etc/auto.space)
> 
> /autofs/space  /etc/auto.space  intr,rw,hard,nodev/nosuid
> 
> there is no issue with increasing fd's in the for loop.  But if it is
> changed to 
> 
> /autofs/space auto.space intr,rw,hard,nodev,nosuid
> 
> so it pulls the indirect map from LDAP, then the issue of increasing fd's
> appears in the for loop

It is tempting to just say LDAP but the autofs LDAP lookup module
is very different to the sss lookup module so lets say sss.

I'll need to setup both direct LDAP and sss to test things out, that'll
take a while since I'm already on leave for Christmas, ;)

> 
> I have the exact same autofs config and maps on all my machines, CentOS
> Stream 8 and Rocky 8, and only 
> see the issue on the former.
> 
> I noticed ver 84 RPM is out for CentOS Stream 8 and upgraded a few boxes to
> it, but the problem 
> still persists.
> 
> In more weirdness, upon downgrading a CentOS Stream 8 box to the
> autofs-5.1.4-82.el8.x86_64.rpm 
> package from Rocky 8.6, I had the issue go away there. And then upgrading
> that same box to ver 84 RPM, 
> it is still  "fixed" on that box.  But when I tried to repeat this on 3
> other CentOS Stream 8 boxes, 
> the issue did NOT go away
> 
> Also, oddly, upgrading to ver 83 RPM from Rocky 8.7 on my Rocky box does not
> result in the issue appearing there.
> 
> This may be a more subtle interaction with various autofs, glibc, sssd,
> openldap packages.

Right, this is going to be fun ... ***sigh***.

Ian

Comment 8 Ian Kent 2022-12-20 09:33:36 UTC

Alexey, could you have a look at this please, check comments #6 and #7.

Any advice on this or who I should contact to further investigate would
be much appreciated.

Comment 9 Paul Raines 2022-12-20 14:49:08 UTC

Not sure how the sssd socket communication works.  I tried restarting sssd on a CS8 box to see if that 
reset the fd's on autofs back down but it did not.

I now have the lastest CS8 packages of autofs and sssd on a box (autofs-5.1.4-84.el8.x86_64 and 
sssd-2.7.3-5.0.1.el8.x86_64) and restarted both but it still shows the issue of ever increasing
fd's on each mount/umount.

I then changed /etc/nsswitch.conf for automount: to use ldap instead of sss.  Then the problem 
went away!!!!  Then changed it back to sss and the problem is back.

So this is definitely a sssd / autofs interaction thing

My sssd config is

[domain/default]

ldap_id_use_start_tls = True
cache_credentials = False
ldap_search_base = dc=foobar,dc=org
id_provider = ldap
auth_provider = ldap
chpass_provider = ldap
ldap_uri = ldap://ldap3.foobar.org, ldap://ldap4.foobar.org
ldap_tls_cacertdir = /etc/openldap/cacerts
ldap_tls_reqcert = never

entry_cache_timeout = 600

autofs_provider=ldap
ldap_autofs_search_base = dc=foobar,dc=org
ldap_autofs_map_object_class=automountMap
ldap_autofs_entry_object_class=automount
ldap_autofs_map_name=automountMapName
ldap_autofs_entry_key=automountKey
ldap_autofs_entry_value=automountInformation

[sssd]
services = nss, pam, autofs
config_file_version = 2

domains = default

All other sections empty

Comment 10 Alexey Tikhonov 2022-12-20 16:40:21 UTC

Hi,

(In reply to Paul Raines from comment #0)
> 
> # \ls -l /proc/2693/fd | wc -l
>   20481
...
> # /bin/ls -l /proc/$(systemctl show --property MainPID --value autofs)/fd/ | grep socket | wc -l
> 674

Where do those sockets point to?
`lsof -p autofs`?



(In reply to Ian Kent from comment #7)
> 
> I wonder if the sss interface is opening a socket to communicate with
> the sssd daemon and maybe not closing it.

Those should be seen as UNIX sockets pointing to `/var/lib/sss/*`

On RHEL9/C9S this is a socket / per service / per thread that makes a call to sss_client libs (but socket should be closed when thread exits).
On RHEL8/C8S this should be a single socket per service (`man sssd` shouldn't mention `SSS_LOCKFREE` on those platforms).

https://github.com/SSSD/sssd/blob/ddd85f95db69b49202d8da012597460b743ce460/src/sss_client/common.c#L67

What is `glibc` version?

Comment 11 Alexey Tikhonov 2022-12-20 16:49:00 UTC

(JFTR: under 'service' I mean NSS/PAM/autofs/...)


> In more weirdness, upon downgrading a CentOS Stream 8 box to the autofs-5.1.4-82.el8.x86_64.rpm 
package from Rocky 8.6, I had the issue go away there.

Was sssd downgraded as well?

Comment 12 Paul Raines 2022-12-20 17:02:47 UTC

Created attachment 1933796 [details]
lsof -p output

> > In more weirdness, upon downgrading a CentOS Stream 8 box to the autofs-5.1.4-82.el8.x86_64.rpm 
> > package from Rocky 8.6, I had the issue go away there.> 

> Was sssd downgraded as well?

No, just autofs.  And I have since upgraded it to ver 84 and even rebooted.  That box still remains "fixed".

But every other CS8 box I tried this on remained "unfixed".  So I am not sure why that box is an outlier now.

GLIBC ver is:  glibc-2.28-209.el8.x86_64

See the cs14_autofs.txt attachment that shows the lsof requested right after a for loop that increased the fd's

Comment 13 Paul Raines 2022-12-20 17:12:34 UTC

Created attachment 1933797 [details]
ss -a --unix -p | output

This might be useful too.  I think it shows all these "extra" sockets connect to nothing.

Comment 14 Alexey Tikhonov 2022-12-20 19:27:56 UTC

Established connection to SSSD service looks like:
```
u_str ESTAB  0      0                                      /var/lib/sss/pipes/nss 22462            * 27091 users:(("sssd_nss",pid=1039,fd=25))                                                
u_str ESTAB  0      0                                                           * 27091            * 22462 users:(("agetty",pid=1051,fd=4))                                                   
```
  --  local/peer match (and `lsof` says `type=STREAM (CONNECTED)`).


I'm not sure but those "u_str ESTAB  0   0 * 643868241 * 0" look like other side of connection closed socket but `automount` (or underlying lib) didn't process this event (and didn't close fd).

Since 'fd' is increased every "automount - sleep 1 - umount" I don't think those are sss_client/SSSD connections (those would be established, default `client_idle_timeout` is 60 secs, not less than 10 secs).
But to make sure you can enable `debug_level = 9` in `[nss]` and `[autofs]` sections of `sssd.conf`, restart SSSD and watch for "Client [cmd %s][uid %u][%p][%d] connected" messages in `/var/log/sssd/*.log`, where cmd == 'automount'.

Comment 15 Alexey Tikhonov 2022-12-20 19:51:43 UTC

(In reply to Paul Raines from comment #9)
> 
> I then changed /etc/nsswitch.conf for automount: to use ldap instead of sss.
> Then the problem 
> went away!!!!  Then changed it back to sss and the problem is back.

But this ^^ is suspicious, I agree. Checking SSSD logs is worth giving a try.

Does automount load/unload plugins dynamically for every mount attempt?

Comment 16 Paul Raines 2022-12-20 21:48:07 UTC

> Does automount load/unload plugins dynamically for every mount attempt?

Not sure what this means.  Maybe this is a question for Ian

I put in the two debug_level lines in sssd.conf and restarted sssd and then ran the for loop test again.  This is what I found:

# grep Client sssd_autofs.log |grep connected
(2022-12-20 16:39:09): [autofs] [accept_fd_handler] (0x0400): [CID#1] Client [cmd /usr/sbin/automount][uid 0][0x55cee5f62c60][23] connected!
(2022-12-20 16:39:12): [autofs] [accept_fd_handler] (0x0400): [CID#2] Client [cmd /usr/sbin/automount][uid 0][0x55cee5f87880][24] connected!
(2022-12-20 16:39:14): [autofs] [accept_fd_handler] (0x0400): [CID#3] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fa4de0][25] connected!
(2022-12-20 16:39:17): [autofs] [accept_fd_handler] (0x0400): [CID#4] Client [cmd /usr/sbin/automount][uid 0][0x55cee6006bd0][26] connected!
(2022-12-20 16:39:19): [autofs] [accept_fd_handler] (0x0400): [CID#5] Client [cmd /usr/sbin/automount][uid 0][0x55cee605dec0][27] connected!
(2022-12-20 16:39:21): [autofs] [accept_fd_handler] (0x0400): [CID#6] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fd2250][28] connected!
(2022-12-20 16:39:24): [autofs] [accept_fd_handler] (0x0400): [CID#7] Client [cmd /usr/sbin/automount][uid 0][0x55cee600af30][29] connected!
(2022-12-20 16:39:26): [autofs] [accept_fd_handler] (0x0400): [CID#8] Client [cmd /usr/sbin/automount][uid 0][0x55cee6067cc0][30] connected!
(2022-12-20 16:39:28): [autofs] [accept_fd_handler] (0x0400): [CID#9] Client [cmd /usr/sbin/automount][uid 0][0x55cee6013db0][31] connected!
(2022-12-20 16:39:31): [autofs] [accept_fd_handler] (0x0400): [CID#10] Client [cmd /usr/sbin/automount][uid 0][0x55cee605d9f0][32] connected!
(2022-12-20 16:39:33): [autofs] [accept_fd_handler] (0x0400): [CID#11] Client [cmd /usr/sbin/automount][uid 0][0x55cee603c740][33] connected!
(2022-12-20 16:39:35): [autofs] [accept_fd_handler] (0x0400): [CID#12] Client [cmd /usr/sbin/automount][uid 0][0x55cee608e5a0][34] connected!
(2022-12-20 16:39:38): [autofs] [accept_fd_handler] (0x0400): [CID#13] Client [cmd /usr/sbin/automount][uid 0][0x55cee6081190][35] connected!
(2022-12-20 16:39:40): [autofs] [accept_fd_handler] (0x0400): [CID#14] Client [cmd /usr/sbin/automount][uid 0][0x55cee60f2a20][36] connected!
(2022-12-20 16:39:42): [autofs] [accept_fd_handler] (0x0400): [CID#15] Client [cmd /usr/sbin/automount][uid 0][0x55cee6067770][37] connected!
(2022-12-20 16:39:45): [autofs] [accept_fd_handler] (0x0400): [CID#16] Client [cmd /usr/sbin/automount][uid 0][0x55cee60064c0][38] connected!
(2022-12-20 16:39:47): [autofs] [accept_fd_handler] (0x0400): [CID#17] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fb2b60][39] connected!
(2022-12-20 16:39:49): [autofs] [accept_fd_handler] (0x0400): [CID#18] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fd5d60][40] connected!
(2022-12-20 16:39:52): [autofs] [accept_fd_handler] (0x0400): [CID#19] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fc3d70][41] connected!
(2022-12-20 16:39:54): [autofs] [accept_fd_handler] (0x0400): [CID#20] Client [cmd /usr/sbin/automount][uid 0][0x55cee6059680][42] connected!


nothing for automount in sssd_nss.log

Comment 17 Ian Kent 2022-12-21 02:56:25 UTC

(In reply to Paul Raines from comment #16)
> > Does automount load/unload plugins dynamically for every mount attempt?

Well, not really, but there's a but ...

> 
> Not sure what this means.  Maybe this is a question for Ian

Yes, I think so.

> 
> I put in the two debug_level lines in sssd.conf and restarted sssd and then
> ran the for loop test again.  This is what I found:
> 
> # grep Client sssd_autofs.log |grep connected
> (2022-12-20 16:39:09): [autofs] [accept_fd_handler] (0x0400): [CID#1] Client
> [cmd /usr/sbin/automount][uid 0][0x55cee5f62c60][23] connected!

Don't really know what this is but it did make me think!

I do open the sss shared library (aka. dlopen()) to get the sss library
entry points. That has to be opening a file handle.

What's more, a while ago I changed to re-initialising the lookup context
in some/most cases.

It looks like that dlopen()ed module handle is leaking on re-init, I'll
need to check the locking so I can fix it without causing a problem so
leave it with me, ;)

Ian

Comment 18 Alexey Tikhonov 2022-12-21 09:33:38 UTC

(In reply to Paul Raines from comment #16)
> 
> I put in the two debug_level lines in sssd.conf and restarted sssd and then
> ran the for loop test again.  This is what I found:
> 
> # grep Client sssd_autofs.log |grep connected
> (2022-12-20 16:39:09): [autofs] [accept_fd_handler] (0x0400): [CID#1] Client [cmd /usr/sbin/automount][uid 0][0x55cee5f62c60][23] connected!
> (2022-12-20 16:39:12): [autofs] [accept_fd_handler] (0x0400): [CID#2] Client [cmd /usr/sbin/automount][uid 0][0x55cee5f87880][24] connected!
...
> (2022-12-20 16:39:54): [autofs] [accept_fd_handler] (0x0400): [CID#20] Client [cmd /usr/sbin/automount][uid 0][0x55cee6059680][42] connected!

So it does indeed re-connect to SSSD...
Could you please attach entire `sssd_autofs.log`?

Comment 19 Alexey Tikhonov 2022-12-21 09:37:59 UTC

(In reply to Ian Kent from comment #17)
> (In reply to Paul Raines from comment #16)
> > > Does automount load/unload plugins dynamically for every mount attempt?
> 
> Well, not really, but there's a but ...
> 
> I do open the sss shared library (aka. dlopen()) to get the sss library
> entry points. That has to be opening a file handle.
> 
> What's more, a while ago I changed to re-initialising the lookup context
> in some/most cases.

Is `dlclose()` executed?
If "yes" then I need to check if it's expected that fd open by sss_client is closed at this moment...

> It looks like that dlopen()ed module handle is leaking on re-init, I'll
> need to check the locking so I can fix it without causing a problem so
> leave it with me, ;)

But those handles hardly would be "unix socket type=stream"...

Comment 20 Alexey Tikhonov 2022-12-21 10:17:39 UTC

(In reply to Alexey Tikhonov from comment #19)
> > 
> > What's more, a while ago I changed to re-initialising the lookup context
> > in some/most cases.
> 
> Is `dlclose()` executed?
> If "yes" then I need to check if it's expected that fd open by sss_client is
> closed at this moment...


It should be:
https://github.com/SSSD/sssd/blob/02bdef7d7577aa0eaf10e1920714870d36c277a6/src/sss_client/common.c#L75

`man dlclose`:
```
Shared  objects may export functions using the __attribute__((constructor)) and __attribute__((destructor)) function attributes.  Constructor func‐
tions are executed before dlopen() returns, and destructor functions are executed before dlclose() returns.
```

So the question is if automount executes `dlclose()`.

Another question is if map/unmap of shared library is justified, but I don't have insights here.

Comment 21 Paul Raines 2022-12-21 14:34:30 UTC

Created attachment 1933922 [details]
sssd_autofs.log

In case you still needed it, here is the sssd_autofs.log

Comment 22 Alexey Tikhonov 2022-12-21 16:29:27 UTC

```
(16:39:09): [accept_fd_handler] (0x0400): [CID#1] Client [cmd /usr/sbin/automount][uid 0][0x55cee5f62c60][23] connected!
(16:39:12): [accept_fd_handler] (0x0400): [CID#2] Client [cmd /usr/sbin/automount][uid 0][0x55cee5f87880][24] connected!
(16:39:14): [accept_fd_handler] (0x0400): [CID#3] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fa4de0][25] connected!
(16:39:17): [accept_fd_handler] (0x0400): [CID#4] Client [cmd /usr/sbin/automount][uid 0][0x55cee6006bd0][26] connected!
(16:39:19): [accept_fd_handler] (0x0400): [CID#5] Client [cmd /usr/sbin/automount][uid 0][0x55cee605dec0][27] connected!
(16:39:21): [accept_fd_handler] (0x0400): [CID#6] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fd2250][28] connected!
(16:39:24): [accept_fd_handler] (0x0400): [CID#7] Client [cmd /usr/sbin/automount][uid 0][0x55cee600af30][29] connected!
(16:39:26): [accept_fd_handler] (0x0400): [CID#8] Client [cmd /usr/sbin/automount][uid 0][0x55cee6067cc0][30] connected!
(16:39:28): [accept_fd_handler] (0x0400): [CID#9] Client [cmd /usr/sbin/automount][uid 0][0x55cee6013db0][31] connected!
(16:39:31): [accept_fd_handler] (0x0400): [CID#10] Client [cmd /usr/sbin/automount][uid 0][0x55cee605d9f0][32] connected!
(16:39:33): [accept_fd_handler] (0x0400): [CID#11] Client [cmd /usr/sbin/automount][uid 0][0x55cee603c740][33] connected!
(16:39:35): [accept_fd_handler] (0x0400): [CID#12] Client [cmd /usr/sbin/automount][uid 0][0x55cee608e5a0][34] connected!
(16:39:38): [accept_fd_handler] (0x0400): [CID#13] Client [cmd /usr/sbin/automount][uid 0][0x55cee6081190][35] connected!
(16:39:40): [accept_fd_handler] (0x0400): [CID#14] Client [cmd /usr/sbin/automount][uid 0][0x55cee60f2a20][36] connected!
(16:39:42): [accept_fd_handler] (0x0400): [CID#15] Client [cmd /usr/sbin/automount][uid 0][0x55cee6067770][37] connected!
(16:39:45): [accept_fd_handler] (0x0400): [CID#16] Client [cmd /usr/sbin/automount][uid 0][0x55cee60064c0][38] connected!
(16:39:47): [accept_fd_handler] (0x0400): [CID#17] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fb2b60][39] connected!
(16:39:49): [accept_fd_handler] (0x0400): [CID#18] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fd5d60][40] connected!
(16:39:52): [accept_fd_handler] (0x0400): [CID#19] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fc3d70][41] connected!
(16:39:54): [accept_fd_handler] (0x0400): [CID#20] Client [cmd /usr/sbin/automount][uid 0][0x55cee6059680][42] connected!
(16:40:39): [client_close_fn] (0x2000): Terminated client [0x55cee5f62c60][23]
(16:40:42): [client_close_fn] (0x2000): Terminated client [0x55cee5f87880][24]
(16:40:44): [client_close_fn] (0x2000): Terminated client [0x55cee5fa4de0][25]
(16:40:47): [client_close_fn] (0x2000): Terminated client [0x55cee6006bd0][26]
(16:40:49): [client_close_fn] (0x2000): Terminated client [0x55cee605dec0][27]
(16:40:51): [client_close_fn] (0x2000): Terminated client [0x55cee5fd2250][28]
(16:40:54): [client_close_fn] (0x2000): Terminated client [0x55cee600af30][29]
(16:40:56): [client_close_fn] (0x2000): Terminated client [0x55cee6067cc0][30]
(16:40:58): [client_close_fn] (0x2000): Terminated client [0x55cee6013db0][31]
(16:41:01): [client_close_fn] (0x2000): Terminated client [0x55cee605d9f0][32]
(16:41:03): [client_close_fn] (0x2000): Terminated client [0x55cee603c740][33]
(16:41:05): [client_close_fn] (0x2000): Terminated client [0x55cee608e5a0][34]
(16:41:08): [client_close_fn] (0x2000): Terminated client [0x55cee6081190][35]
(16:41:10): [client_close_fn] (0x2000): Terminated client [0x55cee60f2a20][36]
(16:41:12): [client_close_fn] (0x2000): Terminated client [0x55cee6067770][37]
(16:41:15): [client_close_fn] (0x2000): Terminated client [0x55cee60064c0][38]
(16:41:17): [client_close_fn] (0x2000): Terminated client [0x55cee5fb2b60][39]
(16:41:19): [client_close_fn] (0x2000): Terminated client [0x55cee5fd5d60][40]
(16:41:22): [client_close_fn] (0x2000): Terminated client [0x55cee5fc3d70][41]
(16:41:24): [client_close_fn] (0x2000): Terminated client [0x55cee6059680][42]
(16:46:04): [accept_fd_handler] (0x0400): [CID#21] Client [cmd /usr/sbin/automount][uid 0][0x55cee5fc3ae0][23] connected!
(16:47:34): [client_close_fn] (0x2000): Terminated client [0x55cee5fc3ae0][23]
```
  --  every mount attempt makes a new sss_client->SSSD connection (that are later closed on server side by idle timeout).

So far it confirms that either `dlclose()` is missing or `__attribute__((destructor))` doesn't work as expected.

Comment 23 Ian Kent 2022-12-22 00:36:37 UTC

(In reply to Alexey Tikhonov from comment #20)
> (In reply to Alexey Tikhonov from comment #19)
> > > 
> > > What's more, a while ago I changed to re-initialising the lookup context
> > > in some/most cases.
> > 
> > Is `dlclose()` executed?
> > If "yes" then I need to check if it's expected that fd open by sss_client is
> > closed at this moment...
> 
> 
> It should be:
> https://github.com/SSSD/sssd/blob/02bdef7d7577aa0eaf10e1920714870d36c277a6/
> src/sss_client/common.c#L75
> 
> `man dlclose`:
> ```
> Shared  objects may export functions using the __attribute__((constructor))
> and __attribute__((destructor)) function attributes.  Constructor func‐
> tions are executed before dlopen() returns, and destructor functions are
> executed before dlclose() returns.
> ```
> 
> So the question is if automount executes `dlclose()`.

It does but there is a case, when a map is re-read, that autofs will re-init
the lookup module and that doesn't perform a dlclose() before doing the dlopen().

I need to fix that.

The logic around whether to re-read the map on lookup is a bit complicated
but should only trigger if the map has changed so maybe this is not the only
problem.

Ian

Comment 24 Ian Kent 2022-12-22 00:52:38 UTC

(In reply to Alexey Tikhonov from comment #20)
> 
> It should be:
> https://github.com/SSSD/sssd/blob/02bdef7d7577aa0eaf10e1920714870d36c277a6/
> src/sss_client/common.c#L75
> 

Just a quick aside, can this called in the context of the autofs shared
library ...

#ifdef HAVE_PTHREAD_EXT
static void sss_at_thread_exit(void *v)
{
    sss_cli_close_socket();
}

static void init_sd_key(void)
{
    pthread_key_create(&sss_sd_key, sss_at_thread_exit);
}
#endif

Shared libraries have a habit of being unloaded which can cause the
destructor function to be called without it being in the address
space of the caller.

I'm not sure if that same problem exists with the at thread exit
mechanism but I have got SEGV reports calling key destructor
recently ... anyway another time ...

Comment 25 Paul Raines 2022-12-22 14:36:06 UTC

Not sure if this illuminates or confuses things more, but that one CS8 box I said was mysteriously "fixed" doesn't really appear to be fixed.

On the "for loop" test it still does not show incrementing fd's  but over time (it is a batch node running many jobs from different users automounting many different data volumes) the fd's are still creeping up.  It is at 209 now where as all my Rocky boxes are still comfortably staying in the 70's.
The 'ss -a --unix -p' command shows many unconnected automount sockets.

Maybe some kind of cache makes it not need to make a new sss connection on each mount on the for loop

Comment 26 Alexey Tikhonov 2022-12-22 22:22:15 UTC

NOTE: following applies to RHEL9 only so not relevant to this BZ.

(In reply to Ian Kent from comment #24)
> 
> Just a quick aside, can this called in the context of the autofs shared
> library ...
> 
> #ifdef HAVE_PTHREAD_EXT
> static void sss_at_thread_exit(void *v)
> {
>     sss_cli_close_socket();
> }
> 
> static void init_sd_key(void)
> {
>     pthread_key_create(&sss_sd_key, sss_at_thread_exit);
> }
> #endif
> 
> Shared libraries have a habit of being unloaded which can cause the
> destructor function to be called without it being in the address
> space of the caller.

There are two different things used in sssd code:
 - `__attribute__((destructor))` - general module destructor
 - thread specific destructor set using `pthread_key_create()`


> I'm not sure if that same problem exists with the at thread exit
> mechanism but I have got SEGV reports calling key destructor
> recently ... anyway another time ...


It seems you are right:
If a thread uses any function from sss_client (and thus `pthread_setspecific()` sets `sss_sd_key` value for this thread, enabling destructor) and then sss_client is unloaded before thread exits, then app will crash when thread exits (trying to execute unloaded `sss_at_thread_exit()`).

But the same would happen if thread would try to execute any previously dlsym()-ed function from this lib. A bit a grey area...

I need to figure out if this is an oversight in glibc (but probably it's just impossible to cleanup properly during dlclose()). If not then we have to handle this case in SSSD code.

(Once again: this is not relevant to this BZ)

Comment 27 Ian Kent 2022-12-22 23:49:06 UTC

(In reply to Alexey Tikhonov from comment #26)
> NOTE: following applies to RHEL9 only so not relevant to this BZ.

Yes, that's right, it's just something I'm struggling with atm.

> 
> (In reply to Ian Kent from comment #24)
> > 
> > Just a quick aside, can this called in the context of the autofs shared
> > library ...
> > 
> > #ifdef HAVE_PTHREAD_EXT
> > static void sss_at_thread_exit(void *v)
> > {
> >     sss_cli_close_socket();
> > }
> > 
> > static void init_sd_key(void)
> > {
> >     pthread_key_create(&sss_sd_key, sss_at_thread_exit);
> > }
> > #endif
> > 
> > Shared libraries have a habit of being unloaded which can cause the
> > destructor function to be called without it being in the address
> > space of the caller.
> 
> There are two different things used in sssd code:
>  - `__attribute__((destructor))` - general module destructor
>  - thread specific destructor set using `pthread_key_create()`
> 
> 
> > I'm not sure if that same problem exists with the at thread exit
> > mechanism but I have got SEGV reports calling key destructor
> > recently ... anyway another time ...
> 
> 
> It seems you are right:
> If a thread uses any function from sss_client (and thus
> `pthread_setspecific()` sets `sss_sd_key` value for this thread, enabling
> destructor) and then sss_client is unloaded before thread exits, then app
> will crash when thread exits (trying to execute unloaded
> `sss_at_thread_exit()`).
> 
> But the same would happen if thread would try to execute any previously
> dlsym()-ed function from this lib. A bit a grey area...
> 
> I need to figure out if this is an oversight in glibc (but probably it's
> just impossible to cleanup properly during dlclose()). If not then we have
> to handle this case in SSSD code.

My understanding is that (and it's not the first time I've had to deal
with it), by definition, thread specific destructor is called at thread
exit but the library isn't necessarily the thread creator therefore it
(the library) can be unloaded before thread exit so the needed function
isn't available.

So, in this case the only safe way is to use globally available functions
for the destructor.

Ian

Comment 28 Ian Kent 2022-12-22 23:53:28 UTC

(In reply to Ian Kent from comment #27)
> > 
> > But the same would happen if thread would try to execute any previously
> > dlsym()-ed function from this lib. A bit a grey area...
> > 
> > I need to figure out if this is an oversight in glibc (but probably it's
> > just impossible to cleanup properly during dlclose()). If not then we have
> > to handle this case in SSSD code.
> 
> My understanding is that (and it's not the first time I've had to deal
> with it), by definition, thread specific destructor is called at thread
> exit but the library isn't necessarily the thread creator therefore it
> (the library) can be unloaded before thread exit so the needed function
> isn't available.
> 
> So, in this case the only safe way is to use globally available functions
> for the destructor.

However, surely while the dlopen()ed handle is valid the library won't
be unloaded. That has to be a reason to use dlopen()ing rather than
demand loading ...

My experience with this (so far) has been with demand loaded shared
libraries.

Comment 29 Ian Kent 2022-12-22 23:55:36 UTC

(In reply to Ian Kent from comment #23)
> (In reply to Alexey Tikhonov from comment #20)
> > (In reply to Alexey Tikhonov from comment #19)
> > > > 
> > > > What's more, a while ago I changed to re-initialising the lookup context
> > > > in some/most cases.
> > > 
> > > Is `dlclose()` executed?
> > > If "yes" then I need to check if it's expected that fd open by sss_client is
> > > closed at this moment...
> > 
> > 
> > It should be:
> > https://github.com/SSSD/sssd/blob/02bdef7d7577aa0eaf10e1920714870d36c277a6/
> > src/sss_client/common.c#L75
> > 
> > `man dlclose`:
> > ```
> > Shared  objects may export functions using the __attribute__((constructor))
> > and __attribute__((destructor)) function attributes.  Constructor func‐
> > tions are executed before dlopen() returns, and destructor functions are
> > executed before dlclose() returns.
> > ```
> > 
> > So the question is if automount executes `dlclose()`.
> 
> It does but there is a case, when a map is re-read, that autofs will re-init
> the lookup module and that doesn't perform a dlclose() before doing the
> dlopen().
> 
> I need to fix that.
> 
> The logic around whether to re-read the map on lookup is a bit complicated
> but should only trigger if the map has changed so maybe this is not the only
> problem.

Upon further investigation the module handle isn't leaking, we'll need to
look further for this.

> 
> Ian

Comment 30 Ian Kent 2022-12-23 00:09:09 UTC

(In reply to Paul Raines from comment #25)
> Not sure if this illuminates or confuses things more, but that one CS8 box I
> said was mysteriously "fixed" doesn't really appear to be fixed.
> 
> On the "for loop" test it still does not show incrementing fd's  but over
> time (it is a batch node running many jobs from different users automounting
> many different data volumes) the fd's are still creeping up.  It is at 209
> now where as all my Rocky boxes are still comfortably staying in the 70's.
> The 'ss -a --unix -p' command shows many unconnected automount sockets.

Are they increasing?

> 
> Maybe some kind of cache makes it not need to make a new sss connection on
> each mount on the for loop

Isn't it the case that your using sub-mounts?

If you try and umount(8) a sub-mount (the base of the mount not a map key
within the sub-mount) it should error with EBUSY "because" a file handle
is open on the base of the mount and there will be one for every sub-mount.
It's always been that way.

So there is a reason for the file handles in use to increase if the number
of active sub-mounts increases but, of course, that file handle should be
closed when the sub-mount expires away.

If the mount you are umounting is an offset (ie. part of a multi-mount)
then it too will have an open file handle while it's mounted but that
file handle with be opened on the directory covered by the mount so it
will not prevent the umount. These user initialled umounts should be
checked for and the file handles cleaned up during expire runs.

Ian

Comment 31 Ian Kent 2022-12-23 00:10:25 UTC

I will need to duplicate it after all ...

Comment 32 Alexey Tikhonov 2022-12-23 10:58:32 UTC

(In reply to Ian Kent from comment #27)
> (In reply to Alexey Tikhonov from comment #26)
> > NOTE: following applies to RHEL9 only so not relevant to this BZ.
> 
> Yes, that's right, it's just something I'm struggling with atm.

JFTR: opened upstream ticket https://github.com/SSSD/sssd/issues/6505

Comment 33 Alexey Tikhonov 2022-12-23 11:03:39 UTC

(In reply to Ian Kent from comment #29)
> 
> Upon further investigation the module handle isn't leaking, we'll need to
> look further for this.

Imo, attached `ss` output and log in comment 22 strongly hints that leaking FDs are from `libsss_autofs` being re-loaded (either without `dlclose()` or `__attribute__((destructor))` doesn't work as expected - I'll test it).

Comment 34 Alexey Tikhonov 2022-12-23 12:14:34 UTC

(In reply to Alexey Tikhonov from comment #33)
> (In reply to Ian Kent from comment #29)
> > 
> > Upon further investigation the module handle isn't leaking, we'll need to
> > look further for this.
> 
> Imo, attached `ss` output and log in comment 22 strongly hints that leaking
> FDs are from `libsss_autofs` being re-loaded (either without `dlclose()` or
> `__attribute__((destructor))` doesn't work as expected - I'll test it).

I tested this and `__attribute__((destructor))` works as expected.
Since on RHEL 8 `sss_cli_sd` doesn't have `__thread` storage specifier, it should be pretty straightforward and shouldn't depend on what thread calls dlclose().
So probably dlclose() is missing after all.


Btw, what is the reason to re-load plugins?
Wouldn't it be possible to have a single shared map {"plugin_name", handle} (lazily initialized once)?

Comment 35 Paul Raines 2022-12-23 15:30:18 UTC

> > Not sure if this illuminates or confuses things more, but that one CS8 box I
> > said was mysteriously "fixed" doesn't really appear to be fixed.
> > 
> > On the "for loop" test it still does not show incrementing fd's  but over
> > time (it is a batch node running many jobs from different users automounting
> > many different data volumes) the fd's are still creeping up.  It is at 209
> > now where as all my Rocky boxes are still comfortably staying in the 70's.
> > The 'ss -a --unix -p' command shows many unconnected automount sockets.
> 
> Are they increasing?

The fd's are increasing over time with normal use on this box.  But my "for loop" test does not show the fds' increasing on this one box while
it still does on all my other CS8 boxes I tested.  So that is the mystery.

However, on this box when I tried the "for loop" test on a different mount in same indirect map that had never been mounted since the last reboot it showed the fd's increasing like on other boxes.  So I suspect this was just some cache thing with /autofs/space/sis_001.

> > 
> > Maybe some kind of cache makes it not need to make a new sss connection on
> > each mount on the for loop
>
> Isn't it the case that your using sub-mounts?
> ...

Sorry, I don't follow this discussion.  The map in question is indirect on /autofs/space with entires like in comment 6

Comment 36 Ian Kent 2022-12-24 04:06:26 UTC

(In reply to Alexey Tikhonov from comment #34)
> (In reply to Alexey Tikhonov from comment #33)
> > (In reply to Ian Kent from comment #29)
> > > 
> > > Upon further investigation the module handle isn't leaking, we'll need to
> > > look further for this.
> > 
> > Imo, attached `ss` output and log in comment 22 strongly hints that leaking
> > FDs are from `libsss_autofs` being re-loaded (either without `dlclose()` or
> > `__attribute__((destructor))` doesn't work as expected - I'll test it).
> 
> I tested this and `__attribute__((destructor))` works as expected.
> Since on RHEL 8 `sss_cli_sd` doesn't have `__thread` storage specifier, it
> should be pretty straightforward and shouldn't depend on what thread calls
> dlclose().
> So probably dlclose() is missing after all.
> 
> 
> Btw, what is the reason to re-load plugins?
> Wouldn't it be possible to have a single shared map {"plugin_name", handle}
> (lazily initialized once)?

The lookup module reinit entry point is only called upon map reload.

For each master map entry all the module structure fields are copied (such
as the module handle) but the corresponding autofs master entry map options
may have changed so they need to be re-calculated.


Ian

Comment 37 Ian Kent 2022-12-24 04:14:25 UTC

(In reply to Paul Raines from comment #35)
> > 
> > Are they increasing?
> 
> The fd's are increasing over time with normal use on this box.  But my "for
> loop" test does not show the fds' increasing on this one box while
> it still does on all my other CS8 boxes I tested.  So that is the mystery.
> 
> However, on this box when I tried the "for loop" test on a different mount
> in same indirect map that had never been mounted since the last reboot it
> showed the fd's increasing like on other boxes.  So I suspect this was just
> some cache thing with /autofs/space/sis_001.
> 
> > > 
> > > Maybe some kind of cache makes it not need to make a new sss connection on
> > > each mount on the for loop
> >
> > Isn't it the case that your using sub-mounts?
> > ...
> 
> Sorry, I don't follow this discussion.  The map in question is indirect on
> /autofs/space with entires like in comment 6

This was one of the cases used in your maps:

foo_002  :/autofs/space/bar_003

I mistakenly thought this was a sub-mount but I think it's actually a bind
mount, sorry.

If you were using sub-mounts (aka. option -fstype=autofs) then we can expect
the file handle count to increase with each one until it expires away.

Ian

Comment 38 Alexey Tikhonov 2022-12-27 14:30:45 UTC

(In reply to Ian Kent from comment #36)
> > 
> > I tested this and `__attribute__((destructor))` works as expected.
> > Since on RHEL 8 `sss_cli_sd` doesn't have `__thread` storage specifier, it
> > should be pretty straightforward and shouldn't depend on what thread calls
> > dlclose().
> > So probably dlclose() is missing after all.
> > 
> > 
> > Btw, what is the reason to re-load plugins?
> > Wouldn't it be possible to have a single shared map {"plugin_name", handle}
> > (lazily initialized once)?


(This question is from a quick glance at autofs-5.1.4 sources, so sorry if I'm asking something silly.)

> The lookup module reinit entry point is only called upon map reload.

Is this `modules/lookup_sss.c::lookup_reinit()` (that uses existing handle: `do_init(..., 1)` -> `reinit_parse()`)?


> For each master map entry all the module structure fields are copied (such
> as the module handle) but the corresponding autofs master entry map options
> may have changed so they need to be re-calculated.

I'm not sure I properly understand the above, so I tried to trace `dlopen()` usage.

It seems `open_lookup()` -> ... -> `dlopen()` is executed in:
 - do_read_master() (and immediately closed)
 - do_read_map()
 - do_lookup_mount()
Where is handle closed in the latter 2 cases?
Is this only in `umount_autofs()` (-> lookup_close_lookup() -> ... -> close_lookup() ->  lookup_done() -> dlclose())?

`open_lookup()` is also executed in lookup_multi.c  --  is this a code used to handle case of multiple entries in nsswitch.conf (i.e. as in this ticket)?

Comment 39 Ian Kent 2022-12-31 13:52:22 UTC

(In reply to Alexey Tikhonov from comment #38)
> (In reply to Ian Kent from comment #36)
> > > 
> > > I tested this and `__attribute__((destructor))` works as expected.
> > > Since on RHEL 8 `sss_cli_sd` doesn't have `__thread` storage specifier, it
> > > should be pretty straightforward and shouldn't depend on what thread calls
> > > dlclose().
> > > So probably dlclose() is missing after all.
> > > 
> > > 
> > > Btw, what is the reason to re-load plugins?
> > > Wouldn't it be possible to have a single shared map {"plugin_name", handle}
> > > (lazily initialized once)?
> 
> 
> (This question is from a quick glance at autofs-5.1.4 sources, so sorry if
> I'm asking something silly.)
> 
> > The lookup module reinit entry point is only called upon map reload.
> 
> Is this `modules/lookup_sss.c::lookup_reinit()` (that uses existing handle:
> `do_init(..., 1)` -> `reinit_parse()`)?

It is.

> 
> 
> > For each master map entry all the module structure fields are copied (such
> > as the module handle) but the corresponding autofs master entry map options
> > may have changed so they need to be re-calculated.
> 
> I'm not sure I properly understand the above, so I tried to trace `dlopen()`
> usage.
> 
> It seems `open_lookup()` -> ... -> `dlopen()` is executed in:
>  - do_read_master() (and immediately closed)

The master map is used to define the automount managed mounts.
It's read and closed once done.

>  - do_read_map()

Only if it hasn't already been opened.

>  - do_lookup_mount()

Again, only if it hasn't already been opened.

> Where is handle closed in the latter 2 cases?
> Is this only in `umount_autofs()` (-> lookup_close_lookup() -> ... ->
> close_lookup() ->  lookup_done() -> dlclose())?

Basically yes, but it isn't quite that simple, which you've probably
noticed.

> 
> `open_lookup()` is also executed in lookup_multi.c  --  is this a code used
> to handle case of multiple entries in nsswitch.conf (i.e. as in this ticket)?

It isn't, the type "multi" is a very old mechanism that provides similar
function to nsswitch but isn't the same at all.

It isn't used much at all as far as I know.

Ian

Comment 40 RHEL Program Management 2023-09-23 11:24:49 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 41 RHEL Program Management 2023-09-23 11:26:56 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Note You need to log in before you can comment on or make changes to this bug.