2264945 – Large number of file descriptors used in RHEL 9

Bug 2264945 - Large number of file descriptors used in RHEL 9

Summary: Large number of file descriptors used in RHEL 9

Keywords:
Status:	NEW
Alias:	None
Product:	Fedora EPEL
Classification:	Fedora
Component:	opendkim
Sub Component:
Version:	epel9
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Jonathan Wright
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-02-19 18:52 UTC by razorbladex401
Modified:	2024-12-03 10:58 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Description razorbladex401 2024-02-19 18:52:42 UTC

Description of problem:
Currently experiencing an interesting issue where opendkim is exhausting all its file descriptors leading to postfix milter reject errors. This recently started happening after having the systems running for several months in RHEL 9. From what I've been able to tell there seems to be some issue with opendkim when on a system that's using sssd 2.9.1-4.el9_3.1 or greater. I have already opened a case with Red Hat but since opendkim is not part of their repo's they can't say if it's a bug with sssd or opendkim.

For some background:
We're using several domains to send from with around 50+ keys. Using, ss -anp | grep 'opendkim' | sed 's/ *$//', we can see streams grow after each email that get's signed. Below is a small example. Generally, we only really see maybe 11 streams at a time and they don't stay open long.


u_str ESTAB 0 0 * 53817391 * 0 users:(("opendkim",pid=1856103,fd=77))
u_str ESTAB 0 0 * 53832275 * 0 users:(("opendkim",pid=1856103,fd=218))
u_str ESTAB 0 0 * 53824876 * 0 users:(("opendkim",pid=1856103,fd=124))
u_str ESTAB 0 0 * 53824756 * 0 users:(("opendkim",pid=1856103,fd=133))
u_str ESTAB 0 0 * 53836020 * 0 users:(("opendkim",pid=1856103,fd=255))
u_str ESTAB 0 0 * 53838661 * 0 users:(("opendkim",pid=1856103,fd=285))
u_str ESTAB 0 0 * 53808986 * 0 users:(("opendkim",pid=1856103,fd=30))
u_str ESTAB 0 0 * 53835631 * 0 users:(("opendkim",pid=1856103,fd=270))
u_str ESTAB 0 0 * 53832887 * 0 users:(("opendkim",pid=1856103,fd=245))
u_str ESTAB 0 0 * 53838157 * 0 users:(("opendkim",pid=1856103,fd=283))
u_str ESTAB 0 0 * 53840434 * 0 users:(("opendkim",pid=1856103,fd=305))
u_str ESTAB 0 0 * 53830151 * 0 users:(("opendkim",pid=1856103,fd=182))
u_str ESTAB 0 0 * 53830225 * 0 users:(("opendkim",pid=1856103,fd=187))
u_str ESTAB 0 0 * 53830723 * 0 users:(("opendkim",pid=1856103,fd=191))
u_str ESTAB 0 0 * 53828693 * 0 users:(("opendkim",pid=1856103,fd=176))
u_str ESTAB 0 0 * 53827800 * 0 users:(("opendkim",pid=1856103,fd=171))
u_str ESTAB 0 0 * 53825776 * 0 users:(("opendkim",pid=1856103,fd=146))
u_str ESTAB 0 0 * 53834975 * 0 users:(("opendkim",pid=1856103,fd=259))
u_str ESTAB 0 0 * 53808916 * 0 users:(("opendkim",pid=1856103,fd=23))
u_str ESTAB 0 0 * 53810481 * 0 users:(("opendkim",pid=1856103,fd=33))
u_str ESTAB 0 0 * 53832456 * 0 users:(("opendkim",pid=1856103,fd=234))
u_str ESTAB 0 0 * 53810634 * 0 users:(("opendkim",pid=1856103,fd=47))
u_str ESTAB 0 0 * 53835245 * 0 users:(("opendkim",pid=1856103,fd=260))
u_str ESTAB 0 0 * 53836391 * 0 users:(("opendkim",pid=1856103,fd=265))
u_str ESTAB 0 0 * 53827932 * 0 users:(("opendkim",pid=1856103,fd=175))
u_str ESTAB 0 0 * 53826487 * 0 users:(("opendkim",pid=1856103,fd=163))
tcp LISTEN 0 4096 127.0.0.1:8891 0.0.0.0:* users:(("opendkim",pid=1856103,fd=3))


After some time there will be several hundred (if not more) open streams. Eventually it'll grow and we'll start seeing milter-rejects in postfix logs. We can get things working again by restarting opendkim.

warning: milter inet:127.0.0.1:8891: can't read SMFIC_OPTNEG reply packet header: Connection reset by peer warning: milter inet:127.0.0.1:8891: read error in initial handshake NOQUEUE: milter-reject: CONNECT from server [IP]: 451 4.7.1 Service unavailable - try again later; proto=SMTP

This only started after we updated sssd to version 2.9.1-4.el9_3.1. When we downgrade to 2.9.1-4.el9_3 it doesn't happen. Like stated before Redhat is not able to confirm if this is a bug with sssd or opendkim since opendkim is not from their repos. We use the version from EPEL.

There is a similar post in the Rocky Linux forums with some additional information that seems to match the same issue we are seeing, https://forums.rockylinux.org/t/rocky-9-3-opendkim-latest-sssd-packages/12327.

I believe this is tied to nss. When enabling debug logging with sssd I can see that opendkim shows up as a client.


Version-Release number of selected component (if applicable):
opendkim-2.11.0-0.36.el9.x86_64

How reproducible:
In our environment this is pretty reproducible, though I'm not sure how reproducible it'll be in others.  It happens every time we have sssd newer than sssd-2.9.1-4.el9_3.  We currently maintain over 50 dkim keys for many different domains for our customers.

Steps to Reproduce:
1. Deploy RHEL 9 server with postfix and opendkim.
2. Configure multiple keys for different domains
3. Ensure sssd 2.9.1-4.el9_3.1 or greater
4. Send several thousand emails "from" different domains.

Actual results:
Opendkim exhausts all file describers after some time and crashes.

Expected results:
Opendkim signing emails and postfix able to deliver emails without issue.

Additional info:

Comment 1 Fedora Admin user for bugzilla script actions 2024-10-26 14:06:33 UTC

This package has changed maintainer in Fedora. Reassigning to the new maintainer of this component.

Comment 2 Fedora Admin user for bugzilla script actions 2024-10-28 00:31:04 UTC

This package has changed maintainer in Fedora. Reassigning to the new maintainer of this component.

Comment 3 Jeffrey Goh 2024-12-03 10:58:47 UTC

# ss -anp | grep 'opendkim' | sed 's/ *$//'
u_str ESTAB      0      0                                                           * 1121513887                * 1121527909 users:(("opendkim",pid=639839,fd=2),("opendkim",pid=639839,fd=1))
u_str ESTAB      0      0                                                           * 1203114549                * 0          users:(("opendkim",pid=639839,fd=8))
u_dgr ESTAB      0      0                                                           * 1121513898                * 18898      users:(("opendkim",pid=639839,fd=6))
udp   ESTAB      0      0                                                   127.0.0.1:56806             127.0.0.1:53         users:(("opendkim",pid=639839,fd=45))
... (truncated about 30 connections to DNS)
udp   ESTAB      0      0                                                   127.0.0.1:40305             127.0.0.1:53         users:(("opendkim",pid=639839,fd=57))
tcp   LISTEN     0      4096                                                127.0.0.1:8891                0.0.0.0:*          users:(("opendkim",pid=639839,fd=5))

My observation on Fedora 34-41 is that this happens whether or not sssd is installed. For me, it does seem to only happen when the recursive server pointed to in resolv.conf is unbound (vs bind)

In any case, my quick fix is to add these two clauses to opendkim.conf, after which the problem mysteriously disappears, although :

# Nameservers (string)
# Provides a comma-separated list of IP addresses that are to be used when doing DNS queries to retrieve DKIM keys, VBR records, etc. 
# These override any local defaults built in to the resolver in use, which may be defined in /etc/resolv.conf or hard-coded into the software.
Nameservers 1.1.1.1, 8.8.8.8
# ip numbers = no TLS
# QueryCache (Boolean)
# Instructs the DKIM library to maintain its own local cache of keys and policies retrieved from DNS, rather than relying on the nameserver for caching service. Useful if the nameserver being used by the filter is not local. 
QueryCache yes

strangely, doing a tcpdump on the default routing interface shows that queries aren't happening? and yet, it seems able to both sign and verify

#  tcpdump -i eno1 host 1.1.1.1 or host 8.8.8.8
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes

Note You need to log in before you can comment on or make changes to this bug.