Description of problem: Currently experiencing an interesting issue where opendkim is exhausting all its file descriptors leading to postfix milter reject errors. This recently started happening after having the systems running for several months in RHEL 9. From what I've been able to tell there seems to be some issue with opendkim when on a system that's using sssd 2.9.1-4.el9_3.1 or greater. I have already opened a case with Red Hat but since opendkim is not part of their repo's they can't say if it's a bug with sssd or opendkim. For some background: We're using several domains to send from with around 50+ keys. Using, ss -anp | grep 'opendkim' | sed 's/ *$//', we can see streams grow after each email that get's signed. Below is a small example. Generally, we only really see maybe 11 streams at a time and they don't stay open long. u_str ESTAB 0 0 * 53817391 * 0 users:(("opendkim",pid=1856103,fd=77)) u_str ESTAB 0 0 * 53832275 * 0 users:(("opendkim",pid=1856103,fd=218)) u_str ESTAB 0 0 * 53824876 * 0 users:(("opendkim",pid=1856103,fd=124)) u_str ESTAB 0 0 * 53824756 * 0 users:(("opendkim",pid=1856103,fd=133)) u_str ESTAB 0 0 * 53836020 * 0 users:(("opendkim",pid=1856103,fd=255)) u_str ESTAB 0 0 * 53838661 * 0 users:(("opendkim",pid=1856103,fd=285)) u_str ESTAB 0 0 * 53808986 * 0 users:(("opendkim",pid=1856103,fd=30)) u_str ESTAB 0 0 * 53835631 * 0 users:(("opendkim",pid=1856103,fd=270)) u_str ESTAB 0 0 * 53832887 * 0 users:(("opendkim",pid=1856103,fd=245)) u_str ESTAB 0 0 * 53838157 * 0 users:(("opendkim",pid=1856103,fd=283)) u_str ESTAB 0 0 * 53840434 * 0 users:(("opendkim",pid=1856103,fd=305)) u_str ESTAB 0 0 * 53830151 * 0 users:(("opendkim",pid=1856103,fd=182)) u_str ESTAB 0 0 * 53830225 * 0 users:(("opendkim",pid=1856103,fd=187)) u_str ESTAB 0 0 * 53830723 * 0 users:(("opendkim",pid=1856103,fd=191)) u_str ESTAB 0 0 * 53828693 * 0 users:(("opendkim",pid=1856103,fd=176)) u_str ESTAB 0 0 * 53827800 * 0 users:(("opendkim",pid=1856103,fd=171)) u_str ESTAB 0 0 * 53825776 * 0 users:(("opendkim",pid=1856103,fd=146)) u_str ESTAB 0 0 * 53834975 * 0 users:(("opendkim",pid=1856103,fd=259)) u_str ESTAB 0 0 * 53808916 * 0 users:(("opendkim",pid=1856103,fd=23)) u_str ESTAB 0 0 * 53810481 * 0 users:(("opendkim",pid=1856103,fd=33)) u_str ESTAB 0 0 * 53832456 * 0 users:(("opendkim",pid=1856103,fd=234)) u_str ESTAB 0 0 * 53810634 * 0 users:(("opendkim",pid=1856103,fd=47)) u_str ESTAB 0 0 * 53835245 * 0 users:(("opendkim",pid=1856103,fd=260)) u_str ESTAB 0 0 * 53836391 * 0 users:(("opendkim",pid=1856103,fd=265)) u_str ESTAB 0 0 * 53827932 * 0 users:(("opendkim",pid=1856103,fd=175)) u_str ESTAB 0 0 * 53826487 * 0 users:(("opendkim",pid=1856103,fd=163)) tcp LISTEN 0 4096 127.0.0.1:8891 0.0.0.0:* users:(("opendkim",pid=1856103,fd=3)) After some time there will be several hundred (if not more) open streams. Eventually it'll grow and we'll start seeing milter-rejects in postfix logs. We can get things working again by restarting opendkim. warning: milter inet:127.0.0.1:8891: can't read SMFIC_OPTNEG reply packet header: Connection reset by peer warning: milter inet:127.0.0.1:8891: read error in initial handshake NOQUEUE: milter-reject: CONNECT from server [IP]: 451 4.7.1 Service unavailable - try again later; proto=SMTP This only started after we updated sssd to version 2.9.1-4.el9_3.1. When we downgrade to 2.9.1-4.el9_3 it doesn't happen. Like stated before Redhat is not able to confirm if this is a bug with sssd or opendkim since opendkim is not from their repos. We use the version from EPEL. There is a similar post in the Rocky Linux forums with some additional information that seems to match the same issue we are seeing, https://forums.rockylinux.org/t/rocky-9-3-opendkim-latest-sssd-packages/12327. I believe this is tied to nss. When enabling debug logging with sssd I can see that opendkim shows up as a client. Version-Release number of selected component (if applicable): opendkim-2.11.0-0.36.el9.x86_64 How reproducible: In our environment this is pretty reproducible, though I'm not sure how reproducible it'll be in others. It happens every time we have sssd newer than sssd-2.9.1-4.el9_3. We currently maintain over 50 dkim keys for many different domains for our customers. Steps to Reproduce: 1. Deploy RHEL 9 server with postfix and opendkim. 2. Configure multiple keys for different domains 3. Ensure sssd 2.9.1-4.el9_3.1 or greater 4. Send several thousand emails "from" different domains. Actual results: Opendkim exhausts all file describers after some time and crashes. Expected results: Opendkim signing emails and postfix able to deliver emails without issue. Additional info:
This package has changed maintainer in Fedora. Reassigning to the new maintainer of this component.
# ss -anp | grep 'opendkim' | sed 's/ *$//' u_str ESTAB 0 0 * 1121513887 * 1121527909 users:(("opendkim",pid=639839,fd=2),("opendkim",pid=639839,fd=1)) u_str ESTAB 0 0 * 1203114549 * 0 users:(("opendkim",pid=639839,fd=8)) u_dgr ESTAB 0 0 * 1121513898 * 18898 users:(("opendkim",pid=639839,fd=6)) udp ESTAB 0 0 127.0.0.1:56806 127.0.0.1:53 users:(("opendkim",pid=639839,fd=45)) ... (truncated about 30 connections to DNS) udp ESTAB 0 0 127.0.0.1:40305 127.0.0.1:53 users:(("opendkim",pid=639839,fd=57)) tcp LISTEN 0 4096 127.0.0.1:8891 0.0.0.0:* users:(("opendkim",pid=639839,fd=5)) My observation on Fedora 34-41 is that this happens whether or not sssd is installed. For me, it does seem to only happen when the recursive server pointed to in resolv.conf is unbound (vs bind) In any case, my quick fix is to add these two clauses to opendkim.conf, after which the problem mysteriously disappears, although : # Nameservers (string) # Provides a comma-separated list of IP addresses that are to be used when doing DNS queries to retrieve DKIM keys, VBR records, etc. # These override any local defaults built in to the resolver in use, which may be defined in /etc/resolv.conf or hard-coded into the software. Nameservers 1.1.1.1, 8.8.8.8 # ip numbers = no TLS # QueryCache (Boolean) # Instructs the DKIM library to maintain its own local cache of keys and policies retrieved from DNS, rather than relying on the nameserver for caching service. Useful if the nameserver being used by the filter is not local. QueryCache yes strangely, doing a tcpdump on the default routing interface shows that queries aren't happening? and yet, it seems able to both sign and verify # tcpdump -i eno1 host 1.1.1.1 or host 8.8.8.8 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes