Bug 2013524
| Summary: | RHEL-7.9 ipa-replica-install "hangs" remote IPA LDAP server | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Marc Sauton <msauton> |
| Component: | 389-ds-base | Assignee: | thierry bordaz <tbordaz> |
| Status: | CLOSED DUPLICATE | QA Contact: | RHDS QE <ds-qe-bugs> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 7.9 | CC: | abokovoy, kurathod, ldap-maint, pcech, progier, spichugi, stanislav.moravec, tbordaz, tmihinto |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | sync-to-jira | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-28 16:13:40 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Description of problem: A RHEL-7.9 ipa-replica-install remotely render its peer IPA LDAP server unresponsive for a long period of time. Version-Release number of selected component (if applicable): RHEL-7.9 389-ds-base-1.3.10.2-12.el7_9.x86_64 ipa-server-4.6.8-5.el7_9.7.x86_64 redhat-release-server-7.9-6.el7_9.x86_64 How reproducible: N/A Steps to Reproduce: 1. N/A 2. 3. Actual results: remote/master IPA replica LDAP service must be killer and restarted replica 85 install: Done configuring the web interface (httpd). Configuring ipa-otpd [1/2]: starting ipa-otpd [2/2]: configuring ipa-otpd to start on boot Done configuring ipa-otpd. Configuring ipa-custodia [1/4]: Generating ipa-custodia config file [2/4]: Generating ipa-custodia keys [3/4]: starting ipa-custodia [4/4]: configuring ipa-custodia to start on boot Done configuring ipa-custodia. ('SEB:', {'ccache': 'MEMORY:Custodia_fMyDiDoK/iI=', 'client_keytab': '/etc/krb5.keytab'}, Name(host, <OID 1.2.840.113554.1.2.1.4>), None) Configuring certificate server (pki-tomcatd). Estimated time: 3 minutes [1/30]: creating certificate server db [2/30]: setting up initial replication Starting replication, please wait until this has completed. <--------------------------------- More replicating caused 83/84 to hang again Update in progress, 21 seconds elapsed Update succeeded [3/30]: creating ACIs for admin [4/30]: creating installation admin user [5/30]: configuring certificate server instance [6/30]: secure AJP connector [7/30]: reindex attributes [8/30]: exporting Dogtag certificate store pin [9/30]: stopping certificate server instance to update CS.cfg [10/30]: backing up CS.cfg [11/30]: disabling nonces [12/30]: set up CRL publishing [13/30]: enable PKIX certificate path discovery and validation [14/30]: destroying installation admin user [15/30]: starting certificate server instance [16/30]: Finalize replication settings and it all completes to the end, except for a last LDAP connection that fails from the replica 85 to the master 84: 2021-10-12T21:16:27Z ERROR cannot connect to 'ldap://84.edited:389': 2021-10-12T21:16:27Z ERROR The ipa-replica-install command failed. See /var/log/ipareplica-install.log for more information (END) this looks more like extremely slow connection processing than a complete hang/deadlock ldapsearch with a simple BIND would not prompt for credentials replica 84 ( may have needed more samples ) dn: cn=config nsslapd-idletimeout: 0 nsslapd-ioblocktimeout: 10000 nsslapd-listen-backlog-size: 128 nsslapd-threadnumber: 192 nsslapd-maxdescriptors: 16384 nsslapd-reservedescriptors: 64 dn: cn=monitor threads: 194 currentconnections: 3462 totalconnections: 11486 currentconnectionsatmaxthreads: 0 maxthreadsperconnhits: 1911 dtablesize: 16384 readwaiters: 0 opsinitiated: 943235 opscompleted: 943231 currenttime: 20211012213934Z starttime: 20211012211635Z nbackends: 3 ldapsearch -o ldif-wrap=no -LLLxD cn=Directory\ Manager -W -b cn=monitor -s base connection -> only 71 connections out of 3400 plus with a sign of been blocked, not significant in the sample taken at that moment. already has sysctl net.core.somaxconn = 65535 stack trace has nearly all the threads appear blocked, 199 out of 211 threads, like for example: Thread 199 (Thread 0x7f335a91f700 (LWP 85062)): #0 0x00007f3390db6184 in pthread_rwlock_rdlock () at /lib64/libpthread.so.0 #1 0x00007f339368c80a in slapi_rwlock_rdlock (rwlock=<optimized out>) at ldap/servers/slapd/slapi2nspr.c:246 #2 0x00007f33936a4f7d in vattr_rdlock () at ldap/servers/slapd/vattr.c:188 netstat returned 100s of connections in CLOSE_WAIT state related to the IPA LDAP service from netstat outputs nsslapd-idletimeout with a default value of 0 in use, which means no timeout. I would set this nsslapd-idletimeout to 5mn / 300 seconds, on the replica 84 and 83, for example, from nsslapd-idletimeout: 0 nsslapd-listen-backlog-size: 128 to nsslapd-idletimeout: 300 nsslapd-listen-backlog-size: 2048 this can be related to a non responding LDAP service, and a cascade of problems, including fail over to other replica and causing the same issue again Expected results: yes Additional info: the remote/master IPA LDAP system netstat output show - hundreds of KDC connections in TIME_WAIT state - hundreds of LDAP connections in ESTABLISHED state a remote session with the customer was showing hundreds of LDAP connections in CLOSE_WAIT state