Bug 2013524

Summary: RHEL-7.9 ipa-replica-install "hangs" remote IPA LDAP server
Product: Red Hat Enterprise Linux 7 Reporter: Marc Sauton <msauton>
Component: 389-ds-baseAssignee: thierry bordaz <tbordaz>
Status: CLOSED DUPLICATE QA Contact: RHDS QE <ds-qe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.9CC: abokovoy, kurathod, ldap-maint, pcech, progier, spichugi, stanislav.moravec, tbordaz, tmihinto
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: sync-to-jira
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-28 16:13:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marc Sauton 2021-10-13 05:47:09 UTC
Description of problem:

A RHEL-7.9 ipa-replica-install remotely render its peer IPA LDAP server unresponsive for a long period of time.


Version-Release number of selected component (if applicable):

RHEL-7.9
389-ds-base-1.3.10.2-12.el7_9.x86_64
ipa-server-4.6.8-5.el7_9.7.x86_64
redhat-release-server-7.9-6.el7_9.x86_64


How reproducible:
N/A

Steps to Reproduce:
1. N/A
2.
3.


Actual results:

remote/master IPA replica LDAP service must be killer and restarted

replica 85 install:

Done configuring the web interface (httpd).
Configuring ipa-otpd
  [1/2]: starting ipa-otpd
  [2/2]: configuring ipa-otpd to start on boot
Done configuring ipa-otpd.
Configuring ipa-custodia
  [1/4]: Generating ipa-custodia config file
  [2/4]: Generating ipa-custodia keys
  [3/4]: starting ipa-custodia
  [4/4]: configuring ipa-custodia to start on boot
Done configuring ipa-custodia.
('SEB:', {'ccache': 'MEMORY:Custodia_fMyDiDoK/iI=', 'client_keytab': '/etc/krb5.keytab'}, Name(host, <OID 1.2.840.113554.1.2.1.4>), None)
Configuring certificate server (pki-tomcatd). Estimated time: 3 minutes
  [1/30]: creating certificate server db
  [2/30]: setting up initial replication
Starting replication, please wait until this has completed.   <--------------------------------- More replicating caused 83/84 to hang again
Update in progress, 21 seconds elapsed
Update succeeded

  [3/30]: creating ACIs for admin
  [4/30]: creating installation admin user
  [5/30]: configuring certificate server instance
  [6/30]: secure AJP connector
  [7/30]: reindex attributes
  [8/30]: exporting Dogtag certificate store pin
  [9/30]: stopping certificate server instance to update CS.cfg
  [10/30]: backing up CS.cfg
  [11/30]: disabling nonces
  [12/30]: set up CRL publishing
  [13/30]: enable PKIX certificate path discovery and validation
  [14/30]: destroying installation admin user
  [15/30]: starting certificate server instance
  [16/30]: Finalize replication settings


and it all completes to the end, except for a last LDAP connection that fails from the replica 85 to the master 84:

2021-10-12T21:16:27Z ERROR cannot connect to 'ldap://84.edited:389':
2021-10-12T21:16:27Z ERROR The ipa-replica-install command failed. See /var/log/ipareplica-install.log for more information
(END)


this looks more like extremely slow connection processing than a complete hang/deadlock
ldapsearch with a simple BIND would not prompt for credentials


replica 84 ( may have needed more samples )

dn: cn=config
nsslapd-idletimeout: 0
nsslapd-ioblocktimeout: 10000
nsslapd-listen-backlog-size: 128
nsslapd-threadnumber: 192
nsslapd-maxdescriptors: 16384
nsslapd-reservedescriptors: 64

dn: cn=monitor
threads: 194
currentconnections: 3462
totalconnections: 11486
currentconnectionsatmaxthreads: 0
maxthreadsperconnhits: 1911
dtablesize: 16384
readwaiters: 0
opsinitiated: 943235
opscompleted: 943231
currenttime: 20211012213934Z
starttime: 20211012211635Z
nbackends: 3

ldapsearch -o ldif-wrap=no -LLLxD cn=Directory\ Manager -W -b cn=monitor -s base connection
->
only 71 connections out of 3400 plus with a sign of been blocked, not significant in the sample taken at that moment.

already has sysctl
net.core.somaxconn = 65535


stack trace has nearly all the threads appear blocked, 199 out of 211 threads, like for example:

Thread 199 (Thread 0x7f335a91f700 (LWP 85062)):
#0  0x00007f3390db6184 in pthread_rwlock_rdlock () at /lib64/libpthread.so.0
#1  0x00007f339368c80a in slapi_rwlock_rdlock (rwlock=<optimized out>) at ldap/servers/slapd/slapi2nspr.c:246
#2  0x00007f33936a4f7d in vattr_rdlock () at ldap/servers/slapd/vattr.c:188


netstat returned  100s of connections in CLOSE_WAIT state related to the IPA LDAP service from netstat outputs

nsslapd-idletimeout with a default value of 0 in use, which means no timeout.

I would set this nsslapd-idletimeout to 5mn / 300 seconds, on the replica 84 and 83, for example, from
nsslapd-idletimeout: 0
nsslapd-listen-backlog-size: 128

to
nsslapd-idletimeout: 300
nsslapd-listen-backlog-size: 2048



this can be related to a non responding LDAP service, and a cascade of problems, including fail over to other replica and causing the same issue again




Expected results:
yes


Additional info:

the remote/master IPA LDAP system netstat output show
- hundreds of KDC connections in TIME_WAIT state
- hundreds of LDAP connections in ESTABLISHED state

a remote session with the customer was showing hundreds of LDAP connections in CLOSE_WAIT state