Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1330171

Summary: Two IPA conjoined faults
Product: Red Hat Enterprise Linux 7 Reporter: Denis <denich>
Component: ipaAssignee: IPA Maintainers <ipa-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Kaleem <ksiddiqu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.1CC: bermoshin, lkrispen, pvoborni, rcritten, tbordaz
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-17 15:35:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Denis 2016-04-25 13:59:57 UTC
Description of problem:


I have two FreeIPA servers (ipa-server-4.1.0-18, krb5-server-1.12.2-15, cyrus-sasl-gssapi-2.1.26-17, 389-ds-base-1.3.3.1-20). Both of them configured as masters and replicate between.  At same time, at both of servers, krb5 service failed.  Time interval between fails is 25 minutes. 
Server one logs:
Krb5kdc.log:


Apr 19 23:07:41 ipa-01.domain.com krb5kdc[2322](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.255.210.3: NEEDED_PREAUTH: host/dc01.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM, Additional pre-authentication required-01.dc.domain.com krb5kdc[2322](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.255.210.3: ISSUE: authtime 1461096461, etypes {rep=18 tkt=18 ses=18}, host/dc01.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM.dc.domain.com krb5kdc[2322](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.224.8.46: LOOKING_UP_CLIENT: username.COM for krbtgt/DC.DOMAIN.COM.COM, Server error
Apr 19 23:13:02 ipa-01.domain.com krb5kdc[2322](info): AS_REQ (4 etypes {18 17 16 23}) 10.254.210.28: LOOKING_UP_CLIENT: host/dc02.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM, Server error
Apr 19 23:13:07 ipa-01.domain.com krb5kdc[2322](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.255.19.20: LOOKING_UP_CLIENT: host/dc01-vm-proxy01.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM, Server error 
Apr 19 23:13:12  ipa-01.domain.com krb5kdc[2322](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.255.210.3: LOOKING_UP_CLIENT: host/dc01.domain.com.COM for krbtgt/DC.DOMAIN.COM, Server error
Apr 19 23:13:12 ipa-01.domain.com krb5kdc[2322](info): DISPATCH: repeated (retransmitted?) request from 10.224.8.46, resending previous response
Apr 19 23:13:12 ipa-01.domain.com krb5kdc[2322](info): DISPATCH: repeated (retransmitted?) request from 10.224.8.46, resending previous response
Apr 19 23:13:12 ipa-01.domain.com krb5kdc[2322](info): closing down fd 4
Apr 19 23:13:12 ipa-01.domain.com krb5kdc[2322](info): DISPATCH: repeated (retransmitted?) request from 10.254.210.28, resending previous response
Apr 19 23:13:12 ipa-01.domain.com  krb5kdc[2322](info): DISPATCH: repeated (retransmitted?) request from 10.254.210.28, resending previous response
Apr 19 23:13:12 ipa-01.domain.com krb5kdc[2322](info): closing down fd 11


dirsrv/slapd-DC-DOMAIN-COM/errors:
[19/Apr/2016:23:12:28 +0300] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)) errno 2 (No such file or directory)
[19/Apr/2016:23:12:55 +0300] set_krb5_creds - Could not get initial credentials for principal [ldap/ ipa-01.domain.com @DC.DOMAIN.COM] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
[19/Apr/2016:23:12:55 +0300] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)) errno 2 (No such file or directory)
[19/Apr/2016:23:12:55 +0300] slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error -2 (Local error)
[19/Apr/2016:23:12:55 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-02.domain.com " (ipa-02:389): Replication bind with GSSAPI auth failed: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired))
[19/Apr/2016:23:45:58 +0300] - slapd shutting down - signaling operation threads - op stack size 0 max work q size 1854 max work q stack size 87
[19/Apr/2016:23:45:58 +0300] - slapd shutting down - waiting for 30 threads to terminate
[19/Apr/2016:23:48:45 +0300] - SSL alert: Configured NSS Ciphers
[19/Apr/2016:23:48:45 +0300] - SSL alert:       TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256: enabled
[19/Apr/2016:23:48:45 +0300] - SSL alert:       TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384: enabled


Server two logs
Krb5kdc.log:
Apr 19 23:39:21 ipa-02.domain.com krb5kdc[2310](info): AS_REQ (4 etypes {18 17 16 23}) 10.224.202.3: ISSUE: authtime 1461098361, etypes {rep=18 tkt=18 ses=18}, host/dc02-srv -02.dc.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM pa-02.dc.domain.com krb5kdc[2310](info): DISPATCH: repeated (retransmitted?) request from 10.254.206.32, resending previous response
Apr 19 23:39:31 ipa-02.domain.com krb5kdc[2310](info): DISPATCH: repeated (retransmitted?) request from 10.254.206.39, resending previous response
Apr 19 23:39:31 ipa-02.domain.com krb5kdc[2310](info): DISPATCH: repeated (retransmitted?) request from 10.224.200.2, resending previous response
Apr 19 23:39:36 ipa-02.domain.com krb5kdc[2310](info): AS_REQ (4 etypes {18 17 16 23}) 10.254.210.15: LOOKING_UP_CLIENT: host/dc02-srv-tstdb-01.dc.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM, Server error
Apr 19 23:39:36 ipa-02.domain.com krb5kdc[2310](info): DISPATCH: repeated (retransmitted?) request from 10.255.206.41, resending previous response
Apr 19 23:39:41 ipa-02.domain.com krb5kdc[2310](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.255.19.8: LOOKING_UP_CLIENT: host/dc01-vm-deb02.dc.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM, Server error
Apr 19 23:39:41 ipa-02.domain.com krb5kdc[2310](info): closing down fd 4
dirsrv/slapd-DC-DOMAIN-COM/errors:


[19/Apr/2016:23:09:00 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Warning: Attempting to release replica, but unable to receive endReplication extended operation response from the replica. Error -5 (Timed out) NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:13:03 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:15:02 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:17:06 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:19:06 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:21:11 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:23:11 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:25:14 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:27:13 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:29:17 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:31:17 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:33:21 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:35:21 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:37:26 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:39:26 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:48:40 +0300] - slapd shutting down - signaling operation threads - op stack size 0 max work q size 1949 max work q stack size 85
[19/Apr/2016:23:48:40 +0300] - slapd shutting down - waiting for 30 threads to terminate
[20/Apr/2016:00:00:12 +0300] - SSL alert: Configured NSS Ciphers
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256: enabled
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384: enabled
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256: enabled
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384: enabled
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384: enabled
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384: enabled
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA: enabled
…


It important  to fix the cause of conjoined faults.

Comment 1 Petr Vobornik 2016-04-25 15:49:44 UTC
Sounds bit like https://www.redhat.com/archives/freeipa-users/2015-July/msg00283.html

Where enlarging DS cache helped: https://www.redhat.com/archives/freeipa-users/2015-July/msg00331.html

If it doesn't help , I'd suggest to start a thread of FreeIPA users mailing list <https://www.redhat.com/mailman/listinfo/freeipa-users> where more members from community would reply.

Comment 2 Petr Vobornik 2016-04-25 16:06:11 UTC
Upstream ticket:
https://fedorahosted.org/freeipa/ticket/5847

Comment 3 Petr Vobornik 2016-04-25 16:37:28 UTC
Is schema compatibility plugin enabled on the server?

Comment 5 Denis 2016-04-26 08:12:37 UTC
Hi!
Thanks for your response!
Schema compatibility plugin is not enabled. 
The   DS cache size is set to default value.

Comment 6 bermoshin 2016-04-29 20:28:33 UTC
Petr, hello!

I have similar mistakes too.

messages

Apr 29 15:05:43 vm-ipa-01 ns-slapd: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)
Apr 29 15:06:10 vm-ipa-01 ns-slapd: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)

Apr 29 15:47:54 vm-ipa-01 rsyslogd-2040: fatal error on disk queue 'action 1 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]



Apr 29 15:55:56 vm-ipa-01 [sssd[ldap_child[3080]]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Cannot contact any KDC for realm 'DC.domain.com'. Unable to create GSSAPI-encrypted LDAP connection.
Apr 29 15:55:56 vm-ipa-01 [sssd[ldap_child[3080]]]: Cannot contact any KDC for realm 'DC.domain.com'
Apr 29 15:55:56 vm-ipa-01 [sssd[ldap_child[3081]]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Cannot contact any KDC for realm 'DC.domain.com'. Unable to create GSSAPI-encrypted LDAP connection.
Apr 29 15:55:56 vm-ipa-01 [sssd[ldap_child[3081]]]: Cannot contact any KDC for realm 'DC.domain.com'
Apr 29 15:55:57 vm-ipa-01 audispd: node=vm-ipa-01.dc.domain.com type=TTY msg=audit(1461934557.343:277): tty pid=2931 uid=0 auid=857800024 ses=6 major=136 minor=3 comm="bash" data=1B5B41207C206872657F7F1B7F67726570206B64630D
Apr 29 15:55:57 vm-ipa-01 audispd: node=vm-ipa-01.dc.domain.com type=USER_TTY msg=audit(1461934557.343:278): pid=2931 uid=0 auid=857800024 ses=6 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 data=7073202D6566207C2067726570206B6463
Apr 29 15:55:57 vm-ipa-01 ns-slapd: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (No Kerberos credentials available)






---
slapd errors

[29/Apr/2016:15:05:43 +0300] set_krb5_creds - Could not get initial credentials for principal [ldap/vm-ipa-01.dc.domain.com.com] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
[29/Apr/2016:15:05:43 +0300] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)) errno 2 (No such file or directory)
[29/Apr/2016:15:06:10 +0300] set_krb5_creds - Could not get initial credentials for principal [ldap/vm-ipa-01.dc.domain.com.com] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
[29/Apr/2016:15:06:10 +0300] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)) errno 2 (No such file or directory)
[29/Apr/2016:15:06:10 +0300] slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error -2 (Local error)
[29/Apr/2016:15:06:10 +0300] NSMMReplicationPlugin - agmt="cn=meTovm-ipa-02.dc.domain.com" (vm-ipa-02:389): Replication bind with GSSAPI auth failed: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)) 


---
krb5

Apr 29 15:00:54 vm-ipa-01.dc.domain.com krb5kdc[4215](info): closing down fd 14
Apr 29 15:06:00 vm-ipa-01.dc.domain.com krb5kdc[4215](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.224.8.30: LOOKING_UP_CLIENT: host/dc02-vm-node0.dc.domain.com.com for krbtgt/DC.domain.com.com, Server error

---
389-ds-base-1.3.3.1-20.el7_1.x86_64 
ipa-server-4.1.0-18.el7.centos.4.x86_64   
krb5-server-1.12.2-15.el7_1.x86_64 
cyrus-sasl-gssapi-2.1.26-17.el7.x86_64 

=====

We did the enlarging of the changelog cache, but slapd fell again .. 
Whether it is necessary  enable compatibility plug-in?
We have made sosreport from a host. What logs are necessary to you?

Comment 7 bermoshin 2016-05-06 08:24:57 UTC
A colleague spent an additional study of the dump, shot after receiving the error.

Taking the hang ns-slapd as the main source of the problem, we get the following:

crash> ps -g ns-slapd
PID: 760    TASK: ffff8801362ef1c0  CPU: 1   COMMAND: "ns-slapd"
  PID: 762    TASK: ffff8801383a0b60  CPU: 1   COMMAND: "ns-slapd"
  PID: 763    TASK: ffff8801383a16c0  CPU: 3   COMMAND: "ns-slapd"
  PID: 764    TASK: ffff8801383a6660  CPU: 3   COMMAND: "ns-slapd"
  PID: 765    TASK: ffff8801383a0000  CPU: 0   COMMAND: "ns-slapd"
  PID: 766    TASK: ffff8800361ac440  CPU: 0   COMMAND: "ns-slapd"
  PID: 767    TASK: ffff8800361ab8e0  CPU: 3   COMMAND: "ns-slapd"
  PID: 768    TASK: ffff8800361aad80  CPU: 3   COMMAND: "ns-slapd"
  PID: 769    TASK: ffff880139b338e0  CPU: 2   COMMAND: "ns-slapd"
  PID: 770    TASK: ffff880139b32220  CPU: 1   COMMAND: "ns-slapd"
  PID: 771    TASK: ffff8800b61a6660  CPU: 1   COMMAND: "ns-slapd"
  PID: 772    TASK: ffff8800b61a71c0  CPU: 1   COMMAND: "ns-slapd"
  PID: 773    TASK: ffff8800b61a4fa0  CPU: 3   COMMAND: "ns-slapd"
  PID: 774    TASK: ffff8800b9040000  CPU: 3   COMMAND: "ns-slapd"
  PID: 776    TASK: ffff8800b9044fa0  CPU: 3   COMMAND: "ns-slapd"
  PID: 777    TASK: ffff8800b9042220  CPU: 0   COMMAND: "ns-slapd"
  PID: 778    TASK: ffff8800b90438e0  CPU: 0   COMMAND: "ns-slapd"
  PID: 779    TASK: ffff8800b9046660  CPU: 1   COMMAND: "ns-slapd"
  PID: 780    TASK: ffff8800b9042d80  CPU: 1   COMMAND: "ns-slapd"
  PID: 781    TASK: ffff8801362da220  CPU: 0   COMMAND: "ns-slapd"
  PID: 782    TASK: ffff8801362dad80  CPU: 1   COMMAND: "ns-slapd"
  PID: 783    TASK: ffff8801397bf1c0  CPU: 2   COMMAND: "ns-slapd"
  PID: 784    TASK: ffff8801397b8b60  CPU: 3   COMMAND: "ns-slapd"
  PID: 785    TASK: ffff8801397b96c0  CPU: 2   COMMAND: "ns-slapd"
  PID: 786    TASK: ffff8800bb67a220  CPU: 1   COMMAND: "ns-slapd"
  PID: 787    TASK: ffff8800bb678000  CPU: 3   COMMAND: "ns-slapd"
  PID: 788    TASK: ffff8800bb678b60  CPU: 3   COMMAND: "ns-slapd"
  PID: 789    TASK: ffff8800bb67e660  CPU: 1   COMMAND: "ns-slapd"
  PID: 790    TASK: ffff8800bb67c440  CPU: 3   COMMAND: "ns-slapd"
  PID: 791    TASK: ffff8800b28da220  CPU: 0   COMMAND: "ns-slapd"
  PID: 792    TASK: ffff8800b28dc440  CPU: 0   COMMAND: "ns-slapd"
  PID: 793    TASK: ffff8800b28de660  CPU: 0   COMMAND: "ns-slapd"
  PID: 794    TASK: ffff8800b28ddb00  CPU: 2   COMMAND: "ns-slapd"
  PID: 795    TASK: ffff8800b28db8e0  CPU: 1   COMMAND: "ns-slapd"
  PID: 796    TASK: ffff8800b28dad80  CPU: 3   COMMAND: "ns-slapd"
  PID: 797    TASK: ffff8800b28d8b60  CPU: 1   COMMAND: "ns-slapd"
  PID: 798    TASK: ffff8800b28dcfa0  CPU: 0   COMMAND: "ns-slapd"
  PID: 799    TASK: ffff8800b28d8000  CPU: 2   COMMAND: "ns-slapd"
  PID: 800    TASK: ffff8800b28df1c0  CPU: 0   COMMAND: "ns-slapd"
  PID: 801    TASK: ffff88013615e660  CPU: 2   COMMAND: "ns-slapd"
  PID: 802    TASK: ffff8801361596c0  CPU: 1   COMMAND: "ns-slapd"
  PID: 803    TASK: ffff88013615f1c0  CPU: 2   COMMAND: "ns-slapd"
  PID: 804    TASK: ffff880137722220  CPU: 1   COMMAND: "ns-slapd"
  PID: 805    TASK: ffff8801377271c0  CPU: 1   COMMAND: "ns-slapd"
  PID: 806    TASK: ffff880137720b60  CPU: 3   COMMAND: "ns-slapd"
  PID: 836    TASK: ffff8800a91f8000  CPU: 2   COMMAND: "ns-slapd"
  PID: 20553  TASK: ffff8800b1dc2220  CPU: 2   COMMAND: "ns-slapd"
  PID: 20555  TASK: ffff8800b1dc6660  CPU: 1   COMMAND: "ns-slapd"

crash> foreach ns-slapd bt -sx
PID: 760    TASK: ffff8801362ef1c0  CPU: 1   COMMAND: "ns-slapd"
#0 [ffff8800a921b9d0] __schedule+0x28d at ffffffff816092dd
#1 [ffff8800a921ba38] schedule+0x29 at ffffffff81609839
#2 [ffff8800a921ba48] schedule_hrtimeout_range_clock+0x12c at ffffffff81608b4c
#3 [ffff8800a921bae0] schedule_hrtimeout_range+0x13 at ffffffff81608ba3
#4 [ffff8800a921baf0] poll_schedule_timeout+0x55 at ffffffff811dac65
#5 [ffff8800a921bb20] do_sys_poll+0x4cd at ffffffff811dc1ed
#6 [ffff8800a921bf40] sys_poll+0x74 at ffffffff811dc3a4
#7 [ffff8800a921bf80] system_call_fastpath+0x16 at ffffffff81614389
    RIP: 00007f3dce0f3b7d  RSP: 00007ffe405bd378  RFLAGS: 00000246
    RAX: 0000000000000007  RBX: ffffffff81614389  RCX: 0000000000000001
    RDX: 00000000000000fa  RSI: 000000000000000e  RDI: 00007ffe405bd140
    RBP: 00000000000000fa   R8: 000000004c26d05a   R9: 0000000000000001
    R10: 0000000000000001  R11: 0000000000000293  R12: 000000000000000e
    R13: 00007ffe405bd140  R14: 000000000000000e  R15: 00000000c849164f
    ORIG_RAX: 0000000000000007  CS: 0033  SS: 002b

PID: 762    TASK: ffff8801383a0b60  CPU: 1   COMMAND: "ns-slapd"
#0 [ffff880023743888] __schedule+0x28d at ffffffff816092dd
#1 [ffff8800237438f0] schedule+0x29 at ffffffff81609839
#2 [ffff880023743900] schedule_hrtimeout_range_clock+0x12c at ffffffff81608b4c
#3 [ffff880023743998] schedule_hrtimeout_range+0x13 at ffffffff81608ba3
#4 [ffff8800237439a8] poll_schedule_timeout+0x55 at ffffffff811dac65
#5 [ffff8800237439d8] do_select+0x6d1 at ffffffff811db5e1
#6 [ffff880023743d80] core_sys_select+0x1db at ffffffff811db8ab
#7 [ffff880023743f18] sys_select+0xba at ffffffff811dba8a
#8 [ffff880023743f80] system_call_fastpath+0x16 at ffffffff81614389
    RIP: 00007f3dce0f58f3  RSP: 00007f3dbf4e8d60  RFLAGS: 00000202
    RAX: 0000000000000017  RBX: ffffffff81614389  RCX: 00000000ffffffff
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: 0000000000000064   R8: 00007f3dbf4e8d60   R9: 00007f3d406c9f20
    R10: 0000000000000000  R11: 0000000000000293  R12: 00007f3dbf4e8d94
    R13: 00007f3dd08bedb4  R14: 0000000000000009  R15: 0000000000000000
    ORIG_RAX: 0000000000000017  CS: 0033  SS: 002b

PID: 763    TASK: ffff8801383a16c0  CPU: 3   COMMAND: "ns-slapd"
#0 [ffff88002368f888] __schedule+0x28d at ffffffff816092dd
#1 [ffff88002368f8f0] schedule+0x29 at ffffffff81609839
#2 [ffff88002368f900] schedule_hrtimeout_range_clock+0x12c at ffffffff81608b4c
#3 [ffff88002368f998] schedule_hrtimeout_range+0x13 at ffffffff81608ba3
#4 [ffff88002368f9a8] poll_schedule_timeout+0x55 at ffffffff811dac65
#5 [ffff88002368f9d8] do_select+0x6d1 at ffffffff811db5e1
#6 [ffff88002368fd80] core_sys_select+0x1db at ffffffff811db8ab
#7 [ffff88002368ff18] sys_select+0xba at ffffffff811dba8a
#8 [ffff88002368ff80] system_call_fastpath+0x16 at ffffffff81614389
    RIP: 00007f3dce0f58f3  RSP: 00007f3dbece6c98  RFLAGS: 00000246
    RAX: 0000000000000017  RBX: ffffffff81614389  RCX: 00007f3dce0f58f3
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: 00007f3dd290e7c0   R8: 00007f3dbece6cb0   R9: 00000000000002fb
    R10: 0000000000000000  R11: 0000000000000293  R12: 00000000000000fa
    R13: 0000000000278d00  R14: 000000000000003c  R15: 00007f3dd08bedb4
    ORIG_RAX: 0000000000000017  CS: 0033  SS: 002b

PID: 764    TASK: ffff8801383a6660  CPU: 3   COMMAND: "ns-slapd"
#0 [ffff8800a1c93888] __schedule+0x28d at ffffffff816092dd
#1 [ffff8800a1c938f0] schedule+0x29 at ffffffff81609839
#2 [ffff8800a1c93900] schedule_hrtimeout_range_clock+0x12c at ffffffff81608b4c
#3 [ffff8800a1c93998] schedule_hrtimeout_range+0x13 at ffffffff81608ba3
#4 [ffff8800a1c939a8] poll_schedule_timeout+0x55 at ffffffff811dac65
#5 [ffff8800a1c939d8] do_select+0x6d1 at ffffffff811db5e1
#6 [ffff8800a1c93d80] core_sys_select+0x1db at ffffffff811db8ab
#7 [ffff8800a1c93f18] sys_select+0xba at ffffffff811dba8a
#8 [ffff8800a1c93f80] system_call_fastpath+0x16 at ffffffff81614389
    RIP: 00007f3dce0f58f3  RSP: 00007f3dbe4e6d10  RFLAGS: 00000202
    RAX: 0000000000000017  RBX: ffffffff81614389  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: 00000000000000fa   R8: 00007f3dbe4e6d60   R9: 00007f3dc932cbc0
    R10: 0000000000000000  R11: 0000000000000293  R12: 0000000000000000
    R13: 0000000000000000  R14: 00007f3dbe4e6d94  R15: 00007f3dd08bedb4
    ORIG_RAX: 0000000000000017  CS: 0033  SS: 002b

crash> dis -rl system_call_fastpath+0x16
/usr/src/debug/kernel-3.10.0-229.14.1.el7/linux-3.10.0-229.14.1.el7.x86_64/arch/x86/kernel/entry_64.S: 436
0xffffffff81614373 <system_call_fastpath>:      cmp    $0x140,%rax
/usr/src/debug/kernel-3.10.0-229.14.1.el7/linux-3.10.0-229.14.1.el7.x86_64/arch/x86/kernel/entry_64.S: 441
0xffffffff81614379 <system_call_fastpath+6>:    ja     0xffffffff81614441 <badsys>
/usr/src/debug/kernel-3.10.0-229.14.1.el7/linux-3.10.0-229.14.1.el7.x86_64/arch/x86/kernel/entry_64.S: 442
0xffffffff8161437f <system_call_fastpath+12>:   mov    %r10,%rcx
/usr/src/debug/kernel-3.10.0-229.14.1.el7/linux-3.10.0-229.14.1.el7.x86_64/arch/x86/kernel/entry_64.S: 443
0xffffffff81614382 <system_call_fastpath+15>:   callq  *-0x7e9e3c40(,%rax,8)
/usr/src/debug/kernel-3.10.0-229.14.1.el7/linux-3.10.0-229.14.1.el7.x86_64/arch/x86/kernel/entry_64.S: 444
0xffffffff81614389 <system_call_fastpath+22>:   mov    %rax,0x20(%rsp)

That is to hang all the threads of the process takes place in one place - syscall NR_select, other than a parent who expects the 7th siskol ie NR_poll. That is, the data is waiting with the network interface. 

[root@msk1-vm-imap01 morfy]# grep 7 /usr/src/kernels/3.10.0-229.14.1.el7.x86_64/arch/x86/include/generated/uapi/asm/unistd_64.h
#define __NR_poll 7
[root@msk1-vm-imap01 morfy]# grep 23 /usr/src/kernels/3.10.0-229.14.1.el7.x86_64/arch/x86/include/generated/uapi/asm/unistd_64.h
#define __NR_select 23


Of all the compounds, in an amount of 1067 pieces all are in unconnected, except one, on which there is expectation:

999 ffff880020dd6f80 ffff880037de0000 INET6:STREAM
57 ffff8800b41d5e00 ffff88010c36da40 UNIX:DGRAM
410 ffff8801055e8500 ffff8800b908b480 INET:STREAM  10.224.8.5-44492 10.224.8.4-389
PID: 760    TASK: ffff8801362ef1c0  CPU: 1   COMMAND: "ns-slapd"
FD      SOCKET            SOCK       FAMILY:TYPE SOURCE-PORT DESTINATION-PORT

crash> net -S ffff8801362ef1c0
PID: 760    TASK: ffff8801362ef1c0  CPU: 1   COMMAND: "ns-slapd"
FD       SOCKET             SOCK
6  ffff8800a73d2800  ffff880135917000

struct socket {
  state = SS_UNCONNECTED,
  type = 1,
  flags = 0,
  wq = 0xffff8800ba2702c0,
  file = 0xffff8800b28fa600,
  sk = 0xffff880135917000,
  ops = 0xffffffff816b7940 <inet6_stream_ops>
}

Comment 8 thierry bordaz 2016-05-09 07:58:19 UTC
Hello,

Original logs from April 19th is somehow related to a DS instance unable to reach the other (ipa-01):
[19/Apr/2016:23:39:26 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended 

It is looking like a periodic (each 25min) DS or network issue but it would require more data to go further. When this problem occurs please, monitor DS (top -H -p <ds_pid) for 1 or two minutes, then get 2-3 pstacks, then attach the error/access logs to the BZ.

In https://bugzilla.redhat.com/show_bug.cgi?id=1330171#c6, the problem seems more related to KRB failure because of ticket expiration. Is it also a periodic symptom ?

thanks
thierry

Comment 9 Petr Vobornik 2016-05-09 11:27:29 UTC
Maybe update to IPA 4.2 in RHEL 7.2 could help, hard to say.

Comment 10 Ludwig 2016-05-09 11:36:11 UTC
If the slapd process hangs again, can we get a pstack of the process or a core file or a gdb 'thread apply all bt -full' from the core

Comment 11 Denis 2016-05-11 11:31:33 UTC
O course, next time i give you pstack.

Comment 12 Petr Vobornik 2016-05-17 15:35:57 UTC
I'm closing it because of lack of data. Please reopen when more information is provided.

Comment 13 bermoshin 2016-05-30 13:33:03 UTC
Bug: https://access.redhat.com/solutions/2149601

ipactl restart sometimes fails
high cpu usage by ns-slapd process
high io utilisation (queue > 1) on the disk containing both the logging and the database files
clients using this IPA server get odd results (does not know users anymore)
kerberos not responsive (kinit takes a long time)
DNS is working properly but probably using cache

Solution: update the 389-ds-base package to 1.3.4.0-19.el7 or later to resolve this issue. (ipa-server 4.2.0-15)