Bug 1330171

Summary: Two IPA conjoined faults
Product: Red Hat Enterprise Linux 7 Reporter: Denis <denich>
Component: ipaAssignee: IPA Maintainers <ipa-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Kaleem <ksiddiqu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.1CC: bermoshin, lkrispen, pvoborni, rcritten, tbordaz
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-17 15:35:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Denis 2016-04-25 13:59:57 UTC
Description of problem:


I have two FreeIPA servers (ipa-server-4.1.0-18, krb5-server-1.12.2-15, cyrus-sasl-gssapi-2.1.26-17, 389-ds-base-1.3.3.1-20). Both of them configured as masters and replicate between.  At same time, at both of servers, krb5 service failed.  Time interval between fails is 25 minutes. 
Server one logs:
Krb5kdc.log:


Apr 19 23:07:41 ipa-01.domain.com krb5kdc[2322](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.255.210.3: NEEDED_PREAUTH: host/dc01.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM, Additional pre-authentication required-01.dc.domain.com krb5kdc[2322](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.255.210.3: ISSUE: authtime 1461096461, etypes {rep=18 tkt=18 ses=18}, host/dc01.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM.dc.domain.com krb5kdc[2322](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.224.8.46: LOOKING_UP_CLIENT: username.COM for krbtgt/DC.DOMAIN.COM.COM, Server error
Apr 19 23:13:02 ipa-01.domain.com krb5kdc[2322](info): AS_REQ (4 etypes {18 17 16 23}) 10.254.210.28: LOOKING_UP_CLIENT: host/dc02.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM, Server error
Apr 19 23:13:07 ipa-01.domain.com krb5kdc[2322](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.255.19.20: LOOKING_UP_CLIENT: host/dc01-vm-proxy01.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM, Server error 
Apr 19 23:13:12  ipa-01.domain.com krb5kdc[2322](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.255.210.3: LOOKING_UP_CLIENT: host/dc01.domain.com.COM for krbtgt/DC.DOMAIN.COM, Server error
Apr 19 23:13:12 ipa-01.domain.com krb5kdc[2322](info): DISPATCH: repeated (retransmitted?) request from 10.224.8.46, resending previous response
Apr 19 23:13:12 ipa-01.domain.com krb5kdc[2322](info): DISPATCH: repeated (retransmitted?) request from 10.224.8.46, resending previous response
Apr 19 23:13:12 ipa-01.domain.com krb5kdc[2322](info): closing down fd 4
Apr 19 23:13:12 ipa-01.domain.com krb5kdc[2322](info): DISPATCH: repeated (retransmitted?) request from 10.254.210.28, resending previous response
Apr 19 23:13:12 ipa-01.domain.com  krb5kdc[2322](info): DISPATCH: repeated (retransmitted?) request from 10.254.210.28, resending previous response
Apr 19 23:13:12 ipa-01.domain.com krb5kdc[2322](info): closing down fd 11


dirsrv/slapd-DC-DOMAIN-COM/errors:
[19/Apr/2016:23:12:28 +0300] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)) errno 2 (No such file or directory)
[19/Apr/2016:23:12:55 +0300] set_krb5_creds - Could not get initial credentials for principal [ldap/ ipa-01.domain.com @DC.DOMAIN.COM] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
[19/Apr/2016:23:12:55 +0300] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)) errno 2 (No such file or directory)
[19/Apr/2016:23:12:55 +0300] slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error -2 (Local error)
[19/Apr/2016:23:12:55 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-02.domain.com " (ipa-02:389): Replication bind with GSSAPI auth failed: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired))
[19/Apr/2016:23:45:58 +0300] - slapd shutting down - signaling operation threads - op stack size 0 max work q size 1854 max work q stack size 87
[19/Apr/2016:23:45:58 +0300] - slapd shutting down - waiting for 30 threads to terminate
[19/Apr/2016:23:48:45 +0300] - SSL alert: Configured NSS Ciphers
[19/Apr/2016:23:48:45 +0300] - SSL alert:       TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256: enabled
[19/Apr/2016:23:48:45 +0300] - SSL alert:       TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384: enabled


Server two logs
Krb5kdc.log:
Apr 19 23:39:21 ipa-02.domain.com krb5kdc[2310](info): AS_REQ (4 etypes {18 17 16 23}) 10.224.202.3: ISSUE: authtime 1461098361, etypes {rep=18 tkt=18 ses=18}, host/dc02-srv -02.dc.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM pa-02.dc.domain.com krb5kdc[2310](info): DISPATCH: repeated (retransmitted?) request from 10.254.206.32, resending previous response
Apr 19 23:39:31 ipa-02.domain.com krb5kdc[2310](info): DISPATCH: repeated (retransmitted?) request from 10.254.206.39, resending previous response
Apr 19 23:39:31 ipa-02.domain.com krb5kdc[2310](info): DISPATCH: repeated (retransmitted?) request from 10.224.200.2, resending previous response
Apr 19 23:39:36 ipa-02.domain.com krb5kdc[2310](info): AS_REQ (4 etypes {18 17 16 23}) 10.254.210.15: LOOKING_UP_CLIENT: host/dc02-srv-tstdb-01.dc.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM, Server error
Apr 19 23:39:36 ipa-02.domain.com krb5kdc[2310](info): DISPATCH: repeated (retransmitted?) request from 10.255.206.41, resending previous response
Apr 19 23:39:41 ipa-02.domain.com krb5kdc[2310](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.255.19.8: LOOKING_UP_CLIENT: host/dc01-vm-deb02.dc.domain.com.COM for krbtgt/DC.DOMAIN.COM.COM, Server error
Apr 19 23:39:41 ipa-02.domain.com krb5kdc[2310](info): closing down fd 4
dirsrv/slapd-DC-DOMAIN-COM/errors:


[19/Apr/2016:23:09:00 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Warning: Attempting to release replica, but unable to receive endReplication extended operation response from the replica. Error -5 (Timed out) NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:13:03 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:15:02 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:17:06 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:19:06 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:21:11 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:23:11 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:25:14 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:27:13 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:29:17 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:31:17 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:33:21 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:35:21 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:37:26 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:39:26 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[19/Apr/2016:23:48:40 +0300] - slapd shutting down - signaling operation threads - op stack size 0 max work q size 1949 max work q stack size 85
[19/Apr/2016:23:48:40 +0300] - slapd shutting down - waiting for 30 threads to terminate
[20/Apr/2016:00:00:12 +0300] - SSL alert: Configured NSS Ciphers
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256: enabled
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384: enabled
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256: enabled
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384: enabled
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384: enabled
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384: enabled
[20/Apr/2016:00:00:12 +0300] - SSL alert:       TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA: enabled
…


It important  to fix the cause of conjoined faults.

Comment 1 Petr Vobornik 2016-04-25 15:49:44 UTC
Sounds bit like https://www.redhat.com/archives/freeipa-users/2015-July/msg00283.html

Where enlarging DS cache helped: https://www.redhat.com/archives/freeipa-users/2015-July/msg00331.html

If it doesn't help , I'd suggest to start a thread of FreeIPA users mailing list <https://www.redhat.com/mailman/listinfo/freeipa-users> where more members from community would reply.

Comment 2 Petr Vobornik 2016-04-25 16:06:11 UTC
Upstream ticket:
https://fedorahosted.org/freeipa/ticket/5847

Comment 3 Petr Vobornik 2016-04-25 16:37:28 UTC
Is schema compatibility plugin enabled on the server?

Comment 5 Denis 2016-04-26 08:12:37 UTC
Hi!
Thanks for your response!
Schema compatibility plugin is not enabled. 
The   DS cache size is set to default value.

Comment 6 bermoshin 2016-04-29 20:28:33 UTC
Petr, hello!

I have similar mistakes too.

messages

Apr 29 15:05:43 vm-ipa-01 ns-slapd: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)
Apr 29 15:06:10 vm-ipa-01 ns-slapd: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)

Apr 29 15:47:54 vm-ipa-01 rsyslogd-2040: fatal error on disk queue 'action 1 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]



Apr 29 15:55:56 vm-ipa-01 [sssd[ldap_child[3080]]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Cannot contact any KDC for realm 'DC.domain.com'. Unable to create GSSAPI-encrypted LDAP connection.
Apr 29 15:55:56 vm-ipa-01 [sssd[ldap_child[3080]]]: Cannot contact any KDC for realm 'DC.domain.com'
Apr 29 15:55:56 vm-ipa-01 [sssd[ldap_child[3081]]]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Cannot contact any KDC for realm 'DC.domain.com'. Unable to create GSSAPI-encrypted LDAP connection.
Apr 29 15:55:56 vm-ipa-01 [sssd[ldap_child[3081]]]: Cannot contact any KDC for realm 'DC.domain.com'
Apr 29 15:55:57 vm-ipa-01 audispd: node=vm-ipa-01.dc.domain.com type=TTY msg=audit(1461934557.343:277): tty pid=2931 uid=0 auid=857800024 ses=6 major=136 minor=3 comm="bash" data=1B5B41207C206872657F7F1B7F67726570206B64630D
Apr 29 15:55:57 vm-ipa-01 audispd: node=vm-ipa-01.dc.domain.com type=USER_TTY msg=audit(1461934557.343:278): pid=2931 uid=0 auid=857800024 ses=6 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 data=7073202D6566207C2067726570206B6463
Apr 29 15:55:57 vm-ipa-01 ns-slapd: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (No Kerberos credentials available)






---
slapd errors

[29/Apr/2016:15:05:43 +0300] set_krb5_creds - Could not get initial credentials for principal [ldap/vm-ipa-01.dc.domain.com.com] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
[29/Apr/2016:15:05:43 +0300] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)) errno 2 (No such file or directory)
[29/Apr/2016:15:06:10 +0300] set_krb5_creds - Could not get initial credentials for principal [ldap/vm-ipa-01.dc.domain.com.com] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
[29/Apr/2016:15:06:10 +0300] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)) errno 2 (No such file or directory)
[29/Apr/2016:15:06:10 +0300] slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error -2 (Local error)
[29/Apr/2016:15:06:10 +0300] NSMMReplicationPlugin - agmt="cn=meTovm-ipa-02.dc.domain.com" (vm-ipa-02:389): Replication bind with GSSAPI auth failed: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Ticket expired)) 


---
krb5

Apr 29 15:00:54 vm-ipa-01.dc.domain.com krb5kdc[4215](info): closing down fd 14
Apr 29 15:06:00 vm-ipa-01.dc.domain.com krb5kdc[4215](info): AS_REQ (6 etypes {18 17 16 23 25 26}) 10.224.8.30: LOOKING_UP_CLIENT: host/dc02-vm-node0.dc.domain.com.com for krbtgt/DC.domain.com.com, Server error

---
389-ds-base-1.3.3.1-20.el7_1.x86_64 
ipa-server-4.1.0-18.el7.centos.4.x86_64   
krb5-server-1.12.2-15.el7_1.x86_64 
cyrus-sasl-gssapi-2.1.26-17.el7.x86_64 

=====

We did the enlarging of the changelog cache, but slapd fell again .. 
Whether it is necessary  enable compatibility plug-in?
We have made sosreport from a host. What logs are necessary to you?

Comment 7 bermoshin 2016-05-06 08:24:57 UTC
A colleague spent an additional study of the dump, shot after receiving the error.

Taking the hang ns-slapd as the main source of the problem, we get the following:

crash> ps -g ns-slapd
PID: 760    TASK: ffff8801362ef1c0  CPU: 1   COMMAND: "ns-slapd"
  PID: 762    TASK: ffff8801383a0b60  CPU: 1   COMMAND: "ns-slapd"
  PID: 763    TASK: ffff8801383a16c0  CPU: 3   COMMAND: "ns-slapd"
  PID: 764    TASK: ffff8801383a6660  CPU: 3   COMMAND: "ns-slapd"
  PID: 765    TASK: ffff8801383a0000  CPU: 0   COMMAND: "ns-slapd"
  PID: 766    TASK: ffff8800361ac440  CPU: 0   COMMAND: "ns-slapd"
  PID: 767    TASK: ffff8800361ab8e0  CPU: 3   COMMAND: "ns-slapd"
  PID: 768    TASK: ffff8800361aad80  CPU: 3   COMMAND: "ns-slapd"
  PID: 769    TASK: ffff880139b338e0  CPU: 2   COMMAND: "ns-slapd"
  PID: 770    TASK: ffff880139b32220  CPU: 1   COMMAND: "ns-slapd"
  PID: 771    TASK: ffff8800b61a6660  CPU: 1   COMMAND: "ns-slapd"
  PID: 772    TASK: ffff8800b61a71c0  CPU: 1   COMMAND: "ns-slapd"
  PID: 773    TASK: ffff8800b61a4fa0  CPU: 3   COMMAND: "ns-slapd"
  PID: 774    TASK: ffff8800b9040000  CPU: 3   COMMAND: "ns-slapd"
  PID: 776    TASK: ffff8800b9044fa0  CPU: 3   COMMAND: "ns-slapd"
  PID: 777    TASK: ffff8800b9042220  CPU: 0   COMMAND: "ns-slapd"
  PID: 778    TASK: ffff8800b90438e0  CPU: 0   COMMAND: "ns-slapd"
  PID: 779    TASK: ffff8800b9046660  CPU: 1   COMMAND: "ns-slapd"
  PID: 780    TASK: ffff8800b9042d80  CPU: 1   COMMAND: "ns-slapd"
  PID: 781    TASK: ffff8801362da220  CPU: 0   COMMAND: "ns-slapd"
  PID: 782    TASK: ffff8801362dad80  CPU: 1   COMMAND: "ns-slapd"
  PID: 783    TASK: ffff8801397bf1c0  CPU: 2   COMMAND: "ns-slapd"
  PID: 784    TASK: ffff8801397b8b60  CPU: 3   COMMAND: "ns-slapd"
  PID: 785    TASK: ffff8801397b96c0  CPU: 2   COMMAND: "ns-slapd"
  PID: 786    TASK: ffff8800bb67a220  CPU: 1   COMMAND: "ns-slapd"
  PID: 787    TASK: ffff8800bb678000  CPU: 3   COMMAND: "ns-slapd"
  PID: 788    TASK: ffff8800bb678b60  CPU: 3   COMMAND: "ns-slapd"
  PID: 789    TASK: ffff8800bb67e660  CPU: 1   COMMAND: "ns-slapd"
  PID: 790    TASK: ffff8800bb67c440  CPU: 3   COMMAND: "ns-slapd"
  PID: 791    TASK: ffff8800b28da220  CPU: 0   COMMAND: "ns-slapd"
  PID: 792    TASK: ffff8800b28dc440  CPU: 0   COMMAND: "ns-slapd"
  PID: 793    TASK: ffff8800b28de660  CPU: 0   COMMAND: "ns-slapd"
  PID: 794    TASK: ffff8800b28ddb00  CPU: 2   COMMAND: "ns-slapd"
  PID: 795    TASK: ffff8800b28db8e0  CPU: 1   COMMAND: "ns-slapd"
  PID: 796    TASK: ffff8800b28dad80  CPU: 3   COMMAND: "ns-slapd"
  PID: 797    TASK: ffff8800b28d8b60  CPU: 1   COMMAND: "ns-slapd"
  PID: 798    TASK: ffff8800b28dcfa0  CPU: 0   COMMAND: "ns-slapd"
  PID: 799    TASK: ffff8800b28d8000  CPU: 2   COMMAND: "ns-slapd"
  PID: 800    TASK: ffff8800b28df1c0  CPU: 0   COMMAND: "ns-slapd"
  PID: 801    TASK: ffff88013615e660  CPU: 2   COMMAND: "ns-slapd"
  PID: 802    TASK: ffff8801361596c0  CPU: 1   COMMAND: "ns-slapd"
  PID: 803    TASK: ffff88013615f1c0  CPU: 2   COMMAND: "ns-slapd"
  PID: 804    TASK: ffff880137722220  CPU: 1   COMMAND: "ns-slapd"
  PID: 805    TASK: ffff8801377271c0  CPU: 1   COMMAND: "ns-slapd"
  PID: 806    TASK: ffff880137720b60  CPU: 3   COMMAND: "ns-slapd"
  PID: 836    TASK: ffff8800a91f8000  CPU: 2   COMMAND: "ns-slapd"
  PID: 20553  TASK: ffff8800b1dc2220  CPU: 2   COMMAND: "ns-slapd"
  PID: 20555  TASK: ffff8800b1dc6660  CPU: 1   COMMAND: "ns-slapd"

crash> foreach ns-slapd bt -sx
PID: 760    TASK: ffff8801362ef1c0  CPU: 1   COMMAND: "ns-slapd"
#0 [ffff8800a921b9d0] __schedule+0x28d at ffffffff816092dd
#1 [ffff8800a921ba38] schedule+0x29 at ffffffff81609839
#2 [ffff8800a921ba48] schedule_hrtimeout_range_clock+0x12c at ffffffff81608b4c
#3 [ffff8800a921bae0] schedule_hrtimeout_range+0x13 at ffffffff81608ba3
#4 [ffff8800a921baf0] poll_schedule_timeout+0x55 at ffffffff811dac65
#5 [ffff8800a921bb20] do_sys_poll+0x4cd at ffffffff811dc1ed
#6 [ffff8800a921bf40] sys_poll+0x74 at ffffffff811dc3a4
#7 [ffff8800a921bf80] system_call_fastpath+0x16 at ffffffff81614389
    RIP: 00007f3dce0f3b7d  RSP: 00007ffe405bd378  RFLAGS: 00000246
    RAX: 0000000000000007  RBX: ffffffff81614389  RCX: 0000000000000001
    RDX: 00000000000000fa  RSI: 000000000000000e  RDI: 00007ffe405bd140
    RBP: 00000000000000fa   R8: 000000004c26d05a   R9: 0000000000000001
    R10: 0000000000000001  R11: 0000000000000293  R12: 000000000000000e
    R13: 00007ffe405bd140  R14: 000000000000000e  R15: 00000000c849164f
    ORIG_RAX: 0000000000000007  CS: 0033  SS: 002b

PID: 762    TASK: ffff8801383a0b60  CPU: 1   COMMAND: "ns-slapd"
#0 [ffff880023743888] __schedule+0x28d at ffffffff816092dd
#1 [ffff8800237438f0] schedule+0x29 at ffffffff81609839
#2 [ffff880023743900] schedule_hrtimeout_range_clock+0x12c at ffffffff81608b4c
#3 [ffff880023743998] schedule_hrtimeout_range+0x13 at ffffffff81608ba3
#4 [ffff8800237439a8] poll_schedule_timeout+0x55 at ffffffff811dac65
#5 [ffff8800237439d8] do_select+0x6d1 at ffffffff811db5e1
#6 [ffff880023743d80] core_sys_select+0x1db at ffffffff811db8ab
#7 [ffff880023743f18] sys_select+0xba at ffffffff811dba8a
#8 [ffff880023743f80] system_call_fastpath+0x16 at ffffffff81614389
    RIP: 00007f3dce0f58f3  RSP: 00007f3dbf4e8d60  RFLAGS: 00000202
    RAX: 0000000000000017  RBX: ffffffff81614389  RCX: 00000000ffffffff
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: 0000000000000064   R8: 00007f3dbf4e8d60   R9: 00007f3d406c9f20
    R10: 0000000000000000  R11: 0000000000000293  R12: 00007f3dbf4e8d94
    R13: 00007f3dd08bedb4  R14: 0000000000000009  R15: 0000000000000000
    ORIG_RAX: 0000000000000017  CS: 0033  SS: 002b

PID: 763    TASK: ffff8801383a16c0  CPU: 3   COMMAND: "ns-slapd"
#0 [ffff88002368f888] __schedule+0x28d at ffffffff816092dd
#1 [ffff88002368f8f0] schedule+0x29 at ffffffff81609839
#2 [ffff88002368f900] schedule_hrtimeout_range_clock+0x12c at ffffffff81608b4c
#3 [ffff88002368f998] schedule_hrtimeout_range+0x13 at ffffffff81608ba3
#4 [ffff88002368f9a8] poll_schedule_timeout+0x55 at ffffffff811dac65
#5 [ffff88002368f9d8] do_select+0x6d1 at ffffffff811db5e1
#6 [ffff88002368fd80] core_sys_select+0x1db at ffffffff811db8ab
#7 [ffff88002368ff18] sys_select+0xba at ffffffff811dba8a
#8 [ffff88002368ff80] system_call_fastpath+0x16 at ffffffff81614389
    RIP: 00007f3dce0f58f3  RSP: 00007f3dbece6c98  RFLAGS: 00000246
    RAX: 0000000000000017  RBX: ffffffff81614389  RCX: 00007f3dce0f58f3
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: 00007f3dd290e7c0   R8: 00007f3dbece6cb0   R9: 00000000000002fb
    R10: 0000000000000000  R11: 0000000000000293  R12: 00000000000000fa
    R13: 0000000000278d00  R14: 000000000000003c  R15: 00007f3dd08bedb4
    ORIG_RAX: 0000000000000017  CS: 0033  SS: 002b

PID: 764    TASK: ffff8801383a6660  CPU: 3   COMMAND: "ns-slapd"
#0 [ffff8800a1c93888] __schedule+0x28d at ffffffff816092dd
#1 [ffff8800a1c938f0] schedule+0x29 at ffffffff81609839
#2 [ffff8800a1c93900] schedule_hrtimeout_range_clock+0x12c at ffffffff81608b4c
#3 [ffff8800a1c93998] schedule_hrtimeout_range+0x13 at ffffffff81608ba3
#4 [ffff8800a1c939a8] poll_schedule_timeout+0x55 at ffffffff811dac65
#5 [ffff8800a1c939d8] do_select+0x6d1 at ffffffff811db5e1
#6 [ffff8800a1c93d80] core_sys_select+0x1db at ffffffff811db8ab
#7 [ffff8800a1c93f18] sys_select+0xba at ffffffff811dba8a
#8 [ffff8800a1c93f80] system_call_fastpath+0x16 at ffffffff81614389
    RIP: 00007f3dce0f58f3  RSP: 00007f3dbe4e6d10  RFLAGS: 00000202
    RAX: 0000000000000017  RBX: ffffffff81614389  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: 00000000000000fa   R8: 00007f3dbe4e6d60   R9: 00007f3dc932cbc0
    R10: 0000000000000000  R11: 0000000000000293  R12: 0000000000000000
    R13: 0000000000000000  R14: 00007f3dbe4e6d94  R15: 00007f3dd08bedb4
    ORIG_RAX: 0000000000000017  CS: 0033  SS: 002b

crash> dis -rl system_call_fastpath+0x16
/usr/src/debug/kernel-3.10.0-229.14.1.el7/linux-3.10.0-229.14.1.el7.x86_64/arch/x86/kernel/entry_64.S: 436
0xffffffff81614373 <system_call_fastpath>:      cmp    $0x140,%rax
/usr/src/debug/kernel-3.10.0-229.14.1.el7/linux-3.10.0-229.14.1.el7.x86_64/arch/x86/kernel/entry_64.S: 441
0xffffffff81614379 <system_call_fastpath+6>:    ja     0xffffffff81614441 <badsys>
/usr/src/debug/kernel-3.10.0-229.14.1.el7/linux-3.10.0-229.14.1.el7.x86_64/arch/x86/kernel/entry_64.S: 442
0xffffffff8161437f <system_call_fastpath+12>:   mov    %r10,%rcx
/usr/src/debug/kernel-3.10.0-229.14.1.el7/linux-3.10.0-229.14.1.el7.x86_64/arch/x86/kernel/entry_64.S: 443
0xffffffff81614382 <system_call_fastpath+15>:   callq  *-0x7e9e3c40(,%rax,8)
/usr/src/debug/kernel-3.10.0-229.14.1.el7/linux-3.10.0-229.14.1.el7.x86_64/arch/x86/kernel/entry_64.S: 444
0xffffffff81614389 <system_call_fastpath+22>:   mov    %rax,0x20(%rsp)

That is to hang all the threads of the process takes place in one place - syscall NR_select, other than a parent who expects the 7th siskol ie NR_poll. That is, the data is waiting with the network interface. 

[root@msk1-vm-imap01 morfy]# grep 7 /usr/src/kernels/3.10.0-229.14.1.el7.x86_64/arch/x86/include/generated/uapi/asm/unistd_64.h
#define __NR_poll 7
[root@msk1-vm-imap01 morfy]# grep 23 /usr/src/kernels/3.10.0-229.14.1.el7.x86_64/arch/x86/include/generated/uapi/asm/unistd_64.h
#define __NR_select 23


Of all the compounds, in an amount of 1067 pieces all are in unconnected, except one, on which there is expectation:

999 ffff880020dd6f80 ffff880037de0000 INET6:STREAM
57 ffff8800b41d5e00 ffff88010c36da40 UNIX:DGRAM
410 ffff8801055e8500 ffff8800b908b480 INET:STREAM  10.224.8.5-44492 10.224.8.4-389
PID: 760    TASK: ffff8801362ef1c0  CPU: 1   COMMAND: "ns-slapd"
FD      SOCKET            SOCK       FAMILY:TYPE SOURCE-PORT DESTINATION-PORT

crash> net -S ffff8801362ef1c0
PID: 760    TASK: ffff8801362ef1c0  CPU: 1   COMMAND: "ns-slapd"
FD       SOCKET             SOCK
6  ffff8800a73d2800  ffff880135917000

struct socket {
  state = SS_UNCONNECTED,
  type = 1,
  flags = 0,
  wq = 0xffff8800ba2702c0,
  file = 0xffff8800b28fa600,
  sk = 0xffff880135917000,
  ops = 0xffffffff816b7940 <inet6_stream_ops>
}

Comment 8 thierry bordaz 2016-05-09 07:58:19 UTC
Hello,

Original logs from April 19th is somehow related to a DS instance unable to reach the other (ipa-01):
[19/Apr/2016:23:39:26 +0300] NSMMReplicationPlugin - agmt="cn=meToipa-01.domain.com" (ipa-01:389): Unable to receive the response for a startReplication extended 

It is looking like a periodic (each 25min) DS or network issue but it would require more data to go further. When this problem occurs please, monitor DS (top -H -p <ds_pid) for 1 or two minutes, then get 2-3 pstacks, then attach the error/access logs to the BZ.

In https://bugzilla.redhat.com/show_bug.cgi?id=1330171#c6, the problem seems more related to KRB failure because of ticket expiration. Is it also a periodic symptom ?

thanks
thierry

Comment 9 Petr Vobornik 2016-05-09 11:27:29 UTC
Maybe update to IPA 4.2 in RHEL 7.2 could help, hard to say.

Comment 10 Ludwig 2016-05-09 11:36:11 UTC
If the slapd process hangs again, can we get a pstack of the process or a core file or a gdb 'thread apply all bt -full' from the core

Comment 11 Denis 2016-05-11 11:31:33 UTC
O course, next time i give you pstack.

Comment 12 Petr Vobornik 2016-05-17 15:35:57 UTC
I'm closing it because of lack of data. Please reopen when more information is provided.

Comment 13 bermoshin 2016-05-30 13:33:03 UTC
Bug: https://access.redhat.com/solutions/2149601

ipactl restart sometimes fails
high cpu usage by ns-slapd process
high io utilisation (queue > 1) on the disk containing both the logging and the database files
clients using this IPA server get odd results (does not know users anymore)
kerberos not responsive (kinit takes a long time)
DNS is working properly but probably using cache

Solution: update the 389-ds-base package to 1.3.4.0-19.el7 or later to resolve this issue. (ipa-server 4.2.0-15)