Bug 1544477

Summary: IPA server is not responding, all authentication and admin tests failed
Product: Red Hat Enterprise Linux 7 Reporter: milei
Component: 389-ds-baseAssignee: mreynolds
Status: CLOSED ERRATA QA Contact: RHDS QE <ds-qe-bugs>
Severity: urgent Docs Contact: Marc Muehlfeld <mmuehlfe>
Priority: high    
Version: 7.5CC: baiesi, fhanzelk, gswami, jgalipea, jneedle, ksiddiqu, lkrispen, lmiksik, milei, mkosek, nkinder, nsoman, pvoborni, rcritten, rmeggins, salmy, spoore, tbordaz, tmihinto, toneata, tscherf, yann.lopez
Target Milestone: rcKeywords: Regression, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 389-ds-base-1.3.8.2-1.el7 Doc Type: Bug Fix
Doc Text:
Directory Server uses the nunc-stans framework to manage connection events. Jobs in this framework can compete to acquire a connection. Previously, in certain situations, a job could loose a connection and still be scheduled. As a consequence, the server managed the wrong job for the connection. This sometimes led to deadlock situations or to Directory Server terminating unexpectedly during runtime. The problem has been fixed and, as a result, deadlocks and crashes during runtime no longer occur.
Story Points: ---
Clone Of:
: 1563107 (view as bug list) Environment:
Last Closed: 2018-10-30 10:13:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1563107    
Attachments:
Description Flags
This is addition stack trace information.
none
Gdb stacktrace is attached.
none
Dirsrv error log
none
Dirsrv access log 1
none
Dirsrv access lofg 2
none
Krb5kdc log is attached
none
dse.ldif is attached
none
pstack log attached
none
top stack trace output file is attached none

Description milei 2018-02-12 15:53:07 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 milei 2018-02-12 16:31:16 UTC
Description of problem:
Running system tests against IdM RHEL 7.5. Authentication and admin tests were scheduled to run for 3 days.  After 1 day running, all authentication and admin tests failed.
Tried to "knit admin", failed with "kinit: Generic error (see e-text) while getting initial credentials"

Version-Release number of selected component (if applicable): 
RHEL 7.5 

How reproducible:
Install 1 IPA server and 1 replica,
   
Config:
1000 users
100 groups
200 totp tokens
200 hotp tokens
10 admin users

Run reliability system test load and monitor test result;


Steps to Reproduce:
1. builds a test environment defined below
2. configure user/group/tokens env defined below
3. apply STR periodic concurrent authn/admin system pos/neg load tests such as;
    -authn kerberos test load
    -authn totp / hotp 2factor test load
    -sudo runtime test
    -admin user test scenarios
    -admin group test scenarios
    -admin enable / disable users
    -admin active stage user test scenario 

Actual results:
Test failure rate should be consistent as tests continue running

Expected results:
Authentication test and admin tests are all failed after 1 day load.

Additional info:

RPM info:
rpm -qi ipa-server
Name        : ipa-server
Version     : 4.5.4
Release     : 9.el7
Architecture: x86_64
Install Date: Tue 30 Jan 2018 08:19:29 PM EST
Group       : System Environment/Base
Size        : 1052958
License     : GPLv3+
Signature   : RSA/SHA256, Wed 24 Jan 2018 02:23:32 PM EST, Key ID 199e2f91fd431d51
Source RPM  : ipa-4.5.4-9.el7.src.rpm
Build Date  : Wed 24 Jan 2018 12:11:56 PM EST
Build Host  : x86-019.build.eng.bos.redhat.com
Relocations : (not relocatable)
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
Vendor      : Red Hat, Inc.
URL         : http://www.freeipa.org/
Summary     : The IPA authentication server


rpm -qi 389-ds-base
Name        : 389-ds-base
Version     : 1.3.7.5
Release     : 14.el7
Architecture: x86_64
Install Date: Tue 30 Jan 2018 08:19:28 PM EST
Group       : System Environment/Daemons
Size        : 5509074
License     : GPLv3+
Signature   : RSA/SHA256, Fri 26 Jan 2018 12:08:52 PM EST, Key ID 199e2f91fd431d51
Source RPM  : 389-ds-base-1.3.7.5-14.el7.src.rpm
Build Date  : Fri 26 Jan 2018 10:17:57 AM EST
Build Host  : x86-041.build.eng.bos.redhat.com
Relocations : (not relocatable)
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
Vendor      : Red Hat, Inc.
URL         : https://www.port389.org/
Summary     : 389 Directory Server (base)



**Test Environment:
IdM RHEL 7.5

1. IPA server - metal
RedHat-Linux-7.5-Maipo
master.qe.sti.com
3.10.0-838.el7.x86_64
ram:       65GB
cpu 10 core / 40 proc

2. replica 1 -metal 
RedHat-Linux-7.5-Maipo
replica1.qe.sti.com 
3.10.0-843.el7.x86_64
ram:       16GM
cpu 10 core / 16 proc


**users:
str_user1k/ 10 groups/ 400 tokens

Comment 3 milei 2018-02-12 16:35:40 UTC
**********************dirsrv error log*****************
10/Feb/2018:11:45:50.847697465 -0500] - ERR - set_krb5_creds - Could not get initial credentials for principal [ldap/master95.qe.sti.com.COM] in keytab [FILE:/etc/dirsrv/ds.keytab
]: -1765328324 (Generic error (see e-text))
[10/Feb/2018:11:45:54.689541815 -0500] - ERR - set_krb5_creds - Could not get initial credentials for principal [ldap/master95.qe.sti.com.COM] in keytab [FILE:/etc/dirsrv/ds.keytab
]: -1765328324 (Generic error (see e-text))


********************************************************************************

*********************pstack <ns-slapd pid>*************
pstack 71735
Thread 1 (process 71735):
#0  0x00007f525ea68bb3 in select () at /lib64/libc.so.6
#1  0x00007f5261c970f9 in DS_Sleep () at /usr/lib64/dirsrv/libslapd.so.0
#2  0x00007f525388efc7 in trickle_threadmain () at /usr/lib64/dirsrv/plugins/libback-ldbm.so
#3  0x00007f525fa23c8b in _pt_root () at /lib64/libnspr4.so
#4  0x00007f525f3c3dd5 in start_thread () at /lib64/libpthread.so.0
#5  0x00007f525ea71aed in clone () at /lib64/libc.so.6
********************************************************************************

[root@master95 ~]# pstack 71648
Thread 1 (process 71648):
#0  0x00007f525f3c7945 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x00007f5261efee63 in [IDLE THREAD] work_q_wait (tp=0x5644661cb740) at src/nunc-stans/ns/ns_thrpool.c:344
#2  0x00007f5261efee63 in worker_thread_func (arg=<optimized out>) at src/nunc-stans/ns/ns_thrpool.c:396
#3  0x00007f525f3c3dd5 in start_thread () at /lib64/libpthread.so.0
#4  0x00007f525ea71aed in clone () at /lib64/libc.so.6
********************************************************************************
[root@master95 ~]# pstack 71649
Thread 1 (process 71649):
#0  0x00007f525f3c7945 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x00007f5261efee63 in [IDLE THREAD] work_q_wait (tp=0x5644661cb740) at src/nunc-stans/ns/ns_thrpool.c:344
#2  0x00007f5261efee63 in worker_thread_func (arg=<optimized out>) at src/nunc-stans/ns/ns_thrpool.c:396
#3  0x00007f525f3c3dd5 in start_thread () at /lib64/libpthread.so.0
#4  0x00007f525ea71aed in clone () at /lib64/libc.so.6
[root@master95 ~]#
********************************************************************************

Comment 4 milei 2018-02-12 16:37:10 UTC
Created attachment 1395010 [details]
This is addition stack trace information.

Comment 5 Rob Crittenden 2018-02-12 17:56:52 UTC
More is needed. You are pinning this on 389-ds but it is unclear what led you to that conclusion. We need full logs from 389-ds and krb5kdc log at a minimum.

If 389-ds is hung then a gdb traceback is more useful: 

gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread apply all bt full' -ex 'quit' /usr/sbin/ns-slapd `pidof ns-slapd` > stacktrace.`date +%s`.txt 2>&1

Comment 6 milei 2018-02-12 18:21:47 UTC
Created attachment 1395042 [details]
Gdb stacktrace is attached.

Comment 7 milei 2018-02-12 18:29:59 UTC
*********************krb5kdc stack trace**********************************
[root@master95 ~]# pstack 71832
#0  0x00007fa3fc5b3032 in wait () from /lib64/libc.so.6
#1  0x000055aa39d68cbf in main ()

root@master95 ~]# pstack 71833
#0  0x00007fa3fc5ed0a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x00007fa3ee61b803 in epoll_dispatch () from /lib64/libevent-2.0.so.5
#2  0x00007fa3ee6073ea in event_base_loop () from /lib64/libevent-2.0.so.5
#3  0x000055aa39d68a76 in main ()

[root@master95 ~]# pstack 71834
#0  0x00007fa3fc5ed0a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x00007fa3ee61b803 in epoll_dispatch () from /lib64/libevent-2.0.so.5
#2  0x00007fa3ee6073ea in event_base_loop () from /lib64/libevent-2.0.so.5
#3  0x000055aa39d68a76 in main ()

[root@master95 ~]# pstack 71844
#0  0x00007fa3fc5ed0a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x00007fa3ee61b803 in epoll_dispatch () from /lib64/libevent-2.0.so.5
#2  0x00007fa3ee6073ea in event_base_loop () from /lib64/libevent-2.0.so.5
#3  0x000055aa39d68a76 in main ()

Comment 10 milei 2018-02-12 19:12:15 UTC
Created attachment 1395072 [details]
Dirsrv error log

Comment 11 milei 2018-02-12 19:13:01 UTC
Created attachment 1395073 [details]
Dirsrv access log 1

Comment 12 milei 2018-02-12 21:52:18 UTC
Created attachment 1395146 [details]
Dirsrv access lofg 2

Comment 13 milei 2018-02-12 22:03:34 UTC
Created attachment 1395148 [details]
Krb5kdc log is attached

Comment 14 milei 2018-02-12 22:15:46 UTC
Created attachment 1395149 [details]
dse.ldif is attached

Comment 15 milei 2018-02-13 04:29:32 UTC
Created attachment 1395207 [details]
pstack log attached

Comment 16 milei 2018-02-13 04:30:54 UTC
Created attachment 1395208 [details]
top stack trace output file is attached

Comment 17 milei 2018-02-13 04:40:24 UTC
First error in DS error log happened at 2/09/2018 11:35:55.
Test failure rate is increasing, but not 100% failure.

on 2/10/2018 0:41:00, all authentication test failed.

 ***********************from DS error log**********************************
[09/Feb/2018:11:35:55.587426531 -0500] - ERR - schema-compat-plugin - scheduled schema-compat-plugin tree scan in about 5 seconds after the server startup

Comment 20 thierry bordaz 2018-02-13 14:14:55 UTC

    - The two threads are in deadlock because acquiring locks in the opposite order
            - monitor lock  (conn->c_job->monitor)
            - connection lock (conn->c_mutex)
      On the same connection/job
                (gdb) thread 46
                (gdb) frame 7
                (gdb) print opconnid
                $4 = 31572
        
                (gdb) thread 100
                (gdb) frame 2
                (gdb) print c->c_connid
                $3 = 31572
                

    - The two threads are

(gdb) thread 46
[Switching to thread 46 (Thread 0x7f51a408d700 (LWP 71786))]
#0  0x00007f525f3ca4cd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0  0x00007f525f3ca4cd in __lll_lock_wait () at /lib64/libpthread.so.0
#1  0x00007f525f3c5de6 in _L_lock_870 () at /lib64/libpthread.so.0
#2  0x00007f525f3c5cdf in pthread_mutex_lock () at /lib64/libpthread.so.0
#3  0x00007f5261efebc6 in ns_job_done (job=0x5644764a3770) at src/nunc-stans/ns/ns_thrpool.c:760
#4  0x0000564465373cd7 in ns_connection_post_io_or_closing (conn=conn@entry=0x56447030bd40) at ldap/servers/slapd/daemon.c:1693
#5  0x000056446536fb04 in disconnect_server_nomutex_ext (conn=conn@entry=0x56447030bd40, opconnid=opconnid@entry=31572, opid=opid@entry=276, reason=reason@entry=-4996, error=error@entry=0, schedule_closure_job=schedule_closure_job@entry=1)
    at ldap/servers/slapd/connection.c:2301
#6  0x00005644653703a3 in disconnect_server_nomutex (error=0, reason=-4996, opid=276, opconnid=31572, conn=0x56447030bd40)
    at ldap/servers/slapd/connection.c:2314
#7  0x00005644653703a3 in disconnect_server (conn=0x56447030bd40, opconnid=31572, opid=276, reason=-4996, error=0)
    at ldap/servers/slapd/connection.c:2209
#8  0x00007f5261c73c13 in flush_ber (pb=pb@entry=0x5644702952c0, conn=0x56447030bd40, op=0x564478a12000, ber=ber@entry=0x56447899bef0, type=type@entry=0) at ldap/servers/slapd/result.c:1780
#9  0x00007f5261c75db0 in send_ldap_result_ext (pb=0x5644702952c0, err=0, matched=<optimized out>, text=<optimized out>, nentries=0, urls=<optimized out>, ber=0x56447899bef0, ber@entry=0x0) at ldap/servers/slapd/result.c:600
#10 0x00007f5261c75f71 in send_ldap_result (pb=<optimized out>, err=<optimized out>, matched=<optimized out>, text=<optimized out>, nentries=<optimized out>, urls=<optimized out>) at ldap/servers/slapd/result.c:199
#11 0x00007f5261c5f673 in slapi_send_ldap_result (pb=pb@entry=0x5644702952c0, err=0, matched=matched@entry=0x0, text=0x0, nentries=nentries@entry=0, urls=urls@entry=0x0) at ldap/servers/slapd/plugin.c:781
#12 0x00007f52538cce49 in ldbm_back_modify (pb=<optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:1000
#13 0x00007f5261c4d656 in op_shared_modify (pb=pb@entry=0x5644702952c0, pw_change=pw_change@entry=0, old_pw=0x0)
    at ldap/servers/slapd/modify.c:1021
#14 0x00007f5261c4eb7b in do_modify (pb=pb@entry=0x5644702952c0) at ldap/servers/slapd/modify.c:391
#15 0x00005644653711e2 in connection_dispatch_operation (pb=0x5644702952c0, op=0x564478a12000, conn=0x56447030bd40)
    at ldap/servers/slapd/connection.c:624
#16 0x00005644653711e2 in connection_threadmain () at ldap/servers/slapd/connection.c:1769
#17 0x00007f525fa23c8b in _pt_root () at /lib64/libnspr4.so
#18 0x00007f525f3c3dd5 in start_thread () at /lib64/libpthread.so.0
#19 0x00007f525ea71aed in clone () at /lib64/libc.so.6


Thread 100 (Thread 0x7f52284e6700 (LWP 71728)):
#0  0x00007f525f3c7945 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x00007f525fa1e503 in PR_EnterMonitor () at /lib64/libnspr4.so
#2  0x0000564465373ffa in ns_handle_pr_read_ready ()
#3  0x00007f5261efec89 in work_job_execute (job=0x5644764a3770) at src/nunc-stans/ns/ns_thrpool.c:291
#4  0x00007f5261effbe3 in event_cb (fd=<optimized out>, event=<optimized out>, arg=<optimized out>) at src/nunc-stans/ns/ns_event_fw_event.c:118
#5  0x00007f525ef54a14 in event_base_loop () at /lib64/libevent-2.0.so.5
#6  0x00007f5261effeae in ns_event_fw_loop (ns_event_fw_ctx=<optimized out>) at src/nunc-stans/ns/ns_event_fw_event.c:308
#7  0x00007f5261efeac9 in event_loop_thread_func (arg=0x5644661cb740) at src/nunc-stans/ns/ns_thrpool.c:581
#8  0x00007f525f3c3dd5 in start_thread () at /lib64/libpthread.so.0
#9  0x00007f525ea71aed in clone () at /lib64/libc.so.6


    - Details of the debug
    Thread 100 acquired the 'monitor' lock (in work_job_execute) before trying to acquire the 'c_mutex' (ns_handle_pr_read_ready)
        Thread 100 waits on conn->c_mutex
        (gdb) thread 100
        (gdb) frame 2
        (gdb) print c->c_mutex
        $12 = (PRMonitor *) 0x564472171130
        (gdb) print (void *)((struct MY_PRMonitor *) 0x564472171130)->owner
        $37 = (void *) 0x7f51a408d700       --> That is hold by Thread 46

    Thread 46 acquired the 'c_mutex' lock (disconnect_server) before trying to acquire the 'monitor' (ns_job_done)
        Thread 46 waits on conn->c_job->monitor
        (gdb) print job
        $5 = (ns_job_t *) 0x5644764a3770
        print job->monitor.__data.__owner
        $3 = 71728                          --> That is hold by Thread 100

    

    - The scenario is rare because it happens when at the same time the server
        - adds the connection to the event queue to "poll" it
        - It closes the connection

Comment 21 milei 2018-02-13 18:37:15 UTC
[root@master95 ~]# rpm -qi libdb
Name        : libdb
Version     : 5.3.21
Release     : 24.el7
Architecture: x86_64
Install Date: Tue 30 Jan 2018 03:42:19 PM EST
Group       : System Environment/Libraries
Size        : 1853360
License     : BSD and LGPLv2 and Sleepycat
Signature   : RSA/SHA256, Thu 11 Jan 2018 10:27:08 AM EST, Key ID 199e2f91fd431d51
Source RPM  : libdb-5.3.21-24.el7.src.rpm
Build Date  : Thu 11 Jan 2018 07:54:21 AM EST
Build Host  : x86-019.build.eng.bos.redhat.com
Relocations : (not relocatable)
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
Vendor      : Red Hat, Inc.
URL         : http://www.oracle.com/database/berkeley-db/
Summary     : The Berkeley DB database library for C

Comment 24 baiesi 2018-02-21 13:49:09 UTC
So under this condition when the threads are in deadlock and the system is not operational anymore, could you please indicate which service / process needs to be restarted for the IdM environment to become operational again.
thanks

Comment 25 wibrown@redhat.com 2018-02-22 00:09:40 UTC
Hi, 

This has been found and reported upstream. For more see, https://pagure.io/389-ds-base/issue/49569

I'll be working on it soon.

Comment 29 milei 2018-03-06 13:33:42 UTC
We did run same test on RHEL 7.4 for 12 days during last holiday shut down. We did not see this issue during 12 day test period.

Comment 30 thierry bordaz 2018-03-06 13:39:12 UTC
Thank you sooo much for having already tested it!!!. 

We can try to retrieve what differs 7.4 from 7.5 in the connection handling.
By any chance do you know which 7.4 version you tested 1.3.6.xx ?

Comment 31 milei 2018-03-06 13:59:13 UTC
RPM info when running IPA system testing on RHEL 7.4:

rpm -qi ipa-server
Name        : ipa-server
Version     : 4.5.0
Release     : 21.el7_4.1.2
Architecture: x86_64
Install Date: Mon 16 Oct 2017 10:58:54 AM EDT
Group       : System Environment/Base
Size        : 1050087
License     : GPLv3+
Signature   : RSA/SHA256, Thu 17 Aug 2017 03:51:44 AM EDT, Key ID 199e2f91fd431d51
Source RPM  : ipa-4.5.0-21.el7_4.1.2.src.rpm
Build Date  : Wed 16 Aug 2017 02:31:08 PM EDT
Build Host  : x86-041.build.eng.bos.redhat.com

rpm -qi 389-ds-base
Name        : 389-ds-base
Version     : 1.3.6.1
Release     : 19.el7_4
Architecture: x86_64
Install Date: Mon 16 Oct 2017 10:58:39 AM EDT
Group       : System Environment/Daemons
Size        : 5480103
License     : GPLv3+
Signature   : RSA/SHA256, Fri 25 Aug 2017 03:59:01 AM EDT, Key ID 199e2f91fd431d51
Source RPM  : 389-ds-base-1.3.6.1-19.el7_4.src.rpm
Build Date  : Tue 22 Aug 2017 09:33:10 AM EDT
Build Host  : x86-019.build.eng.bos.redhat.com

Comment 32 thierry bordaz 2018-03-06 17:31:55 UTC
Hi Min,

The investigations to fix the current deadlock can be long/risky.
We know deadlock condition was introduced in some builds in 7.5.
We identified possible guilty fixes that came after 1.3.7.5-9.
Would it be possible to run the STR test with 1.3.7.5-9 (https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=618728) ?

If it does not hang, it would greatly help to reduce the scope of investigations.

Comment 33 thierry bordaz 2018-03-13 15:30:51 UTC
STR tests complete successfully (*without* hang) on 1.3.7.5-9.
So the problem was introduced ]1.3.7.5-9, 1.3.7.5-14]

Suspecting a bunch of coverity fixes done in 1.3.7.5-14, the next tests will be done on 1.3.7.5-13.

Comment 39 milei 2018-03-21 23:55:37 UTC
Run system load test with DS version 1.3.7.5-9 and DS version 1.3.7.5-13 did not observe IPA server not responding issue during 2 and half day test period.


The server not responding issue is reproduced with DS version 1.3.7.5-14 after 27 hours' system load test. Load type and thread numbers are same as when we reported bug on 2018-02-12

Server not responding and test all failed start around 14:29 EDT on March 21th, 2018.

Comment 40 milei 2018-03-22 00:47:17 UTC
Correct failing time should be 21/Mar/2018:14:36.

DIRSRV log error:

[21/Mar/2018:14:36:09.887351063 -0400] - ERR - set_krb5_creds - Could not get initial credentials for principal [ldap/master95.qe.sti.com.COM] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
[21/Mar/2018:14:36:13.161005418 -0400] - ERR - set_krb5_creds - Could not get initial credentials for principal [ldap/master95.qe.sti.com.COM] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)

Comment 41 thierry bordaz 2018-03-22 09:27:46 UTC
Thanks Min for this great piece of information. Your tests on -14 hit the same deadlock as https://bugzilla.redhat.com/show_bug.cgi?id=1544477#c20

Thread 60 (Thread 0x7f67513d0700 (LWP 2414)):
#0  0x00007f68048be4cd in __lll_lock_wait () at /lib64/libpthread.so.0
#1  0x00007f68048b9de6 in _L_lock_870 () at /lib64/libpthread.so.0
#2  0x00007f68048b9cdf in pthread_mutex_lock () at /lib64/libpthread.so.0
#3  0x00007f68073f2bc6 in ns_job_done (job=0x5636f47962d0) at src/nunc-stans/ns/ns_thrpool.c:760
#4  0x00005636e583dcd7 in ns_connection_post_io_or_closing (conn=conn@entry=0x5636f09ecb18) at ldap/servers/slapd/daemon.c:1693
#5  0x00005636e5839b04 in disconnect_server_nomutex_ext (conn=conn@entry=0x5636f09ecb18, opconnid=opconnid@entry=23341, opid=opid@entry=608, reason=reason@entry=-4996, error=error@entry=0, schedule_closure_job=schedule_closure_job@entry=1) at ldap/servers/slapd/connection.c:2301
#6  0x00005636e583a3a3 in disconnect_server_nomutex (error=0, reason=-4996, opid=608, opconnid=23341, conn=0x5636f09ecb18) at ldap/servers/slapd/connection.c:2314
#7  0x00005636e583a3a3 in disconnect_server (conn=0x5636f09ecb18, opconnid=23341, opid=608, reason=-4996, error=0) at ldap/servers/slapd/connection.c:2209
#8  0x00007f6807167c13 in flush_ber (pb=pb@entry=0x5636f0d46240, conn=0x5636f09ecb18, op=0x5636fbfae000, ber=ber@entry=0x5636f57e1450, type=type@entry=0) at ldap/servers/slapd/result.c:1780
#9  0x00007f6807169db0 in send_ldap_result_ext (pb=0x5636f0d46240, err=0, matched=<optimized out>, text=<optimized out>, nentries=0, urls=<optimized out>, ber=0x5636f57e1450, ber@entry=0x0) at ldap/servers/slapd/result.c:600
#10 0x00007f6807169f71 in send_ldap_result (pb=<optimized out>, err=<optimized out>, matched=<optimized out>, text=<optimized out>, nentries=<optimized out>, urls=<optimized out>) at ldap/servers/slapd/result.c:199
#11 0x00007f6807153673 in slapi_send_ldap_result (pb=pb@entry=0x5636f0d46240, err=0, matched=matched@entry=0x0, text=0x0, nentries=nentries@entry=0, urls=urls@entry=0x0) at ldap/servers/slapd/plugin.c:781
#12 0x00007f67f8dbfe49 in ldbm_back_modify (pb=<optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:1000
#13 0x00007f6807141656 in op_shared_modify (pb=pb@entry=0x5636f0d46240, pw_change=pw_change@entry=0, old_pw=0x0) at ldap/servers/slapd/modify.c:1021
#14 0x00007f6807142b7b in do_modify (pb=pb@entry=0x5636f0d46240) at ldap/servers/slapd/modify.c:391
#15 0x00005636e583b1e2 in connection_dispatch_operation (pb=0x5636f0d46240, op=0x5636fbfae000, conn=0x5636f09ecb18) at ldap/servers/slapd/connection.c:624
#16 0x00005636e583b1e2 in connection_threadmain () at ldap/servers/slapd/connection.c:1769
#17 0x00007f6804f17c8b in _pt_root () at /lib64/libnspr4.so
#18 0x00007f68048b7dd5 in start_thread () at /lib64/libpthread.so.0
#19 0x00007f6803f64b3d in clone () at /lib64/libc.so.6

Thread 99 (Thread 0x7f67cd9d9700 (LWP 2372)):
#0  0x00007f68048bb945 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x00007f6804f12503 in PR_EnterMonitor () at /lib64/libnspr4.so
#2  0x00005636e583dffa in ns_handle_pr_read_ready (job=0x5636f47962d0) at ldap/servers/slapd/daemon.c:1782
#3  0x00007f68073f2c89 in work_job_execute (job=0x5636f47962d0) at src/nunc-stans/ns/ns_thrpool.c:291
#4  0x00007f68073f3be3 in event_cb (fd=<optimized out>, event=<optimized out>, arg=<optimized out>) at src/nunc-stans/ns/ns_event_fw_event.c:118
#5  0x00007f6804447a14 in event_base_loop () at /lib64/libevent-2.0.so.5
#6  0x00007f68073f3eae in ns_event_fw_loop (ns_event_fw_ctx=<optimized out>) at src/nunc-stans/ns/ns_event_fw_event.c:308
#7  0x00007f68073f2ac9 in event_loop_thread_func (arg=0x5636e68df740) at src/nunc-stans/ns/ns_thrpool.c:581
#8  0x00007f68048b7dd5 in start_thread () at /lib64/libpthread.so.0
#9  0x00007f6803f64b3d in clone () at /lib64/libc.so.6


I will investigate what could trigger this issue in -14 that was not in -13.

Regarding your tests on -14, have you noticed some failures around OTP authentication like you reported during tests of -13 ?

Comment 44 milei 2018-03-22 14:18:15 UTC
OTP failure rate is high on build -13 and -14. 
OTP Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1508498

The only difference when we run test against build 13 and build 14 is :
When running build -13, we did not run HOTP test, only run TOTP.

When running build -14, We did run both HOTP and TOTP.

We could rerun build -13 with hotp.

Comment 45 thierry bordaz 2018-03-22 14:33:14 UTC
If the load differs -13 and -14 it can contribute to the connection hang (that is dynamic).

I agree it would be good to test -13 with HOTP and TOTP because we expect it to fail like -14.

Comment 53 baiesi 2018-04-02 20:44:20 UTC
Hi Viktor, The STR team successfully ran Idm Test with DirServer ver/build 1.3.7.5-11 with no deadlock issue. System test duration 4+ days.

Run ID 	R-3909
Profile Name 	idm-75-sys-load-test
Run Name 	idm system test_11_lce-2
Tester 	milei
State 	Complete
Group 	idm
Start Date 	2018-03-28 16:04:24
Stop Date 	2018-04-02 08:04:24
Test Duration 	Days:4, Hrs:16, Mins:0, Secs:0
Lab Controller Engine 	LCE-2-10.8.48.121:8085

Comment 55 thierry bordaz 2018-04-03 08:22:23 UTC
(In reply to baiesi from comment #53)
> Hi Viktor, The STR team successfully ran Idm Test with DirServer ver/build
> 1.3.7.5-11 with no deadlock issue. System test duration 4+ days.
> 
> Run ID 	R-3909
> Profile Name 	idm-75-sys-load-test
> Run Name 	idm system test_11_lce-2
> Tester 	milei
> State 	Complete
> Group 	idm
> Start Date 	2018-03-28 16:04:24
> Stop Date 	2018-04-02 08:04:24
> Test Duration 	Days:4, Hrs:16, Mins:0, Secs:0
> Lab Controller Engine 	LCE-2-10.8.48.121:8085

Hi Bruce, Thanks for this great news. It confirms the current plan is the good one. A new build (0day) that reverts a patch (DS ticket 48184) introduced in 1.3.7.5-12 will be available soon (https://bugzilla.redhat.com/show_bug.cgi?id=1563107)
Will you have bandwidth to test this 0day build (when available) ?

Comment 56 milei 2018-04-03 12:59:03 UTC
Hi Thierry:

STR team will install the oday build with fix when it is available.

Comment 57 thierry bordaz 2018-04-03 13:01:14 UTC
Min, Bruce, thank you soooo much.
The build is now available https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=668919

Comment 67 Scott Poore 2018-08-29 18:05:25 UTC
Verified.

Version ::

ipa-server-4.6.4-6.el7.x86_64
389-ds-base-1.3.8.4-10.el7.x86_64

Results ::

Run ID 	R-4740
Profile Name 	idm-76-sys-load-test-bz1544477-qe.sti.com-2-7day
Run Name 	idm system test_3_lce-1
Tester 	spoore
State 	Complete Complete
Group 	idm
Start Date 	2018-08-21 18:30:39
Stop Date 	2018-08-25 18:30:39
Test Duration 	Days:4, Hrs:0, Mins:0, Secs:0
Lab Controller Engine 	LCE-1-10.8.48.121:8084

Regardless of name, this was a 4 day system load test run.  The profile was created with original intention of 7 day runs but, due to timing, that was shortened for this particular run.

Excluding OTP tests which were affected by another bug, the restuls for this 4 day run were:

Passed:  26,446
Failed:  8

Failures above were not related to the authentication and admin test failures seen initially.  No drastic numbers or rate of failures seen.  The same so far appears to be true for another run going on now.

Comment 70 errata-xmlrpc 2018-10-30 10:13:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3127