Bug 1419051

Summary: replication: unable to receive response till nsds5replicaTimeout
Product: Red Hat Enterprise Linux 6 Reporter: German Parente <gparente>
Component: 389-ds-baseAssignee: Noriko Hosoi <nhosoi>
Status: CLOSED WONTFIX QA Contact: Viktor Ashirov <vashirov>
Severity: urgent Docs Contact:
Priority: high    
Version: 6.8CC: arajendr, cww, gscott, mkadmiel, msauton, nkinder, rmeggins
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-15 20:47:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description German Parente 2017-02-03 14:32:57 UTC
Description of problem:

very often in IPA context and pure RHDS, we see these errors in the logs:

[03/Feb/2017:10:36:16.125219254 +0100] NSMMReplicationPlugin - agmt="cn=agmt" (host:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.
[03/Feb/2017:10:36:27.091234210 +0100] NSMMReplicationPlugin - agmt="cn=agmt2" (host2:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

In RHDS, as the timeout by default is set to 10 munutes (#define DEFAULT_TIMEOUT 600)

So, this can provoke situations where at stop time we need to wait for 10 minutes for the server to stop.

Regarding this situation, I have a pstack in RHEL6 from customer:

Thread 2 (Thread 0x7eff4b5fe700 (LWP 31219)):
#0  0x00007eff6b658334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007eff6b65360e in _L_lock_995 () from /lib64/libpthread.so.0
#2  0x00007eff6b653576 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007eff6bca8669 in PR_Lock () from /lib64/libnspr4.so
#4  0x00007eff63c1f914 in conn_read_result_ex () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#5  0x00007eff63c282ea in release_replica () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#6  0x00007eff63c224a3 in repl5_inc_run () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#7  0x00007eff63c27a15 in prot_thread_main () from /usr/lib64/dirsrv/plugins/libreplication-plugin.so
#8  0x00007eff6bcaec13 in ?? () from /lib64/libnspr4.so
#9  0x00007eff6b651aa1 in start_thread () from /lib64/libpthread.so.0
#10 0x00007eff6b39e93d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7eff6dd197c0 (LWP 9193)):
#0  0x00007eff6b65568c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007eff6bca914e in PR_WaitCondVar () from /lib64/libnspr4.so
#2  0x00007eff6bcae671 in PR_Cleanup () from /lib64/libnspr4.so
#3  0x000000000041f232 in main ()

The exact bug for RHEL7 is 1419050