Bug 1317000 - LDAP Servers in Multi-master configuration locked up and unresponsive
LDAP Servers in Multi-master configuration locked up and unresponsive
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: openldap (Show other bugs)
6.4
x86_64 Linux
medium Severity medium
: rc
: ---
Assigned To: Matus Honek
BaseOS QE Security Team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-03-11 11:42 EST by Kevin
Modified: 2016-09-05 07:31 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-09-05 07:31:16 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenLDAP ITS 8384 None None None 2016-03-11 11:42 EST

  None (edit)
Description Kevin 2016-03-11 11:42:36 EST
Description of problem:

On rare occasions when one of my OpenLDAP servers is started, both servers will lock up and be unresponsive. Any query to either server will time out. The servers will stay in this state indefinitely.


Version-Release number of selected component (if applicable):

openldap-2.4.39-8.el6.x86_64.rpm
openldap-servers-2.4.39-8.el6.x86_64.rpm


How reproducible:

Rarely. This has happened multiple times, but I am unable to reproduce it with any consistency.


Steps to Reproduce:
1. Start slapd with replication configured on one machine.
2. Start slapd with replication configured on a different machine.


Actual results:

The servers will get stuck during the replication process and not respond to any queries (e.g. ldapsearch).

Expected results:

The servers will replicate and respond to queries without delay. 


Additional info:

I have two OpenLDAP servers running on separate hosts and replicating via delta
syncrepl in mirrormode. 

- Killing either of the servers will resolve this problem.
- Both servers have a thread that is stuck in a do_syncrep1() call.
- Netstat shows that each server has unacknowledged data in one of their
socket's receive queue.


GDB on host1:

(gdb) info threads
  18 Thread 0x7f8b7b80d700 (LWP 31620)  0x00007f8ba7fa6f03 in epoll_wait () from
/lib64/libc.so.6
  17 Thread 0x7f8b7b00c700 (LWP 31621)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  16 Thread 0x7f8b7a80b700 (LWP 31622)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  15 Thread 0x7f8b7a00a700 (LWP 31623)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  14 Thread 0x7f8b79809700 (LWP 31625)  0x00007f8ba846a54d in read () from
/lib64/libpthread.so.0
  13 Thread 0x7f8b79008700 (LWP 31626)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  12 Thread 0x7f8b59ffd700 (LWP 25459)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  11 Thread 0x7f8b597fc700 (LWP 10195)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  10 Thread 0x7f8b70b37700 (LWP 28996)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  9 Thread 0x7f8b58ffb700 (LWP 8763)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  8 Thread 0x7f8b4bfff700 (LWP 28614)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  7 Thread 0x7f8b4b7fe700 (LWP 23367)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  6 Thread 0x7f8b4affd700 (LWP 13359)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  5 Thread 0x7f8b4a7fc700 (LWP 29064)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  4 Thread 0x7f8b49ffb700 (LWP 2184)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  3 Thread 0x7f8b497fa700 (LWP 31175)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  2 Thread 0x7f8b48ff9700 (LWP 2316)  0x00007f8ba846a054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
* 1 Thread 0x7f8baa8db700 (LWP 31615)  0x00007f8ba84640ad in pthread_join ()
from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f8ba84640ad in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f8baa93c969 in slapd_daemon () at
../../%./servers/slapd/daemon.c:2929
#2  0x00007f8baa927733 in main (argc=6, argv=<value optimized out>) at
../../../servers/slapd/main.c:1012
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f8b48ff9700 (LWP 2316))]#0  0x00007f8ba846a054
in __lll_lock_wait (929 from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f8ba846a054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f8ba8465388 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007f8ba8465257 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f8baa4b96a0 in tlsm_session_accept_or_connect
(session=0x7f8b2c104110, is_accept=<value optimized out>) at tls_m.c:2656
#4  0x00007f8baa4b8787 in ldap_pvt_tls_accept (sb=0x7f8b2c104150,
ctx_arg=0x7f8bac6f8230) at tls2.c:425
#5  0x00007f8baa943f23 in connection_read (ctx=0x7f8b48ff8b70, argv=0x2f) at
../../../servers/slapd/connection.c:1372
#6  connection_read_thread (ctx=0x7f8b48ff8b70, argv=0x2f) at
../../../servers/slapd/connection.c:1284
#7  0x00007f8baa490a98 in ldap_int_thread_pool_wrapper (xpool=0x7f8bac5c8fb0) at
../../../libraries/libldap_r/tpool.c:688
#8  0x00007f8ba8463851 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f8ba7fa690d in clone () from /lib64/libc.so.6
(gdb) thread 14
[Switching to thread 14 (Thread 0x7f8b79809700 (LWP 31625))]#0 
0x00007f8ba846a54d in read () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f8ba846a54d in read () from /lib64/libpthread.so.0
#1  0x00007f8baa278b1c in sb_debug_read (sbiod=0x7f8b681def80,
buf=0x7f8b6819c830, len=5) at ../../../libraries/liblber/sockbuf.c:829
#2  0x00007f8baa4b9984 in tlsm_PR_Recv (fd=<value optimized out>,
buf=0x7f8b6819c830, len=5, flags=<value optimized out>, timeout=<value
optimized
out>) at tls_m.c:3007
#3  0x00007f8ba946c1ed in ?? () from /usr/lib64/libssl3.so
#4  0000007f8ba9467480 in ?? () from /usr/lib64/libssl3.so
#5  0x00007f8ba9469ed2 in ?? () from /usr/lib64/libssl3.so
#6  0x00007f8ba9470135 in ?? () from /usr/lib64/libssl3.so
#7  0x00007f8ba947196f in SSL_ForceHandshake () from /usr/lib64/libssl3o%o
#8  0x00007f8baa4b96a8 in tlsm_session_accept_or_connect
(session=0x7f8b681b0f20, is_accept=<value optimized out>) at tls_m.c:2658
#9  0x00007f8baa4b82e2 in ldap_int_tls_connect (ld=0x7f8b681d9250,
conn=<value
optimized out>) at tls2.c:362
#10 0x00007f8baa4b857inin ldap_int_tls_start (ld=0x7f8b681d9250,
conn=0x7f8b681b0ba0, srv=<value optimized out>) at tls2.c:860
#11 0x00007f8baa4b86ce in ldap_start_tls_s (ld=0x7f8b681d9250, serverctrls=0x0,
clientctrls=0x0) at tls2.c:1040
#12 0x00007f8baa937ff6 in slap_client_connect (ldp=0x7f8bac6d35a8,
sb=0x7f8bac6d3380) at ../../../servers/slapd/config.c:2012
#13 0x00007f8baa9ae3f5 in do_syncrep1 (ctx=<value optimized out>,
arg=0x7f8bac6cc430) at ../../../servers/slapd/syncrepl.c:613
#14 do_syncrepl (ctx=<value optimized out>, arg=0x7f8bac6cc430) at
../../../servers/slapd/syncrepl.c:1527
#15 0x00007f8baa490a98 in ldap_int_thread_pool_wrapper (xpool=0x7f8bac5c8fb0) at
../../../libraries/libldap_r/tpool.c:688
#16 0x00007f8ba8463851 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f8ba7fa690d in clone () from /lib64/libc.so.6
(gdb) thread 18
[Switching to thread 18 (Thread 0x7f8b7b80d700 (LWP 31620))]#0 
0x00007f8ba7fa6f03 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f8ba7fa6f03 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f8baa93d9e2 in slapd_daemon_task (ptr=<value optimized out>) at
../../../servers/slapd/daemon.c:2536
#2  0x00007f8ba8463851 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f8ba7fa690d in clone () from /lib64/libc.so.6


GDB on host2:

(gdb) info threads
  7 Thread 0x7f5c942d5700 (LWP 3869)  0x00007f5cc0797f03 in epoll_wait () from
/lib64/libc.so.6
  6 Thread 0x7f5c93ad4700 (LWP 3870)  0x00007f5cc0c5b54d in read () from
/lib64/libpthread.so.0
  5 Thread 0x7f5c932d3700 (LWP 3871)  0x00007f5cc0c5b054 in __lll_lock_wait ()
from /lib64/libpthread.so.0
  4 Thread 0x7f5c92ad2700 (LWP 3872)  0x00007f5cc0c5843c in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3 Thread 0x7f5c922d1700 (LWP 3873)  0x00007f5cc0c5843c in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 0x7f5c91ad0700 (LWP 3874)  0x00007f5cc0c5843c in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 1 Thread 0x7f5cc30cc700 (LWP 3868)  0x00007f5cc0c550ad in pthread_join () from
/lib64/libpthread.so.0
(gdb) bt
#0  0x00007f5cc0c550ad in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f5cc312d969 in slapd_daemon () at
../../../servers/slapd/daemon.c:2929
#2  0x00007f5cc3118733 in main (argc=6, argv=<value optimized out>) at
../../../servers/slapd/main.c:1012
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f5c91ad0700 (LWP 3874))]#0  0x00007f5cc0c5843c
in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f5cc0c5843c in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007f5cc2c81af5 in ldap_int_thread_pool_wrapper (xpool=0x7f5cc3d18030) at
../../../libraries/libldap_r/tpool.c:675
#2  0x00007f5cc0c54851 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f5cc079790d in clone () from /lib64/libc.so.6
(gdb) thread 5
[Switching to thread 5 (Thread 0x7f5c932d3700 (LWP 3871))]#0  0x00007f5cc0c5b054
in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f5cc0c5b054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f5cc0c56388 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007f5cc0c56257 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f5cc2caa6a0 in tlsm_session_accept_or_connect
(session=0x7f5c841165c0, is_accept=<value optimized out>) at tls_m.c:2656
#4  0x00007f5cc2ca9787 in ldap_pvt_tls_accept (sb=0x7f5c800008c0,
ctx_arg=0x7f5cc3e3b240) at tls2.c:425
#5  0x00007f5cc3134f23 in connection_read (ctx=0x7f5c932d2b70, argv=0x10) at
../../../servers/slapd/connection.c:1372
#6  connection_read_thread (ctx=0x7f5c932d2b70, argv=0x10) at
../../../servers/slapd/connection.c:1284
#7  0x00007f5cc2c81a98 in ldap_int_thread_pool_wrapper (xpool=0x7f5cc3d18030) at
../../../libraries/libldap_r%tptpool.c:688
#8  0x00007f5cc0c54851 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f5cc079790d in clone () from /lib64/libc.so.6
(gdb) thread 6
[Switching to thread 6 (Thread 0x7f5c93ad4700 (LWP 3870))]#0  0x00007f5cc0c5b54d
in read () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f5cc0c5b54d in read () from /lib64/libpthread.so.0
#1  0x00007f5cc2a69b1c in sb_debug_read (sbiod=0x7f5c7c185e10,
buf=0x7f5c7c18a7f0, len=5) at ../../../libraries/liblber/sockbuf.c:829
#2  0x00007f5cc2caa984 in tlsm_PR_Recv (fd=<value optimized out>,
buf=0x7f5c7c18a7f0, len=5, flags=<value optimized out>, timeout=<value
optimized
out>) at tls_m.c:3007
#3  0x00007f5cc1c5d1ed in ?? () from /usr/lib64/libssl3.so
#4  0x00007f5cc1c58480 in ?? () from /usr/lib64/libssl3.so
#5  0x00007f5cc1c5aed2 in ?? () from /usr/lib64/libssl3.so
#6  0x00007f5cc1c61135 in ?? () from /usr/lib64/libssl3.so
#7  0x00007f5cc1c6296f in SSL_ForceHandshake () from /usr/lib64/libssl3.so
#8  0x00007f5cc2caa6a8 in tlsm_session_accept_or_connect
(session=0x7f5c7c182100, is_accept=<value optimized out>) at tls_m.c:2658
#9  0x00007f5cc2ca92e2 in ldap_int_tls_connect (ld=0x7f5c7c100910,
conn=<value
optimized out>) at tls2.c:362
#10 0x00007f5cc2ca9574 in ldap_int_tls_start (ld=0x7f5c7c100910,
conn=0x7f5c7c10a020, srv=<value optimized out>) at tls2.c:860
#11 0x00007f5cc2ca96ce in ldap_start_tls_s (ld=0x7f5c7c100910, serverctrls=0x0,
clientctrls=0x0) at tls2.c:1040
#12 0x00007f5cc3128ff6 in slap_client_connect (ldp=0x7f5cc3e22558,
sb=0x7f5cc3e22330) at ../../../servers/slapd/config.c:2012
#13 0x00007f5cc319f3f5 in do_syncrep1 (ctx=<value optimized out>,
arg=0x7f5cc3e1f410) at ../../../servers/slapd/syncrepl.c:613
#14 do_syncrepl (ctx=<value optimized out>, arg=0x7f5cc3e1f410) at
../../../servers/slapd/syncrepl.c:1527
#15 0x00007f5cc2c81a98 in ldap_int_thread_pool_wrapper (xpool=0x7f5cc3d18030) at
../../../libraries/libldap_r/tpool.c:688
#16 0x00007f5cc0c54851 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f5cc079790d in clone () from /lib64/libc.so.6
(gdb) thread 7
[Switching to thread 7 (Thread 0x7f5c942d5700 (LWP 3869))]#0  0x00007f5cc0797f03
in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f5cc0797f03 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f5cc312e9e2 in slapd_daemon_task (ptr=<value optimized out>) at
../../../servers/slapd/daemon.c:2536
#2  0x00007f5cc0c54851 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f5cc079790d in clone () from /lib64/libc.so.6


Relevant configuration information (from host2):

syncrepl rid=001
    provider=ldap://host1/
    type=refreshAndPersist
    retry="10 +"
    searchbase="dc=example,dc=com"
    starttls=critical
    tls_cacert=/etc/openldap/ssl-certs/cacert.pem
    tls_cert=/etc/openldap/ssl-certs/host1.crt
    tls_key=/etc/openldap/ssl-certs/host1.key
    tls_reqcert=demand
    bindmethod=sasl
    saslmech=EXTERNAL
    sizelimit=unlimited
    keepalive="10:2:10"
    logfilter="(&(objectClass=auditWriteObject)(reqResult=0))"
    logbase="cn=accesslog"
    syncdata=accesslog


Netstat of host1:

[root@host1 ~]# netstat -anp | grep slapd
tcp        0      0 0.0.0.0:389                 0.0.0.0:*                  
LISTEN      31615/slapd
tcp       56      0 192.168.1.1:389               192.168.1.2:57276            
ESTABLISHED 31615/slapd
tcp        0      0 192.168.1.1:45569             192.168.1.2:389              
ESTABLISHED 31615/slapd
unix  2      [ ]         STREAM     CONNECTED     45346152 31615/slapd
unix  2      [ ]         DGRAM                    45346144 31615/slapd

Netstat of host2:

[root@host2 ~]# netstat -anp | grep slapd
tcp        0      0 0.0.0.0:389                 0.0.0.0:*                  
LISTEN      3868/slapd
tcp       56      0 192.168.1.2:389               192.168.1.1:45569            
ESTABLISHED 3868/slapd
tcp        0      0 192.168.1.2:57276             192.168.1.1:389              
ESTABLISHED 3868/slapd
unix  2      [ ]         STREAM     CONNECTED     19144  3868/slapd
unix         [ ]         DGRAM                    19136  3868/slapd
Comment 2 Patrik Kis 2016-03-14 04:20:16 EDT
This issue really looks like as duplicate of bug 1263477 that will be fixed in upcoming release.
Comment 4 Matus Honek 2016-08-08 09:10:14 EDT
Hello Kevin,

could you, please, given #c2, test if the issue is present in the latest release?
Thank you.
Comment 6 Kevin 2016-08-14 22:53:46 EDT
I apologize, I am confused by '#c2'. Are you referring to comment #2?

I will respond with more information about this issue soon.
Comment 7 Matus Honek 2016-08-15 03:40:57 EDT
(In reply to Kevin from comment #6)
> I apologize, I am confused by '#c2'. Are you referring to comment #2?
Sorry for the confusion. Yes, I meant comment#2.
> 
> I will respond with more information about this issue soon.
Thank you. Also please, provide a `thread apply all bt full' from gdb when this happens again, so that we get full backtrace including local variables' values. Thanks!
Comment 8 Kevin 2016-09-01 23:09:39 EDT
Sorry for taking so long to respond.

I have been running with openldap-2.4.40-10.el6.x86_64 and openldap-servers-2.4.40-10.el6.x86_64 since late March and I haven't seen a lock up since then. I will admit that the lockup didn't occur very often, but I think that I would have seen one by now.

Thanks for all of your help.
Comment 9 Matus Honek 2016-09-05 07:31:16 EDT
Thanks Kevin. Given your comment #8, I am closing this as CURRENTRELEASE. Should you encounter the bug again, feel free to reopen this bugzilla with an update. Thanks!

Note You need to log in before you can comment on or make changes to this bug.