Bug 996373

Summary:

deadlock during SSL_ForceHandshake when getting connection to replica

Product:

Red Hat Enterprise Linux 6

Reporter:

Jatin Nansi <jnansi>

Component:

openldap

Assignee:

Jan Synacek <jsynacek>

Status:

CLOSED ERRATA

QA Contact:

David Spurek <dspurek>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.4

CC:

dspurek, ebenes, ekeck, emaldona, jkurik, jsvarova, jsynacek, ksrot, mschuppe, nkinder, pmoravec, rmeggins, rrelyea, yjog

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

openldap-2.4.39-6.el6

Doc Type:

Bug Fix

Doc Text:

Previously, OpenLDAP did not properly handle a number of simultaneous updates. As a consequence, sending a number of parallel update requests to the server could cause a deadlock. With this update, a superfluous locking mechanism causing the deadlock has been removed, thus fixing the bug.

Story Points:

---

Clone Of:

Clones:

1125152 (view as bug list)

Environment:

Last Closed:

2014-10-14 04:57:45 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

994246, 1056121, 1056124

Attachments:

Description	Flags
gdb session server 1	none
gdb session server 2	none
main script used to reproduce the issue	none

Description Jatin Nansi 2013-08-13 05:18:48 UTC

Description of problem:
A customer has two rhel 6.4 servers running openldap cluster with multimaster replication. When TLS is enabled, they get into deadlock after several minutes of operation. We have coredumps from the deadlocked processes - looking at the backtrace it seems like the issue is being caused by the mutex call implemented to protect the SSL_ForceHandshake call (http://www.openldap.org/lists/openldap-bugs/201109/msg00005.html). This mutex lock is wrapped inside another mutex lock implemented to protect the connection struct (servers/slapd/connection.c:1292). It looks like the root cause for this bug may be an issue with receiving the SSL handshake, which ends up deadlocking the processes trying to grab one of the two mutexes. 

Please see additional comments for backtraces.


Version-Release number of selected component (if applicable):
openldap-servers-2.4.23-32.el6_4.1.x86_64


How reproducible:
Readily reproducible for the customer.


Steps to Reproduce:
1. Setup openldap multimaster replication over SSL
2. Make many simultaneous updates 
3. slapd should deadlock.



Actual results:
slapd process deadlocks.


Expected results:
slapd does not deadlock.


Additional info:
I am still working on qualifying this bug. I opened this BZ early to get engineering's opinion.

Comment 4 Jan Synacek 2013-08-13 06:09:58 UTC

Is the customer using a custom DB_CONFIG, possibly with "lock_timeout" and/or "txn_timeout"? Or is the "writetimeout" slapd option set to anything but 0?

Comment 9 Pavel Moravec 2013-10-16 11:08:33 UTC

(In reply to Pavel Moravec from comment #7)
> Created attachment 812544 [details]
> reproducer

Checking if the deadlock / reproducer depends on some setup:

- it does NOT depend on using root or chained certificates (i.e. using root certs, the deadlock appears as frequently as with chained certs)
- deadlock appears even for olcLogLevel set to -1 (while customer was unable to see the error in that case - probably their I/O operations are much faster or slower than on the reproducer VMs, causing the threads interleaving necessary for the deadlock does not occur).
- deadlock appears also for "none" logLevel. Together with previous it means, the deadlock should be fully independent on logging (until very slow/fast disk operations).
- deadlock present (though less frequent) even if clients dont use SSL ("-ZZ" option removed)
- deadlock present independently on syncrepl type (refreshOnly|refreshAndPersist) - expectable as replication connection even not established

Comment 10 Jan Synacek 2013-10-16 13:42:41 UTC

The shell script from the reproducer requires 'moduser3.ldif' file, which is missing.

Comment 11 Pavel Moravec 2013-10-16 13:58:43 UTC

moduser3.ldif used by reproducer but missing there:

# cat moduser3.ldif 
dn: uid=student5,ou=People,dc=example,dc=org
changetype: modify
replace: loginShell
loginShell: /bin/sh
-
replace: homeDirectory
homeDirectory: /home/student5sh
#

Anyway any valid modification would work..

Comment 12 Jan Synacek 2013-10-17 12:37:30 UTC

I'm unable to reproduce this on my machines. I tried two installations of Fedora 20 and two installations of RHEL-6.4.

Could anyone, who's able to reproduce the problem, please, try again with http://people.redhat.com/jsynacek/openldap-2.4.35-1.el6/ ?

Comment 13 Jan Synacek 2013-10-17 12:45:16 UTC

Of course... Right after I commented about how unable I am to reproduce this, it reproduced. Ok, continuing with investigation.

Comment 14 Jan Synacek 2013-10-17 13:39:25 UTC

This reproduces even with my rebase packages and even with the latest packages on Fedora 20.

Looking at the patch at http://www.openldap.org/lists/openldap-bugs/201109/msg00005.html, I think this was exactly one of those patches needed because of NSS. I don't think this can be done from within openldap.

Rich, could you please advise?

Comment 15 Rich Megginson 2013-10-17 14:03:51 UTC

(In reply to Jan Synacek from comment #14)
> This reproduces even with my rebase packages and even with the latest
> packages on Fedora 20.
> 
> Looking at the patch at
> http://www.openldap.org/lists/openldap-bugs/201109/msg00005.html, I think
> this was exactly one of those patches needed because of NSS.

I don't see the deadlock.  In every case, there is a thread doing a read().  Unless the read() is blocked, or is waiting on a lock, the read() should complete and eventually release its locks.  Why isn't the read() completing?  Are you able to reproduce this exact same condition in the debugger?  If so, if you go to the thread that is in read(), what is it doing?  If you do a "(gdb) finish" does it ever finish?

> I don't think this can be done from within openldap.

Why?
 
> Rich, could you please advise?

Comment 16 Jan Synacek 2013-10-18 07:33:28 UTC

(In reply to Rich Megginson from comment #15)
> I don't see the deadlock.  In every case, there is a thread doing a read(). 
> Unless the read() is blocked, or is waiting on a lock, the read() should
> complete and eventually release its locks.  Why isn't the read() completing?
> Are you able to reproduce this exact same condition in the debugger?  If so,
> if you go to the thread that is in read(), what is it doing?  If you do a
> "(gdb) finish" does it ever finish?

(gdb) finish
Run till exit from #0  0x00007f00b8466297 in pthread_join () from /lib64/libpthread.so.0

No response after that. It still waits.

If I install glibc-debuginfo, running gdb freezes:
# gdb /usr/sbin/slapd $(pgrep slapd)
...
Loaded symbols for /lib64/libnss3.so
Reading symbols from /lib64/libpthread.so.0...Reading symbols from /usr/lib/debug/usr/lib64/libpthread-2.18.so.debug...done.
done.

Nothing happens after this, I have to kill gdb from another terminal.

> Why?

I didn't make myself clear. What I meant by it was that it seems that there's a problem in NSS (now it even looks like a glibc problem), so it has to be fixed there, not in openldap.

Comment 17 Rich Megginson 2013-10-18 15:17:32 UTC

(In reply to Jan Synacek from comment #16)
> (In reply to Rich Megginson from comment #15)
> > I don't see the deadlock.  In every case, there is a thread doing a read(). 
> > Unless the read() is blocked, or is waiting on a lock, the read() should
> > complete and eventually release its locks.  Why isn't the read() completing?
> > Are you able to reproduce this exact same condition in the debugger?  If so,
> > if you go to the thread that is in read(), what is it doing?  If you do a
> > "(gdb) finish" does it ever finish?
> 
> (gdb) finish
> Run till exit from #0  0x00007f00b8466297 in pthread_join () from
> /lib64/libpthread.so.0

But that's not in the same thread as the read, https://bugzilla.redhat.com/show_bug.cgi?id=996373#c7 which was Thread 15.

That is, if you can reproduce, in gdb, the same stack trace as in https://bugzilla.redhat.com/show_bug.cgi?id=996373#c7, and you change to the thread that is doing the read, and you keep doing (gdb) finish, and stay in that thread (that is, if gdb switches you to another thread, just type "thread N" to go back to the original thread that was doing the read() - you can find out the current thread by doing (gdb) p $_thread).

> 
> No response after that. It still waits.

What does the thread apply all bt look like at that point?
 
> If I install glibc-debuginfo, running gdb freezes:
> # gdb /usr/sbin/slapd $(pgrep slapd)
> ...
> Loaded symbols for /lib64/libnss3.so
> Reading symbols from /lib64/libpthread.so.0...Reading symbols from
> /usr/lib/debug/usr/lib64/libpthread-2.18.so.debug...done.
> done.
> 
> Nothing happens after this, I have to kill gdb from another terminal.

Ok.  Then I guess avoid using glibc-debuginfo for this problem.
 
> > Why?
> 
> I didn't make myself clear. What I meant by it was that it seems that
> there's a problem in NSS (now it even looks like a glibc problem), so it has
> to be fixed there, not in openldap.

I believe one of the main problems we had with openldap + NSS integration in the beginning was that the NSS PEM module was not thread safe, so we had to introduce locking.  I believe there may have been other NSS thread safety issues.

Comment 18 Jan Synacek 2013-10-21 06:30:57 UTC

Created attachment 814412 [details]
gdb session server 1

GDB session from server #1.

After switching to thread 16 (the one blocking on the read()) and trying to 'finish', the thread stays blocked. I have to cancel with ^C, then I show that the backtrace is still the same.

Comment 19 Jan Synacek 2013-10-21 06:33:13 UTC

Created attachment 814413 [details]
gdb session server 2

GDB session from server #2.

More or less the same as the first session, but fewer threads.

Comment 20 Jan Synacek 2013-10-21 06:35:37 UTC

Created attachment 814415 [details]
main script used to reproduce the issue

Comment 21 Rich Megginson 2013-10-21 13:43:10 UTC

This doesn't look like a deadlock where thread A holds lock A and is waiting on lock B, and thread B holds lock B and is waiting on lock A.  It looks like the server thinks it needs to read something from the other server but the other server is not sending anything.  What is the other server doing when this server is in read()?

Comment 22 Jan Synacek 2013-10-21 13:48:11 UTC

They both wait in read(). I attached gdb sessions from both servers.

Comment 23 Rich Megginson 2013-10-21 14:17:16 UTC

Ok, I see.  This is a deadlock, between two different servers.

I wonder if we need two different locks here - one for accept and one for connect - rather than a single lock for both accept and connect.  In this case, neither connect can proceed because both connects are waiting for the other side to do the accept, but the accept is blocked by the connect.

That would be something like this:

static int
tlsm_session_accept_or_connect( tls_session *session, int is_accept )
{
...
#ifdef LDAP_R_COMPILE
        ldap_pvt_thread_mutex_t local_pem_mutex = is_accept ? tlsm_accept_pem_mutex ? tlsm_connect_pem_mutex;
#endif
...
	if ( pem_module ) {
                LDAP_MUTEX_LOCK( &local_pem_mutex );
	}
	rc = SSL_ForceHandshake( s );
	if ( pem_module ) {
		LDAP_MUTEX_UNLOCK( &local_pem_mutex );
	}

That is, have separate mutexes for accept and connect.  This assumes the PEM module and the rest of NSS is thread safe enough to have a connect thread and an accept thread at the same time.

On the other hand, perhaps the PEM module and the rest of NSS has improved the thread safety so that the mutex is no longer required?

Comment 24 Jan Synacek 2013-10-22 08:10:52 UTC

It would be great to have thread safety inside NSS and remove the locks protecting SSL_ForceHandshake().

Elio? Has NSS improved in this regard and is it safe to remove the locks?

Comment 25 Elio Maldonado Batiz 2013-10-22 16:04:32 UTC

(In reply to Jan Synacek from comment #24)
> It would be great to have thread safety inside NSS and remove the locks
> protecting SSL_ForceHandshake().
> 
> Elio? Has NSS improved in this regard and is it safe to remove the locks?

As far as the PEM module goes we still have issues. See Bug 736410 - pem_Initialize should return CKR_CANT_LOCK as it isn't thread safe yet.
We do likewise on RHEL. From the PKCS #11 standard.

CKR_CANT_LOCK: This value can only be returned by C_Initialize
. It means that the type of locking requested by the application for thread
- safety is not available in this library, and so the application cannot make 
  use of this library in the specified fashion.

I defer to Bob for his assessment of NSS in general.

Comment 26 Bob Relyea 2013-10-22 23:58:30 UTC

The PEM module was changed to return CKR_CANT_LOCK, which will cause NSS to do the appropriate locking for PEM.

NSS, in general, is thread safe, though certain thinks (like calling multiple reads on the same SSL socket on the same thread) were supposed to the the responsibility of the caller, but it was historically to complex for some servers, so NSS has separate read and write locks to prevent multiple reads or multiple writes (or closes while still reading and writing) in place to protect against this.

So you should be safe without setting locks specifically for NSS (though you may need to still protect your own data structures. The tlsm_ functions look like wrappers around NSS and would presumably need to protect it's own data structures.

bob

Comment 42 errata-xmlrpc 2014-10-14 04:57:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1426.html