Bug 996373
Summary: | deadlock during SSL_ForceHandshake when getting connection to replica | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jatin Nansi <jnansi> | ||||||||
Component: | openldap | Assignee: | Jan Synacek <jsynacek> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | David Spurek <dspurek> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 6.4 | CC: | dspurek, ebenes, ekeck, emaldona, jkurik, jsvarova, jsynacek, ksrot, mschuppe, nkinder, pmoravec, rmeggins, rrelyea, yjog | ||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | openldap-2.4.39-6.el6 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Previously, OpenLDAP did not properly handle a number of simultaneous updates. As a consequence, sending a number of parallel update requests to the server could cause a deadlock. With this update, a superfluous locking mechanism causing the deadlock has been removed, thus fixing the bug.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 1125152 (view as bug list) | Environment: | |||||||||
Last Closed: | 2014-10-14 04:57:45 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 994246, 1056121, 1056124 | ||||||||||
Attachments: |
|
Description
Jatin Nansi
2013-08-13 05:18:48 UTC
Is the customer using a custom DB_CONFIG, possibly with "lock_timeout" and/or "txn_timeout"? Or is the "writetimeout" slapd option set to anything but 0? (In reply to Pavel Moravec from comment #7) > Created attachment 812544 [details] > reproducer Checking if the deadlock / reproducer depends on some setup: - it does NOT depend on using root or chained certificates (i.e. using root certs, the deadlock appears as frequently as with chained certs) - deadlock appears even for olcLogLevel set to -1 (while customer was unable to see the error in that case - probably their I/O operations are much faster or slower than on the reproducer VMs, causing the threads interleaving necessary for the deadlock does not occur). - deadlock appears also for "none" logLevel. Together with previous it means, the deadlock should be fully independent on logging (until very slow/fast disk operations). - deadlock present (though less frequent) even if clients dont use SSL ("-ZZ" option removed) - deadlock present independently on syncrepl type (refreshOnly|refreshAndPersist) - expectable as replication connection even not established The shell script from the reproducer requires 'moduser3.ldif' file, which is missing. moduser3.ldif used by reproducer but missing there: # cat moduser3.ldif dn: uid=student5,ou=People,dc=example,dc=org changetype: modify replace: loginShell loginShell: /bin/sh - replace: homeDirectory homeDirectory: /home/student5sh # Anyway any valid modification would work.. I'm unable to reproduce this on my machines. I tried two installations of Fedora 20 and two installations of RHEL-6.4. Could anyone, who's able to reproduce the problem, please, try again with http://people.redhat.com/jsynacek/openldap-2.4.35-1.el6/ ? Of course... Right after I commented about how unable I am to reproduce this, it reproduced. Ok, continuing with investigation. This reproduces even with my rebase packages and even with the latest packages on Fedora 20. Looking at the patch at http://www.openldap.org/lists/openldap-bugs/201109/msg00005.html, I think this was exactly one of those patches needed because of NSS. I don't think this can be done from within openldap. Rich, could you please advise? (In reply to Jan Synacek from comment #14) > This reproduces even with my rebase packages and even with the latest > packages on Fedora 20. > > Looking at the patch at > http://www.openldap.org/lists/openldap-bugs/201109/msg00005.html, I think > this was exactly one of those patches needed because of NSS. I don't see the deadlock. In every case, there is a thread doing a read(). Unless the read() is blocked, or is waiting on a lock, the read() should complete and eventually release its locks. Why isn't the read() completing? Are you able to reproduce this exact same condition in the debugger? If so, if you go to the thread that is in read(), what is it doing? If you do a "(gdb) finish" does it ever finish? > I don't think this can be done from within openldap. Why? > Rich, could you please advise? (In reply to Rich Megginson from comment #15) > I don't see the deadlock. In every case, there is a thread doing a read(). > Unless the read() is blocked, or is waiting on a lock, the read() should > complete and eventually release its locks. Why isn't the read() completing? > Are you able to reproduce this exact same condition in the debugger? If so, > if you go to the thread that is in read(), what is it doing? If you do a > "(gdb) finish" does it ever finish? (gdb) finish Run till exit from #0 0x00007f00b8466297 in pthread_join () from /lib64/libpthread.so.0 No response after that. It still waits. If I install glibc-debuginfo, running gdb freezes: # gdb /usr/sbin/slapd $(pgrep slapd) ... Loaded symbols for /lib64/libnss3.so Reading symbols from /lib64/libpthread.so.0...Reading symbols from /usr/lib/debug/usr/lib64/libpthread-2.18.so.debug...done. done. Nothing happens after this, I have to kill gdb from another terminal. > Why? I didn't make myself clear. What I meant by it was that it seems that there's a problem in NSS (now it even looks like a glibc problem), so it has to be fixed there, not in openldap. (In reply to Jan Synacek from comment #16) > (In reply to Rich Megginson from comment #15) > > I don't see the deadlock. In every case, there is a thread doing a read(). > > Unless the read() is blocked, or is waiting on a lock, the read() should > > complete and eventually release its locks. Why isn't the read() completing? > > Are you able to reproduce this exact same condition in the debugger? If so, > > if you go to the thread that is in read(), what is it doing? If you do a > > "(gdb) finish" does it ever finish? > > (gdb) finish > Run till exit from #0 0x00007f00b8466297 in pthread_join () from > /lib64/libpthread.so.0 But that's not in the same thread as the read, https://bugzilla.redhat.com/show_bug.cgi?id=996373#c7 which was Thread 15. That is, if you can reproduce, in gdb, the same stack trace as in https://bugzilla.redhat.com/show_bug.cgi?id=996373#c7, and you change to the thread that is doing the read, and you keep doing (gdb) finish, and stay in that thread (that is, if gdb switches you to another thread, just type "thread N" to go back to the original thread that was doing the read() - you can find out the current thread by doing (gdb) p $_thread). > > No response after that. It still waits. What does the thread apply all bt look like at that point? > If I install glibc-debuginfo, running gdb freezes: > # gdb /usr/sbin/slapd $(pgrep slapd) > ... > Loaded symbols for /lib64/libnss3.so > Reading symbols from /lib64/libpthread.so.0...Reading symbols from > /usr/lib/debug/usr/lib64/libpthread-2.18.so.debug...done. > done. > > Nothing happens after this, I have to kill gdb from another terminal. Ok. Then I guess avoid using glibc-debuginfo for this problem. > > Why? > > I didn't make myself clear. What I meant by it was that it seems that > there's a problem in NSS (now it even looks like a glibc problem), so it has > to be fixed there, not in openldap. I believe one of the main problems we had with openldap + NSS integration in the beginning was that the NSS PEM module was not thread safe, so we had to introduce locking. I believe there may have been other NSS thread safety issues. Created attachment 814412 [details]
gdb session server 1
GDB session from server #1.
After switching to thread 16 (the one blocking on the read()) and trying to 'finish', the thread stays blocked. I have to cancel with ^C, then I show that the backtrace is still the same.
Created attachment 814413 [details]
gdb session server 2
GDB session from server #2.
More or less the same as the first session, but fewer threads.
Created attachment 814415 [details]
main script used to reproduce the issue
This doesn't look like a deadlock where thread A holds lock A and is waiting on lock B, and thread B holds lock B and is waiting on lock A. It looks like the server thinks it needs to read something from the other server but the other server is not sending anything. What is the other server doing when this server is in read()? They both wait in read(). I attached gdb sessions from both servers. Ok, I see. This is a deadlock, between two different servers. I wonder if we need two different locks here - one for accept and one for connect - rather than a single lock for both accept and connect. In this case, neither connect can proceed because both connects are waiting for the other side to do the accept, but the accept is blocked by the connect. That would be something like this: static int tlsm_session_accept_or_connect( tls_session *session, int is_accept ) { ... #ifdef LDAP_R_COMPILE ldap_pvt_thread_mutex_t local_pem_mutex = is_accept ? tlsm_accept_pem_mutex ? tlsm_connect_pem_mutex; #endif ... if ( pem_module ) { LDAP_MUTEX_LOCK( &local_pem_mutex ); } rc = SSL_ForceHandshake( s ); if ( pem_module ) { LDAP_MUTEX_UNLOCK( &local_pem_mutex ); } That is, have separate mutexes for accept and connect. This assumes the PEM module and the rest of NSS is thread safe enough to have a connect thread and an accept thread at the same time. On the other hand, perhaps the PEM module and the rest of NSS has improved the thread safety so that the mutex is no longer required? It would be great to have thread safety inside NSS and remove the locks protecting SSL_ForceHandshake(). Elio? Has NSS improved in this regard and is it safe to remove the locks? (In reply to Jan Synacek from comment #24) > It would be great to have thread safety inside NSS and remove the locks > protecting SSL_ForceHandshake(). > > Elio? Has NSS improved in this regard and is it safe to remove the locks? As far as the PEM module goes we still have issues. See Bug 736410 - pem_Initialize should return CKR_CANT_LOCK as it isn't thread safe yet. We do likewise on RHEL. From the PKCS #11 standard. CKR_CANT_LOCK: This value can only be returned by C_Initialize . It means that the type of locking requested by the application for thread - safety is not available in this library, and so the application cannot make use of this library in the specified fashion. I defer to Bob for his assessment of NSS in general. The PEM module was changed to return CKR_CANT_LOCK, which will cause NSS to do the appropriate locking for PEM. NSS, in general, is thread safe, though certain thinks (like calling multiple reads on the same SSL socket on the same thread) were supposed to the the responsibility of the caller, but it was historically to complex for some servers, so NSS has separate read and write locks to prevent multiple reads or multiple writes (or closes while still reading and writing) in place to protect against this. So you should be safe without setting locks specifically for NSS (though you may need to still protect your own data structures. The tlsm_ functions look like wrappers around NSS and would presumably need to protect it's own data structures. bob Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1426.html |