Bug 818570 - NSS segfaults in nssTrustDomain_LockCertCache()
NSS segfaults in nssTrustDomain_LockCertCache()
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: nss (Show other bugs)
6.3
Unspecified Unspecified
high Severity high
: beta
: ---
Assigned To: Elio Maldonado Batiz
BaseOS QE Security Team
:
Depends On: 818572
Blocks: 960054 719895
  Show dependency treegraph
 
Reported: 2012-05-03 08:11 EDT by Karel Srot
Modified: 2013-05-23 16:20 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-05-20 21:42:56 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Karel Srot 2012-05-03 08:11:35 EDT
This is a RHEL6 clone of following Fedora bug. We need this fixed because of Bug 719895.

+++ This bug was initially created as a clone of Bug #804229 +++

NSS has a segfault which is aborting ldap clients using SSL to connect
to the LDAP server.

I'm not sure where to pin the blame, with openldap or with NSS. But I
do understand exactly what is causing the problem. First of all it's
important to understand openldap is using NSS as it's TLS provider
(implemented in libraries/libldap/tls_m.c in the openldap
distribution).

NSS segfaults in nssTrustDomain_LockCertCache() because the td
parameter is NULL. nssCertificate_Destroy() calls
nssTrustDomain_LockCertCache() passing it the td obtained by
STAN_GetDefaultTrustDomain().

STAN_GetDefaultTrustDomain() passes back the global
g_default_trust_domain which is initialized in
STAN_LoadDefaultNSS3TrustDomain() and freed and set to NULL in
STAN_Shutdown().

The actual order of events when an openldap SSL connection is
destroyed is this (as seen in the attached backtrace):

ldap_ld_free() calls ldap_int_tls_destroy() which in turn calls
NSS_ShutdownContext() which calls STAN_Shutdown() which causes
g_default_trust_domain to be set to NULL.

Then later ldap_ld_free() calls ber_sockbuf_free() which invokes
tlsm_sb_remove() invoking PR_Close() invoking ssl_close() invoking
CERT_DestroyCertificate which then passes NULL to
nssTrustDomain_LockCertCache() and we SEGFAULT.

One or both of the following seem to be a culprit:

1) The order of shutdown in openldap seems wrong. The SSL context
should not be destroyed before freeing the SSL socket (or so one would
think).

2) NSS should be more robust with handling the order of shutdown
operations. If a shutdown operation has already occurred (e.g. the
TrustDomain has been freed) then during the shutdown/close operation
that should be tolerated without error.

I thought I would take a stab at patching NSS to protect against a
NULL trust domain. But the more I looked at the code the less certain
I was of how that might change the expectations in other parts of the
code. In fact I have previously dealt with NSS shutdown problems and
discovered NSS shutdown is exceedingly complicated and fragile. I'm of
the opinion only an NSS developer should touch this code.

With respect to issue 1, the ordering of operations in openldap, Rich
Megginson tells me that openldap has been doing it this way for years
without problems, although Rich conceeds the ordering would appear to
be wrong. I don't know how long openldap has had an NSS backend, but I
think it's fairly recent. I believe normally openldap uses openssl
which may be tolerant of the order of operations. FWIW openldap also
seems to support GNU TLS.

Like I said in the beginning I don't know who to pin the blame on
here, all I know is I can reliably make openldap clients segfault.

All of this came about because of bug #719895. In bug #719895
freeradius is acting as a SSL client connecting to an openldap SSL
server. The ldap code in freeradius had some issues, including not
calling LDAP_OPT_X_TLS_NEWCTX, I cleaned up that code and went through
all the test exercises which are a bit convoluted. I factored out the
ldap code into a simple C program which I've also attached (which
hopefully I can make private because it references internal test
virtual machines). That test program will segfault consistently
without the hassel of doing a dance with the radius server.

Also I should note I see this behavior on both Fedora 16 and RHEL 6.2,
the package versions are:

nss-3.13.1-11.fc16.i686
openldap-2.4.26-6.fc16.i686

--- Additional comment from jdennis@redhat.com on 2012-03-16 18:58:26 EDT ---

Created attachment 570731 [details]
GDB backtrace

This backtrace shows with openldap that STAN_Shutdown is called before the SSL socket is closed, that causes g_default_trust_domain to be set to NULL which is subsequently passed to nssTrustDomain_LockCertCache() causing the segfault.

--- Additional comment from jdennis@redhat.com on 2012-03-16 19:04:34 EDT ---

Created attachment 570733 [details]
C program reproducer

This C program will reproduce the problem. The initial comment at the head of the file gives the gcc compile/link command.

I've been running the test on vm-105.idm.lab.bos.redhat.com, a virtual machine I'll leave around for a little while, it's already set up with the openldap certs, etc.

I also have a shell script which can set up the certs, etc. ping me if you need that as well.

--- Additional comment from kengert@redhat.com on 2012-03-17 07:06:16 EDT ---

John, thanks for your analysis.

Yes, I've also had to learn that NSS is very picky of correct init and shutdown. After you have called nss_shutdown, any use of NSS data structures or APIs is undefined. 

The application must ensure that nss_shutdown is only done after any NSS resources have been released.

--- Additional comment from kengert@redhat.com on 2012-03-17 07:20:29 EDT ---

The earlier code that triggers the shutdown is triggered by

  0x0079bc80 in ldap_ld_free (ld=0x80181358, close=1, sctrls=0x0, cctrls=0x0)
  at unbind.c:206


A couple of lines later some other cleanup code triggers the shutdown of an SSL connection

  0x0079bcd5 in ldap_ld_free (ld=0x80181358, close=1, sctrls=0x0, cctrls=0x0)
  at unbind.c:220


At the very least the order of cleanup should be adjusted, freeing of the tls context must be done after any associated connections have been released.


(Does NSS use a counter mechanism to match the number of init/shutdown calls? If not, you would be at additional risk if an application used multiple ldap connections with overlapping lifetime.)

--- Additional comment from jdennis@redhat.com on 2012-03-17 19:47:07 EDT ---

My personal belief is that both NSS and openlap need to be fixed. Here's my reasoning:

NSS should never segfault, even if the API is abused. NSS should protect against dereferencing a NULL value and either silently ignore the condition if that makes sense or return an error code, but it should never segfault, ever.

openldap should reorder it's shutdown/close operations, it seems clearly wrong, at least at first blush. Let me explain, openlap defines a set of function pointers that TLS providers populate. I have not seen any documentation as to exactly what operations are supposed to occur when the upper layer calls a TLS function. I suppose it's possible tls_m.c might be performing an operation in a exported function it's not supposed to, in which case the NSS TLS provider (tls_m.c) needs to be adjusted. But when I look at the upper layer (ldap_ld_free) it seems wrong to me, but I'm new to that body of code so I don't want to rush to conclusions.

--- Additional comment from jvcelak@redhat.com on 2012-03-18 18:31:41 EDT ---

> openldap should reorder it's shutdown/close operations, it seems clearly wrong,
> at least at first blush. Let me explain, openlap defines a set of function
> pointers that TLS providers populate. I have not seen any documentation as to
> exactly what operations are supposed to occur when the upper layer calls a TLS
> function. I suppose it's possible tls_m.c might be performing an operation in a
> exported function it's not supposed to, in which case the NSS TLS provider
> (tls_m.c) needs to be adjusted. But when I look at the upper layer
> (ldap_ld_free) it seems wrong to me, but I'm new to that body of code so I
> don't want to rush to conclusions.

This is how it works in OpenLDAP, from what I know:

'tlsm_init' and 'tlsm_destroy' are called just once, at the library initialization and at the exit. In fact we do almost nothing here.

For each OpenLDAP TLS context in the application, 'tlsm_ctx_new' and 'tlsm_ctx_init' is called. We just initialize the structures and still don't call any NSS function - the initialization is deferred till we really need it. The real initialization happens in 'tlsm_deferred_ctx_init' which is protected by 'PR_CallOnceWithArg'. There we call 'NSS_InitContext' for the first time. (Each OpenLDAP TLS context has its NSS context.)

When the OpenLDAP TLS connection is closed, context is destroyed in 'tlsm_ctx_free'. The NSS context is closed using 'NSS_ShutdownContext'. 

For cleaning up, 'NSS_RegisterShutdown' is used. 'SSL_ShutdownServerSessionIDCache' is called. After that if PEM module was loaded, it is destroyed by calling 'SECMOD_UnloadUserModule' and 'SECMOD_DestoryModule'.

Is there anything particular you are interested in?

--- Additional comment from jdennis@redhat.com on 2012-03-19 09:30:38 EDT ---

re comment #6

Thanks Jan, yes I too was able to read the tls_m.c code and see what it does. What I was trying to suggest is that either ldap_ld_free() or some of the code in tls_m.c has things occurring in the wrong order, this should be obvious from the attached backtrace. Both Kai and I are suggesting that CERT_DestroyCertificate should not be called after the NSS context it was created from to was destroyed, it's the wrong order of operations.

In other words what is happening is an SSL socket is created using an NSS context (i.e. the NSS context must exist prior to creating the SSL socket). But when the socket is disposed of the order of operations isn't mirrored in reverse. The socket should be disposed of first because it belongs to the NSS context, then the NSS context should be destroyed. But openldap is performing the opposite, it's destroying the NSS context first and then the socket.

So it seems to me that openldap got this wrong due to one or both of the following: 

In ldap_ld_free() it first calls ldap_int_tls_destroy() which via the tls function pointers will call tlsm_ctx_free destroying the NSS context. Then later ldap_ld_free() calls ber_sockbuf_free() which via the tls function pointers calls NSS to close the socket. This is the wrong order of operations.

So either ldap_int_tls_destroy() needs to be moved after ber_sockbuf_free() or what's performed in the tls functions exported by tls_m.c need to be modified it guarantees the right order of operations.

Make sense now?

--- Additional comment from jvcelak@redhat.com on 2012-03-19 11:01:22 EDT ---

(In reply to comment #7)
> Thanks Jan, yes I too was able to read the tls_m.c code and see what it does.

Sorry, I was trying to help.

> In other words what is happening is an SSL socket is created using an NSS
> context (i.e. the NSS context must exist prior to creating the SSL socket). But
> when the socket is disposed of the order of operations isn't mirrored in
> reverse. The socket should be disposed of first because it belongs to the NSS
> context, then the NSS context should be destroyed. But openldap is performing
> the opposite, it's destroying the NSS context first and then the socket.

True.

> So either ldap_int_tls_destroy() needs to be moved after ber_sockbuf_free() or
> what's performed in the tls functions exported by tls_m.c need to be modified
> it guarantees the right order of operations.

I have just quickly scanned OpenSSL and GnuTLS backends, and it seems that this change would be possible without harming the other backends. As for OpenSSL, I found some references that closing the connection (SSL_Shutdown) should appear before cleaning up the resources (SSL_CTX_free). Which is the same case. I'm not very sure about GnuTLS. But the current order is illogical IMO.

--- Additional comment from kengert@redhat.com on 2012-04-16 13:27:08 EDT ---

I understand that openldap is the culprit, reassigning bug.

--- Additional comment from kengert@redhat.com on 2012-04-16 13:28:49 EDT ---

Just some thoughts about the proposal that NSS should get fixed, too:

I agree in general with comment 5 that NSS should get some additional defensiveness about incorrect use, but in the past some developers have argued, that NSS should behave exactly like it is now -- why? because:
this allows users of NSS to immediately know they are doing it wrong.
If NSS used more defensiveness, then users of NSS would complain that NSS leaks memory, or misbehaves, etc., and those users would no longer be able to immediately understand what they did wrong - that the user of NSS is the culprit.
I didn't invent this strategy, I'm saying that to explain why there is sometimes lack of defensiveness, and a tendency to punish bad use of NSS with a crash.
Regardingless of this discussion, IMHO it's clear that openldap must be fixed anyway.

--- Additional comment from jdennis@redhat.com on 2012-04-16 14:08:45 EDT ---

re comment #10

(Not trying to shoot the messenger here Kai, I know you're just passing along information, but ...) 

IMHO the notion a system library which an application links against should cause the application to segfault by design rather than generating an error is so fundamentally wrong on so many counts I don't know where to begin.

Generating a segfault and aborting the application is NEVER the correct behaviour.

As an aside it's very easy to get NSS to segfault, one only needs to omit some critical piece of configuration and it aborts deep inside some obscure piece of code because some obscure pointer is NULL. So what did the application do incorrectly? Well you have to take the NSS source code and the backtrace in the debugger and work backwards to figure out where and how that variable got initialized. Woe to you if there is more than one way. It's not an easy process. Imaging how much simpler if instead NSS reported an error like "Security context not initialized". The notion it's better to segfault and leave it up to someone to spend hours or days analysing the code to find the problem rather than reporting the condition is beyond me.

Many of the segfault by design problems can be caught during development or testing. Howver that's not always the case. Imagine if the application was deployed with these lurking time bombs whose purpose is to destroy the application in the field. Now instead imagine if the deployed application provided a meaningful error message in it's log file and exited cleanly. Which scenario is preferred?

--- Additional comment from jvcelak@redhat.com on 2012-04-17 03:42:20 EDT ---

(In reply to comment #9)
> I understand that openldap is the culprit, reassigning bug.

I already have this cloned for OpenLDAP as #808464. Ressigning back.
Comment 1 RHEL Product and Program Management 2012-07-10 03:42:26 EDT
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 2 RHEL Product and Program Management 2012-07-10 21:43:06 EDT
This request was erroneously removed from consideration in Red Hat Enterprise Linux 6.4, which is currently under development.  This request will be evaluated for inclusion in Red Hat Enterprise Linux 6.4.
Comment 3 David Spurek 2012-10-03 05:30:26 EDT
Is possible fix this in rhel 6.4?
Comment 6 Karel Srot 2012-12-12 03:50:24 EST
This bug didn't make it to 6.4, moving to 6.5.
Comment 8 Karel Srot 2013-05-07 03:35:36 EDT
Fedora 16 bug 804229 ended as WONTFIX. Can we ask for a devel review on this one? Thank you.
Comment 9 Elio Maldonado Batiz 2013-05-07 12:09:02 EDT
See Bob Relyea's comment https://bugzilla.redhat.com/show_bug.cgi?id=804229#c15
It must be fixed upstream and this bug is an example of the broader problem of how can NSS detect and report to applications that there was an error probably due the client code not using the API's correctly so that nss reports errors correctly and not crash the client or provide misleading diagnostics. NSS has layers, NSS and the cryptographic module for example, and the information available at each level is different and one layer has to make very minimal assumptions about what goes on the other one.  This calls for an error handling strategy. I have seen the same problem on similar libraries I have worked on the past and is good error handling is an easy task to tackle. Once logged upstream it will take a lot of discussion as to the suitable design, implementation, unforeseen side effects, and many iterations. It will be quite some time before it can be dealt with properly.
Comment 10 Eric Paris 2013-05-20 21:42:56 EDT
This bug has been reviewed by the development team and will not be 'fixed' in 6.5.  It seems we all agree that the real problem lays in the application mis-using NSS, not NSS itself.  If applications are found to have problems they should be fixed (as was done here).

Our devel team may work upstream to support more error reporting instead of segfaults on misuse, but in RHEL6, if a segfault is found the application should be fixed.

I am going to close this bug as we do not believe the possibility of NSS segfault-ing when incorrectly used by an application is solvable in RHEL6.  Our suggestion is to fix the calling application.

Note You need to log in before you can comment on or make changes to this bug.