Red Hat Bugzilla – Bug 804229
NSS segfaults in nssTrustDomain_LockCertCache()
Last modified: 2013-02-26 10:20:27 EST
NSS has a segfault which is aborting ldap clients using SSL to connect
to the LDAP server.
I'm not sure where to pin the blame, with openldap or with NSS. But I
do understand exactly what is causing the problem. First of all it's
important to understand openldap is using NSS as it's TLS provider
(implemented in libraries/libldap/tls_m.c in the openldap
NSS segfaults in nssTrustDomain_LockCertCache() because the td
parameter is NULL. nssCertificate_Destroy() calls
nssTrustDomain_LockCertCache() passing it the td obtained by
STAN_GetDefaultTrustDomain() passes back the global
g_default_trust_domain which is initialized in
STAN_LoadDefaultNSS3TrustDomain() and freed and set to NULL in
The actual order of events when an openldap SSL connection is
destroyed is this (as seen in the attached backtrace):
ldap_ld_free() calls ldap_int_tls_destroy() which in turn calls
NSS_ShutdownContext() which calls STAN_Shutdown() which causes
g_default_trust_domain to be set to NULL.
Then later ldap_ld_free() calls ber_sockbuf_free() which invokes
tlsm_sb_remove() invoking PR_Close() invoking ssl_close() invoking
CERT_DestroyCertificate which then passes NULL to
nssTrustDomain_LockCertCache() and we SEGFAULT.
One or both of the following seem to be a culprit:
1) The order of shutdown in openldap seems wrong. The SSL context
should not be destroyed before freeing the SSL socket (or so one would
2) NSS should be more robust with handling the order of shutdown
operations. If a shutdown operation has already occurred (e.g. the
TrustDomain has been freed) then during the shutdown/close operation
that should be tolerated without error.
I thought I would take a stab at patching NSS to protect against a
NULL trust domain. But the more I looked at the code the less certain
I was of how that might change the expectations in other parts of the
code. In fact I have previously dealt with NSS shutdown problems and
discovered NSS shutdown is exceedingly complicated and fragile. I'm of
the opinion only an NSS developer should touch this code.
With respect to issue 1, the ordering of operations in openldap, Rich
Megginson tells me that openldap has been doing it this way for years
without problems, although Rich conceeds the ordering would appear to
be wrong. I don't know how long openldap has had an NSS backend, but I
think it's fairly recent. I believe normally openldap uses openssl
which may be tolerant of the order of operations. FWIW openldap also
seems to support GNU TLS.
Like I said in the beginning I don't know who to pin the blame on
here, all I know is I can reliably make openldap clients segfault.
All of this came about because of bug #719895. In bug #719895
freeradius is acting as a SSL client connecting to an openldap SSL
server. The ldap code in freeradius had some issues, including not
calling LDAP_OPT_X_TLS_NEWCTX, I cleaned up that code and went through
all the test exercises which are a bit convoluted. I factored out the
ldap code into a simple C program which I've also attached (which
hopefully I can make private because it references internal test
virtual machines). That test program will segfault consistently
without the hassel of doing a dance with the radius server.
Also I should note I see this behavior on both Fedora 16 and RHEL 6.2,
the package versions are:
Created attachment 570731 [details]
This backtrace shows with openldap that STAN_Shutdown is called before the SSL socket is closed, that causes g_default_trust_domain to be set to NULL which is subsequently passed to nssTrustDomain_LockCertCache() causing the segfault.
John, thanks for your analysis.
Yes, I've also had to learn that NSS is very picky of correct init and shutdown. After you have called nss_shutdown, any use of NSS data structures or APIs is undefined.
The application must ensure that nss_shutdown is only done after any NSS resources have been released.
The earlier code that triggers the shutdown is triggered by
0x0079bc80 in ldap_ld_free (ld=0x80181358, close=1, sctrls=0x0, cctrls=0x0)
A couple of lines later some other cleanup code triggers the shutdown of an SSL connection
0x0079bcd5 in ldap_ld_free (ld=0x80181358, close=1, sctrls=0x0, cctrls=0x0)
At the very least the order of cleanup should be adjusted, freeing of the tls context must be done after any associated connections have been released.
(Does NSS use a counter mechanism to match the number of init/shutdown calls? If not, you would be at additional risk if an application used multiple ldap connections with overlapping lifetime.)
My personal belief is that both NSS and openlap need to be fixed. Here's my reasoning:
NSS should never segfault, even if the API is abused. NSS should protect against dereferencing a NULL value and either silently ignore the condition if that makes sense or return an error code, but it should never segfault, ever.
openldap should reorder it's shutdown/close operations, it seems clearly wrong, at least at first blush. Let me explain, openlap defines a set of function pointers that TLS providers populate. I have not seen any documentation as to exactly what operations are supposed to occur when the upper layer calls a TLS function. I suppose it's possible tls_m.c might be performing an operation in a exported function it's not supposed to, in which case the NSS TLS provider (tls_m.c) needs to be adjusted. But when I look at the upper layer (ldap_ld_free) it seems wrong to me, but I'm new to that body of code so I don't want to rush to conclusions.
> openldap should reorder it's shutdown/close operations, it seems clearly wrong,
> at least at first blush. Let me explain, openlap defines a set of function
> pointers that TLS providers populate. I have not seen any documentation as to
> exactly what operations are supposed to occur when the upper layer calls a TLS
> function. I suppose it's possible tls_m.c might be performing an operation in a
> exported function it's not supposed to, in which case the NSS TLS provider
> (tls_m.c) needs to be adjusted. But when I look at the upper layer
> (ldap_ld_free) it seems wrong to me, but I'm new to that body of code so I
> don't want to rush to conclusions.
This is how it works in OpenLDAP, from what I know:
'tlsm_init' and 'tlsm_destroy' are called just once, at the library initialization and at the exit. In fact we do almost nothing here.
For each OpenLDAP TLS context in the application, 'tlsm_ctx_new' and 'tlsm_ctx_init' is called. We just initialize the structures and still don't call any NSS function - the initialization is deferred till we really need it. The real initialization happens in 'tlsm_deferred_ctx_init' which is protected by 'PR_CallOnceWithArg'. There we call 'NSS_InitContext' for the first time. (Each OpenLDAP TLS context has its NSS context.)
When the OpenLDAP TLS connection is closed, context is destroyed in 'tlsm_ctx_free'. The NSS context is closed using 'NSS_ShutdownContext'.
For cleaning up, 'NSS_RegisterShutdown' is used. 'SSL_ShutdownServerSessionIDCache' is called. After that if PEM module was loaded, it is destroyed by calling 'SECMOD_UnloadUserModule' and 'SECMOD_DestoryModule'.
Is there anything particular you are interested in?
re comment #6
Thanks Jan, yes I too was able to read the tls_m.c code and see what it does. What I was trying to suggest is that either ldap_ld_free() or some of the code in tls_m.c has things occurring in the wrong order, this should be obvious from the attached backtrace. Both Kai and I are suggesting that CERT_DestroyCertificate should not be called after the NSS context it was created from to was destroyed, it's the wrong order of operations.
In other words what is happening is an SSL socket is created using an NSS context (i.e. the NSS context must exist prior to creating the SSL socket). But when the socket is disposed of the order of operations isn't mirrored in reverse. The socket should be disposed of first because it belongs to the NSS context, then the NSS context should be destroyed. But openldap is performing the opposite, it's destroying the NSS context first and then the socket.
So it seems to me that openldap got this wrong due to one or both of the following:
In ldap_ld_free() it first calls ldap_int_tls_destroy() which via the tls function pointers will call tlsm_ctx_free destroying the NSS context. Then later ldap_ld_free() calls ber_sockbuf_free() which via the tls function pointers calls NSS to close the socket. This is the wrong order of operations.
So either ldap_int_tls_destroy() needs to be moved after ber_sockbuf_free() or what's performed in the tls functions exported by tls_m.c need to be modified it guarantees the right order of operations.
Make sense now?
(In reply to comment #7)
> Thanks Jan, yes I too was able to read the tls_m.c code and see what it does.
Sorry, I was trying to help.
> In other words what is happening is an SSL socket is created using an NSS
> context (i.e. the NSS context must exist prior to creating the SSL socket). But
> when the socket is disposed of the order of operations isn't mirrored in
> reverse. The socket should be disposed of first because it belongs to the NSS
> context, then the NSS context should be destroyed. But openldap is performing
> the opposite, it's destroying the NSS context first and then the socket.
> So either ldap_int_tls_destroy() needs to be moved after ber_sockbuf_free() or
> what's performed in the tls functions exported by tls_m.c need to be modified
> it guarantees the right order of operations.
I have just quickly scanned OpenSSL and GnuTLS backends, and it seems that this change would be possible without harming the other backends. As for OpenSSL, I found some references that closing the connection (SSL_Shutdown) should appear before cleaning up the resources (SSL_CTX_free). Which is the same case. I'm not very sure about GnuTLS. But the current order is illogical IMO.
I understand that openldap is the culprit, reassigning bug.
Just some thoughts about the proposal that NSS should get fixed, too:
I agree in general with comment 5 that NSS should get some additional defensiveness about incorrect use, but in the past some developers have argued, that NSS should behave exactly like it is now -- why? because:
this allows users of NSS to immediately know they are doing it wrong.
If NSS used more defensiveness, then users of NSS would complain that NSS leaks memory, or misbehaves, etc., and those users would no longer be able to immediately understand what they did wrong - that the user of NSS is the culprit.
I didn't invent this strategy, I'm saying that to explain why there is sometimes lack of defensiveness, and a tendency to punish bad use of NSS with a crash.
Regardingless of this discussion, IMHO it's clear that openldap must be fixed anyway.
re comment #10
(Not trying to shoot the messenger here Kai, I know you're just passing along information, but ...)
IMHO the notion a system library which an application links against should cause the application to segfault by design rather than generating an error is so fundamentally wrong on so many counts I don't know where to begin.
Generating a segfault and aborting the application is NEVER the correct behaviour.
As an aside it's very easy to get NSS to segfault, one only needs to omit some critical piece of configuration and it aborts deep inside some obscure piece of code because some obscure pointer is NULL. So what did the application do incorrectly? Well you have to take the NSS source code and the backtrace in the debugger and work backwards to figure out where and how that variable got initialized. Woe to you if there is more than one way. It's not an easy process. Imaging how much simpler if instead NSS reported an error like "Security context not initialized". The notion it's better to segfault and leave it up to someone to spend hours or days analysing the code to find the problem rather than reporting the condition is beyond me.
Many of the segfault by design problems can be caught during development or testing. Howver that's not always the case. Imagine if the application was deployed with these lurking time bombs whose purpose is to destroy the application in the field. Now instead imagine if the deployed application provided a meaningful error message in it's log file and exited cleanly. Which scenario is preferred?
(In reply to comment #9)
> I understand that openldap is the culprit, reassigning bug.
I already have this cloned for OpenLDAP as #808464. Ressigning back.
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '16'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 16's end of life.
Bug Reporter: Thank you for reporting this issue and we are sorry that
we may not be able to fix it before Fedora 16 is end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora, you are encouraged to click on
"Clone This Bug" and open it against that version of Fedora.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
The process we are following is described here:
I understand the openldap code has been fixed but doesn't the fundamental problem of not checking values remain in NSS? Should this be opened in the NSS bug tracker?
It's hardly the only place NSS doesn't check values. If you call NSS functions without a successfull NSS_Init, you will crash.
Changing that would be an upstream bug and would cover a lot more than just nssTrustDomain_LockCertCache(). While it's probably a reasonable thing to ask for (as NSS gets used by more less well behaved clients), it's probably not worth tracking a single function in a read hat bug.
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version.
Thank you for reporting this bug and we are sorry it could not be fixed.