Bug 804229
Summary: | NSS segfaults in nssTrustDomain_LockCertCache() | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | John Dennis <jdennis> | ||||
Component: | nss | Assignee: | Elio Maldonado Batiz <emaldona> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 16 | CC: | emaldona, extras-orphan, jsynacek, jvcelak, kdudka, kengert, notting, rmeggins, rrelyea | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 808464 812974 (view as bug list) | Environment: | |||||
Last Closed: | 2013-02-26 15:20:20 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 808464, 812974 | ||||||
Attachments: |
|
Description
John Dennis
2012-03-16 22:54:09 UTC
Created attachment 570731 [details]
GDB backtrace
This backtrace shows with openldap that STAN_Shutdown is called before the SSL socket is closed, that causes g_default_trust_domain to be set to NULL which is subsequently passed to nssTrustDomain_LockCertCache() causing the segfault.
John, thanks for your analysis. Yes, I've also had to learn that NSS is very picky of correct init and shutdown. After you have called nss_shutdown, any use of NSS data structures or APIs is undefined. The application must ensure that nss_shutdown is only done after any NSS resources have been released. The earlier code that triggers the shutdown is triggered by 0x0079bc80 in ldap_ld_free (ld=0x80181358, close=1, sctrls=0x0, cctrls=0x0) at unbind.c:206 A couple of lines later some other cleanup code triggers the shutdown of an SSL connection 0x0079bcd5 in ldap_ld_free (ld=0x80181358, close=1, sctrls=0x0, cctrls=0x0) at unbind.c:220 At the very least the order of cleanup should be adjusted, freeing of the tls context must be done after any associated connections have been released. (Does NSS use a counter mechanism to match the number of init/shutdown calls? If not, you would be at additional risk if an application used multiple ldap connections with overlapping lifetime.) My personal belief is that both NSS and openlap need to be fixed. Here's my reasoning: NSS should never segfault, even if the API is abused. NSS should protect against dereferencing a NULL value and either silently ignore the condition if that makes sense or return an error code, but it should never segfault, ever. openldap should reorder it's shutdown/close operations, it seems clearly wrong, at least at first blush. Let me explain, openlap defines a set of function pointers that TLS providers populate. I have not seen any documentation as to exactly what operations are supposed to occur when the upper layer calls a TLS function. I suppose it's possible tls_m.c might be performing an operation in a exported function it's not supposed to, in which case the NSS TLS provider (tls_m.c) needs to be adjusted. But when I look at the upper layer (ldap_ld_free) it seems wrong to me, but I'm new to that body of code so I don't want to rush to conclusions. > openldap should reorder it's shutdown/close operations, it seems clearly wrong,
> at least at first blush. Let me explain, openlap defines a set of function
> pointers that TLS providers populate. I have not seen any documentation as to
> exactly what operations are supposed to occur when the upper layer calls a TLS
> function. I suppose it's possible tls_m.c might be performing an operation in a
> exported function it's not supposed to, in which case the NSS TLS provider
> (tls_m.c) needs to be adjusted. But when I look at the upper layer
> (ldap_ld_free) it seems wrong to me, but I'm new to that body of code so I
> don't want to rush to conclusions.
This is how it works in OpenLDAP, from what I know:
'tlsm_init' and 'tlsm_destroy' are called just once, at the library initialization and at the exit. In fact we do almost nothing here.
For each OpenLDAP TLS context in the application, 'tlsm_ctx_new' and 'tlsm_ctx_init' is called. We just initialize the structures and still don't call any NSS function - the initialization is deferred till we really need it. The real initialization happens in 'tlsm_deferred_ctx_init' which is protected by 'PR_CallOnceWithArg'. There we call 'NSS_InitContext' for the first time. (Each OpenLDAP TLS context has its NSS context.)
When the OpenLDAP TLS connection is closed, context is destroyed in 'tlsm_ctx_free'. The NSS context is closed using 'NSS_ShutdownContext'.
For cleaning up, 'NSS_RegisterShutdown' is used. 'SSL_ShutdownServerSessionIDCache' is called. After that if PEM module was loaded, it is destroyed by calling 'SECMOD_UnloadUserModule' and 'SECMOD_DestoryModule'.
Is there anything particular you are interested in?
re comment #6 Thanks Jan, yes I too was able to read the tls_m.c code and see what it does. What I was trying to suggest is that either ldap_ld_free() or some of the code in tls_m.c has things occurring in the wrong order, this should be obvious from the attached backtrace. Both Kai and I are suggesting that CERT_DestroyCertificate should not be called after the NSS context it was created from to was destroyed, it's the wrong order of operations. In other words what is happening is an SSL socket is created using an NSS context (i.e. the NSS context must exist prior to creating the SSL socket). But when the socket is disposed of the order of operations isn't mirrored in reverse. The socket should be disposed of first because it belongs to the NSS context, then the NSS context should be destroyed. But openldap is performing the opposite, it's destroying the NSS context first and then the socket. So it seems to me that openldap got this wrong due to one or both of the following: In ldap_ld_free() it first calls ldap_int_tls_destroy() which via the tls function pointers will call tlsm_ctx_free destroying the NSS context. Then later ldap_ld_free() calls ber_sockbuf_free() which via the tls function pointers calls NSS to close the socket. This is the wrong order of operations. So either ldap_int_tls_destroy() needs to be moved after ber_sockbuf_free() or what's performed in the tls functions exported by tls_m.c need to be modified it guarantees the right order of operations. Make sense now? (In reply to comment #7) > Thanks Jan, yes I too was able to read the tls_m.c code and see what it does. Sorry, I was trying to help. > In other words what is happening is an SSL socket is created using an NSS > context (i.e. the NSS context must exist prior to creating the SSL socket). But > when the socket is disposed of the order of operations isn't mirrored in > reverse. The socket should be disposed of first because it belongs to the NSS > context, then the NSS context should be destroyed. But openldap is performing > the opposite, it's destroying the NSS context first and then the socket. True. > So either ldap_int_tls_destroy() needs to be moved after ber_sockbuf_free() or > what's performed in the tls functions exported by tls_m.c need to be modified > it guarantees the right order of operations. I have just quickly scanned OpenSSL and GnuTLS backends, and it seems that this change would be possible without harming the other backends. As for OpenSSL, I found some references that closing the connection (SSL_Shutdown) should appear before cleaning up the resources (SSL_CTX_free). Which is the same case. I'm not very sure about GnuTLS. But the current order is illogical IMO. I understand that openldap is the culprit, reassigning bug. Just some thoughts about the proposal that NSS should get fixed, too: I agree in general with comment 5 that NSS should get some additional defensiveness about incorrect use, but in the past some developers have argued, that NSS should behave exactly like it is now -- why? because: this allows users of NSS to immediately know they are doing it wrong. If NSS used more defensiveness, then users of NSS would complain that NSS leaks memory, or misbehaves, etc., and those users would no longer be able to immediately understand what they did wrong - that the user of NSS is the culprit. I didn't invent this strategy, I'm saying that to explain why there is sometimes lack of defensiveness, and a tendency to punish bad use of NSS with a crash. Regardingless of this discussion, IMHO it's clear that openldap must be fixed anyway. re comment #10 (Not trying to shoot the messenger here Kai, I know you're just passing along information, but ...) IMHO the notion a system library which an application links against should cause the application to segfault by design rather than generating an error is so fundamentally wrong on so many counts I don't know where to begin. Generating a segfault and aborting the application is NEVER the correct behaviour. As an aside it's very easy to get NSS to segfault, one only needs to omit some critical piece of configuration and it aborts deep inside some obscure piece of code because some obscure pointer is NULL. So what did the application do incorrectly? Well you have to take the NSS source code and the backtrace in the debugger and work backwards to figure out where and how that variable got initialized. Woe to you if there is more than one way. It's not an easy process. Imaging how much simpler if instead NSS reported an error like "Security context not initialized". The notion it's better to segfault and leave it up to someone to spend hours or days analysing the code to find the problem rather than reporting the condition is beyond me. Many of the segfault by design problems can be caught during development or testing. Howver that's not always the case. Imagine if the application was deployed with these lurking time bombs whose purpose is to destroy the application in the field. Now instead imagine if the deployed application provided a meaningful error message in it's log file and exited cleanly. Which scenario is preferred? (In reply to comment #9) > I understand that openldap is the culprit, reassigning bug. I already have this cloned for OpenLDAP as #808464. Ressigning back. This message is a reminder that Fedora 16 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '16'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 16's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 16 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping I understand the openldap code has been fixed but doesn't the fundamental problem of not checking values remain in NSS? Should this be opened in the NSS bug tracker? It's hardly the only place NSS doesn't check values. If you call NSS functions without a successfull NSS_Init, you will crash. Changing that would be an upstream bug and would cover a lot more than just nssTrustDomain_LockCertCache(). While it's probably a reasonable thing to ask for (as NSS gets used by more less well behaved clients), it's probably not worth tracking a single function in a read hat bug. bob Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |