Bug 453508

Summary: TPS Segfaults on startup on Fedora 9
Product: [Retired] Dogtag Certificate System Reporter: Andrew Bartlett <abartlet>
Component: TPSAssignee: Ade Lee <alee>
Status: CLOSED ERRATA QA Contact: Chandrasekar Kannan <ckannan>
Severity: high Docs Contact:
Priority: high    
Version: 1.0CC: benl, bob.lord, cfu, david.k.stutzman2.ctr, rcritten, rmeggins, rrelyea
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 1.0.7-8.fc8 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-07-22 23:29:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 443788, 450345    
Attachments:
Description Flags
let the apache module unload function shut down NSS
none
patch v1
none
patch take 2
none
patch to fix v3
none
patch to fix v4 none

Description Andrew Bartlett 2008-07-01 04:28:12 UTC
Description of problem:
On Fedora 9, x86_64 TPS segfaults on startup.

Version-Release number of selected component (if applicable):
pki-tps-1.0.0-2.fc9.x86_64

How reproducible:
Every time

Steps to Reproduce:
1. From a completely new install
2. With Modutil.pm fixed as per bug 453504
2. service tps start
3.
  
Actual results:

Re-running the command under gdb

[abartlet@naomi SPECS]$ sudo gdb --args runcon -t unconfined_t --
/usr/sbin/httpd.worker -f /etc/pki-tps/httpd.conf
GNU gdb Fedora (6.8-10.fc9)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(no debugging symbols found)
Missing separate debuginfos, use: debuginfo-install coreutils.x86_64
(gdb) run
Starting program: /usr/bin/runcon -t unconfined_t -- /usr/sbin/httpd.worker -f
/etc/pki-tps/httpd.conf
Executing new program: /usr/sbin/httpd.worker

Program received signal SIGSEGV, Segmentation fault.
NSSCryptoContext_FindBestCertificateByNickname (cc=<value optimized out>,
name=<value optimized out>, timeOpt=<value optimized out>, usage=<value
optimized out>, policiesOpt=<value optimized out>)
    at cryptocontext.c:245
245	    if (!cc->certStore) {
Missing separate debuginfos, use: debuginfo-install httpd.x86_64
(gdb) p cc
$1 = <value optimized out>
(gdb) bt full
#0  NSSCryptoContext_FindBestCertificateByNickname (cc=<value optimized out>,
name=<value optimized out>, timeOpt=<value optimized out>, usage=<value
optimized out>, policiesOpt=<value optimized out>)
    at cryptocontext.c:245
	certs = <value optimized out>
	rvCert = <value optimized out>
#1  0x0000000003e0a277 in CERT_FindCertByNickname (handle=<value optimized out>,
nickname=<value optimized out>) at stanpcertdb.c:586
	cc = <value optimized out>
	c = <value optimized out>
	ct = <value optimized out>
	cert = <value optimized out>
	usage = Could not find the frame base for "CERT_FindCertByNickname".
(gdb) l
240	)
241	{
242	    NSSCertificate **certs;
243	    NSSCertificate *rvCert = NULL;
244	    PORT_Assert(cc->certStore);
245	    if (!cc->certStore) {
246		return NULL;
247	    }
248	    certs = nssCertificateStore_FindCertificatesByNickname(cc->certStore,
249	                                                           name,

Expected results:
Correct startup

Additional info:

Comment 1 Ade Lee 2008-07-14 20:07:11 UTC
The segfault is a result of changes made to mod_nss as part of the no_fork patch
from mod_nss-1.0.7-2.fc8 to mod_nss-1.0.7-3.fc8.  

The way mod_nss starts up has changed.  See the following note from rcritten:

**********
The way mod_nss used to work is it would open the database during 
initialization and close it when the module was unloaded. Now it closes 
it much quicker. We can probably make an exception during the first init 
when the config is being loaded, I suspect this is where you are seeing 
the crash.

So rebuild mod_nss (I build with: ./configure -with-apr-config) and look 
in nss_init_Module().

There are 2 calls:

         nss_init_ChildKill(base_server);
         nss_init_ModuleKill(base_server);

These shut things down. Now maybe these can be moved/removed or another 
special case added so the database remains initialized until module 
unload the first go around. You can tell with mc->nInitCount. If it == 1 
then it is the first load where the Apache configuration is verified and 
STDIN/STDOUT are available.
**********

Removing these function calls does in fact allow the TPS to start up - but
probably results in a leak on the mod_nss side.  Reassigning to rcritten for fix
in mod_nss.


Comment 2 Rob Crittenden 2008-07-16 15:16:47 UTC
Created attachment 311957 [details]
let the apache module unload function shut down NSS

Comment 3 Rob Crittenden 2008-07-16 15:18:01 UTC
Committed upstream:

Checking in nss_engine_init.c;
/cvs/dirsec/mod_nss/nss_engine_init.c,v  <--  nss_engine_init.c
new revision: 1.34; previous revision: 1.33
done


Comment 4 Fedora Update System 2008-07-16 18:22:29 UTC
mod_nss-1.0.7-8.fc8 has been submitted as an update for Fedora 8

Comment 5 Fedora Update System 2008-07-16 18:23:53 UTC
mod_nss-1.0.7-9.fc9 has been submitted as an update for Fedora 9

Comment 6 Fedora Update System 2008-07-17 14:16:17 UTC
mod_nss-1.0.7-8.fc8 has been pushed to the Fedora 8 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 7 Fedora Update System 2008-07-17 14:18:06 UTC
mod_nss-1.0.7-9.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 8 Andrew Bartlett 2008-07-28 00:53:30 UTC
Confirmed, this no longer segfaults with the new mod_nss.

Comment 9 Ade Lee 2008-08-05 16:49:33 UTC
Reopening this issue -- 

It turns out that if the tps executable is started from the command line - then it does not segfault.

If it is started form the init script, then it does segfault - albeit quietly in the background in a child process.  The difference is that the init script sets the following LD_PRELOAD

LD_PRELOAD="/usr/lib64/libldap60.so"
LD_PRELOAD="/usr/lib64/libssl3.so:${LD_PRELOAD}"

When tps is started with this preload, it segfaults with the following trace:

Core was generated by `/usr/sbin/httpd.worker -f /etc/pki-tps/httpd.conf'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000022095bb in ?? () from /usr/lib64/libnss3.so
(gdb) bt
#0  0x00000000022095bb in ?? () from /usr/lib64/libnss3.so
#1  0x0000000002204977 in CERT_FindCertByNickname () from /usr/lib64/libnss3.so
#2  0x000000000577f157 in RA::InitializeHttpConnections (id=0x579a3c3 "ca", 
    len=0x59b14e4, conn=0x59b1500, ctx=0x7f2b7c25f520)
    at ../src/engine/RA.cpp:1787
#3  0x000000000578052e in RA::Initialize (cfg_path=<value optimized out>, 
    ctx=0x7f2b7c25f520) at ../src/engine/RA.cpp:292
#4  0x00000000654e7e89 in mod_tps_initialize (p=0x7f2b7a22f708, 
    plog=<value optimized out>, ptemp=<value optimized out>, sv=0x7f2b7a234e08)
    at ../src/modules/tps/mod_tps.cpp:283
#5  0x00007f2b794ad57c in ap_run_post_config (pconf=0x7f2b7a22f708, 
    plog=0x7f2b7a261898, ptemp=0x7f2b7a235738, s=0x7f2b7a234e08)
    at /usr/src/debug/httpd-2.2.8/server/config.c:91
#6  0x00007f2b7949ab0d in main (argc=3, argv=0x7fff814d2b48)
    at /usr/src/debug/httpd-2.2.8/server/main.c:719

Versions of nss, mod_nss, and mozldap :

[root@goofy-vm2 ~]# rpm -q mod_nss nss mozldap
mod_nss-1.0.7-8.fc8
nss-3.12.0.3-0.8.2.fc8
nss-3.12.0.3-0.8.2.fc8
mozldap-6.0.5-1.fc8
mozldap-6.0.5-1.fc8

Comment 10 Rich Megginson 2008-08-06 23:05:13 UTC
I have a similar problem on F-8 x86_64 with the latest Fedora DS admin server.  If I configure it to be both an SSL server and an SSL client (to the directory server), I get this message in the admin server error log:

[<timestamp>] [Info] Init: Re-initializing NSS library

The server hangs at this point.  I have no idea where this is coming from.  This string does not appear in the mod_admserv code nor the mod_nss code.

Comment 11 Rob Crittenden 2008-08-10 03:50:21 UTC
Sent this to Bob Relyea, one of the NSS developers:

As you may recall, Apache does some interesting things when it starts up. It loads all modules to let them check their configuration, unloads them, closes all ttys, then reloads them all. The next step is the model takes over which for us is either forked or threaded. Basically the children get spawned.

Apache has two ways to initialize things: in post_config stage (basically the parent process) and per-child.

What we used to do is initialize NSS in post_config and just leave it to work for any children that got spawned. They would inherit the NSS database. This also allowed any other modules that wanted to use NSS to piggy-back on top of mod_nss.

That was of course, wrong. What we really need to do is wait until after the fork to initialize NSS. So what I do now is:

In first module load, initialize NSS so we can verify token passwords, certs, etc. Then shut it down when the module is unloaded.

When it is reloaded again for the final time, we do not initialize NSS. We let each child thread/process do the initialization and because it is post-fork everything seems to be working fine.

The problem is those modules that used to piggy-back on our initialization. They work during the first load/configuration check stage but fail when the modules are loaded for the last time because NSS is not available in the parent.

I looked into initializating in the parent anyway but soon ran into several problems:

1. If we leave it initialized then all NSS_Initialize() in the children will fail and we're in the pre-fork problem again
2. If I initialize it and try to shut it down right before spawning children it is likely to fail because some other module is holding a reference to it (via a cert, key, whatever).

This is affecting modules in the DS admin server and the CS TPS subsystem.

I suspect that this is going to require a re-write of both of those modules to do their NSS work per-child instead of in the parent.

He responded with:

I think there are 3 options here. The optimum will depend on the environment.

1)  Your solution, where all the subordinate operations are moved to the child.
2)  Hybrid: you create a single child that does this work and communicates via an rpc to the parent.
3)  You do fork/exec instead of just fork in the child.

Obviously 3 would be a non-starter is you are depending on lots of other state, or if the fork() is managed by apache itself rather than the mod_ package, but if it's viable, you could try it (Hmmm it also has the disadvantage that the shared SSL cache code won't work.....

Option 2 works if you have a few local functions you need to perform, and not if you are trying to set up an encryption environment for the child. For example, if you just need to fetch some schema from a peer server, you could make an rpc to a child which initiallizes NSS itself and does the actual operation. If the number of operations are small, but expensive, this might be worthwhile to do.

Otherwise your option 1 is your best bet.

Comment 12 Andrew Bartlett 2008-08-18 21:51:34 UTC
What kind of an answer is 'WONTFIX'?

Surely this (a segfault on startup) is either this is 'not a bug', 'fixed' or you must withdraw the packages?

Comment 13 Rob Crittenden 2008-08-18 23:43:13 UTC
Ok, NOTABUG it is. The end result is the same.

Comment 14 Rob Crittenden 2008-08-18 23:56:50 UTC
It's sad that back doesn't work the way it used to in bugzilla...

This is my fault. I didn't notice that it was filed against Dogtag and not mod_nss. I'll have my crow now.

Re-opened and assigned back to alee.

Comment 15 Ade Lee 2008-08-22 16:47:31 UTC
Created attachment 314820 [details]
patch v1

cfu, please review.

Not totally happy with this -- seems like we should get the cert database path from "nss.conf" instead of ap_server_roo/alias.

Ade

Comment 16 Rich Megginson 2008-08-22 17:14:24 UTC
Where does tps store its key/cert database?  Can you get that path in the tps module?  If not, you probably want to add that as a config parameter.

Admin server gets around this problem because we ship our own nss.conf, and define the parameters that mod_nss uses in nss.conf and other config files (console.conf).

Comment 17 Ade Lee 2008-08-22 17:55:19 UTC
Created attachment 314824 [details]
patch take 2

Much better -- reads from TPS's CS.cfg

cfu, please review.

Comment 18 Rob Crittenden 2008-08-22 18:09:26 UTC
I'm not sure it's a good idea to initialize the same database that mod_nss is going to initialize later. This will likely not work at all in the Apache forked model.

Comment 19 Ade Lee 2008-11-24 21:46:12 UTC
Bob  Relyea.  

Can you please comment on the patch?  Will it work or is there more that needs to be done?

Ade

Comment 20 Ade Lee 2008-11-25 17:06:44 UTC
So , when I upgrade nss from nss-3.11.7-10.fc8 to nss-3.12.1.1-2.fc8, the tps still starts but the page fails to load (with a message about the connection being interrupted.)

The apache error log is below:

[Tue Nov 25 02:51:02 2008] [notice] SELinux policy enabled; httpd running as context system_u:system_r:unconfined_t:s0-s0:c0.c1023
[Tue Nov 25 02:51:02 2008] [info] Initializing SSL Session Cache of size 10000. SSL2 timeout = 100, SSL3/TLS timeout = 86400.
[Tue Nov 25 02:51:02 2008] [info] Init: Initializing (virtual) servers for SSL
[Tue Nov 25 02:51:02 2008] [info] Configuring server for SSL protocol
[Tue Nov 25 02:51:02 2008] [debug] nss_engine_init.c(592): Enabling SSL3
[Tue Nov 25 02:51:02 2008] [debug] nss_engine_init.c(597): Enabling TLS
[Tue Nov 25 02:51:02 2008] [debug] nss_engine_init.c(768): Configuring permitted SSL ciphers [-des,-desede3,-rc2,-rc2export,-rc4,-rc4export,+rsa_3des_sha,-rsa_des_56_sha,+rsa_des_sha,-rsa_null_md5,-rsa_null_sha,-rsa_rc2_40_md5,+rsa_rc4_128_md5,-rsa_rc4_128_sha,-rsa_rc4_40_md5,-rsa_rc4_56_sha,-fortezza,-fortezza_rc4_128_sha,-fortezza_null,-fips_des_sha,+fips_3des_sha,-rsa_aes_128_sha,-rsa_aes_256_sha,+ecdhe_ecdsa_aes_256_sha]
[Tue Nov 25 02:51:02 2008] [error] Unknown cipher ecdhe_ecdsa_aes_256_sha
[Tue Nov 25 02:51:02 2008] [info] Using nickname Server-Cert cert-pki-tps.
[Tue Nov 25 02:51:02 2008] [info] Server: Apache/2.2.8, Interface: mod_nss/2.2.8, Library: NSS/3.12.0.3
[Tue Nov 25 02:51:02 2008] [info] The TPS plugin was successfully loaded!
[Tue Nov 25 02:51:03 2008] [info] Shutting down SSL Session ID Cache
[Tue Nov 25 02:51:06 2008] [info] Initializing SSL Session Cache of size 10000. SSL2 timeout = 100, SSL3/TLS timeout = 86400.
[Tue Nov 25 02:51:06 2008] [info] Server: Apache/2.2.8, Interface: mod_nss/2.2.8, Library: NSS/3.12.0.3
[Tue Nov 25 02:51:06 2008] [info] The TPS plugin was successfully loaded!
[Tue Nov 25 02:51:07 2008] [notice] Apache/2.2.9 (Unix) mod_nss/2.2.8 NSS/3.12.0.3 mod_perl/2.0.3 Perl/v5.8.8 configured -- resuming normal operations
[Tue Nov 25 02:51:07 2008] [info] Server built: Jul 14 2008 15:28:30
[Tue Nov 25 02:51:07 2008] [debug] worker.c(1740): AcceptMutex: sysvsem (default: sysvsem)
[Tue Nov 25 02:51:07 2008] [error] Password for slot internal is incorrect.
[Tue Nov 25 02:51:07 2008] [error] NSS initialization failed. Certificate database: /var/lib/pki-tps/alias.
[Tue Nov 25 02:51:07 2008] [error] SSL Library Error: -8192 I/O Error
[Tue Nov 25 02:51:33 2008] [info] Connection to child 0 established (server goofy-vm2.dsdev.sjc.redhat.com:7889, client 10.11.13.37)
[Tue Nov 25 02:51:33 2008] [info] SSL input filter read failed.
[Tue Nov 25 02:51:33 2008] [error] SSL Library Error: -12268 Cannot connect: SSL is disabled
[Tue Nov 25 02:51:33 2008] [info] Connection to child 0 closed (server goofy-vm2.dsdev.sjc.redhat.com:7889, client 10.11.13.37)
[Tue Nov 25 02:51:33 2008] [info] Connection to child 1 established (server goofy-vm2.dsdev.sjc.redhat.com:7889, client 10.11.13.37)
[Tue Nov 25 02:51:33 2008] [info] SSL input filter read failed.
[Tue Nov 25 02:51:33 2008] [error] SSL Library Error: -12268 Cannot connect: SSL is disabled
[Tue Nov 25 02:51:33 2008] [info] Connection to child 1 closed (server goofy-vm2.dsdev.sjc.redhat.com:7889, client 10.11.13.37)
[Tue Nov 25 02:51:37 2008] [info] Connection to child 2 established (server goofy-vm2.dsdev.sjc.redhat.com:7889, client 10.11.13.37)
[Tue Nov 25 02:51:37 2008] [info] SSL input filter read failed.
[Tue Nov 25 02:51:37 2008] [error] SSL Library Error: -12268 Cannot connect: SSL is disabled
[Tue Nov 25 02:51:37 2008] [info] Connection to child 2 closed (server goofy-vm2.dsdev.sjc.redhat.com:7889, client 10.11.13.37)

Comment 21 Bob Relyea 2008-11-26 01:54:05 UTC
Your patch is not likely to work if you are in an environment where you can fork().

I can't see everything you are doing in that patch, but you need to do what you need to do with NSS and then free all the NSS objects and shutdown NSS before your process does the fork().

Better yet, move whatever initialization you are doing here before the fork() to the child.

Comment 22 Christina Fu 2008-11-26 16:18:23 UTC
sorry for jumping in this late...

Rob maybe you can help me understand it more clearly. In comment#11, looks like you were describing what you would do to mod_nss.  Is that correct?  Which option did you end up doing?  As a result, every child or every thread needs to be initialized again?

Ade, you are trying to initialize NSS for every HTTP connection initialization.
The module code is in pki/base/tps/src/modules/tps
If I understand it correctly, you don't want to put the "child's NSS init" in the 
mod_tps_initialize()(you put it deep in the calls in a different file), because it gets loaded and unloaded as described by Rob.

Ade, perhaps you want to ask Rich if he could give you a sample code of his Apache module that you can consult with.

Comment 23 Ade Lee 2008-12-02 18:31:34 UTC
Created attachment 325404 [details]
patch to fix v3

This patch should now do the right things -- 
ie. do the NSS initialization only on the first "config" load 
and in the child initialization.

TPS now installs and starts up ok.

cfu, please review.

Comment 24 Ade Lee 2008-12-02 19:27:16 UTC
for comparison of fix, see richm's changes :

https://bugzilla.redhat.com/show_bug.cgi?id=461028

I pretty much followed those.

Comment 25 Rob Crittenden 2008-12-03 18:38:12 UTC
In this change:

@@ -300,7 +337,26 @@
 
         goto loser;
     }
+  
+    if (sc->gconfig->nInitCount < 2 ) {
+        status = RA::InitializeInChild( sc->context); 
+    } else {

Shouldn't you be testing against 1 and not 2?

Comment 26 Ade Lee 2008-12-03 19:18:58 UTC
No, the check against nInitCount < 2 is correct here.  The implementation is slightly different.  The function mod_tps_initialize (of which the above is a code fragment) does the following:

mod_tps_initialize()

sc->gconfig->nInitCount ++;
do parent initialization
if (initcount < 2) {
   do child initialization stuff
}

So , on the first module load - initCount is set to 1 and we do both the parent and child initialization. 
On the second module load, initCount is set to 2 and we only do parent initialization.

Basically, this works because we increment nInitCount before doing the check.

Comment 27 Christina Fu 2008-12-06 18:00:56 UTC
Does TPS not need to clear SSL session caches like the admin server?

Other than that, the code seems fine.  Although I'd like you to make sure the tests are more complete.
From what I understand, Ade, you have tested the following cases:
1. admin/agent SSL client auth to tus interface
2. format and enrollment of an actual token

Could you perform the following if you have not done so already?
* set up SSL authentication between ESC client and TPS, and test the format and enrollment

Please observe various logs to see if there is any new error messages that might seem alarming.

Comment 28 Ade Lee 2008-12-08 20:10:36 UTC
Created attachment 326173 [details]
patch to fix v4

Added call to SSL_ClearCache.

Seems like we should need it - although my testing shows no error messages either way.  

Also included spec file.

cfu, please approve.

Comment 29 Ade Lee 2008-12-08 20:18:35 UTC
*** Bug 472509 has been marked as a duplicate of this bug. ***

Comment 30 Christina Fu 2008-12-08 23:41:11 UTC
(In reply to comment #28)
> Created an attachment (id=326173) [details]
> patch to fix v4
> 
> Added call to SSL_ClearCache.
> 
> Seems like we should need it - although my testing shows no error messages
> either way.  
> 
> Also included spec file.
> 
> cfu, please approve.

cfu+

Comment 31 Bob Relyea 2008-12-09 01:03:39 UTC
Clearing the cache allows shutdown to complete if you started any SSL connections. It's possible that no SSL sessions were started, there was no need to clear the cache. You probably want the code in there for safety reasons.

bob

Comment 32 Ade Lee 2008-12-09 01:13:51 UTC
Checked in .. 

[builder@dhcp231-124 pki]$ svn ci -m "changes to fix BZ#453508" base/tps/src/include/engine/RA.h base/tps/src/engine/RA.cpp base/tps/src/modules/tps/mod_tps.cpp base/tps/src/httpClient/engine.cpp dogtag/tps/pki-tps.spec 
Sending        base/tps/src/engine/RA.cpp
Sending        base/tps/src/httpClient/engine.cpp
Sending        base/tps/src/include/engine/RA.h
Sending        base/tps/src/modules/tps/mod_tps.cpp
Sending        dogtag/tps/pki-tps.spec
Transmitting file data .....
Committed revision 165.