Bug 453508
Summary: | TPS Segfaults on startup on Fedora 9 | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Dogtag Certificate System | Reporter: | Andrew Bartlett <abartlet> | ||||||||||||
Component: | TPS | Assignee: | Ade Lee <alee> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Chandrasekar Kannan <ckannan> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 1.0 | CC: | benl, bob.lord, cfu, david.k.stutzman2.ctr, rcritten, rmeggins, rrelyea | ||||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | All | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | 1.0.7-8.fc8 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2009-07-22 23:29:08 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 443788, 450345 | ||||||||||||||
Attachments: |
|
Description
Andrew Bartlett
2008-07-01 04:28:12 UTC
The segfault is a result of changes made to mod_nss as part of the no_fork patch from mod_nss-1.0.7-2.fc8 to mod_nss-1.0.7-3.fc8. The way mod_nss starts up has changed. See the following note from rcritten: ********** The way mod_nss used to work is it would open the database during initialization and close it when the module was unloaded. Now it closes it much quicker. We can probably make an exception during the first init when the config is being loaded, I suspect this is where you are seeing the crash. So rebuild mod_nss (I build with: ./configure -with-apr-config) and look in nss_init_Module(). There are 2 calls: nss_init_ChildKill(base_server); nss_init_ModuleKill(base_server); These shut things down. Now maybe these can be moved/removed or another special case added so the database remains initialized until module unload the first go around. You can tell with mc->nInitCount. If it == 1 then it is the first load where the Apache configuration is verified and STDIN/STDOUT are available. ********** Removing these function calls does in fact allow the TPS to start up - but probably results in a leak on the mod_nss side. Reassigning to rcritten for fix in mod_nss. Created attachment 311957 [details]
let the apache module unload function shut down NSS
Committed upstream: Checking in nss_engine_init.c; /cvs/dirsec/mod_nss/nss_engine_init.c,v <-- nss_engine_init.c new revision: 1.34; previous revision: 1.33 done mod_nss-1.0.7-8.fc8 has been submitted as an update for Fedora 8 mod_nss-1.0.7-9.fc9 has been submitted as an update for Fedora 9 mod_nss-1.0.7-8.fc8 has been pushed to the Fedora 8 stable repository. If problems still persist, please make note of it in this bug report. mod_nss-1.0.7-9.fc9 has been pushed to the Fedora 9 stable repository. If problems still persist, please make note of it in this bug report. Confirmed, this no longer segfaults with the new mod_nss. Reopening this issue -- It turns out that if the tps executable is started from the command line - then it does not segfault. If it is started form the init script, then it does segfault - albeit quietly in the background in a child process. The difference is that the init script sets the following LD_PRELOAD LD_PRELOAD="/usr/lib64/libldap60.so" LD_PRELOAD="/usr/lib64/libssl3.so:${LD_PRELOAD}" When tps is started with this preload, it segfaults with the following trace: Core was generated by `/usr/sbin/httpd.worker -f /etc/pki-tps/httpd.conf'. Program terminated with signal 11, Segmentation fault. #0 0x00000000022095bb in ?? () from /usr/lib64/libnss3.so (gdb) bt #0 0x00000000022095bb in ?? () from /usr/lib64/libnss3.so #1 0x0000000002204977 in CERT_FindCertByNickname () from /usr/lib64/libnss3.so #2 0x000000000577f157 in RA::InitializeHttpConnections (id=0x579a3c3 "ca", len=0x59b14e4, conn=0x59b1500, ctx=0x7f2b7c25f520) at ../src/engine/RA.cpp:1787 #3 0x000000000578052e in RA::Initialize (cfg_path=<value optimized out>, ctx=0x7f2b7c25f520) at ../src/engine/RA.cpp:292 #4 0x00000000654e7e89 in mod_tps_initialize (p=0x7f2b7a22f708, plog=<value optimized out>, ptemp=<value optimized out>, sv=0x7f2b7a234e08) at ../src/modules/tps/mod_tps.cpp:283 #5 0x00007f2b794ad57c in ap_run_post_config (pconf=0x7f2b7a22f708, plog=0x7f2b7a261898, ptemp=0x7f2b7a235738, s=0x7f2b7a234e08) at /usr/src/debug/httpd-2.2.8/server/config.c:91 #6 0x00007f2b7949ab0d in main (argc=3, argv=0x7fff814d2b48) at /usr/src/debug/httpd-2.2.8/server/main.c:719 Versions of nss, mod_nss, and mozldap : [root@goofy-vm2 ~]# rpm -q mod_nss nss mozldap mod_nss-1.0.7-8.fc8 nss-3.12.0.3-0.8.2.fc8 nss-3.12.0.3-0.8.2.fc8 mozldap-6.0.5-1.fc8 mozldap-6.0.5-1.fc8 I have a similar problem on F-8 x86_64 with the latest Fedora DS admin server. If I configure it to be both an SSL server and an SSL client (to the directory server), I get this message in the admin server error log: [<timestamp>] [Info] Init: Re-initializing NSS library The server hangs at this point. I have no idea where this is coming from. This string does not appear in the mod_admserv code nor the mod_nss code. Sent this to Bob Relyea, one of the NSS developers: As you may recall, Apache does some interesting things when it starts up. It loads all modules to let them check their configuration, unloads them, closes all ttys, then reloads them all. The next step is the model takes over which for us is either forked or threaded. Basically the children get spawned. Apache has two ways to initialize things: in post_config stage (basically the parent process) and per-child. What we used to do is initialize NSS in post_config and just leave it to work for any children that got spawned. They would inherit the NSS database. This also allowed any other modules that wanted to use NSS to piggy-back on top of mod_nss. That was of course, wrong. What we really need to do is wait until after the fork to initialize NSS. So what I do now is: In first module load, initialize NSS so we can verify token passwords, certs, etc. Then shut it down when the module is unloaded. When it is reloaded again for the final time, we do not initialize NSS. We let each child thread/process do the initialization and because it is post-fork everything seems to be working fine. The problem is those modules that used to piggy-back on our initialization. They work during the first load/configuration check stage but fail when the modules are loaded for the last time because NSS is not available in the parent. I looked into initializating in the parent anyway but soon ran into several problems: 1. If we leave it initialized then all NSS_Initialize() in the children will fail and we're in the pre-fork problem again 2. If I initialize it and try to shut it down right before spawning children it is likely to fail because some other module is holding a reference to it (via a cert, key, whatever). This is affecting modules in the DS admin server and the CS TPS subsystem. I suspect that this is going to require a re-write of both of those modules to do their NSS work per-child instead of in the parent. He responded with: I think there are 3 options here. The optimum will depend on the environment. 1) Your solution, where all the subordinate operations are moved to the child. 2) Hybrid: you create a single child that does this work and communicates via an rpc to the parent. 3) You do fork/exec instead of just fork in the child. Obviously 3 would be a non-starter is you are depending on lots of other state, or if the fork() is managed by apache itself rather than the mod_ package, but if it's viable, you could try it (Hmmm it also has the disadvantage that the shared SSL cache code won't work..... Option 2 works if you have a few local functions you need to perform, and not if you are trying to set up an encryption environment for the child. For example, if you just need to fetch some schema from a peer server, you could make an rpc to a child which initiallizes NSS itself and does the actual operation. If the number of operations are small, but expensive, this might be worthwhile to do. Otherwise your option 1 is your best bet. What kind of an answer is 'WONTFIX'? Surely this (a segfault on startup) is either this is 'not a bug', 'fixed' or you must withdraw the packages? Ok, NOTABUG it is. The end result is the same. It's sad that back doesn't work the way it used to in bugzilla... This is my fault. I didn't notice that it was filed against Dogtag and not mod_nss. I'll have my crow now. Re-opened and assigned back to alee. Created attachment 314820 [details]
patch v1
cfu, please review.
Not totally happy with this -- seems like we should get the cert database path from "nss.conf" instead of ap_server_roo/alias.
Ade
Where does tps store its key/cert database? Can you get that path in the tps module? If not, you probably want to add that as a config parameter. Admin server gets around this problem because we ship our own nss.conf, and define the parameters that mod_nss uses in nss.conf and other config files (console.conf). Created attachment 314824 [details]
patch take 2
Much better -- reads from TPS's CS.cfg
cfu, please review.
I'm not sure it's a good idea to initialize the same database that mod_nss is going to initialize later. This will likely not work at all in the Apache forked model. Bob Relyea. Can you please comment on the patch? Will it work or is there more that needs to be done? Ade So , when I upgrade nss from nss-3.11.7-10.fc8 to nss-3.12.1.1-2.fc8, the tps still starts but the page fails to load (with a message about the connection being interrupted.) The apache error log is below: [Tue Nov 25 02:51:02 2008] [notice] SELinux policy enabled; httpd running as context system_u:system_r:unconfined_t:s0-s0:c0.c1023 [Tue Nov 25 02:51:02 2008] [info] Initializing SSL Session Cache of size 10000. SSL2 timeout = 100, SSL3/TLS timeout = 86400. [Tue Nov 25 02:51:02 2008] [info] Init: Initializing (virtual) servers for SSL [Tue Nov 25 02:51:02 2008] [info] Configuring server for SSL protocol [Tue Nov 25 02:51:02 2008] [debug] nss_engine_init.c(592): Enabling SSL3 [Tue Nov 25 02:51:02 2008] [debug] nss_engine_init.c(597): Enabling TLS [Tue Nov 25 02:51:02 2008] [debug] nss_engine_init.c(768): Configuring permitted SSL ciphers [-des,-desede3,-rc2,-rc2export,-rc4,-rc4export,+rsa_3des_sha,-rsa_des_56_sha,+rsa_des_sha,-rsa_null_md5,-rsa_null_sha,-rsa_rc2_40_md5,+rsa_rc4_128_md5,-rsa_rc4_128_sha,-rsa_rc4_40_md5,-rsa_rc4_56_sha,-fortezza,-fortezza_rc4_128_sha,-fortezza_null,-fips_des_sha,+fips_3des_sha,-rsa_aes_128_sha,-rsa_aes_256_sha,+ecdhe_ecdsa_aes_256_sha] [Tue Nov 25 02:51:02 2008] [error] Unknown cipher ecdhe_ecdsa_aes_256_sha [Tue Nov 25 02:51:02 2008] [info] Using nickname Server-Cert cert-pki-tps. [Tue Nov 25 02:51:02 2008] [info] Server: Apache/2.2.8, Interface: mod_nss/2.2.8, Library: NSS/3.12.0.3 [Tue Nov 25 02:51:02 2008] [info] The TPS plugin was successfully loaded! [Tue Nov 25 02:51:03 2008] [info] Shutting down SSL Session ID Cache [Tue Nov 25 02:51:06 2008] [info] Initializing SSL Session Cache of size 10000. SSL2 timeout = 100, SSL3/TLS timeout = 86400. [Tue Nov 25 02:51:06 2008] [info] Server: Apache/2.2.8, Interface: mod_nss/2.2.8, Library: NSS/3.12.0.3 [Tue Nov 25 02:51:06 2008] [info] The TPS plugin was successfully loaded! [Tue Nov 25 02:51:07 2008] [notice] Apache/2.2.9 (Unix) mod_nss/2.2.8 NSS/3.12.0.3 mod_perl/2.0.3 Perl/v5.8.8 configured -- resuming normal operations [Tue Nov 25 02:51:07 2008] [info] Server built: Jul 14 2008 15:28:30 [Tue Nov 25 02:51:07 2008] [debug] worker.c(1740): AcceptMutex: sysvsem (default: sysvsem) [Tue Nov 25 02:51:07 2008] [error] Password for slot internal is incorrect. [Tue Nov 25 02:51:07 2008] [error] NSS initialization failed. Certificate database: /var/lib/pki-tps/alias. [Tue Nov 25 02:51:07 2008] [error] SSL Library Error: -8192 I/O Error [Tue Nov 25 02:51:33 2008] [info] Connection to child 0 established (server goofy-vm2.dsdev.sjc.redhat.com:7889, client 10.11.13.37) [Tue Nov 25 02:51:33 2008] [info] SSL input filter read failed. [Tue Nov 25 02:51:33 2008] [error] SSL Library Error: -12268 Cannot connect: SSL is disabled [Tue Nov 25 02:51:33 2008] [info] Connection to child 0 closed (server goofy-vm2.dsdev.sjc.redhat.com:7889, client 10.11.13.37) [Tue Nov 25 02:51:33 2008] [info] Connection to child 1 established (server goofy-vm2.dsdev.sjc.redhat.com:7889, client 10.11.13.37) [Tue Nov 25 02:51:33 2008] [info] SSL input filter read failed. [Tue Nov 25 02:51:33 2008] [error] SSL Library Error: -12268 Cannot connect: SSL is disabled [Tue Nov 25 02:51:33 2008] [info] Connection to child 1 closed (server goofy-vm2.dsdev.sjc.redhat.com:7889, client 10.11.13.37) [Tue Nov 25 02:51:37 2008] [info] Connection to child 2 established (server goofy-vm2.dsdev.sjc.redhat.com:7889, client 10.11.13.37) [Tue Nov 25 02:51:37 2008] [info] SSL input filter read failed. [Tue Nov 25 02:51:37 2008] [error] SSL Library Error: -12268 Cannot connect: SSL is disabled [Tue Nov 25 02:51:37 2008] [info] Connection to child 2 closed (server goofy-vm2.dsdev.sjc.redhat.com:7889, client 10.11.13.37) Your patch is not likely to work if you are in an environment where you can fork(). I can't see everything you are doing in that patch, but you need to do what you need to do with NSS and then free all the NSS objects and shutdown NSS before your process does the fork(). Better yet, move whatever initialization you are doing here before the fork() to the child. sorry for jumping in this late... Rob maybe you can help me understand it more clearly. In comment#11, looks like you were describing what you would do to mod_nss. Is that correct? Which option did you end up doing? As a result, every child or every thread needs to be initialized again? Ade, you are trying to initialize NSS for every HTTP connection initialization. The module code is in pki/base/tps/src/modules/tps If I understand it correctly, you don't want to put the "child's NSS init" in the mod_tps_initialize()(you put it deep in the calls in a different file), because it gets loaded and unloaded as described by Rob. Ade, perhaps you want to ask Rich if he could give you a sample code of his Apache module that you can consult with. Created attachment 325404 [details]
patch to fix v3
This patch should now do the right things --
ie. do the NSS initialization only on the first "config" load
and in the child initialization.
TPS now installs and starts up ok.
cfu, please review.
for comparison of fix, see richm's changes : https://bugzilla.redhat.com/show_bug.cgi?id=461028 I pretty much followed those. In this change: @@ -300,7 +337,26 @@ goto loser; } + + if (sc->gconfig->nInitCount < 2 ) { + status = RA::InitializeInChild( sc->context); + } else { Shouldn't you be testing against 1 and not 2? No, the check against nInitCount < 2 is correct here. The implementation is slightly different. The function mod_tps_initialize (of which the above is a code fragment) does the following: mod_tps_initialize() sc->gconfig->nInitCount ++; do parent initialization if (initcount < 2) { do child initialization stuff } So , on the first module load - initCount is set to 1 and we do both the parent and child initialization. On the second module load, initCount is set to 2 and we only do parent initialization. Basically, this works because we increment nInitCount before doing the check. Does TPS not need to clear SSL session caches like the admin server? Other than that, the code seems fine. Although I'd like you to make sure the tests are more complete. From what I understand, Ade, you have tested the following cases: 1. admin/agent SSL client auth to tus interface 2. format and enrollment of an actual token Could you perform the following if you have not done so already? * set up SSL authentication between ESC client and TPS, and test the format and enrollment Please observe various logs to see if there is any new error messages that might seem alarming. Created attachment 326173 [details]
patch to fix v4
Added call to SSL_ClearCache.
Seems like we should need it - although my testing shows no error messages either way.
Also included spec file.
cfu, please approve.
*** Bug 472509 has been marked as a duplicate of this bug. *** (In reply to comment #28) > Created an attachment (id=326173) [details] > patch to fix v4 > > Added call to SSL_ClearCache. > > Seems like we should need it - although my testing shows no error messages > either way. > > Also included spec file. > > cfu, please approve. cfu+ Clearing the cache allows shutdown to complete if you started any SSL connections. It's possible that no SSL sessions were started, there was no need to clear the cache. You probably want the code in there for safety reasons. bob Checked in .. [builder@dhcp231-124 pki]$ svn ci -m "changes to fix BZ#453508" base/tps/src/include/engine/RA.h base/tps/src/engine/RA.cpp base/tps/src/modules/tps/mod_tps.cpp base/tps/src/httpClient/engine.cpp dogtag/tps/pki-tps.spec Sending base/tps/src/engine/RA.cpp Sending base/tps/src/httpClient/engine.cpp Sending base/tps/src/include/engine/RA.h Sending base/tps/src/modules/tps/mod_tps.cpp Sending dogtag/tps/pki-tps.spec Transmitting file data ..... Committed revision 165. |