Description of problem: When logged in as a kerberos user I cannot unlock the screensaver even though the dialog appears to correctly take the password. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.Log into gnome session as a kerberos user 2.lock screen 3.provide password to unlock screensaver and get back to session Actual results: screensaver stays locked Expected results: Screensaver unlocked Additional info:
So I looked at this a bit today with Matthias. The lock dialog segfaulted when its pam helper thread exited because it was trying to access resources from pkinit's address block after pkinit was unloaded (libc was trying to clean up the thread's local storage). We were able to work around the problem by running PR_Cleanup from the pkinit's fini function. I'm not sure if this is the right fix however. Will calling PR_Cleanup mess with other plugins or libraries using nss in the same process? I don't think this will be a problem for gnome-screensaver because it doesn't use NSS elsewhere inprocess (except conceivably by another pam module, but those are executed synchronously from each other I believe).
Created attachment 141814 [details] the patch that made the problem go away
adding an NSPR expert to this mix.
The right fix is to synchronize with the termination of the lock dialog's pam helper thread before pkinit is unloaded. That can be done with the PR_JoinThread/pthread_join function or a condition variable. For PR_JoinThread to work, the thread must be created with PR_CreateThread(..., PR_JOINTABLE_THREAD, ...). Your workaround is depending on certain side effects of PR_Cleanup. If you don't know why it works, it's probably not the right fix. PR_Cleanup does the following: - frees all the memory and system resources (file descriptors, etc.) used by NSPR - synchronizes with the termination of all the threads created by PR_CreateThread(PR_USER_THREAD, ...). See the first while loop in http://lxr.mozilla.org/nspr/source/nsprpub/pr/src/pthreads/ptthread.c#927 Perhaps your workaround is depending on the second side effect?
Hi Wan-Teh, I'm not sure I understand what you're saying. the pkinit plugin is loaded, run, and unloaded by pam_krb5 in the pam helper thread. There are only two threads at play here, the main thread that processes UI events, and the helper thread that is created for the authentication conversation. When the pam conversation completes the helper thread wakes up the main thread and exits. when the main thread wakes up it reaps the helper thread with g_thread_join (which is just a wrapper around pthread_join). The crash happens when the helper thread exits, however. Make sense?
*** Bug 216615 has been marked as a duplicate of this bug. ***
Ray, sorry about the confusion. Could you get a stack trace of the pam helper thread when it segfaulted? What's the thread local storage that libc was trying to clean up? Your patch (attachment 141814 [details]) in comment 2 calls both NSS_Shutdown and PR_Cleanup. Do you know which function call made the problem go away?
If it's the NSS_Shutdown() call, then it's something that needs to be fixed there anyway (if the module initializes the library when it gets loaded, then it should be shutting it down before it's unloaded).
The PR_Cleanup made it go away
Thanks. I still need a stack trace and information about the thread local storage that Ray referred to in comment 1. But at the risk of causing more confusion, I speculate that the crash is caused by the thread local storage used by NSPR for NSPR thread IDs (pointers to the PRThread structure), and your workaround is depending on this pthread_setspecific call in PR_Cleanup: http://lxr.mozilla.org/nspr/source/nsprpub/pr/src/pthreads/ptthread.c#950 If you know how to build NSPR, you can comment out these three lines in PR_Cleanup (mozilla/nsprpub/pr/src/pthreads/ptthread.c): _pt_thread_death(me); rv = pthread_setspecific(pt_book.key, NULL); PR_ASSERT(0 == rv); If I guessed right, commenting out these three lines will make the problem come back. If you don't know how to build NSPR, I can do that for you when I get to work later.
I built three NSPR libraries. Please perform these three experiments. In each experiment, install the NSPR library as /usr/lib/libnspr4.so. Remember to back up the original /usr/lib/libnspr4.so. 1. /home/wtchang/nspr-4.6.4/libnspr4.so.1 and your PR_Cleanup workaround This is the original NSPR 4.6.4 library. It should not crash. 2. /home/wtchang/nspr-4.6.4/libnspr4.so.2 and your PR_Cleanup workaround In this version, I removed those three lines from PR_Cleanup. It should crash. 3. /home/wtchang/nspr-4.6.4/libnspr4.so.3 and replace PR_Cleanup() by PR_DetachThread() in your workaround In this version, I rewrote PR_DetachThread() to clean up the thread local storage. It should not crash.
*** Bug 216933 has been marked as a duplicate of this bug. ***
Created attachment 141966 [details] NSPR test program (detach.c) Based on Ray's comment 6, I wrote an NSPR test program that crashed. Below is the stack trace. Does your stack trace look like this? Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1208894544 (LWP 31364)] 0x00b4eeaa in ?? () (gdb) where #0 0x00b4eeaa in ?? () #1 0x009d3228 in __nptl_deallocate_tsd () from /lib/tls/libpthread.so.0 #2 0x009d337f in start_thread () from /lib/tls/libpthread.so.0 #3 0x0084affe in clone () from /lib/tls/libc.so.6 (gdb)
Hi Wan-teh, Sorry for the late reply; I've been on a road trip today. As Matthias said, PR_Cleanup was the function that seemed to make the problem go away. The NSS_Shutdown was something we added when we noticed that NSS_Init was called elsewhere in the module. I don't know the exact stack trace off hand, but it was short. Something along the lines of: <signal SEGV received> nptl_deallocate_tsd() ?? start_thread(...) Also, valgrind said the segfault was because a resource was accessed at an unmapped address and that the address fell in the range that nspr was mapped at ealier. I'll tackle Ed in his cube on monday and we can try your built libraries.
ah yes, the backtrace in comment 14 looks right.
Ray, thanks for confirming. Your PR_Cleanup workaround is fine provided that nobody else in that process is using NSPR. Another workaround is for the main thread to call dlopen("libnspr4.so", RTLD_NOW) and leak the returned library handle. This prevents libnspr4.so from being unloaded. Alternatively, link the program with -lnspr4 and call an NSPR function such as PR_GetCurrentThread() or PR_SetError(0, 0) to create a reference.
Are we going to need to work around this indefinitely? Can we fix this in NSPR in time for 5.0.0?
We don't need to work around this indefinitely. I will fix this in NSPR. Since NSPR is used by many products, the testing will take a long time. So for 5.0.0 I recommend we work around this. However, I'd still appreciate it if Ray could perform the three experiments I described in comment 12 to verify it's indeed an NSPR bug. The NSPR bug report is https://bugzilla.mozilla.org/show_bug.cgi?id=254983 I set its target milestone to the next NSPR release, 4.6.5, yesterday.
Sorry Wan-Teh, I forgot to follow up. Ed, is your machine x86-64 or i386?
Ray, this is also reproducible on the Dell on the blue table in my office (hooked to the flat panel monitor). Usual Westford root pwd. gnome-screensaver not currently running for rather obvious reasons.
I'm also consistently getting this every day on my main workstation in the Westford office, on i386, running the RHEL5Beta2 with some updates from the QA instance of RHN. I get a segfault with a backtrace similar to the one in comment #15 if I attach to the gnome-screensaver-dialog process in gdb.
Dave, are you using i386 and not x86-64? If so, would you mind building attachment 141814 [details] into a cvs checkout of pkinit-nss and performing the steps in comment 12?
Ray, I built x86-64 NSPR binaries for you. They are /home/wtchang/nspr-4.6.4/x86-64/libnspr4.so.1 /home/wtchang/nspr-4.6.4/x86-64/libnspr4.so.2 /home/wtchang/nspr-4.6.4/x86-64/libnspr4.so.3 Follow the steps in comment 12 with these binaries. Thanks.
Hi Wan-Teh, I snuck into Ed's cube today while he was on PTO and performed the steps you outlined in comment 12. Things seem to be exactly as you've predicted. 1) doesn't crash 2) crashes 3) doesn't crash after rebuilding pkinit with your proposed change (I didn't try the original patch with libnspr4.so.3 though) So how should be move forward on this for RHEL5 and beyond? I guess we could 1) Fix DetachThread in NSPR and call it from a fini function or __attribute__((destructor)) function in NSPR 2) Don't touch DetachThread, but just clean up thread local storage in the fini method 3) make pkinit explicitly dlopen nspr with RTLD_NODELETE 4) make the kerberos libs link against nspr explicitly 5) when the kerberos libs dlopen pkinit they could do it with RTLD_NODELETE 6) make gnome-screensaver-dialog link against nspr explicitly Do all those workarounds sound plausible? Are there any other possibilities? Which should we do?
oh, we could also fix DetachThread and then call it from pkinit like step 3 in comment 12.
Ray, thank you for running those experiments for me. For RHEL 5, my original recommendation in comment 17 was to use your workaround "the patch that made the problem go away" (attachment 141814 [details]). Since I didn't know about the RTLD_NODELETE flag for dlopen, my other suggestions in comment 17 should be ignored; your suggestions are much better. Among them, #3 seems the best. It is better that we use a workaround in RHEL 5 that doesn't require a new NSPR release because I may not have enough time to test the new NSPR code thoroughly. Here are my recommendations: Near term: the PR_Cleanup workaround, or use one of the proposals to prevent libnspr4.so from being unloaded. Long term: fix PR_DetachThread and then call it from pkinit
Hi Wan-Teh, I don't think expecting pkinit to call PR_DetachThread is very nice. It doesn't after all do anything with threads. Why can't nspr call PR_DetachThread on its own when it gets unloaded?
Nalin, which work around would you feel best about?
Ray, nspr calling PR_DetachThread on its own when it gets unloaded is a great idea and I will definitely pursue it. Although it will fix this bug, it isn't sufficient in general when there are multiple threads that have "touched" NSPR. If all of them stay around after NSPR is unloaded, all of them need to clean up their thread local storage, and PR_Cleanup can only take care of one of them. As you pointed out, the need to call PR_DetachThread exposes an implementation detail that isn't obvious to the caller. I will try to hide this as much as I can. But I am worried that there will be places where it is necessary.
Created attachment 142782 [details] NSPR patch: add a __attribute__ ((destructor)) function that cleans up the thread local storage In this patch, I add a __attribute__ ((destructor)) function to NSPR that cleans up the thread local storage of the current thread. I built an i386 NSPR library with this patch and emailed it to Ray for testing. This NSPR patch alone, without any change to pkinit, should be able to fix the crash. This patch takes less time to test thoroughly because it doesn't change the PR_DetachThread function. So it is fine to apply this patch to the nspr-4.6.4 package and use it in RHEL 5.
So I tested the library on ed's machine and I can report that it seems to work great. I rolled back pkinit to the version we ship in Beta2, used your updated nspr library and everything worked fine. I'd say we should get this built into nspr as quickly as possible. Kai, can you build it? or are you going to Wan-Teh? (I can, too, just ask)
Ray, thanks again for testing my NSPR patch. I opened the bug https://bugzilla.mozilla.org/show_bug.cgi?id=362768 for the NSPR upstream and submitted the patch there for code review. Are we supposed to only apply a patch that has been accepted by the upstream?
I built a package for RHEL-5 that includes Wan-Teh's patch. nspr-4.6.4-3.el5
I'm going to reopen and move to MODIFIED so that QA can verify the fix and ensure there aren't any regressions, etc.
Don't forget to check in a patch like attachment 141814 [details] for pkinit to call NSS_Shutdown (if pkinit isn't calling NSS_Shutdown now) and PR_Cleanup (for double protection). In attachment 141814 [details], the NSS_Shutdown call needs to be moved after the PORT_FreeArena call because PORT_FreeArena is an NSS function. It doesn't matter in this case, but it's a better style. One of you should review the NSPR patch. RHEL must have some code review requirements, especially for patches that go in at this stage. The work to get this patch into the NSPR upstream is several times greater (more porting, testing, and code review are required), so it won't happen right away. If we use the patch before it's accepted by the upstream, we better make sure the patch is in a good shape that will be accepted.
I ported the NSPR patch to all the other Unix platforms we support and attached a new patch to the NSPR upstream bug report. For RHEL 5 the NSPR patch in this bug report is fine.
(In reply to comment #35) > I built a package for RHEL-5 that includes Wan-Teh's patch. > nspr-4.6.4-3.el5 > I have tested this build on my personal workstaton today, and can report that it works fine, insofar as gnome-screensaver unlock works fine. Updated to GA snapshot 3 on x86_64.
A package has been built which should help the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you.