Red Hat Bugzilla – Bug 216718
nspr needs to cleanup thread local storage before getting unloaded
Last modified: 2015-04-12 19:17:26 EDT
Description of problem:
When logged in as a kerberos user I cannot unlock the screensaver even though
the dialog appears to correctly take the password.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Log into gnome session as a kerberos user
3.provide password to unlock screensaver and get back to session
screensaver stays locked
So I looked at this a bit today with Matthias. The lock dialog segfaulted when
its pam helper thread exited because it was trying to access resources from
pkinit's address block after pkinit was unloaded (libc was trying to clean up
the thread's local storage).
We were able to work around the problem by running PR_Cleanup from the pkinit's
I'm not sure if this is the right fix however. Will calling PR_Cleanup mess
with other plugins or libraries using nss in the same process? I don't think
this will be a problem for gnome-screensaver because it doesn't use NSS
elsewhere inprocess (except conceivably by another pam module, but those are
executed synchronously from each other I believe).
Created attachment 141814 [details]
the patch that made the problem go away
adding an NSPR expert to this mix.
The right fix is to synchronize with the termination of
the lock dialog's pam helper thread before pkinit is unloaded.
That can be done with the PR_JoinThread/pthread_join function
or a condition variable. For PR_JoinThread to work, the thread
must be created with PR_CreateThread(..., PR_JOINTABLE_THREAD, ...).
Your workaround is depending on certain side effects of
PR_Cleanup. If you don't know why it works, it's probably
not the right fix.
PR_Cleanup does the following:
- frees all the memory and system resources (file descriptors,
etc.) used by NSPR
- synchronizes with the termination of all the threads created
by PR_CreateThread(PR_USER_THREAD, ...). See the first while
Perhaps your workaround is depending on the second side effect?
I'm not sure I understand what you're saying.
the pkinit plugin is loaded, run, and unloaded by pam_krb5 in the pam helper
thread. There are only two threads at play here, the main thread that processes
UI events, and the helper thread that is created for the authentication
When the pam conversation completes the helper thread wakes up the main thread
and exits. when the main thread wakes up it reaps the helper thread with
g_thread_join (which is just a wrapper around pthread_join).
The crash happens when the helper thread exits, however. Make sense?
*** Bug 216615 has been marked as a duplicate of this bug. ***
Ray, sorry about the confusion. Could you get a stack trace of
the pam helper thread when it segfaulted? What's the thread local
storage that libc was trying to clean up? Your patch (attachment 141814 [details])
in comment 2 calls both NSS_Shutdown and PR_Cleanup. Do you know
which function call made the problem go away?
If it's the NSS_Shutdown() call, then it's something that needs to be fixed
there anyway (if the module initializes the library when it gets loaded, then it
should be shutting it down before it's unloaded).
The PR_Cleanup made it go away
Thanks. I still need a stack trace and information about the
thread local storage that Ray referred to in comment 1. But
at the risk of causing more confusion, I speculate that the
crash is caused by the thread local storage used by NSPR for
NSPR thread IDs (pointers to the PRThread structure), and your
workaround is depending on this pthread_setspecific call in PR_Cleanup:
If you know how to build NSPR, you can comment out these three lines
in PR_Cleanup (mozilla/nsprpub/pr/src/pthreads/ptthread.c):
rv = pthread_setspecific(pt_book.key, NULL);
PR_ASSERT(0 == rv);
If I guessed right, commenting out these three lines will make
the problem come back. If you don't know how to build NSPR, I can
do that for you when I get to work later.
I built three NSPR libraries. Please perform these three experiments.
In each experiment, install the NSPR library as /usr/lib/libnspr4.so.
Remember to back up the original /usr/lib/libnspr4.so.
and your PR_Cleanup workaround
This is the original NSPR 4.6.4 library. It should not crash.
and your PR_Cleanup workaround
In this version, I removed those three lines from PR_Cleanup.
It should crash.
and replace PR_Cleanup() by PR_DetachThread() in your workaround
In this version, I rewrote PR_DetachThread() to clean up the
thread local storage. It should not crash.
*** Bug 216933 has been marked as a duplicate of this bug. ***
Created attachment 141966 [details]
NSPR test program (detach.c)
Based on Ray's comment 6, I wrote an NSPR test program that
crashed. Below is the stack trace. Does your stack trace
look like this?
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1208894544 (LWP 31364)]
0x00b4eeaa in ?? ()
#0 0x00b4eeaa in ?? ()
#1 0x009d3228 in __nptl_deallocate_tsd () from /lib/tls/libpthread.so.0
#2 0x009d337f in start_thread () from /lib/tls/libpthread.so.0
#3 0x0084affe in clone () from /lib/tls/libc.so.6
Sorry for the late reply; I've been on a road trip today.
As Matthias said, PR_Cleanup was the function that seemed to make the problem go
away. The NSS_Shutdown was something we added when we noticed that NSS_Init was
called elsewhere in the module.
I don't know the exact stack trace off hand, but it was short. Something along
the lines of:
<signal SEGV received>
Also, valgrind said the segfault was because a resource was accessed at an
unmapped address and that the address fell in the range that nspr was mapped at
I'll tackle Ed in his cube on monday and we can try your built libraries.
ah yes, the backtrace in comment 14 looks right.
Ray, thanks for confirming.
Your PR_Cleanup workaround is fine provided that nobody else
in that process is using NSPR.
Another workaround is for the main thread to call
dlopen("libnspr4.so", RTLD_NOW) and leak the returned library
handle. This prevents libnspr4.so from being unloaded.
Alternatively, link the program with -lnspr4 and call
an NSPR function such as PR_GetCurrentThread() or PR_SetError(0, 0)
to create a reference.
Are we going to need to work around this indefinitely? Can we fix this in NSPR
in time for 5.0.0?
We don't need to work around this indefinitely. I will fix
this in NSPR. Since NSPR is used by many products, the testing
will take a long time. So for 5.0.0 I recommend we work around
this. However, I'd still appreciate it if Ray could perform the
three experiments I described in comment 12 to verify it's indeed
an NSPR bug.
The NSPR bug report is
I set its target milestone to the next NSPR release, 4.6.5,
Sorry Wan-Teh, I forgot to follow up. Ed, is your machine x86-64 or i386?
Ray, this is also reproducible on the Dell on the blue table in my office
(hooked to the flat panel monitor). Usual Westford root pwd. gnome-screensaver
not currently running for rather obvious reasons.
I'm also consistently getting this every day on my main workstation in the
Westford office, on i386, running the RHEL5Beta2 with some updates from the QA
instance of RHN. I get a segfault with a backtrace similar to the one in
comment #15 if I attach to the gnome-screensaver-dialog process in gdb.
Dave, are you using i386 and not x86-64? If so, would you mind building
attachment 141814 [details] into a cvs checkout of pkinit-nss and performing the steps in
Ray, I built x86-64 NSPR binaries for you. They are
Follow the steps in comment 12 with these binaries. Thanks.
I snuck into Ed's cube today while he was on PTO and performed the steps you
outlined in comment 12.
Things seem to be exactly as you've predicted.
1) doesn't crash
3) doesn't crash after rebuilding pkinit with your proposed change (I didn't try
the original patch with libnspr4.so.3 though)
So how should be move forward on this for RHEL5 and beyond?
I guess we could
1) Fix DetachThread in NSPR and call it from a fini function or
__attribute__((destructor)) function in NSPR
2) Don't touch DetachThread, but just clean up thread local storage in the fini
3) make pkinit explicitly dlopen nspr with RTLD_NODELETE
4) make the kerberos libs link against nspr explicitly
5) when the kerberos libs dlopen pkinit they could do it with RTLD_NODELETE
6) make gnome-screensaver-dialog link against nspr explicitly
Do all those workarounds sound plausible? Are there any other possibilities?
Which should we do?
oh, we could also fix DetachThread and then call it from pkinit like step 3 in
Ray, thank you for running those experiments for me.
For RHEL 5, my original recommendation in comment 17 was to use your
workaround "the patch that made the problem go away" (attachment 141814 [details]).
Since I didn't know about the RTLD_NODELETE flag for dlopen, my other
suggestions in comment 17 should be ignored; your suggestions are much
better. Among them, #3 seems the best.
It is better that we use a workaround in RHEL 5 that doesn't require a
new NSPR release because I may not have enough time to test the new
NSPR code thoroughly.
Here are my recommendations:
Near term: the PR_Cleanup workaround, or use one of the proposals to
prevent libnspr4.so from being unloaded.
Long term: fix PR_DetachThread and then call it from pkinit
I don't think expecting pkinit to call PR_DetachThread is very nice. It doesn't
after all do anything with threads.
Why can't nspr call PR_DetachThread on its own when it gets unloaded?
Nalin, which work around would you feel best about?
Ray, nspr calling PR_DetachThread on its own when it gets unloaded is
a great idea and I will definitely pursue it. Although it will fix this
bug, it isn't sufficient in general when there are multiple threads that
have "touched" NSPR. If all of them stay around after NSPR is unloaded,
all of them need to clean up their thread local storage, and PR_Cleanup
can only take care of one of them.
As you pointed out, the need to call PR_DetachThread exposes an implementation
detail that isn't obvious to the caller. I will try to hide this as much
as I can. But I am worried that there will be places where it is necessary.
Created attachment 142782 [details]
NSPR patch: add a __attribute__ ((destructor)) function that cleans up the thread local storage
In this patch, I add a __attribute__ ((destructor)) function to
NSPR that cleans up the thread local storage of the current thread.
I built an i386 NSPR library with this patch and emailed it to Ray
for testing. This NSPR patch alone, without any change to pkinit,
should be able to fix the crash.
This patch takes less time to test thoroughly because it doesn't
change the PR_DetachThread function. So it is fine to apply this
patch to the nspr-4.6.4 package and use it in RHEL 5.
So I tested the library on ed's machine and I can report that it seems to work
great. I rolled back pkinit to the version we ship in Beta2, used your updated
nspr library and everything worked fine.
I'd say we should get this built into nspr as quickly as possible.
Kai, can you build it? or are you going to Wan-Teh? (I can, too, just ask)
Ray, thanks again for testing my NSPR patch.
I opened the bug https://bugzilla.mozilla.org/show_bug.cgi?id=362768
for the NSPR upstream and submitted the patch there for code review.
Are we supposed to only apply a patch that has been accepted by the
I built a package for RHEL-5 that includes Wan-Teh's patch.
I'm going to reopen and move to MODIFIED so that QA can verify the fix and
ensure there aren't any regressions, etc.
Don't forget to check in a patch like attachment 141814 [details] for pkinit
to call NSS_Shutdown (if pkinit isn't calling NSS_Shutdown now)
and PR_Cleanup (for double protection). In attachment 141814 [details],
the NSS_Shutdown call needs to be moved after the PORT_FreeArena
call because PORT_FreeArena is an NSS function. It doesn't
matter in this case, but it's a better style.
One of you should review the NSPR patch. RHEL must have some
code review requirements, especially for patches that go in at
this stage. The work to get this patch into the NSPR upstream
is several times greater (more porting, testing, and code review
are required), so it won't happen right away. If we use the patch
before it's accepted by the upstream, we better make sure the patch
is in a good shape that will be accepted.
I ported the NSPR patch to all the other Unix platforms we support
and attached a new patch to the NSPR upstream bug report. For RHEL 5
the NSPR patch in this bug report is fine.
(In reply to comment #35)
> I built a package for RHEL-5 that includes Wan-Teh's patch.
I have tested this build on my personal workstaton today, and can report that it
works fine, insofar as gnome-screensaver unlock works fine. Updated to GA
snapshot 3 on x86_64.
A package has been built which should help the problem described in
this bug report. This report is therefore being closed with a resolution
of CURRENTRELEASE. You may reopen this bug report if the solution does
not work for you.