Bug 216718 - nspr needs to cleanup thread local storage before getting unloaded
nspr needs to cleanup thread local storage before getting unloaded
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: nspr (Show other bugs)
5.0
i386 Linux
urgent Severity high
: ---
: ---
Assigned To: Kai Engert (:kaie)
: Desktop, Regression, Reopened
: 216615 216933 (view as bug list)
Depends On:
Blocks: 181509 251070
  Show dependency treegraph
 
Reported: 2006-11-21 12:56 EST by Edward Rousseau
Modified: 2015-04-12 19:17 EDT (History)
12 users (show)

See Also:
Fixed In Version: RC
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-02-07 19:43:41 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
the patch that made the problem go away (545 bytes, patch)
2006-11-21 14:36 EST, Ray Strode [halfline]
no flags Details | Diff
NSPR test program (detach.c) (1.25 KB, text/plain)
2006-11-22 21:27 EST, Wan-Teh Chang
no flags Details
NSPR patch: add a __attribute__ ((destructor)) function that cleans up the thread local storage (1.62 KB, patch)
2006-12-04 17:22 EST, Wan-Teh Chang
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Mozilla Foundation 362768 None None None Never

  None (edit)
Description Edward Rousseau 2006-11-21 12:56:53 EST
Description of problem:
When logged in as a kerberos user I cannot unlock the screensaver even though
the dialog appears to correctly take the password.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Log into gnome session as a kerberos user
2.lock screen
3.provide password to unlock screensaver and get back to session
  
Actual results:
screensaver stays locked


Expected results:
Screensaver unlocked

Additional info:
Comment 1 Ray Strode [halfline] 2006-11-21 14:34:31 EST
So I looked at this a bit today with Matthias. The lock dialog segfaulted when
its pam helper thread exited because it was trying to access resources from
pkinit's address block after pkinit was unloaded (libc was trying to clean up
the thread's local storage).

We were able to work around the problem by running PR_Cleanup from the pkinit's
fini function.

I'm not sure if this is the right fix however.  Will calling PR_Cleanup mess
with other plugins or libraries using nss in the same process? I don't think
this will be a problem for gnome-screensaver because it doesn't use NSS
elsewhere inprocess (except conceivably by another pam module, but those are
executed synchronously from each other I believe).
Comment 2 Ray Strode [halfline] 2006-11-21 14:36:47 EST
Created attachment 141814 [details]
the patch that made the problem go away
Comment 4 Bob Relyea 2006-11-21 17:58:15 EST
adding an NSPR expert to this mix.
Comment 5 Wan-Teh Chang 2006-11-21 22:59:37 EST
The right fix is to synchronize with the termination of
the lock dialog's pam helper thread before pkinit is unloaded.
That can be done with the PR_JoinThread/pthread_join function
or a condition variable.  For PR_JoinThread to work, the thread
must be created with PR_CreateThread(..., PR_JOINTABLE_THREAD, ...).

Your workaround is depending on certain side effects of
PR_Cleanup.  If you don't know why it works, it's probably
not the right fix.

PR_Cleanup does the following:
- frees all the memory and system resources (file descriptors,
  etc.) used by NSPR
- synchronizes with the termination of all the threads created
  by PR_CreateThread(PR_USER_THREAD, ...).  See the first while
  loop in
  http://lxr.mozilla.org/nspr/source/nsprpub/pr/src/pthreads/ptthread.c#927

Perhaps your workaround is depending on the second side effect?

Comment 6 Ray Strode [halfline] 2006-11-22 01:31:39 EST
Hi Wan-Teh,

I'm not sure I understand what you're saying.

the pkinit plugin is loaded, run, and unloaded by pam_krb5 in the pam helper
thread.  There are only two threads at play here, the main thread that processes
UI events, and the helper thread that is created for the authentication
conversation. 

When the pam conversation completes the helper thread wakes up the main thread
and exits.  when the main thread wakes up it reaps the helper thread with
g_thread_join (which is just a wrapper around pthread_join).

The crash happens when the helper thread exits, however.  Make sense?
Comment 7 Ray Strode [halfline] 2006-11-22 01:45:45 EST
*** Bug 216615 has been marked as a duplicate of this bug. ***
Comment 8 Wan-Teh Chang 2006-11-22 10:20:15 EST
Ray, sorry about the confusion.  Could you get a stack trace of
the pam helper thread when it segfaulted?  What's the thread local
storage that libc was trying to clean up?  Your patch (attachment 141814 [details])
in comment 2 calls both NSS_Shutdown and PR_Cleanup.  Do you know
which function call made the problem go away?
Comment 9 Nalin Dahyabhai 2006-11-22 10:28:54 EST
If it's the NSS_Shutdown() call, then it's something that needs to be fixed
there anyway (if the module initializes the library when it gets loaded, then it
should be shutting it down before it's unloaded).
Comment 10 Matthias Clasen 2006-11-22 10:33:16 EST
The PR_Cleanup made it go away
Comment 11 Wan-Teh Chang 2006-11-22 10:52:33 EST
Thanks.  I still need a stack trace and information about the
thread local storage that Ray referred to in comment 1.  But
at the risk of causing more confusion, I speculate that the
crash is caused by the thread local storage used by NSPR for
NSPR thread IDs (pointers to the PRThread structure), and your
workaround is depending on this pthread_setspecific call in PR_Cleanup:
http://lxr.mozilla.org/nspr/source/nsprpub/pr/src/pthreads/ptthread.c#950

If you know how to build NSPR, you can comment out these three lines
in PR_Cleanup (mozilla/nsprpub/pr/src/pthreads/ptthread.c):
        _pt_thread_death(me);
        rv = pthread_setspecific(pt_book.key, NULL);
        PR_ASSERT(0 == rv);
If I guessed right, commenting out these three lines will make
the problem come back.  If you don't know how to build NSPR, I can
do that for you when I get to work later.
Comment 12 Wan-Teh Chang 2006-11-22 13:21:59 EST
I built three NSPR libraries.  Please perform these three experiments.
In each experiment, install the NSPR library as /usr/lib/libnspr4.so.
Remember to back up the original /usr/lib/libnspr4.so.

1. /home/wtchang/nspr-4.6.4/libnspr4.so.1
and your PR_Cleanup workaround

This is the original NSPR 4.6.4 library.  It should not crash.

2. /home/wtchang/nspr-4.6.4/libnspr4.so.2
and your PR_Cleanup workaround

In this version, I removed those three lines from PR_Cleanup.
It should crash.

3. /home/wtchang/nspr-4.6.4/libnspr4.so.3
and replace PR_Cleanup() by PR_DetachThread() in your workaround

In this version, I rewrote PR_DetachThread() to clean up the
thread local storage.  It should not crash.
Comment 13 Edward Rousseau 2006-11-22 16:38:20 EST
*** Bug 216933 has been marked as a duplicate of this bug. ***
Comment 14 Wan-Teh Chang 2006-11-22 21:27:54 EST
Created attachment 141966 [details]
NSPR test program (detach.c)

Based on Ray's comment 6, I wrote an NSPR test program that
crashed.  Below is the stack trace.  Does your stack trace
look like this?

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1208894544 (LWP 31364)]
0x00b4eeaa in ?? ()
(gdb) where
#0  0x00b4eeaa in ?? ()
#1  0x009d3228 in __nptl_deallocate_tsd () from /lib/tls/libpthread.so.0
#2  0x009d337f in start_thread () from /lib/tls/libpthread.so.0
#3  0x0084affe in clone () from /lib/tls/libc.so.6
(gdb)
Comment 15 Ray Strode [halfline] 2006-11-22 22:26:19 EST
Hi Wan-teh,

Sorry for the late reply; I've been on a road trip today.

As Matthias said, PR_Cleanup was the function that seemed to make the problem go
away.  The NSS_Shutdown was something we added when we noticed that NSS_Init was
called elsewhere in the module.

I don't know the exact stack trace off hand, but it was short.  Something along
the lines of:

 <signal SEGV received>
 nptl_deallocate_tsd()
 ??
 start_thread(...) 
 
Also, valgrind said the segfault was because a resource was accessed at an
unmapped address and that the address fell in the range that nspr was mapped at
ealier.

I'll tackle Ed in his cube on monday and we can try your built libraries.
Comment 16 Ray Strode [halfline] 2006-11-22 22:28:33 EST
ah yes, the backtrace in comment 14 looks right.
Comment 17 Wan-Teh Chang 2006-11-22 22:45:39 EST
Ray, thanks for confirming.

Your PR_Cleanup workaround is fine provided that nobody else
in that process is using NSPR.

Another workaround is for the main thread to call
dlopen("libnspr4.so", RTLD_NOW) and leak the returned library
handle.  This prevents libnspr4.so from being unloaded.
Alternatively, link the program with -lnspr4 and call
an NSPR function such as PR_GetCurrentThread() or PR_SetError(0, 0)
to create a reference.
Comment 18 Nalin Dahyabhai 2006-11-30 17:31:28 EST
Are we going to need to work around this indefinitely?  Can we fix this in NSPR
in time for 5.0.0?
Comment 19 Wan-Teh Chang 2006-11-30 17:57:40 EST
We don't need to work around this indefinitely.  I will fix
this in NSPR.  Since NSPR is used by many products, the testing
will take a long time.  So for 5.0.0 I recommend we work around
this.  However, I'd still appreciate it if Ray could perform the
three experiments I described in comment 12 to verify it's indeed
an NSPR bug.

The NSPR bug report is
https://bugzilla.mozilla.org/show_bug.cgi?id=254983

I set its target milestone to the next NSPR release, 4.6.5,
yesterday.
Comment 20 Ray Strode [halfline] 2006-11-30 18:05:33 EST
Sorry Wan-Teh, I forgot to follow up.  Ed, is your machine x86-64 or i386?
Comment 21 Jeff Needle 2006-11-30 18:11:54 EST
Ray, this is also reproducible on the Dell on the blue table in my office
(hooked to the flat panel monitor).  Usual Westford root pwd.  gnome-screensaver
not currently running for rather obvious reasons.
Comment 22 Dave Malcolm 2006-12-01 11:25:52 EST
I'm also consistently getting this every day on my main workstation in the
Westford office, on i386, running the RHEL5Beta2 with some updates from the QA
instance of RHN.  I get a segfault with a backtrace similar to the one in
comment #15 if I attach to the gnome-screensaver-dialog process in gdb.

Comment 23 Ray Strode [halfline] 2006-12-01 11:39:30 EST
Dave, are you using i386 and not x86-64?  If so, would you mind building
attachment 141814 [details] into a cvs checkout of pkinit-nss and performing the steps in
comment 12?
Comment 24 Wan-Teh Chang 2006-12-01 19:41:31 EST
Ray, I built x86-64 NSPR binaries for you.  They are
/home/wtchang/nspr-4.6.4/x86-64/libnspr4.so.1
/home/wtchang/nspr-4.6.4/x86-64/libnspr4.so.2
/home/wtchang/nspr-4.6.4/x86-64/libnspr4.so.3

Follow the steps in comment 12 with these binaries. Thanks.
Comment 25 Ray Strode [halfline] 2006-12-04 13:33:18 EST
Hi Wan-Teh,

I snuck into Ed's cube today while he was on PTO and performed the steps you
outlined in comment 12.

Things seem to be exactly as you've predicted.

1) doesn't crash
2) crashes
3) doesn't crash after rebuilding pkinit with your proposed change (I didn't try
the original patch with libnspr4.so.3 though)

So how should be move forward on this for RHEL5 and beyond?

I guess we could 
1) Fix DetachThread in NSPR and call it from a fini function or
__attribute__((destructor)) function in NSPR
2) Don't touch DetachThread, but just clean up thread local storage in the fini
method
3) make pkinit explicitly dlopen nspr with RTLD_NODELETE 
4) make the kerberos libs link against nspr explicitly 
5) when the kerberos libs dlopen pkinit they could do it with RTLD_NODELETE 
6) make gnome-screensaver-dialog link against nspr explicitly

Do all those workarounds sound plausible?  Are there any other possibilities? 
Which should we do?
Comment 26 Ray Strode [halfline] 2006-12-04 14:07:54 EST
oh, we could also fix DetachThread and then call it from pkinit like step 3 in
comment 12.
Comment 27 Wan-Teh Chang 2006-12-04 14:17:13 EST
Ray, thank you for running those experiments for me.

For RHEL 5, my original recommendation in comment 17 was to use your
workaround "the patch that made the problem go away" (attachment 141814 [details]).

Since I didn't know about the RTLD_NODELETE flag for dlopen, my other
suggestions in comment 17 should be ignored; your suggestions are much
better.  Among them, #3 seems the best.

It is better that we use a workaround in RHEL 5 that doesn't require a
new NSPR release because I may not have enough time to test the new
NSPR code thoroughly.

Here are my recommendations:
Near term: the PR_Cleanup workaround, or use one of the proposals to
prevent libnspr4.so from being unloaded.
Long term: fix PR_DetachThread and then call it from pkinit
Comment 28 Ray Strode [halfline] 2006-12-04 14:26:48 EST
Hi Wan-Teh,

I don't think expecting pkinit to call PR_DetachThread is very nice.  It doesn't
after all do anything with threads.

Why can't nspr call PR_DetachThread on its own when it gets unloaded?
Comment 29 Ray Strode [halfline] 2006-12-04 14:27:42 EST
Nalin, which work around would you feel best about?
Comment 30 Wan-Teh Chang 2006-12-04 14:40:02 EST
Ray, nspr calling PR_DetachThread on its own when it gets unloaded is
a great idea and I will definitely pursue it.  Although it will fix this
bug, it isn't sufficient in general when there are multiple threads that
have "touched" NSPR.  If all of them stay around after NSPR is unloaded,
all of them need to clean up their thread local storage, and PR_Cleanup
can only take care of one of them.

As you pointed out, the need to call PR_DetachThread exposes an implementation
detail that isn't obvious to the caller.  I will try to hide this as much
as I can.  But I am worried that there will be places where it is necessary.

Comment 31 Wan-Teh Chang 2006-12-04 17:22:05 EST
Created attachment 142782 [details]
NSPR patch: add a __attribute__ ((destructor)) function that cleans up the thread local storage

In this patch, I add a __attribute__ ((destructor)) function to
NSPR that cleans up the thread local storage of the current thread.
I built an i386 NSPR library with this patch and emailed it to Ray
for testing.  This NSPR patch alone, without any change to pkinit,
should be able to fix the crash.

This patch takes less time to test thoroughly because it doesn't
change the PR_DetachThread function.  So it is fine to apply this
patch to the nspr-4.6.4 package and use it in RHEL 5.
Comment 32 Ray Strode [halfline] 2006-12-04 17:31:49 EST
So I tested the library on ed's machine and I can report that it seems to work
great.  I rolled back pkinit to the version we ship in Beta2, used your updated
nspr library and everything worked fine.

I'd say we should get this built into nspr as quickly as possible.

Kai, can you build it? or are you going to Wan-Teh?  (I can, too, just ask)
Comment 33 Wan-Teh Chang 2006-12-04 18:15:53 EST
Ray, thanks again for testing my NSPR patch.

I opened the bug https://bugzilla.mozilla.org/show_bug.cgi?id=362768
for the NSPR upstream and submitted the patch there for code review.

Are we supposed to only apply a patch that has been accepted by the
upstream?
Comment 35 Kai Engert (:kaie) 2006-12-05 10:30:04 EST
I built a package for RHEL-5 that includes Wan-Teh's patch.
nspr-4.6.4-3.el5
Comment 36 Ray Strode [halfline] 2006-12-05 10:50:01 EST
I'm going to reopen and move to MODIFIED so that QA can verify the fix and
ensure there aren't any regressions, etc.
Comment 37 Wan-Teh Chang 2006-12-05 11:49:27 EST
Don't forget to check in a patch like attachment 141814 [details] for pkinit
to call NSS_Shutdown (if pkinit isn't calling NSS_Shutdown now)
and PR_Cleanup (for double protection).  In attachment 141814 [details],
the NSS_Shutdown call needs to be moved after the PORT_FreeArena
call because PORT_FreeArena is an NSS function.  It doesn't
matter in this case, but it's a better style.

One of you should review the NSPR patch.  RHEL must have some
code review requirements, especially for patches that go in at
this stage.  The work to get this patch into the NSPR upstream
is several times greater (more porting, testing, and code review
are required), so it won't happen right away.  If we use the patch
before it's accepted by the upstream, we better make sure the patch
is in a good shape that will be accepted.
Comment 39 Wan-Teh Chang 2006-12-05 20:07:47 EST
I ported the NSPR patch to all the other Unix platforms we support
and attached a new patch to the NSPR upstream bug report.  For RHEL 5
the NSPR patch in this bug report is fine.
Comment 40 Jon Orris 2006-12-11 15:27:51 EST
(In reply to comment #35)
> I built a package for RHEL-5 that includes Wan-Teh's patch.
> nspr-4.6.4-3.el5
> 

I have tested this build on my personal workstaton today, and can report that it
works fine, insofar as gnome-screensaver unlock works fine. Updated to GA
snapshot 3 on x86_64.



Comment 41 RHEL Product and Program Management 2007-02-07 19:43:41 EST
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.

Note You need to log in before you can comment on or make changes to this bug.