Bug 306611 - autofs-5.0.1-0.rc2.43.0.2.x86_64 segfaults when '/etc/init.d/autofs stop' is run
Summary: autofs-5.0.1-0.rc2.43.0.2.x86_64 segfaults when '/etc/init.d/autofs stop' is run
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: autofs5
Version: 4.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Ian Kent
QA Contact:
URL:
Whiteboard:
Depends On: 286541
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-09-26 09:39 UTC by Ian Kent
Modified: 2008-07-24 19:08 UTC (History)
2 users (show)

Fixed In Version: RHBA-2008-0659
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-07-24 19:08:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0659 0 normal SHIPPED_LIVE autofs5 bug fix and enhancement update 2008-07-23 14:54:24 UTC

Description Ian Kent 2007-09-26 09:39:09 UTC
+++ This bug was initially created as a clone of Bug #286541 +++

Description of problem:
This is an escalation from Issue tracker, describing it here to avoid
unnecessary cruft.
-------------------------------------------------------------------
/etc/init.d/autofs stop

generates logs:
Sep 11 10:03:14 why kernel: automount[15749]: segfault at 00002aaaac1a7d80 rip
00002aaaac1a7d80 rsp 00000000404230e8 error 14

The autofs daemon has been configured to consult a ldap server.
-------------------------------------------------------------------

Version-Release number of selected component (if applicable):
autofs-5.0.1-0.rc2.43.0.2.x86_64

How reproducible:
Quite frequently, though not consistently

Steps to Reproduce:
1. Configure autofs to point a a ldap server
2. Run the following script:

[root@why it119213]# cat test-autofs-segfault.sh
#!/bin/bash

while [ -z "$fault" ]; do
    date
    service autofs start
    ls /studio >& /dev/null
    service autofs stop
    fault=$(tail /var/log/messages | grep segfault)
done

3. The script will stop at when the seg fault occurs.

Actual results:
The automount daemon receives a segfault while shutting down.

Expected results:
The automount daemon should not receive a segfault and should shut down cleanly.

Additional info:
1. Strangely enough a 'service autofs restart' is less consistent in reproducing
this error (the fact that a mount is actually accessed between the start/stop
may have something to do with this)
2. Problem exists with the upstream version of autofs too (autofs-5.0.2)
3. The problem is reproducible on a local system - why.sfbay.redhat.com
4. I am attaching the following files:

core.26374 ..... core file generated by 'service autofs stop'
bt.txt ......... A capture of the gdb session of the core with the commands 'bt'
followed by 'thread apply all bt'


5. Now, I do not know enough to take this further and the autofs code is new to
me, but I suspect there is a race condition someplace which cause the
pthread_create in daemon/automount.c::do_signals() to receive a SIGSEGV.
6. Upstream comments seem to suggest that the thread cancellation at
shutdown[1], has been a problem earlier too. This might be related, but that is
just a wild guess.

Please let me know if you need additional information.

- steve

[1] Commits from 2006-08-19 to 2006-08-25
http://git.kernel.org/?p=linux/storage/autofs/autofs.git;a=shortlog;h=407e21d657cc9937ffbad3c0c1c932050d25defd;pg=1

-- Additional comment from sfernand on 2007-09-11 14:16 EST --
Created an attachment (id=192781)
gdb session with 'bt' and 'thread apply all bt' on the core


-- Additional comment from sfernand on 2007-09-11 14:24 EST --
Created an attachment (id=192811)
bzip2 of core file generated by 'service autofs stop'


-- Additional comment from ikent on 2007-09-11 23:20 EST --
(In reply to comment #1)
> Created an attachment (id=192781) [edit]
> gdb session with 'bt' and 'thread apply all bt' on the core
> 

Excellent information.
I'm aware of this issue.
Please see bug 207260 for more information.

I've already spent quite a bit of time on this and I'll take
a closer look at this info. to ensure that my analysis holds.

Ian


-- Additional comment from ikent on 2007-09-11 23:31 EST --
*** Bug 207260 has been marked as a duplicate of this bug. ***

-- Additional comment from ikent on 2007-09-11 23:47 EST --
Created an attachment (id=193181)
Don't race with pthread library when deleting thread specific key

I'm testing this patch and it appears to resolve the issue.
What still needs to be done is to establish that this is the
right thing to do wrt. to the library code.

I believe the problem is that the pthreads library is trying
to delete the thread specific key at the same time as the
library during library unload following the dlclose. I have to
wonder why the key delete call is present at all in the library
code as this is usually better left to the pthread library.

Ian

-- Additional comment from ikent on 2007-09-12 00:33 EST --
(In reply to comment #1)
> Created an attachment (id=192781) [edit]
> gdb session with 'bt' and 'thread apply all bt' on the core
> 

Yes, all the threads in this trace are in a location that
follows the autofs lookup library close. This concurs with
my current thinking as to the cause of this issue.

I've built the krb5 package with the patch posted here from
a private branch, could you test it out please.

You can find the x86_64 build rpms at
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=962316

Ian


-- Additional comment from sfernand on 2007-09-12 07:22 EST --
Hi Ian,

Thanks for the quick response on this, however ...

(In reply to comment #6)
> (In reply to comment #1)
> > Created an attachment (id=192781) [edit] [edit]
> > gdb session with 'bt' and 'thread apply all bt' on the core
> >
>
> Yes, all the threads in this trace are in a location that
> follows the autofs lookup library close. This concurs with
> my current thinking as to the cause of this issue.
>
> I've built the krb5 package with the patch posted here from
> a private branch, could you test it out please.

Umm, I am not sure that this specific issue is related to the
pthread_key_delete() within the krb5 libs. In this particular case, we are not
using Kerberos at all. Moreover, the RIP in the segfault message does not appear
to be in any krb related libs.

I am attaching the pmap of the automount process to this BZ.

Also, in case it helps, as mentioned in comment #1 --

> 3. The problem is reproducible on a local system - why.sfbay.redhat.com


- steve


-- Additional comment from sfernand on 2007-09-12 07:25 EST --
Created an attachment (id=193351)
pmap `pidof automount`


-- Additional comment from ikent on 2007-09-12 09:32 EST --
(In reply to comment #7)
> Hi Ian,
> 
> Thanks for the quick response on this, however ...
> 
> (In reply to comment #6)
> > (In reply to comment #1)
> > > Created an attachment (id=192781) [edit] [edit] [edit]
> > > gdb session with 'bt' and 'thread apply all bt' on the core
> > >
> >
> > Yes, all the threads in this trace are in a location that
> > follows the autofs lookup library close. This concurs with
> > my current thinking as to the cause of this issue.
> >
> > I've built the krb5 package with the patch posted here from
> > a private branch, could you test it out please.
> 
> Umm, I am not sure that this specific issue is related to the
> pthread_key_delete() within the krb5 libs. In this particular case, we are not
> using Kerberos at all. Moreover, the RIP in the segfault message does not appear
> to be in any krb related libs.

You are loading the library indirectly as a dependency by using
LDAP which causes the thread specific key to be created when
the autofs LDAP lookup module is opened and deleted when the last
mount closes it. This is done by the DSO constructor and destructor
functions so you don't have to actually be using Kerberos for
the thread specific key to be created and then deleted. The
lookup module is closed just before the thread handling the
autofs mount exits which also lends evidence to this as a
possible cause.

Also see the comment in daemon/automount.c at about line
1245. It's been there for a long time.

So, if nothing else, try this for me.
We can discuss root cause based on the result of the test
but right now I need to know if this prevents the problem
from happening for you as it does for me.

Ian


-- Additional comment from ikent on 2007-09-12 09:37 EST --
(In reply to comment #8)
> Created an attachment (id=193351) [edit]
> pmap `pidof automount`
> 

And sure enough the dependency list of horror shows up in
this list.

libkrb5 -> libgssapi_krb5 -> libkrb5support

Ian




-- Additional comment from sfernand on 2007-09-12 10:09 EST --
> > Umm, I am not sure that this specific issue is related to the
> > pthread_key_delete() within the krb5 libs. In this particular case, we are not
> > using Kerberos at all. Moreover, the RIP in the segfault message does not appear
> > to be in any krb related libs.
> 
> You are loading the library indirectly as a dependency by using
> LDAP which causes the thread specific key to be created when
> the autofs LDAP lookup module is opened and deleted when the last
> mount closes it. This is done by the DSO constructor and destructor
> functions so you don't have to actually be using Kerberos for
> the thread specific key to be created and then deleted. The
> lookup module is closed just before the thread handling the
> autofs mount exits which also lends evidence to this as a
> possible cause.

Thanks for the clarification. I think, I understand.

> Also see the comment in daemon/automount.c at about line
> 1245. It's been there for a long time.
Yes, I noticed that when I was searching whether automount itself called
pthread_key_delete() someplace.

> So, if nothing else, try this for me.
Sure, will do. I just asked because I was curious :).


> We can discuss root cause based on the result of the test
> but right now I need to know if this prevents the problem
> from happening for you as it does for me.

...shall update you soon.

- steve



-- Additional comment from ikent on 2007-09-12 10:28 EST --
(In reply to comment #11)
> > Also see the comment in daemon/automount.c at about line
> > 1245. It's been there for a long time.
> Yes, I noticed that when I was searching whether automount itself called
> pthread_key_delete() someplace.

autofs can't call pthread_key_delete because it has no way of
knowing if the tsd key is still in use. As far as I know it's
best to let pthreads take care of this.

Ian



-- Additional comment from sfernand on 2007-09-12 10:44 EST --
Hi Ian,

> > We can discuss root cause based on the result of the test
> > but right now I need to know if this prevents the problem
> > from happening for you as it does for me.
> 
> ...shall update you soon.

Installing the patched packages doesn't seem to help for this issue. The RIP and
bt appear to be unchanged. (trying this on the same system with the reproducer).

- steve



-- Additional comment from ikent on 2007-09-12 11:40 EST --
(In reply to comment #13)
> Hi Ian,
> 
> > > We can discuss root cause based on the result of the test
> > > but right now I need to know if this prevents the problem
> > > from happening for you as it does for me.
> > 
> > ...shall update you soon.
> 
> Installing the patched packages doesn't seem to help for this issue. The RIP and
> bt appear to be unchanged. (trying this on the same system with the reproducer).

Oops .. sorry.
Did everything except apply the patch in the spec file.

Could you give the rpms at
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=962765
a go please.

Ian



-- Additional comment from sfernand on 2007-09-12 12:06 EST --
Hi Ian,

> Oops .. sorry.
> Did everything except apply the patch in the spec file.
> 
> Could you give the rpms at
> http://brewweb.devel.redhat.com/brew/taskinfo?taskID=962765
> a go please.

Sorry, still no change.

I will be heading back home now, so any more testing would possibly have to
wait till tomorrow, or Bryan (The TAM contact on the associated Issue Tracker)
may update you.

thanks,
- steve


-- Additional comment from ikent on 2007-09-12 12:14 EST --
(In reply to comment #15)
> Hi Ian,
> 
> > Oops .. sorry.
> > Did everything except apply the patch in the spec file.
> > 
> > Could you give the rpms at
> > http://brewweb.devel.redhat.com/brew/taskinfo?taskID=962765
> > a go please.
> 
> Sorry, still no change.
> 
> I will be heading back home now, so any more testing would possibly have to
> wait till tomorrow, or Bryan (The TAM contact on the associated Issue Tracker)
> may update you.

Mmmm, it's back to square one then.
Ian


-- Additional comment from bjmason on 2007-09-12 13:25 EST --
Hi Ian - the reproducer machine is in my office, so let me know if there's
anything I can do to help.  ~Bryan

-- Additional comment from ikent on 2007-09-12 14:11 EST --
(In reply to comment #17)
> Hi Ian - the reproducer machine is in my office, so let me know if there's
> anything I can do to help.  ~Bryan

Thanks, could you find out what the map entry for /studio
(from the reproducer script) is please.

Ian


-- Additional comment from bjmason on 2007-09-12 14:56 EST --
Created an attachment (id=193801)
dump of /studio map from reproducer

Hi Ian,

Here's the output of "ldapsearch -x -b
'ou=automount,dc=anim,dc=dreamworks,dc=com' '(nisMapName=auto.studio-gld)'"
which should dump the contents of the map entry for /studio.

However, I can still reproduce the problem if I comment out the line

    ls /studio >& /dev/null

from the reproducer script, so I'm not sure if the contents of the map are
important.  (I'm not actually sure why I put that line in there in the first
place . . . I probably thought autofs needed something to do to :^).

-- Bryan

-- Additional comment from ikent on 2007-09-13 11:07 EST --
(In reply to comment #16)
> (In reply to comment #15)
> > Hi Ian,
> > 
> > > Oops .. sorry.
> > > Did everything except apply the patch in the spec file.
> > > 
> > > Could you give the rpms at
> > > http://brewweb.devel.redhat.com/brew/taskinfo?taskID=962765
> > > a go please.
> > 
> > Sorry, still no change.
> > 
> > I will be heading back home now, so any more testing would possibly have to
> > wait till tomorrow, or Bryan (The TAM contact on the associated Issue Tracker)
> > may update you.
> 
> Mmmm, it's back to square one then.

*sigh*

I went through this again from the start.

It still looks like pthreads is trying to call an
invalid thread specific data (tsd) destructor function
at thread termination. I tried everything I could to
show it's autofs but all I succeeded in doing is convincing
myself it isn't. The bit that really seals it is that
if I set the tsd destructor to NULL for all autofs tsds
the segv still happens and pthreads explicitly checks
this before calling them. And I checked the ldap, krb5
and openssl packages and the only tsd is in libkrb5support.

Come to think of it I haven't checked the sasl libraries yet.

Anyway, now I think that this may be due to pthreads
not clearing the destructor field of the key when it's
deleted (pthread_key_delete call). So the sequence might
be unload library, pthread_key_delete called, function disappears,
thread terminates, pthreads calls invalid destructor and boom.

I'm building a patched glibc now to check this.

Ian


-- Additional comment from ikent on 2007-09-17 20:55 EST --
(In reply to comment #20)
> *sigh*
> 
> I went through this again from the start.
> 
> It still looks like pthreads is trying to call an
> invalid thread specific data (tsd) destructor function
> at thread termination.

And it is!
Think I've finally found the guilty party, totally
unexpected, it's libxml2.

libxml2 uses a tsd key, sets a destructor function and
never deletes the key. Consequently, when the LDAP lookup
module is dlclosed the function can go away before autofs
exits.

Ian


-- Additional comment from ikent on 2007-09-17 21:12 EST --
I've patched libxml2 to work around this in a private
branch so we can test it.

The patch is by no means my recommendation as to how
this should be fixed, the libxml2 folks will need to
decide that. It should, however, provide a workaround
while we're waiting.

Please try the packages at:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=969304
and let me know how it goes.

Ian


-- Additional comment from ikent on 2007-09-17 21:15 EST --
Created an attachment (id=197961)
Patch to add library ctor function to delete tsd key at library exit

And this is the workaround patch I'm testing.

Ian


-- Additional comment from bjmason on 2007-09-18 15:19 EST --
Works for me!

I installed the updated libxml2 packages and was able to start/stop autofs over
100 times with no segfaults.

-- Additional comment from veillard on 2007-09-18 17:56 EST --
patch looks simple ... except for __attribute ((destructor))
this is not just conditional on HAVE_PTHREAD_H but also on the
compiler used, so would need a bit more looking before pushing it
upstream. I wonder if there is a more portable way to run a library
destructor...
I'm still a bit puzzled about this but well I assume it's
right, I will have to run libxml2 regression tests with it enabled 
too !

Daniel

-- Additional comment from ikent on 2007-09-18 23:45 EST --
(In reply to comment #25)
> patch looks simple ... except for __attribute ((destructor))
> this is not just conditional on HAVE_PTHREAD_H but also on the
> compiler used, so would need a bit more looking before pushing it
> upstream. I wonder if there is a more portable way to run a library
> destructor...
> I'm still a bit puzzled about this but well I assume it's
> right, I will have to run libxml2 regression tests with it enabled 
> too !

Thanks for the reply Daniel,

The patch isn't portable, for sure, and isn't appropriate
for upstream, it's really just a workaround for the interim.
As it stands the patch may be allowing a memory leak for the
key data it's trying to free at the exit of autofs.

Let me try and fill in the blanks. I'm not sure how much you
know about pthreads "thread specific data" functions
pthread_key_create, pthread_getspecific, pthread_setspeific
and pthread_key_delete so I'll assume not much.

Basically, using these functions one can create a "key" that
is used to hang allocated data of that is distinct in each
thread by using pthread_setspecific and pthread_getspecific.
Often, when this "key" is created, a destructor function is
given that's used to deallocate the data and it's called
internally by pthreads at the exit of each thread. The call
is conditional on two things that the "key" creator has control
over, a non-null data pointer for the "key" of the owning thread
and whether the key is valid, which it is if pthread_key_delete
hasn't been called for the given "key".

So, the bottom line is that normally you can just not delete
the key and let pthreads clean up at application exit knowing
that the allocated data will be freed and everyone will be
happy. But if your a shared library that can be unloaded
by the user before the application exits this assumption no
longer holds. In this case, if a key value is non-null and the
key hasn't been deleted, the function given to clean up the data
at exit may not exist any more because the library is gone.
Hence the segv.

The problem happens with autofs because it "dlopens" lookup
modules for different map sources such as LDAP, NIS etc.
and "dlcloses" them when it's finished with them. The LDAP
lookup module (uses) depends on the libxml2 library and so
this problem happens. We can see other libraries, such as
libkrb5support, do do a key delete in their library unload
routine similar to the patch that I included here.

So, a more portable way to deal with this is for the library
to always deallocate this data once it's finished using it
and then call pthread_setspecific to set the key value to NULL.
This should be enough to prevent the segv that has been reported
here. But this has the problem of leaking pthreads keys which are
also limited in number and could also lead to a problem if the
library is loaded and unloaded a lot. So, in reality, both these
things should be done, the deallocation and the key delete for
a proper fix.

I really don't have enough understanding of libxml2 to be able
to implement the deallocation bit nor sufficient knowledge of
it's configure environment you mentioned to do this properly.
It's really something that upstream needs to work on. Hey,
I don't really know much about anything so all this could be
completely wrong, ;), but I don't think it's that far from the
mark.

Let me know how you go upstream please.
Ian


-- Additional comment from ikent on 2007-09-18 23:57 EST --
(In reply to comment #26)
> 
> The problem happens with autofs because it "dlopens" lookup
> modules for different map sources such as LDAP, NIS etc.
> and "dlcloses" them when it's finished with them. The LDAP
> lookup module (uses) depends on the libxml2 library and so
> this problem happens. We can see other libraries, such as
> libkrb5support, do do a key delete in their library unload
> routine similar to the patch that I included here.
> 

But wait thre's more.

An RFE that I'm working on will lead to the LDAP lookup
module possibly being loaded and unloaded over time so
I suspect that, even if autofs never exits, the key
exhaustion problem will pop up as well.

Sorry, Ian


-- Additional comment from ikent on 2007-09-19 12:33 EST --
(In reply to comment #26)
> (In reply to comment #25)
> > patch looks simple ... except for __attribute ((destructor))
> > this is not just conditional on HAVE_PTHREAD_H but also on the
> > compiler used, so would need a bit more looking before pushing it
> > upstream. I wonder if there is a more portable way to run a library
> > destructor...
> > I'm still a bit puzzled about this but well I assume it's
> > right, I will have to run libxml2 regression tests with it enabled 
> > too !
> 
> Thanks for the reply Daniel,
> 
> The patch isn't portable, for sure, and isn't appropriate
> for upstream, it's really just a workaround for the interim.
> As it stands the patch may be allowing a memory leak for the
> key data it's trying to free at the exit of autofs.

Actually, now I think about it I need to work around this in
autofs anyway. I can't assume that people will have an updated
libxml2 even if it is fixed.

But please send this upstream also.

Ian


-- Additional comment from rkhadgar on 2007-09-26 01:17 EST --
the patched libxml2 package helps :)

-- Additional comment from ikent on 2007-09-26 05:02 EST --
(In reply to comment #29)
> the patched libxml2 package helps :)

Sorry, I do have an patch to workaround this.
I just haven't got to building it just yet.

I'll get to it soon as I can.
Ian

Comment 1 RHEL Program Management 2007-12-11 04:15:12 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Ian Kent 2008-03-18 04:15:30 UTC
The RHTS sub test bz286541 under autofs-tests/bugzillas can
be used to verify this bug.

This issue is fixed in autofs5-5.0.1-0.rc2.87.

Comment 4 Barry Donahue 2008-06-12 13:32:31 UTC
Verified RHTS job #23575.

Comment 6 errata-xmlrpc 2008-07-24 19:08:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0659.html


Note You need to log in before you can comment on or make changes to this bug.