+++ This bug was initially created as a clone of Bug #286541 +++ Description of problem: This is an escalation from Issue tracker, describing it here to avoid unnecessary cruft. ------------------------------------------------------------------- /etc/init.d/autofs stop generates logs: Sep 11 10:03:14 why kernel: automount[15749]: segfault at 00002aaaac1a7d80 rip 00002aaaac1a7d80 rsp 00000000404230e8 error 14 The autofs daemon has been configured to consult a ldap server. ------------------------------------------------------------------- Version-Release number of selected component (if applicable): autofs-5.0.1-0.rc2.43.0.2.x86_64 How reproducible: Quite frequently, though not consistently Steps to Reproduce: 1. Configure autofs to point a a ldap server 2. Run the following script: [root@why it119213]# cat test-autofs-segfault.sh #!/bin/bash while [ -z "$fault" ]; do date service autofs start ls /studio >& /dev/null service autofs stop fault=$(tail /var/log/messages | grep segfault) done 3. The script will stop at when the seg fault occurs. Actual results: The automount daemon receives a segfault while shutting down. Expected results: The automount daemon should not receive a segfault and should shut down cleanly. Additional info: 1. Strangely enough a 'service autofs restart' is less consistent in reproducing this error (the fact that a mount is actually accessed between the start/stop may have something to do with this) 2. Problem exists with the upstream version of autofs too (autofs-5.0.2) 3. The problem is reproducible on a local system - why.sfbay.redhat.com 4. I am attaching the following files: core.26374 ..... core file generated by 'service autofs stop' bt.txt ......... A capture of the gdb session of the core with the commands 'bt' followed by 'thread apply all bt' 5. Now, I do not know enough to take this further and the autofs code is new to me, but I suspect there is a race condition someplace which cause the pthread_create in daemon/automount.c::do_signals() to receive a SIGSEGV. 6. Upstream comments seem to suggest that the thread cancellation at shutdown[1], has been a problem earlier too. This might be related, but that is just a wild guess. Please let me know if you need additional information. - steve [1] Commits from 2006-08-19 to 2006-08-25 http://git.kernel.org/?p=linux/storage/autofs/autofs.git;a=shortlog;h=407e21d657cc9937ffbad3c0c1c932050d25defd;pg=1 -- Additional comment from sfernand on 2007-09-11 14:16 EST -- Created an attachment (id=192781) gdb session with 'bt' and 'thread apply all bt' on the core -- Additional comment from sfernand on 2007-09-11 14:24 EST -- Created an attachment (id=192811) bzip2 of core file generated by 'service autofs stop' -- Additional comment from ikent on 2007-09-11 23:20 EST -- (In reply to comment #1) > Created an attachment (id=192781) [edit] > gdb session with 'bt' and 'thread apply all bt' on the core > Excellent information. I'm aware of this issue. Please see bug 207260 for more information. I've already spent quite a bit of time on this and I'll take a closer look at this info. to ensure that my analysis holds. Ian -- Additional comment from ikent on 2007-09-11 23:31 EST -- *** Bug 207260 has been marked as a duplicate of this bug. *** -- Additional comment from ikent on 2007-09-11 23:47 EST -- Created an attachment (id=193181) Don't race with pthread library when deleting thread specific key I'm testing this patch and it appears to resolve the issue. What still needs to be done is to establish that this is the right thing to do wrt. to the library code. I believe the problem is that the pthreads library is trying to delete the thread specific key at the same time as the library during library unload following the dlclose. I have to wonder why the key delete call is present at all in the library code as this is usually better left to the pthread library. Ian -- Additional comment from ikent on 2007-09-12 00:33 EST -- (In reply to comment #1) > Created an attachment (id=192781) [edit] > gdb session with 'bt' and 'thread apply all bt' on the core > Yes, all the threads in this trace are in a location that follows the autofs lookup library close. This concurs with my current thinking as to the cause of this issue. I've built the krb5 package with the patch posted here from a private branch, could you test it out please. You can find the x86_64 build rpms at http://brewweb.devel.redhat.com/brew/taskinfo?taskID=962316 Ian -- Additional comment from sfernand on 2007-09-12 07:22 EST -- Hi Ian, Thanks for the quick response on this, however ... (In reply to comment #6) > (In reply to comment #1) > > Created an attachment (id=192781) [edit] [edit] > > gdb session with 'bt' and 'thread apply all bt' on the core > > > > Yes, all the threads in this trace are in a location that > follows the autofs lookup library close. This concurs with > my current thinking as to the cause of this issue. > > I've built the krb5 package with the patch posted here from > a private branch, could you test it out please. Umm, I am not sure that this specific issue is related to the pthread_key_delete() within the krb5 libs. In this particular case, we are not using Kerberos at all. Moreover, the RIP in the segfault message does not appear to be in any krb related libs. I am attaching the pmap of the automount process to this BZ. Also, in case it helps, as mentioned in comment #1 -- > 3. The problem is reproducible on a local system - why.sfbay.redhat.com - steve -- Additional comment from sfernand on 2007-09-12 07:25 EST -- Created an attachment (id=193351) pmap `pidof automount` -- Additional comment from ikent on 2007-09-12 09:32 EST -- (In reply to comment #7) > Hi Ian, > > Thanks for the quick response on this, however ... > > (In reply to comment #6) > > (In reply to comment #1) > > > Created an attachment (id=192781) [edit] [edit] [edit] > > > gdb session with 'bt' and 'thread apply all bt' on the core > > > > > > > Yes, all the threads in this trace are in a location that > > follows the autofs lookup library close. This concurs with > > my current thinking as to the cause of this issue. > > > > I've built the krb5 package with the patch posted here from > > a private branch, could you test it out please. > > Umm, I am not sure that this specific issue is related to the > pthread_key_delete() within the krb5 libs. In this particular case, we are not > using Kerberos at all. Moreover, the RIP in the segfault message does not appear > to be in any krb related libs. You are loading the library indirectly as a dependency by using LDAP which causes the thread specific key to be created when the autofs LDAP lookup module is opened and deleted when the last mount closes it. This is done by the DSO constructor and destructor functions so you don't have to actually be using Kerberos for the thread specific key to be created and then deleted. The lookup module is closed just before the thread handling the autofs mount exits which also lends evidence to this as a possible cause. Also see the comment in daemon/automount.c at about line 1245. It's been there for a long time. So, if nothing else, try this for me. We can discuss root cause based on the result of the test but right now I need to know if this prevents the problem from happening for you as it does for me. Ian -- Additional comment from ikent on 2007-09-12 09:37 EST -- (In reply to comment #8) > Created an attachment (id=193351) [edit] > pmap `pidof automount` > And sure enough the dependency list of horror shows up in this list. libkrb5 -> libgssapi_krb5 -> libkrb5support Ian -- Additional comment from sfernand on 2007-09-12 10:09 EST -- > > Umm, I am not sure that this specific issue is related to the > > pthread_key_delete() within the krb5 libs. In this particular case, we are not > > using Kerberos at all. Moreover, the RIP in the segfault message does not appear > > to be in any krb related libs. > > You are loading the library indirectly as a dependency by using > LDAP which causes the thread specific key to be created when > the autofs LDAP lookup module is opened and deleted when the last > mount closes it. This is done by the DSO constructor and destructor > functions so you don't have to actually be using Kerberos for > the thread specific key to be created and then deleted. The > lookup module is closed just before the thread handling the > autofs mount exits which also lends evidence to this as a > possible cause. Thanks for the clarification. I think, I understand. > Also see the comment in daemon/automount.c at about line > 1245. It's been there for a long time. Yes, I noticed that when I was searching whether automount itself called pthread_key_delete() someplace. > So, if nothing else, try this for me. Sure, will do. I just asked because I was curious :). > We can discuss root cause based on the result of the test > but right now I need to know if this prevents the problem > from happening for you as it does for me. ...shall update you soon. - steve -- Additional comment from ikent on 2007-09-12 10:28 EST -- (In reply to comment #11) > > Also see the comment in daemon/automount.c at about line > > 1245. It's been there for a long time. > Yes, I noticed that when I was searching whether automount itself called > pthread_key_delete() someplace. autofs can't call pthread_key_delete because it has no way of knowing if the tsd key is still in use. As far as I know it's best to let pthreads take care of this. Ian -- Additional comment from sfernand on 2007-09-12 10:44 EST -- Hi Ian, > > We can discuss root cause based on the result of the test > > but right now I need to know if this prevents the problem > > from happening for you as it does for me. > > ...shall update you soon. Installing the patched packages doesn't seem to help for this issue. The RIP and bt appear to be unchanged. (trying this on the same system with the reproducer). - steve -- Additional comment from ikent on 2007-09-12 11:40 EST -- (In reply to comment #13) > Hi Ian, > > > > We can discuss root cause based on the result of the test > > > but right now I need to know if this prevents the problem > > > from happening for you as it does for me. > > > > ...shall update you soon. > > Installing the patched packages doesn't seem to help for this issue. The RIP and > bt appear to be unchanged. (trying this on the same system with the reproducer). Oops .. sorry. Did everything except apply the patch in the spec file. Could you give the rpms at http://brewweb.devel.redhat.com/brew/taskinfo?taskID=962765 a go please. Ian -- Additional comment from sfernand on 2007-09-12 12:06 EST -- Hi Ian, > Oops .. sorry. > Did everything except apply the patch in the spec file. > > Could you give the rpms at > http://brewweb.devel.redhat.com/brew/taskinfo?taskID=962765 > a go please. Sorry, still no change. I will be heading back home now, so any more testing would possibly have to wait till tomorrow, or Bryan (The TAM contact on the associated Issue Tracker) may update you. thanks, - steve -- Additional comment from ikent on 2007-09-12 12:14 EST -- (In reply to comment #15) > Hi Ian, > > > Oops .. sorry. > > Did everything except apply the patch in the spec file. > > > > Could you give the rpms at > > http://brewweb.devel.redhat.com/brew/taskinfo?taskID=962765 > > a go please. > > Sorry, still no change. > > I will be heading back home now, so any more testing would possibly have to > wait till tomorrow, or Bryan (The TAM contact on the associated Issue Tracker) > may update you. Mmmm, it's back to square one then. Ian -- Additional comment from bjmason on 2007-09-12 13:25 EST -- Hi Ian - the reproducer machine is in my office, so let me know if there's anything I can do to help. ~Bryan -- Additional comment from ikent on 2007-09-12 14:11 EST -- (In reply to comment #17) > Hi Ian - the reproducer machine is in my office, so let me know if there's > anything I can do to help. ~Bryan Thanks, could you find out what the map entry for /studio (from the reproducer script) is please. Ian -- Additional comment from bjmason on 2007-09-12 14:56 EST -- Created an attachment (id=193801) dump of /studio map from reproducer Hi Ian, Here's the output of "ldapsearch -x -b 'ou=automount,dc=anim,dc=dreamworks,dc=com' '(nisMapName=auto.studio-gld)'" which should dump the contents of the map entry for /studio. However, I can still reproduce the problem if I comment out the line ls /studio >& /dev/null from the reproducer script, so I'm not sure if the contents of the map are important. (I'm not actually sure why I put that line in there in the first place . . . I probably thought autofs needed something to do to :^). -- Bryan -- Additional comment from ikent on 2007-09-13 11:07 EST -- (In reply to comment #16) > (In reply to comment #15) > > Hi Ian, > > > > > Oops .. sorry. > > > Did everything except apply the patch in the spec file. > > > > > > Could you give the rpms at > > > http://brewweb.devel.redhat.com/brew/taskinfo?taskID=962765 > > > a go please. > > > > Sorry, still no change. > > > > I will be heading back home now, so any more testing would possibly have to > > wait till tomorrow, or Bryan (The TAM contact on the associated Issue Tracker) > > may update you. > > Mmmm, it's back to square one then. *sigh* I went through this again from the start. It still looks like pthreads is trying to call an invalid thread specific data (tsd) destructor function at thread termination. I tried everything I could to show it's autofs but all I succeeded in doing is convincing myself it isn't. The bit that really seals it is that if I set the tsd destructor to NULL for all autofs tsds the segv still happens and pthreads explicitly checks this before calling them. And I checked the ldap, krb5 and openssl packages and the only tsd is in libkrb5support. Come to think of it I haven't checked the sasl libraries yet. Anyway, now I think that this may be due to pthreads not clearing the destructor field of the key when it's deleted (pthread_key_delete call). So the sequence might be unload library, pthread_key_delete called, function disappears, thread terminates, pthreads calls invalid destructor and boom. I'm building a patched glibc now to check this. Ian -- Additional comment from ikent on 2007-09-17 20:55 EST -- (In reply to comment #20) > *sigh* > > I went through this again from the start. > > It still looks like pthreads is trying to call an > invalid thread specific data (tsd) destructor function > at thread termination. And it is! Think I've finally found the guilty party, totally unexpected, it's libxml2. libxml2 uses a tsd key, sets a destructor function and never deletes the key. Consequently, when the LDAP lookup module is dlclosed the function can go away before autofs exits. Ian -- Additional comment from ikent on 2007-09-17 21:12 EST -- I've patched libxml2 to work around this in a private branch so we can test it. The patch is by no means my recommendation as to how this should be fixed, the libxml2 folks will need to decide that. It should, however, provide a workaround while we're waiting. Please try the packages at: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=969304 and let me know how it goes. Ian -- Additional comment from ikent on 2007-09-17 21:15 EST -- Created an attachment (id=197961) Patch to add library ctor function to delete tsd key at library exit And this is the workaround patch I'm testing. Ian -- Additional comment from bjmason on 2007-09-18 15:19 EST -- Works for me! I installed the updated libxml2 packages and was able to start/stop autofs over 100 times with no segfaults. -- Additional comment from veillard on 2007-09-18 17:56 EST -- patch looks simple ... except for __attribute ((destructor)) this is not just conditional on HAVE_PTHREAD_H but also on the compiler used, so would need a bit more looking before pushing it upstream. I wonder if there is a more portable way to run a library destructor... I'm still a bit puzzled about this but well I assume it's right, I will have to run libxml2 regression tests with it enabled too ! Daniel -- Additional comment from ikent on 2007-09-18 23:45 EST -- (In reply to comment #25) > patch looks simple ... except for __attribute ((destructor)) > this is not just conditional on HAVE_PTHREAD_H but also on the > compiler used, so would need a bit more looking before pushing it > upstream. I wonder if there is a more portable way to run a library > destructor... > I'm still a bit puzzled about this but well I assume it's > right, I will have to run libxml2 regression tests with it enabled > too ! Thanks for the reply Daniel, The patch isn't portable, for sure, and isn't appropriate for upstream, it's really just a workaround for the interim. As it stands the patch may be allowing a memory leak for the key data it's trying to free at the exit of autofs. Let me try and fill in the blanks. I'm not sure how much you know about pthreads "thread specific data" functions pthread_key_create, pthread_getspecific, pthread_setspeific and pthread_key_delete so I'll assume not much. Basically, using these functions one can create a "key" that is used to hang allocated data of that is distinct in each thread by using pthread_setspecific and pthread_getspecific. Often, when this "key" is created, a destructor function is given that's used to deallocate the data and it's called internally by pthreads at the exit of each thread. The call is conditional on two things that the "key" creator has control over, a non-null data pointer for the "key" of the owning thread and whether the key is valid, which it is if pthread_key_delete hasn't been called for the given "key". So, the bottom line is that normally you can just not delete the key and let pthreads clean up at application exit knowing that the allocated data will be freed and everyone will be happy. But if your a shared library that can be unloaded by the user before the application exits this assumption no longer holds. In this case, if a key value is non-null and the key hasn't been deleted, the function given to clean up the data at exit may not exist any more because the library is gone. Hence the segv. The problem happens with autofs because it "dlopens" lookup modules for different map sources such as LDAP, NIS etc. and "dlcloses" them when it's finished with them. The LDAP lookup module (uses) depends on the libxml2 library and so this problem happens. We can see other libraries, such as libkrb5support, do do a key delete in their library unload routine similar to the patch that I included here. So, a more portable way to deal with this is for the library to always deallocate this data once it's finished using it and then call pthread_setspecific to set the key value to NULL. This should be enough to prevent the segv that has been reported here. But this has the problem of leaking pthreads keys which are also limited in number and could also lead to a problem if the library is loaded and unloaded a lot. So, in reality, both these things should be done, the deallocation and the key delete for a proper fix. I really don't have enough understanding of libxml2 to be able to implement the deallocation bit nor sufficient knowledge of it's configure environment you mentioned to do this properly. It's really something that upstream needs to work on. Hey, I don't really know much about anything so all this could be completely wrong, ;), but I don't think it's that far from the mark. Let me know how you go upstream please. Ian -- Additional comment from ikent on 2007-09-18 23:57 EST -- (In reply to comment #26) > > The problem happens with autofs because it "dlopens" lookup > modules for different map sources such as LDAP, NIS etc. > and "dlcloses" them when it's finished with them. The LDAP > lookup module (uses) depends on the libxml2 library and so > this problem happens. We can see other libraries, such as > libkrb5support, do do a key delete in their library unload > routine similar to the patch that I included here. > But wait thre's more. An RFE that I'm working on will lead to the LDAP lookup module possibly being loaded and unloaded over time so I suspect that, even if autofs never exits, the key exhaustion problem will pop up as well. Sorry, Ian -- Additional comment from ikent on 2007-09-19 12:33 EST -- (In reply to comment #26) > (In reply to comment #25) > > patch looks simple ... except for __attribute ((destructor)) > > this is not just conditional on HAVE_PTHREAD_H but also on the > > compiler used, so would need a bit more looking before pushing it > > upstream. I wonder if there is a more portable way to run a library > > destructor... > > I'm still a bit puzzled about this but well I assume it's > > right, I will have to run libxml2 regression tests with it enabled > > too ! > > Thanks for the reply Daniel, > > The patch isn't portable, for sure, and isn't appropriate > for upstream, it's really just a workaround for the interim. > As it stands the patch may be allowing a memory leak for the > key data it's trying to free at the exit of autofs. Actually, now I think about it I need to work around this in autofs anyway. I can't assume that people will have an updated libxml2 even if it is fixed. But please send this upstream also. Ian -- Additional comment from rkhadgar on 2007-09-26 01:17 EST -- the patched libxml2 package helps :) -- Additional comment from ikent on 2007-09-26 05:02 EST -- (In reply to comment #29) > the patched libxml2 package helps :) Sorry, I do have an patch to workaround this. I just haven't got to building it just yet. I'll get to it soon as I can. Ian
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
The RHTS sub test bz286541 under autofs-tests/bugzillas can be used to verify this bug. This issue is fixed in autofs5-5.0.1-0.rc2.87.
Verified RHTS job #23575.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0659.html