Bug 1197622
Summary: | automount segfaults in error code path (reproducible PRErrorTableList corruption) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Karsten Weiss <knweiss> | ||||||||||
Component: | autofs | Assignee: | Ian Kent <ikent> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Kun Wang <kunwan> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | urgent | ||||||||||||
Version: | 6.6 | CC: | den1987, dueno, eguan, ext.gfi.guillaume.hilt, ikent, knweiss, m.camen, mjtrangoni, rbergant, realitix, rhandlin, rmeggins, rpalco, swhiteho, toneata, xifeng, xzhou | ||||||||||
Target Milestone: | rc | Keywords: | Reopened | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | autofs-5.0.5-140.el6_10.1 | Doc Type: | If docs needed, set a value | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 1381924 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2019-04-09 13:35:02 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1268411 | ||||||||||||
Attachments: |
|
Description
Karsten Weiss
2015-03-02 09:04:35 UTC
As I mentioned in email I'm wondering if this is a side effect of closing the autofs shared library module on HUP signal or when autofs initiates a map re-read because it thinks the map has changed. Before I dive into the code of related packages we really need to know if this library re-open is causing the problem somehow. We can check if this is the case fairly easily by just not doing the close then open on map re-read and then working through the failure case you have. Beware that some of the configuration entries relating to things like the LDAP server etc. won't be updated if they have changed. So if you're using this test rpm you'll need to restart autofs to apply configuration changes. Can you work through your procedure using the rpms here please: http://people.redhat.com/~ikent/autofs-5.0.5-109.debug.1.el6.1 Ian (In reply to Karsten Weiss from comment #0) > tl;dr I think the initialization of automount (autofs) is broken using > ldap automount maps and causes segfaults in some error code paths. The > reason is (reproducible) corruption of the global PRErrorTableList. > I've seen it both in 6.6 and 7.0! (Full disclosure: My 6.6 test machine > runs CentOS; I've disclosed this problem in a private mail to Ian Kent and he > encouraged me to open this bug.) I don't think it matters if your running CentOS, at least not to me. Logging a bug directly in RedHat bugzilla you face the same problem as anyone else doing so, including RHEL subscribers. If there are tasks that need my attention that come in through support then they must get priority so there's a risk this bug won't get the time it needs. But I think that's fair since, apart from the obvious business sense, for these bugzillas there's also a risk I will be stuck burning time gathering information that should be done by front line support. I've tested your autofs test rpms in gdb: When then code enters PR_ErrorInstallTable() for the 4th time we now see this situation: (gdb) print *Table_List $30 = {next = 0x7ffff8239f30, table = 0x7ffff4b12750, table_private = 0x0} (gdb) print *Table_List.table $31 = {msgs = 0x0, name = 0x0, base = 0x0, n_msgs = 0x0} !!!!!!!!! (gdb) print *Table_List.next $32 = {next = 0x7ffff8239490, table = 0x7ffff4366820, table_private = 0x0} (gdb) print *Table_List.next.table $33 = {msgs = 0x7ffff4366840, name = 0x7ffff415b874 "secerrstrings", base = 0xffffe000, n_msgs = 0xb4} (gdb) print *Table_List.next.next $34 = {next = 0x0, table = 0x7ffff3d33600, table_private = 0x0} (gdb) print *Table_List.next.next.table $35 = {msgs = 0x7ffff3d33620, name = 0x7ffff3b27de0 "prerr", base = 0xffffe890, n_msgs = 0x4d} The first table entry now contains a nullified/empty table (gdb output $31) (which was a "sslerr" table before). [ With the *official* autofs rpm the same "print" would have shown this output instead: (gdb) print *Table_List.table Cannot access memory at address 0x7ffff4b12750 ] When we finish this 4th call to PR_ErrorInstallTable() we get this result: (gdb) finish Run till exit from #0 PR_CallOnce (once=0x7ffff176cca8, func=<value optimized out>) at ../../../nspr/pr/src/misc/prinit.c:776 st_expire: state 1 path /soft handle_packet: type = 3 handle_packet_missing_indirect: token 906, name sana2, request pid 32684 0x00007ffff154e3c7 in ssl_InitializePRErrorTable () at sslerrstrs.c:32 32 return (PR_SUCCESS == PR_CallOnce(&once, ssl_InitializePRErrorTableOnce)) Value returned is $45 = PR_SUCCESS (gdb) print *Table_List $46 = {next = 0x7ffff8292450, table = 0x7ffff1769750, table_private = 0x0} (gdb) print *Table_List.table $47 = {msgs = 0x7ffff1769780, name = 0x7ffff155f34a "sslerr", base = 0xffffd000, n_msgs = 0x85} (gdb) print *Table_List.next $48 = {next = 0x7ffff8239f30, table = 0x7ffff4b12750, table_private = 0x0} (gdb) print *Table_List.next.table $49 = {msgs = 0x0, name = 0x0, base = 0x0, n_msgs = 0x0} (gdb) printable_List.next.next $50 = {next = 0x7ffff8239490, table = 0x7ffff4366820, table_private = 0x0} (gdb) print *Table_List.next.next.table $51 = {msgs = 0x7ffff4366840, name = 0x7ffff415b874 "secerrstrings", base = 0xffffe000, n_msgs = 0xb4} (gdb) print *Table_List.next.next.next $52 = {next = 0x0, table = 0x7ffff3d33600, table_private = 0x0} (gdb) print *Table_List.next.next.next.table $53 = {msgs = 0x7ffff3d33620, name = 0x7ffff3b27de0 "prerr", base = 0xffffe890, n_msgs = 0x4d} If we take another look at the (segfaulting) PR_ErrorToString() code in nspr things look better now: PR_IMPLEMENT(const char *) PR_ErrorToString(PRErrorCode code, PRLanguageCode language) { [...] struct PRErrorTableList *et; [...] for (et = Table_List; et; et = et->next) { !!! if (et->table->base <= code && !!! et->table->base + et->table->n_msgs > code) { /* This is the right table */ [...] if (callback_lookup) { msg = callback_lookup(code, language, et->table, callback_private, et->table_private); if (msg) return msg; } return(et->table->msgs[code - et->table->base].en_text); } } [...] As far as I can see the marked "if" now can't segfault anymore: a) access of et->table is valid for all four PRErrorTableList table entries (no more invalid memory accesses) b) et->table->base (0x0) <= code (-5961) (=> "TCP connection reset by peer") is false => we don't enter the if statement's body. Thus the nullified table should be skipped now. This looks good to me! Having said that, I still wonder if it really makes sense that we see the 4th PR_ErrorInstallTable() in the worker thread at all? I.e. why does the first registration of the "sslerr" table get nullified and thrown away? Why does the code register the sslerr" table a 2nd time in the worker thread - thereby re-doing all the initial (nspr) ssl_Init() work of the main thread? Shouldn't there only be a single "sslerr" table registration (probably from the main thread)? But then, once again: I'm not really familiar with this code. I'm just wondering. I will keep this autofs debug rpm running on my test system and keep you updated if I see anything unexpected. (In reply to Karsten Weiss from comment #3) > > Having said that, I still wonder if it really makes sense that we see the 4th > PR_ErrorInstallTable() in the worker thread at all? > > I.e. why does the first registration of the "sslerr" table get nullified and > thrown away? Why does the code register the sslerr" table a 2nd time in the > worker thread - thereby re-doing all the initial (nspr) ssl_Init() work of > the main thread? > > Shouldn't there only be a single "sslerr" table registration (probably from > the main thread)? These are question I can't answer, as I've said I have deliberately avoided using anything but what I have to, namely the LDAP libraries in this case, and leave the handling of fragile libraries to those libraries. > > But then, once again: I'm not really familiar with this code. I'm just > wondering. I'm not familiar with the code were looking at either. And after the last time I needed to look at it I'm not keen at all to do it again if I can avoid it. The only change made to autofs in that rpm is to comment out the close and then open of autofs's own libraries, nothing more. That's why a HUP signal won't now pick up some configuration changes. The point of the exercise is to work out if the dlclose() and then dlopen() (of autofs's own libraries only) when a HUP signal is received or a map re-read is triggered, is causing a library data segment that contains things this library assumes won't to go away, to go away, and then come back clean. If that is the case then the library probably assumes global variables in data segments it doesn't "own" won't go away, which isn't good. If we're sure this is the case or at least not doing the close then open will resolve the problem I'll need to change autofs. That's not straight forward but is better than facing this problem with who knows how many libraries behaving this way over time. Looking forward to your feedback and thanks for the effort. Ian I've been working on eliminating the autofs library modules close then open on receiving a HUP signal. Doing that was based on the feedback of comment#3. Now I need to test the changes (particularly so because one of the modules was quite difficult and fairly complicated to change) which will be tedious and time consuming so I thought I'd check whether you have been using the package of comment#1 and how that went. If I'm wrong about the cause I'd rather not spend even more time on this change. Unfortunately, the debug autofs rpm was accidentally replaced by the "official" on this test system after a couple of days. However, we installed the debug rpm again on another test system that recently ran into several autofs crashes. I'll come back to you on Monday. (Side note: Did you try to call PR_ErrorToString (-5961, 0x0) unconditionally after the autofs error table initializations on a debug build? Maybe this would result in a simple testcase that can be tested without waiting for TCP problems. It may be worth a try.) (In reply to Karsten Weiss from comment #6) > Unfortunately, the debug autofs rpm was accidentally replaced by the > "official" on this test system after a couple of days. LOL, it happens. > > However, we installed the debug rpm again on another test system that > recently ran into several autofs crashes. I'll come back to you on Monday. That doesn't sound good, I'll be interested to see what happened. There is only very little difference in the rpm. > > (Side note: Did you try to call PR_ErrorToString (-5961, 0x0) > unconditionally after the autofs error table initializations on a debug > build? Maybe this would result in a simple testcase that can be tested > without waiting for TCP problems. It may be worth a try.) There's a fair bit of procedure in initializing ssl none of which is done in autofs and there really isn't any need for autofs to call the ssl layer directly. As I've said before I leave that to the libraries that need to do it. Thanks for getting back to me. Ian On the 2nd test machine we saw two autofs-5.0.5-109.el6_6.1 crashes (2015-06-30 and 2015-07-03). We then decided to updated autofs to your autofs-5.0.5-109.debug.1.el6.1.x86_64 (2015-07-07) and so far it is running fine and did not crash anymore. However, I just took a closer look at the two existing backtraces of this 2nd test system and noticed that they are a little bit different than on the first system. This time the crash happens because of a "double free or corruption (fasttop)" during a ldap reconnect: Here's the 2015-06-30 backtrace (the other one is identical) - please notice that this is very similar to Red Hat solution #717803 "automount core dumps on RHEL 6.5" (https://access.redhat.com/solutions/717803): Core was generated by `automount --pid-file /var/run/autofs.pid'. Program terminated with signal 6, Aborted. #0 0x00007fc852e09625 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); Missing separate debuginfos, use: debuginfo-install libcom_err-1.41.12-21.el6.x86_64 (gdb) bt #0 0x00007fc852e09625 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007fc852e0ae05 in abort () at abort.c:92 #2 0x00007fc852e47537 in __libc_message (do_abort=2, fmt=0x7fc852f2f840 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198 #3 0x00007fc852e4ce66 in malloc_printerr (action=3, str=0x7fc852f2fb18 "double free or corruption (fasttop)", ptr=<value optimized out>) at malloc.c:6336 #4 0x00007fc85255ba35 in ldap_int_tls_destroy (lo=0x7fc7ec020860) at ../../../libraries/libldap/tls2.c:105 #5 0x00007fc85254204f in ldap_ld_free (ld=0x7fc7ec020a70, close=<value optimized out>, sctrls=<value optimized out>, cctrls=<value optimized out>) at ../../../libraries/libldap/unbind.c:209 #6 0x00007fc852784660 in __unbind_ldap_connection (logopt=0, ldap=0x7fc7ec020a70, ctxt=<value optimized out>) at lookup_ldap.c:227 #7 0x00007fc8527849bd in unbind_ldap_connection (logopt=0, ldap=0x7fc7ec020a70, ctxt=0x7fc7f8023ba0) at lookup_ldap.c:239 #8 0x00007fc852784bcd in do_connect (logopt=0, uri=<value optimized out>, ctxt=0x7fc7f8023ba0) at lookup_ldap.c:668 #9 0x00007fc8527853fc in do_reconnect (logopt=0, ctxt=0x7fc7f8023ba0) at lookup_ldap.c:924 #10 0x00007fc852785d93 in lookup_one (ap=0x7fc855e73100, source=0x7fc838000950, key=0x7fc7ec02c0f0 "session.zip", key_len=11, ctxt=0x7fc7f8023ba0) at lookup_ldap.c:2875 #11 match_key (ap=0x7fc855e73100, source=0x7fc838000950, key=0x7fc7ec02c0f0 "session.zip", key_len=11, ctxt=0x7fc7f8023ba0) at lookup_ldap.c:3366 #12 0x00007fc852787057 in check_map_indirect (ap=0x7fc855e73100, name=<value optimized out>, name_len=<value optimized out>, context=0x7fc7f8023ba0) at lookup_ldap.c:3453 #13 lookup_mount (ap=0x7fc855e73100, name=<value optimized out>, name_len=<value optimized out>, context=0x7fc7f8023ba0) at lookup_ldap.c:3606 #14 0x00007fc8543f04c3 in lookup_nss_mount () #15 0x00007fc8543e8758 in ?? () #16 0x00007fc853fa19d1 in start_thread (arg=0x7fc7fffff700) at pthread_create.c:301 #17 0x00007fc852ebf8fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) up 4 #4 0x00007fc85255ba35 in ldap_int_tls_destroy (lo=0x7fc7ec020860) at ../../../libraries/libldap/tls2.c:105 105 ldap_pvt_tls_ctx_free( lo->ldo_tls_ctx ); (gdb) list 100 101 void 102 ldap_int_tls_destroy( struct ldapoptions *lo ) 103 { 104 if ( lo->ldo_tls_ctx ) { 105 ldap_pvt_tls_ctx_free( lo->ldo_tls_ctx ); 106 lo->ldo_tls_ctx = NULL; 107 } 108 109 if ( lo->ldo_tls_certfile ) { (gdb) print *lo $1 = {ldo_valid = 2, ldo_debug = 0, ldo_version = 3, ldo_deref = 0, ldo_timelimit = 10, ldo_sizelimit = 0, ldo_tm_api = {tv_sec = -1, tv_usec = 0}, ldo_tm_net = {tv_sec = 8, tv_usec = 0}, ldo_defludp = 0x0, ldo_defport = 389, ldo_defbase = 0x7fc808000a10 "dc=<redacted>", ldo_defbinddn = 0x0, ldo_keepalive_idle = 0, ldo_keepalive_probes = 0, ldo_keepalive_interval = 0, ldo_refhoplimit = 5, ldo_sctrls = 0x0, ldo_cctrls = 0x0, ldo_rebind_proc = 0, ldo_rebind_params = 0x0, ldo_nextref_proc = 0, ldo_nextref_params = 0x0, ldo_urllist_proc = 0, ldo_urllist_params = 0x0, ldo_conn_cbs = 0x0, ldo_booleans = 1, ldo_peer = 0x0, ldo_cldapdn = 0x0, ldo_is_udp = 0, ldo_tls_ctx = 0x7fc80800ad30, ldo_tls_connect_cb = 0, ldo_tls_connect_arg = 0x0, ldo_tls_info = {lt_certfile = 0x0, lt_keyfile = 0x0, lt_dhfile = 0x0, lt_cacertfile = 0x0, lt_cacertdir = 0x0, lt_ciphersuite = 0x0, lt_crlfile = 0x0, lt_randfile = 0x0, lt_protocol_min = 0}, ldo_tls_mode = 0, ldo_tls_require_cert = 2, ldo_tls_impl = 0, ldo_tls_crlcheck = 0, ldo_def_sasl_mech = 0x0, ldo_def_sasl_realm = 0x0, ldo_def_sasl_authcid = 0x0, ldo_def_sasl_authzid = 0x0, ldo_sasl_secprops = {min_ssf = 0, max_ssf = 2147483647, maxbufsize = 16777215, security_flags = 17, property_names = 0x0, property_values = 0x0}} (Notice that there are many automount messages like this in the system log: automount[3166]: key "session.zip" not found in map source(s). This is caused by software that is not under our control which performs some strange map lookups like that. automount still shouldn't crash, though.) (In reply to Karsten Weiss from comment #8) > On the 2nd test machine we saw two autofs-5.0.5-109.el6_6.1 crashes > (2015-06-30 and 2015-07-03). > > We then decided to updated autofs to your > autofs-5.0.5-109.debug.1.el6.1.x86_64 (2015-07-07) > and so far it is running fine and did not crash anymore. > > However, I just took a closer look at the two existing backtraces of this > 2nd test > system and noticed that they are a little bit different than on the first > system. > This time the crash happens because of a "double free or corruption > (fasttop)" > during a ldap reconnect: I'm aware of this (at least this looks like the same problem) and there were a couple of corrections that came out of that investigation although they are not included in the current RHEL release becuase the customer hasn't tested the change. I thought I included them in that the test rpm I provided but I will check. Problem is that I thought the customer with this issue (and another similar one) had reported the corrections didn't fix the problem but now I look I see they haven't given any feedback from testing. That's a problem because I had started thinking the shared library close to open on HUP signal (or certain type map modified) was also a factor with this problem. So, assuming I have included all the changes as well as disabling the close to open on HUP signal in the provided rpm, we should be good to test. Will confirm what I've included in the rpm. Ian (In reply to Ian Kent from comment #9) > > I thought I included them in that the test rpm I provided but I will > check. > snip ... > > Will confirm what I've included in the rpm. > I see that one change was included in rev 104 but there's another that I should have included. Let me add that patch and provide a new test rpm tomorrow. The backtrace above could be covered by the missing change or it could be due to the close to open behaviour. But if it was due to the missing change I'd have expected to see get_query_dn() in the backtrace. If it occurs with the updated rpm I'll need to dig deeper. Ian autofs-5.0.5-109.debug.1.el6.1.x86_64 is still up and running on the 2nd test machine. (In reply to Karsten Weiss from comment #11) > autofs-5.0.5-109.debug.1.el6.1.x86_64 is still up and running on the 2nd > test machine. That's good to know. After thinking about this it doesn't give us conclusive evidence that my assumption of the problem cause is correct. The problem being the change in the rpm you have prevents the code being executed and while some of that code is still well used for lookups the problem could still be with code that is not used. I think the better way to go now is to test the actual change that I've done. It's certain to have mistakes in it so it will be a challenge for us both, if your willing to help with testing it. So that's the question, are you willing to set aside a machine or two to do this? If so I'll get onto back porting the patches and produce a test rpm. Ian Yes, I can test a new debug version. (In reply to Karsten Weiss from comment #13) > Yes, I can test a new debug version. Thanks, I'll get onto the back port, I've written it against the upstream source, and do some initial testing. I've back ported the changes that I hope will help with this problem. I've done some general testing and that worked out OK. I think it's worth trying this out to see if it does in fact help with the problem. If it doesn't I don't really want to keep spending time on this and I'll need to start looking deeper into dependent library code used by autofs. The packages can be found at: http://people.redhat.com/~ikent/autofs-5.0.5-114.bz1243525.1.el6/ These packages have been built against rhel-6.7 but should work OK with earlier rhel revisions. If not there is a source rpm included at the location above that can be used to re-build the package on the target system. Please be conservative when selecting systems to use for testing as we don't know what mistakes I've made and there are quite a few changes included in the build. Please test and let me know how it goes. Ian Thanks. I've just updated the test system from autofs-5.0.5-109.debug.1.el6.1.x86_64 to this new version. There are updated test packages which can be found at: http://people.redhat.com/~ikent/autofs-5.0.5-114.bz1243525.3.el6/ Changes since revision 1: - fixed incorrect use of the new lookup module reinit interface. - fixed a couple of problems resulting from the above change. - added patch to prevent calling ldap unbind on an ldap connection that had never been bound. Please use these packages for testing. My test system is now using autofs-5.0.5-114.bz1243525.3.el6.x86_64. Ian, I've just examined the Table_List of the new autofs-5.0.5-114.bz1243525.3.el6.x86_64 right after initialization and noticed the inaccessible table entry again. I.e. launch autofs, then attach gdb to the running process and do this: (gdb) print *Table_List $1 = {next = 0x7f3c967a7490, table = 0x7f3c912c8710, table_private = 0x0} (gdb) print *Table_List.table $2 = {msgs = 0x7f3c912c8740, name = 0x7f3c910bdb2a "sslerr", base = -12288, n_msgs = 134} (gdb) print *Table_List.next $3 = {next = 0x7f3c96735bc0, table = 0x7f3c92b33710, table_private = 0x0} (gdb) print *Table_List.next.table !!!!!!!!!!!!!!!!!!!! Cannot access memory at address 0x7f3c92b33710 !!!!!!!!!!!!!!!!!!!! (gdb) print *Table_List.next.next $4 = {next = 0x7f3c96735100, table = 0x7f3c92386820, table_private = 0x0} (gdb) print *Table_List.next.next.table $5 = {msgs = 0x7f3c92386840, name = 0x7f3c9217b8f4 "secerrstrings", base = -8192, n_msgs = 180} (gdb) print *Table_List.next.next.next $6 = {next = 0x0, table = 0x7f3c91d53600, table_private = 0x0} (gdb) print *Table_List.next.next.next.table $7 = {msgs = 0x7f3c91d53620, name = 0x7f3c91b47bc0 "prerr", base = -6000, n_msgs = 77} (gdb) print *Table_List.next.next.next.next Cannot access memory at address 0x0 I'm pretty sure this version would crash in PR_ErrorToString() again (as initially reported) and we're back to square one. Don't you think so? Could you please try to reproduce this simple test by yourself? (You'll need the matching nspr-debuginfo package to resolve the Table_List symbol.) (In reply to Karsten Weiss from comment #0) > > If we take a closer look at the backtrace of the 4th registration (in > the worker thread) the frames 0-5 look suspicious because even > ssl_Init() shouldn't run a 2nd time! It uses the 'ssl_inited' static > variable to prevent a 2nd initialization. The same in nspr which uses > the "once" data structure to prevent a second initialization: > > (gdb) bt > #0 PR_ErrorInstallTable (table=0x7ffff393d750) at > ../../../nspr/pr/src/misc/prerrortable.c:166 > #1 0x00007ffff43ad16a in PR_CallOnce (once=0x7ffff3940ca8, func=<value > optimized out>) at ../../../nspr/pr/src/misc/prinit.c:775 > #2 0x00007ffff37223c7 in ssl_InitializePRErrorTable () at sslerrstrs.c:32 > #3 0x00007ffff372240d in ssl_Init () at sslinit.c:21 > #4 0x00007ffff372c9f7 in SSL_CipherPolicySet (which=49195, policy=1) at > sslsock.c:1154 > #5 0x00007ffff372cae1 in NSS_SetDomesticPolicy () at sslsock.c:1310 > #6 0x00007ffff5740078 in tlsm_deferred_init (arg=0x7fffac00b5d0) at > ../../../libraries/libldap/tls_m.c:1914 > > Unfortunately, from the worker thread's point of view these variables > all say "uninitialized" and it executes the initialization code once > again and the global Table_List ends up with the broken entry. And that's what made me think it could be the autofs library close to open causing library data segments to be discarded on some libraries but not on others. Changing that hasn't helped. > > Also, something (I don't know what) releases the data structure / memory > of the first "sslerr" PRErrorTable and makes its fields > inaccessible without de-registering it from the global Table_List > (PRErrorTableList). Possibly something is keeping track of what it has allocated and is freeing it on connection shutdown, can't find where that is happening either. But the nspr code doesn't provide any way to remove an error table once registered. It appears to assume they will always be available once registered. So even the work done on ensuring the nss shutdown function is called at library unload (in rhel-7.2 I believe) if it hasn't already been called probably won't help with this either. In nspr there's no consideration given to the possibility there could be multiple concurrent processes modifying the list of tables either but then the openldap library is said to not be thread safe for initialization so perhaps that's not a problem in itself since callers need to serialize their calls anyway. I've been working on expanding the autofs ldap serialization even further than it is now, shouldn't be long before that's done. I'm also been considering dropping the re-init patch series and adding code to dlopen() the openldap, nss and nspr libraries at automount program start to see if that will help with retaining the expected library state. Is there anything you care to add regarding this before I dive into it? Ian No. > I.e. why does the first registration of the "sslerr" table get nullified and
thrown away? Why does the code register the sslerr" table a 2nd time in the
worker thread - thereby re-doing all the initial (nspr) ssl_Init() work of
the main thread?
This is really mysterious to me. Here is the code from sslerrstrs.c:
static const struct PRErrorTable ssl_et = {
ssltext, "sslerr", SSL_ERROR_BASE,
(sizeof ssltext)/(sizeof ssltext[0])
};
static PRStatus
ssl_InitializePRErrorTableOnce(void) {
return PR_ErrorInstallTable(&ssl_et);
}
static PRCallOnceType once;
SECStatus
ssl_InitializePRErrorTable(void)
{
return (PR_SUCCESS == PR_CallOnce(&once, ssl_InitializePRErrorTableOnce))
? SECSuccess : SECFailure;
}
If I understand this correctly, there should be only 1 copy of `once`, ssl_et, and the functions in the libssl3.so library, and there should be only 1 copy of that library in the process. Each one of these symbols should always have the same address. But this is not the case - the first call stack:
#0 PR_ErrorInstallTable (table=0x7ffff53aa750) at ../../../nspr/pr/src/misc/prerrortable.c:166
#1 0x00007ffff43ad16a in PR_CallOnce (once=0x7ffff53adca8, func=<value optimized out>) at ../../../nspr/pr/src/misc/prinit.c:775
#2 0x00007ffff518f3c7 in ssl_InitializePRErrorTable () at sslerrstrs.c:32
#3 0x00007ffff518f40d in ssl_Init () at sslinit.c:21
#4 0x00007ffff51999f7 in SSL_CipherPolicySet (which=49195, policy=1) at sslsock.c:1154
#5 0x00007ffff5199ae1 in NSS_SetDomesticPolicy () at sslsock.c:1310
#6 0x00007ffff6141078 in tlsm_deferred_init (arg=0x7ffff8203300) at ../../../libraries/libldap/tls_m.c:1914
You'll notice that the address of once=0x7ffff53adca8.
Here is the second call stack:
#0 PR_ErrorInstallTable (table=0x7ffff393d750) at ../../../nspr/pr/src/misc/prerrortable.c:166
#1 0x00007ffff43ad16a in PR_CallOnce (once=0x7ffff3940ca8, func=<value optimized out>) at ../../../nspr/pr/src/misc/prinit.c:775
#2 0x00007ffff37223c7 in ssl_InitializePRErrorTable () at sslerrstrs.c:32
#3 0x00007ffff372240d in ssl_Init () at sslinit.c:21
#4 0x00007ffff372c9f7 in SSL_CipherPolicySet (which=49195, policy=1) at sslsock.c:1154
#5 0x00007ffff372cae1 in NSS_SetDomesticPolicy () at sslsock.c:1310
#6 0x00007ffff5740078 in tlsm_deferred_init (arg=0x7fffac00b5d0) at ../../../libraries/libldap/tls_m.c:1914
The addresses of once, table, ssl_InitializePRErrorTable, ssl_Init, SSL_CipherPolicySet, tlsm_deferred_init, etc. are all different, which suggests that the NSS libraries and the openldap libraries were loaded somehow at different locations.
But, the address of PR_CallOnce is the same - 0x00007ffff43ad16a. So why would there be some sort of per-process shared library symbol address randomization for some of the libraries but not others?
It's almost as if there are two different openldap and NSS libraries loaded in the same process.
> And here's a backtrace from this point in time (ignore the addresses in this and the next bt because I captured them from a different session - it's the call trace that's important and interesting!): So, never mind my analysis, back to the drawing board. Does the automount code call fork() anywhere? That would cause NSS to unload its modules and reload them again: from the openldap tls_m.c: /* NSS enforces the pkcs11 requirement that modules should be unloaded after a fork() - since there is no portable way to determine if NSS has been already initialized in a parent process, we just call SECMOD_RestartModules with force == FALSE - if the module has been unloaded due to a fork, it will be reloaded, otherwise, it is a no-op */ if ( SECFailure == ( rc = SECMOD_RestartModules(PR_FALSE /* do not force */) ) ) { That would explain why > What is happening here? "sslerr" wants to register *again* although it is already registered (in the main thread). Both the `once` and the `ssl_et` are addresses in the library. If the library is unloaded and reloaded, those address will be assigned different values. So now the Table_List is pointing to memory that is no longer used by the process, or possibly in a completely different memory region of the program. It would seem that the correct thing for NSS to do would be to remove its error tables from NSPR when unloaded, but unfortunately there appears to be no way to do that - there is nothing like a PR_ErrorRemoveTable() call. There definitely needs to be one to allow libraries that are unloaded to remove their tables. Another way would be for NSS to use malloc() memory for the `once` object and for the `ssl_et` table. That would avoid having a corrupted pointer, since the malloc() memory would live forever, but that would leak every time fork() is called - NSS would re-malloc() `once` and `ssl_et`. An insecure (violates PKCS11 standards) workaround might be to set NSS_STRICT_NOFORK=DISABLED in the automount daemon environment. openldap is already doing this: http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=libraries/libldap/tls_m.c;h=9b101f954d38c8f804aadbddb5de1282e4c36e60;hb=refs/heads/OPENLDAP_REL_ENG_2_4#l3082 but perhaps there is another call somewhere to NSS before this is called, that is controlling the fork behavior. (In reply to Rich Megginson from comment #24) Ha, I wrote a response to the previous comment before seeing this. It's still mostly relevant and I'll add a slightly modified version of it below. But there's also a couple of things to comment on here now too. > > And here's a backtrace from this point in time (ignore the addresses in > this and the next bt because I captured them from a different session - > it's the call trace that's important and interesting!): > > So, never mind my analysis, back to the drawing board. > > Does the automount code call fork() anywhere? That would cause NSS to > unload its modules and reload them again: from the openldap tls_m.c: That depends on your definition of fork, certainly autofs does fork mount(8) to perform mounts but does no ldap interaction in that code. But it also does most of its actions in separate threads (see below for a procedural description). > > /* NSS enforces the pkcs11 requirement that modules should be > unloaded after > a fork() - since there is no portable way to determine if NSS has > been > already initialized in a parent process, we just call > SECMOD_RestartModules > with force == FALSE - if the module has been unloaded due to a > fork, it will > be reloaded, otherwise, it is a no-op */ > if ( SECFailure == ( rc = SECMOD_RestartModules(PR_FALSE /* do not > force */) ) ) { > > That would explain why > > > What is happening here? "sslerr" wants to register *again* although it > is already registered (in the main thread). > > Both the `once` and the `ssl_et` are addresses in the library. If the > library is unloaded and reloaded, those address will be assigned different > values. So now the Table_List is pointing to memory that is no longer used > by the process, or possibly in a completely different memory region of the > program. > > It would seem that the correct thing for NSS to do would be to remove its > error tables from NSPR when unloaded, but unfortunately there appears to be > no way to do that - there is nothing like a PR_ErrorRemoveTable() call. > There definitely needs to be one to allow libraries that are unloaded to > remove their tables. That was, more or less, my conclusion after looking at the nspr code. And there probably needs to be some serialization on the list modifications too. > > Another way would be for NSS to use malloc() memory for the `once` object > and for the `ssl_et` table. That would avoid having a corrupted pointer, > since the malloc() memory would live forever, but that would leak every time > fork() is called - NSS would re-malloc() `once` and `ssl_et`. Sure, the structures need to have clear ownership, either all nss or all nspr with associated cleanup for each. But in this case there appears to be a mix of data ownership between the two libraries. > > An insecure (violates PKCS11 standards) workaround might be to set > NSS_STRICT_NOFORK=DISABLED in the automount daemon environment. openldap is > already doing this: > > http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=libraries/ > libldap/tls_m.c;h=9b101f954d38c8f804aadbddb5de1282e4c36e60;hb=refs/heads/ > OPENLDAP_REL_ENG_2_4#l3082 > > but perhaps there is another call somewhere to NSS before this is called, > that is controlling the fork behavior. From a procedural point of view here's now autofs does things. First autofs does almost all of its actions in distinct threads. So they're always going to be running from different thread processes. Clearly that implies clone(2) will be called but it isn't a full fork(2). I'm not sure if that meets the criteria above for an nss library re-load. It gets a little more complicated from this point. autofs will dlopen() its lookup module shared library for a given map source, ldap in this case, once for each map that is being used, each with its own private data segment that holds map specific data. On every lookup a connection to the ldap server is opened, the map queried, and the connection closed. Now, autofs tries to work out if a map has changed since last access, based on what it gets back from the map entry query. And if it thinks it has changed it will initiate a map re-read. This essentially results in the equivalent of sending a HUP signal to the application. We know that there was a recent regression where autofs wasn't quite getting this right and excessive map re-reads were being triggered so we know re-reads were happening. On receiving a HUP signal autofs will, take a write lock (against a read-write mutex) of the given map to ensure all lookups for this particular map are completed then dlcose() the lookup module and then re-open it to effect the required per map initialization, since the map arguments may have changed. So there is plenty of opportunity for dependent demand loaded shared libraries to think they are no longer in use and to be unloaded. One thing is sure though, all co-operating libraries should not assume state of the data segment of another library when there is a degree of context shared between the two. But that does seem to be happening here. The look of things makes me wonder if there's a library reference count bug somewhere, causing inconsistent library unloads between dependent libraries ... Not sure this is helpful but thought a procedural description of what happens might help make sense of the above observations. Ian (In reply to Rich Megginson from comment #24) > > An insecure (violates PKCS11 standards) workaround might be to set > NSS_STRICT_NOFORK=DISABLED in the automount daemon environment. openldap is > already doing this: > > http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=libraries/ > libldap/tls_m.c;h=9b101f954d38c8f804aadbddb5de1282e4c36e60;hb=refs/heads/ > OPENLDAP_REL_ENG_2_4#l3082 > > but perhaps there is another call somewhere to NSS before this is called, > that is controlling the fork behavior. I've been careful to not call anything related to libraries that ldap depends on. Certainly there are no calls to the nss or nspr libraries in the autofs ldap code. But there are sasl ldap and Kerberos auth calls going on here too, with multi-step authentication, so what's called and when within dependent libraries isn't at all straight forward. (In reply to Ian Kent from comment #25) > (In reply to Rich Megginson from comment #24) > > Ha, I wrote a response to the previous comment before > seeing this. > > It's still mostly relevant and I'll add a slightly modified > version of it below. > > But there's also a couple of things to comment on here now too. > > > > And here's a backtrace from this point in time (ignore the addresses in > > this and the next bt because I captured them from a different session - > > it's the call trace that's important and interesting!): > > > > So, never mind my analysis, back to the drawing board. > > > > Does the automount code call fork() anywhere? That would cause NSS to > > unload its modules and reload them again: from the openldap tls_m.c: > > That depends on your definition of fork, certainly autofs does > fork mount(8) to perform mounts but does no ldap interaction > in that code. That would trigger the NSS fork detection code, if NSS is used anywhere in the forked process. > > But it also does most of its actions in separate threads (see > below for a procedural description). > > > > > /* NSS enforces the pkcs11 requirement that modules should be > > unloaded after > > a fork() - since there is no portable way to determine if NSS has > > been > > already initialized in a parent process, we just call > > SECMOD_RestartModules > > with force == FALSE - if the module has been unloaded due to a > > fork, it will > > be reloaded, otherwise, it is a no-op */ > > if ( SECFailure == ( rc = SECMOD_RestartModules(PR_FALSE /* do not > > force */) ) ) { > > > > That would explain why > > > > > What is happening here? "sslerr" wants to register *again* although it > > is already registered (in the main thread). > > > > Both the `once` and the `ssl_et` are addresses in the library. If the > > library is unloaded and reloaded, those address will be assigned different > > values. So now the Table_List is pointing to memory that is no longer used > > by the process, or possibly in a completely different memory region of the > > program. > > > > It would seem that the correct thing for NSS to do would be to remove its > > error tables from NSPR when unloaded, but unfortunately there appears to be > > no way to do that - there is nothing like a PR_ErrorRemoveTable() call. > > There definitely needs to be one to allow libraries that are unloaded to > > remove their tables. > > That was, more or less, my conclusion after looking at the nspr > code. And there probably needs to be some serialization on the > list modifications too. > > > > > Another way would be for NSS to use malloc() memory for the `once` object > > and for the `ssl_et` table. That would avoid having a corrupted pointer, > > since the malloc() memory would live forever, but that would leak every time > > fork() is called - NSS would re-malloc() `once` and `ssl_et`. > > Sure, the structures need to have clear ownership, either all nss > or all nspr with associated cleanup for each. But in this case > there appears to be a mix of data ownership between the two > libraries. > > > > > An insecure (violates PKCS11 standards) workaround might be to set > > NSS_STRICT_NOFORK=DISABLED in the automount daemon environment. openldap is > > already doing this: > > > > http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=libraries/ > > libldap/tls_m.c;h=9b101f954d38c8f804aadbddb5de1282e4c36e60;hb=refs/heads/ > > OPENLDAP_REL_ENG_2_4#l3082 > > > > but perhaps there is another call somewhere to NSS before this is called, > > that is controlling the fork behavior. > > From a procedural point of view here's now autofs does things. > > First autofs does almost all of its actions in distinct threads. > So they're always going to be running from different thread > processes. Clearly that implies clone(2) will be called but > it isn't a full fork(2). It must be a fork(2) to trigger the NSS pkcs11 code. > > I'm not sure if that meets the criteria above for an nss > library re-load. No, it doesn't. > > It gets a little more complicated from this point. > > autofs will dlopen() its lookup module shared library for > a given map source, ldap in this case, once for each map that > is being used, each with its own private data segment that > holds map specific data. > > On every lookup a connection to the ldap server is opened, > the map queried, and the connection closed. > > Now, autofs tries to work out if a map has changed since > last access, based on what it gets back from the map entry > query. And if it thinks it has changed it will initiate a > map re-read. > > This essentially results in the equivalent of sending a > HUP signal to the application. > > We know that there was a recent regression where autofs > wasn't quite getting this right and excessive map re-reads > were being triggered so we know re-reads were happening. > > On receiving a HUP signal autofs will, take a write lock > (against a read-write mutex) of the given map to ensure all > lookups for this particular map are completed then dlcose() > the lookup module and then re-open it to effect the > required per map initialization, since the map arguments > may have changed. > > So there is plenty of opportunity for dependent demand > loaded shared libraries to think they are no longer in > use and to be unloaded. Do we know if, in this case, automount is being HUP'd? If it is, that would certainly explain why symbols are being assigned different addresses. It may also explain why NSPR is not, since NSPR is used by a wide variety of software, and may have already been loaded into the process address space before openldap and NSS. If automount is never being HUP'd, and we are getting this problem, then I'm not sure what's going on. How can we find out if the problem is caused by automount being HUP'd? > > One thing is sure though, all co-operating libraries should > not assume state of the data segment of another library when > there is a degree of context shared between the two. But that > does seem to be happening here. > > The look of things makes me wonder if there's a library > reference count bug somewhere, causing inconsistent library > unloads between dependent libraries ... > > Not sure this is helpful but thought a procedural description > of what happens might help make sense of the above observations. > > Ian (In reply to Ian Kent from comment #26) > (In reply to Rich Megginson from comment #24) > > > > An insecure (violates PKCS11 standards) workaround might be to set > > NSS_STRICT_NOFORK=DISABLED in the automount daemon environment. openldap is > > already doing this: > > > > http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=libraries/ > > libldap/tls_m.c;h=9b101f954d38c8f804aadbddb5de1282e4c36e60;hb=refs/heads/ > > OPENLDAP_REL_ENG_2_4#l3082 > > > > but perhaps there is another call somewhere to NSS before this is called, > > that is controlling the fork behavior. > > I've been careful to not call anything related to libraries > that ldap depends on. Certainly there are no calls to the nss > or nspr libraries in the autofs ldap code. Not explicitly, but if you are calling into openldap code to handle TLS/SSL connections, then you are implicitly calling into NSS and NSPR. > But there are sasl ldap and Kerberos auth calls going on here > too, with multi-step authentication, so what's called and > when within dependent libraries isn't at all straight forward. There may be sasl or Kerberos libraries that use NSS for crypto. (In reply to Rich Megginson from comment #27) > > That depends on your definition of fork, certainly autofs does > > fork mount(8) to perform mounts but does no ldap interaction > > in that code. > > That would trigger the NSS fork detection code, if NSS is used anywhere in > the forked process. On second thoughts we have to assume that the fork()ed process is calling NSS related functions because we don't know if the autofs map entry being mounted uses, for example, Kerberos to authenticate to the NFS server. snip ... > > > > It gets a little more complicated from this point. > > > > autofs will dlopen() its lookup module shared library for > > a given map source, ldap in this case, once for each map that > > is being used, each with its own private data segment that > > holds map specific data. > > > > On every lookup a connection to the ldap server is opened, > > the map queried, and the connection closed. > > > > Now, autofs tries to work out if a map has changed since > > last access, based on what it gets back from the map entry > > query. And if it thinks it has changed it will initiate a > > map re-read. > > > > This essentially results in the equivalent of sending a > > HUP signal to the application. > > > > We know that there was a recent regression where autofs > > wasn't quite getting this right and excessive map re-reads > > were being triggered so we know re-reads were happening. > > > > On receiving a HUP signal autofs will, take a write lock > > (against a read-write mutex) of the given map to ensure all > > lookups for this particular map are completed then dlcose() > > the lookup module and then re-open it to effect the > > required per map initialization, since the map arguments > > may have changed. > > > > So there is plenty of opportunity for dependent demand > > loaded shared libraries to think they are no longer in > > use and to be unloaded. > > Do we know if, in this case, automount is being HUP'd? If it is, that would > certainly explain why symbols are being assigned different addresses. It > may also explain why NSPR is not, since NSPR is used by a wide variety of > software, and may have already been loaded into the process address space > before openldap and NSS. > > If automount is never being HUP'd, and we are getting this problem, then I'm > not sure what's going on. > > How can we find out if the problem is caused by automount being HUP'd? automount(8) catches HUP signals and deals with them. There's no process restart of any kind. There is a lot of detached thread creation and most of them run to completion in a fairly brief amount of time. So I'm not sure that receiving a HUP signal counts. Ian (In reply to Ian Kent from comment #29) > (In reply to Rich Megginson from comment #27) > > > That depends on your definition of fork, certainly autofs does > > > fork mount(8) to perform mounts but does no ldap interaction > > > in that code. > > > > That would trigger the NSS fork detection code, if NSS is used anywhere in > > the forked process. > > On second thoughts we have to assume that the fork()ed > process is calling NSS related functions because we don't > know if the autofs map entry being mounted uses, for example, > Kerberos to authenticate to the NFS server. We could easily verify that this is the problem by setting NSS_STRICT_NOFORK=DISABLED in the automount daemon environment, and trying to reproduce the problem. > > snip ... > > > > > > > It gets a little more complicated from this point. > > > > > > autofs will dlopen() its lookup module shared library for > > > a given map source, ldap in this case, once for each map that > > > is being used, each with its own private data segment that > > > holds map specific data. > > > > > > On every lookup a connection to the ldap server is opened, > > > the map queried, and the connection closed. > > > > > > Now, autofs tries to work out if a map has changed since > > > last access, based on what it gets back from the map entry > > > query. And if it thinks it has changed it will initiate a > > > map re-read. > > > > > > This essentially results in the equivalent of sending a > > > HUP signal to the application. > > > > > > We know that there was a recent regression where autofs > > > wasn't quite getting this right and excessive map re-reads > > > were being triggered so we know re-reads were happening. > > > > > > On receiving a HUP signal autofs will, take a write lock > > > (against a read-write mutex) of the given map to ensure all > > > lookups for this particular map are completed then dlcose() > > > the lookup module and then re-open it to effect the > > > required per map initialization, since the map arguments > > > may have changed. > > > > > > So there is plenty of opportunity for dependent demand > > > loaded shared libraries to think they are no longer in > > > use and to be unloaded. > > > > Do we know if, in this case, automount is being HUP'd? If it is, that would > > certainly explain why symbols are being assigned different addresses. It > > may also explain why NSPR is not, since NSPR is used by a wide variety of > > software, and may have already been loaded into the process address space > > before openldap and NSS. > > > > If automount is never being HUP'd, and we are getting this problem, then I'm > > not sure what's going on. > > > > How can we find out if the problem is caused by automount being HUP'd? > > automount(8) catches HUP signals and deals with them. Right, as I understand it as you have explained it, it catches the HUP to reload configuration. As part of that process, libraries can be unloaded and reloaded. Is this not the case? > There's no process restart of any kind. Right. Did I mention process restart? > There is a lot of detached thread creation and most of them > run to completion in a fairly brief amount of time. Ok. > > So I'm not sure that receiving a HUP signal counts. What I want to know is this: Does sending the automount daemon a SIGHUP cause the crash? Not directly, but eventually via the NSS crash detailed above? Does automount crash if it is never sent a SIGHUP? > > Ian (In reply to Rich Megginson from comment #30) > (In reply to Ian Kent from comment #29) > > (In reply to Rich Megginson from comment #27) > > > > That depends on your definition of fork, certainly autofs does > > > > fork mount(8) to perform mounts but does no ldap interaction > > > > in that code. > > > > > > That would trigger the NSS fork detection code, if NSS is used anywhere in > > > the forked process. > > > > On second thoughts we have to assume that the fork()ed > > process is calling NSS related functions because we don't > > know if the autofs map entry being mounted uses, for example, > > Kerberos to authenticate to the NFS server. > > We could easily verify that this is the problem by setting > NSS_STRICT_NOFORK=DISABLED in the automount daemon environment, and trying > to reproduce the problem. Karsten, this can be done by adding the "NSS_STRICT_NOFORK=DISABLED" to /etc/sysconfig/autofs and restarting autofs. Could you check this for us please. > > > > > snip ... > > > > > > > > > > It gets a little more complicated from this point. > > > > > > > > autofs will dlopen() its lookup module shared library for > > > > a given map source, ldap in this case, once for each map that > > > > is being used, each with its own private data segment that > > > > holds map specific data. > > > > > > > > On every lookup a connection to the ldap server is opened, > > > > the map queried, and the connection closed. > > > > > > > > Now, autofs tries to work out if a map has changed since > > > > last access, based on what it gets back from the map entry > > > > query. And if it thinks it has changed it will initiate a > > > > map re-read. > > > > > > > > This essentially results in the equivalent of sending a > > > > HUP signal to the application. > > > > > > > > We know that there was a recent regression where autofs > > > > wasn't quite getting this right and excessive map re-reads > > > > were being triggered so we know re-reads were happening. > > > > > > > > On receiving a HUP signal autofs will, take a write lock > > > > (against a read-write mutex) of the given map to ensure all > > > > lookups for this particular map are completed then dlcose() > > > > the lookup module and then re-open it to effect the > > > > required per map initialization, since the map arguments > > > > may have changed. > > > > > > > > So there is plenty of opportunity for dependent demand > > > > loaded shared libraries to think they are no longer in > > > > use and to be unloaded. > > > > > > Do we know if, in this case, automount is being HUP'd? If it is, that would > > > certainly explain why symbols are being assigned different addresses. It > > > may also explain why NSPR is not, since NSPR is used by a wide variety of > > > software, and may have already been loaded into the process address space > > > before openldap and NSS. > > > > > > If automount is never being HUP'd, and we are getting this problem, then I'm > > > not sure what's going on. > > > > > > How can we find out if the problem is caused by automount being HUP'd? > > > > automount(8) catches HUP signals and deals with them. > > Right, as I understand it as you have explained it, it catches the HUP to > reload configuration. As part of that process, libraries can be unloaded > and reloaded. Is this not the case? There's a little more to it due to the history in this bug. Basically, yes, automount(8) will close and reopen the map lookup shared library on receiving a HUP signal (or a service reload) and also when it thinks the map has been modified, although a HUP signal isn't needed for the later. The history here is that, way back at the start, I provided an rpm with the close then open code commented out to test exactly this and Karsten reported he didn't see the corruption with this package. That lead to the rpm with changes that are supposed to reinit the lookup modules rather than do the close then open but then Karsten reported he had again seen the corruption with this change in place. At this point I was stumped, although my change could be at fault, of course. I don't think we verified the corruption wasn't occurring with original rpm to the point we could be sure the lack of corruption was completely reproducible. Karsten, do you still have that original rpm, if not I can recreate it and provide it again. Could you help us with this by performing that test again please? I've repeated my test in Comment #19 with this autofs version # rpm -qa|grep autofs autofs-debuginfo-5.0.5-114.bz1243525.3.el6.x86_64 autofs-5.0.5-114.bz1243525.3.el6.x86_64 and set NSS_STRICT_NOFORK=DISABLED. Result: (gdb) print *Table_List $1 = {next = 0x7f8401f78e20, table = 0x7f83fe831710, table_private = 0x0} !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! (gdb) print *Table_List.table Cannot access memory at address 0x7f83fe831710 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! (gdb) print *Table_List.next $2 = {next = 0x7f8401f78350, table = 0x7f83fe084820, table_private = 0x0} (gdb) print *Table_List.next.table $3 = {msgs = 0x7f83fe084840, name = 0x7f83fde798f4 "secerrstrings", base = 0xffffe000, n_msgs = 0xb4} (gdb) print *Table_List.next.next $4 = {next = 0x0, table = 0x7f83fda51600, table_private = 0x0} (gdb) print *Table_List.next.next.table $5 = {msgs = 0x7f83fda51620, name = 0x7f83fd845bc0 "prerr", base = 0xffffe890, n_msgs = 0x4d} I.e. there's inaccessible memory, too, although this time it's *Table_List.table instead of *Table_List.next.table. Am I right that this should not happen? Can you reproduce this test? (See comment #19 for instructions) (In reply to Karsten Weiss from comment #32) > I've repeated my test in Comment #19 with this autofs version > > # rpm -qa|grep autofs > autofs-debuginfo-5.0.5-114.bz1243525.3.el6.x86_64 > autofs-5.0.5-114.bz1243525.3.el6.x86_64 > > and set NSS_STRICT_NOFORK=DISABLED. Result: > > (gdb) print *Table_List > $1 = {next = 0x7f8401f78e20, table = 0x7f83fe831710, table_private = 0x0} > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > (gdb) print *Table_List.table > Cannot access memory at address 0x7f83fe831710 > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > (gdb) print *Table_List.next > $2 = {next = 0x7f8401f78350, table = 0x7f83fe084820, table_private = 0x0} > (gdb) print *Table_List.next.table > $3 = {msgs = 0x7f83fe084840, name = 0x7f83fde798f4 "secerrstrings", base = > 0xffffe000, n_msgs = 0xb4} > (gdb) print *Table_List.next.next > $4 = {next = 0x0, table = 0x7f83fda51600, table_private = 0x0} > (gdb) print *Table_List.next.next.table > $5 = {msgs = 0x7f83fda51620, name = 0x7f83fd845bc0 "prerr", base = > 0xffffe890, n_msgs = 0x4d} > > I.e. there's inaccessible memory, too, although this time it's > *Table_List.table > instead of *Table_List.next.table. > > Am I right that this should not happen? You are right - this should not happen. > > Can you reproduce this test? (See comment #19 for instructions) (In reply to Karsten Weiss from comment #19) > Ian, I've just examined the Table_List of the new > autofs-5.0.5-114.bz1243525.3.el6.x86_64 right after initialization > and noticed the inaccessible table entry again. > > I.e. launch autofs, then attach gdb to the running process and do this: > > (gdb) print *Table_List > $1 = {next = 0x7f3c967a7490, table = 0x7f3c912c8710, table_private = 0x0} > (gdb) print *Table_List.table > $2 = {msgs = 0x7f3c912c8740, name = 0x7f3c910bdb2a "sslerr", base = -12288, > n_msgs = 134} > (gdb) print *Table_List.next > $3 = {next = 0x7f3c96735bc0, table = 0x7f3c92b33710, table_private = 0x0} > (gdb) print *Table_List.next.table > !!!!!!!!!!!!!!!!!!!! > Cannot access memory at address 0x7f3c92b33710 > !!!!!!!!!!!!!!!!!!!! > (gdb) print *Table_List.next.next > $4 = {next = 0x7f3c96735100, table = 0x7f3c92386820, table_private = 0x0} > (gdb) print *Table_List.next.next.table > $5 = {msgs = 0x7f3c92386840, name = 0x7f3c9217b8f4 "secerrstrings", base = > -8192, n_msgs = 180} > (gdb) print *Table_List.next.next.next > $6 = {next = 0x0, table = 0x7f3c91d53600, table_private = 0x0} > (gdb) print *Table_List.next.next.next.table > $7 = {msgs = 0x7f3c91d53620, name = 0x7f3c91b47bc0 "prerr", base = -6000, > n_msgs = 77} > (gdb) print *Table_List.next.next.next.next > Cannot access memory at address 0x0 > > I'm pretty sure this version would crash in PR_ErrorToString() again (as > initially reported) and we're back to square one. Don't you think so? > > Could you please try to reproduce this simple test by yourself? In this case, is automount unloading and reloading the ldap module, at automount startup? That is, you are not doing anything except starting automount? (In reply to Karsten Weiss from comment #32) > I've repeated my test in Comment #19 with this autofs version Thanks for that. > > # rpm -qa|grep autofs > autofs-debuginfo-5.0.5-114.bz1243525.3.el6.x86_64 > autofs-5.0.5-114.bz1243525.3.el6.x86_64 > > and set NSS_STRICT_NOFORK=DISABLED. Result: Just to be sure, how did you set this? It should be sufficient to set it in /etc/sysconfig/autofs but the export keyword cannot be added in that file. So, to be sure, it would be best to add the define with the export prefix just before the daemon is run in the init script. Or did you export the variable and run automount from the command line, that should be fine too? > > (gdb) print *Table_List > $1 = {next = 0x7f8401f78e20, table = 0x7f83fe831710, table_private = 0x0} > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > (gdb) print *Table_List.table > Cannot access memory at address 0x7f83fe831710 > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > (gdb) print *Table_List.next > $2 = {next = 0x7f8401f78350, table = 0x7f83fe084820, table_private = 0x0} > (gdb) print *Table_List.next.table > $3 = {msgs = 0x7f83fe084840, name = 0x7f83fde798f4 "secerrstrings", base = > 0xffffe000, n_msgs = 0xb4} > (gdb) print *Table_List.next.next > $4 = {next = 0x0, table = 0x7f83fda51600, table_private = 0x0} > (gdb) print *Table_List.next.next.table > $5 = {msgs = 0x7f83fda51620, name = 0x7f83fd845bc0 "prerr", base = > 0xffffe890, n_msgs = 0x4d} > > I.e. there's inaccessible memory, too, although this time it's > *Table_List.table > instead of *Table_List.next.table. > > Am I right that this should not happen? > > Can you reproduce this test? (See comment #19 for instructions) Your analysis is so clearly showing a problem that has occurred at some unknown previous time that I hadn't attempted to reproduce it. Reproducing this does involve setting up an ldap server, generating CA and certificate and adding test maps, etc. I'm writing a test to do this now for a related bug (as support were unable to reproduce the problem of the related bug) and I will use that to check your findings as well. TBH I'm still not sure what additional information I will get from doing so since it's still not clear what's causing it. This test is good feedback, to be sure, but I was actually asking about the original rpm I provided where the library re-load was commented out, ie. our initial check to try and work out if it was related to the close to open of lookup modules. Do you still have that rpm and could you test with it again please? Ian Ian, I did what you wrote in comment 31 i.e. "adding the "NSS_STRICT_NOFORK=DISABLED" to /etc/sysconfig/autofs and restarting autofs". I.e. I did *not* export this variable myself. > That is, you are not doing anything except starting automount? Exactly, as described in comment 19: Launch autofs, then (immediately without doing anything special) attach gdb to the running process and print the Table_List (and follow its pointers) and look if there are inaccessbily memory regions. (Make sure to install the nspr-debuginfo package to be able to resolve the symbols in gdb.) > Reproducing this does involve setting up an ldap server, I know. However, at this point I think it's crucial that a) you have a local test environment, too, and b) you are able to reproduce the issue independently. > This test is good feedback, to be sure, but I was actually asking > about the original rpm I provided where the library re-load was > commented out, ie. our initial check to try and work out if it was > related to the close to open of lookup modules. Do you still have > that rpm and could you test with it again please? Isn't the answer already in comment 3 (apart from NSS_STRICT_NOFORK=DISABLED)? I believe this problem has been resolved in autofs revision 122. Can you check if this is the case please? I've tested autofs-5.0.5-123: (gdb) print *Table_List $1 = {next = 0x7f2d1d40f270, table = 0x7f2d184233d0, table_private = 0x0} (gdb) print *Table_List.table $2 = {msgs = 0x0, name = 0x0, base = 0, n_msgs = 0} ?????????? "name" was a ptr to "sslerr" before but it is 0x0 now. (gdb) print *Table_List.next $3 = {next = 0x7f2d1d40e7a0, table = 0x7f2d17c71820, table_private = 0x0} (gdb) print *Table_List.next.table $4 = {msgs = 0x7f2d17c71840, name = 0x7f2d17a66e14 "secerrstrings", base = -8192, n_msgs = 180} (gdb) print *Table_List.next.next $5 = {next = 0x0, table = 0x7f2d1763e600, table_private = 0x0} (gdb) print *Table_List.next.next.table $6 = {msgs = 0x7f2d1763e620, name = 0x7f2d17432d60 "prerr", base = -6000, n_msgs = 77} I.e there's no more inaccessbily memory region but the first table consists of 0 bytes. (In reply to Karsten Weiss from comment #44) > I've tested autofs-5.0.5-123: > > (gdb) print *Table_List > $1 = {next = 0x7f2d1d40f270, table = 0x7f2d184233d0, table_private = 0x0} > (gdb) print *Table_List.table > $2 = {msgs = 0x0, name = 0x0, base = 0, n_msgs = 0} > ?????????? > "name" was a ptr to "sslerr" before but it is 0x0 now. > (gdb) print *Table_List.next > $3 = {next = 0x7f2d1d40e7a0, table = 0x7f2d17c71820, table_private = 0x0} > (gdb) print *Table_List.next.table > $4 = {msgs = 0x7f2d17c71840, name = 0x7f2d17a66e14 "secerrstrings", base = > -8192, n_msgs = 180} > (gdb) print *Table_List.next.next > $5 = {next = 0x0, table = 0x7f2d1763e600, table_private = 0x0} > (gdb) print *Table_List.next.next.table > $6 = {msgs = 0x7f2d1763e620, name = 0x7f2d17432d60 "prerr", base = -6000, > n_msgs = 77} > > I.e there's no more inaccessbily memory region but the first table consists > of 0 bytes. I never got to the bottom of the recent problems related to this. I believe they were due to OpenLDAP changing the libraries it uses. There were a number of problems including one very similar to this (also table corruption in this library). After considering all the evidence I made assumptions about what the problem was and after quite extensive changes to autofs reports indicated the problems had been resolved. I don't know if the empty table above is actually a problem but if there's no observable failure then there's nothing to go on. The other thing is that this library isn't used directly by autofs at all, it's used by OpenLDAP so it's not strictly an autofs problem. So are you seeing an actual problem now? I'm not able to reproduce the problem on 6.8 with autofs-5.0.5-123.x86_64 right now. (On 7.2 using autofs-5.0.7-54.el7 it's still happening, though.) (In reply to Karsten Weiss from comment #46) > I'm not able to reproduce the problem on 6.8 with autofs-5.0.5-123.x86_64 > right now. > > (On 7.2 using autofs-5.0.7-54.el7 it's still happening, though.) Yes, it is. The changes that went into RHEL-6.8 that I hope fixed this have also gone into RHEL-7.3 and RHEL-7.3 is well into the QA process now. I can put the RHEL-7.3 package on people.redhat.com if you would like to check if it helps with this problem before 7.3 is actually released. Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available. The official life cycle policy can be reviewed here: http://redhat.com/rhel/lifecycle This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL: https://access.redhat.com/ We still have the issue with recent autofs packages on RHEL 6 on our servers : - kernel 2.6.32-754.3.5.el6.x86_64 - autofs 5.0.5-123.el6_8.x86_64 or 5.0.5-133.el6_9.x86_64 A crash dump is attached. Created attachment 1479406 [details]
crashdump for autofs-5.0.5-123.el6_8.x86_64 crash
Hello, At my side too we continue to see this bug with the error: Jul 20 16:03:09 uljvpdrr01 kernel: automount[14859]: segfault at 7f5ad1915fb0 ip 00007f5ad090463c sp 00007f5acf07d1b0 error 4 in libnspr4.so[7f5ad08ea000+39000] Can you investigate on this ? Thanks, Jean-Sébastien BEVILACQUA (In reply to Guillaume Hilt from comment #51) > Created attachment 1479406 [details] > crashdump for autofs-5.0.5-123.el6_8.x86_64 crash I finally got some time to have a look at this dump. Unfortunately the autofs package in use has been built against a different glibc so I can't get the symbols I need to look closely at it. But I don't think that matters because, from what I can see, the fault is the same as described in comment #0. I made a lot of changes to autofs to try and work around this bug in nspr but clearly the assumptions made about when and how it happens must be wrong. On the face of it the nspr library is lacking locking around error tables list modifications. There must be a reason this is not done but I don't know what it may be. I'm not willing to make further changes to autofs, I've made far to many that I shouldn't have needed to make already. I'm thinking as minimal as possible a change to add locking around the error table list handling would be worth trying and we'll see what our options are if that does help, given the point where RHEL-6 is in its life cycle. Ian Could someone who is seeing this problem try this test build of nspr to see if it helps with the problem please. It's located at: http://people.redhat.com/~ikent/nspr-4.19.0-1.bz1197622.el6/ I'm trying it tomorrow, thanks Ian. Package installed. As it crashes only once a week, please be patient. autofs hasn't crashed on the server i installed your package, look like you fixed it :) Nice job ! Regarding the support ticket 02188519 i opened at rhel, they said they won't fix it for RHEL6, but if your fix works, they may do something. Thanks Ian Kent for your work. Sincerely, Jean-Sebastien BEVILACQUA (In reply to Guillaume Hilt from comment #57) > autofs hasn't crashed on the server i installed your package, look like you > fixed it :) Nice job ! Originally I thought this problem was related to other similar problems I tried to work around within autofs but clearly it isn't and equally clearly (although we should wait a little longer to be sure) not using mutual exclusion for the error table list handling is not ok. Perhaps the library authors think this is the responsibility of the library users, OpenLDAP in this case. > > Regarding the support ticket 02188519 i opened at rhel, they said they won't > fix it for RHEL6, but if your fix works, they may do something. Yes, it is hard to get changes into RHEL-6 now but lets see how we go. The reality is a properly approved bug and the package (nspr) maintainer happy with the change and willing to commit it is what's needed. The bug approvals will be the difficult part of this. Anyway, first we need to consult the nspr package maintainer and I think see if we can get someone in support interested. I'll see what I can do while we give the package a bit longer to be sure this fixes the problem. Created attachment 1492826 [details]
Patch - nspr - add locking for error table list
Sadly, the second server on which i installed your fix (3 days ago) saw autofs crash this morning. Here's the core dump. Created attachment 1493703 [details]
crash dump with autofs proposed fix
(In reply to Guillaume Hilt from comment #62) > Created attachment 1493703 [details] > crash dump with autofs proposed fix I have had a look at the core. It appears the packages the core requires don't quite match the packages listed in the installed-rpms of the included sosreport, I don't know why. I did however get most of the dependent debuginfo packages installed but, because of the mismatch, I'm not sure how much I can trust what I saw looking at the core. On the face of it, it looks like the error table list is in tact but the tables that the list entries point to are not accessible (it looks like they have gone away), which naturally causes a SEGV. So that much makes sense. At this point the problem here is looking a lot like the similar related problem, which I mentioned before, that was meant to be resolved. In that bug SEGVs were being caused by memory going away for no apparent reason, like it appears to be doing here. It was thought this might be due to interactions between the nss library and autofs closing and then re-opening some it's libraries. That lead to me changing autofs to no close and re-open any of it's libraries. The other possibility was that the nss library was re-initializing due to fork(2)s being done (which apparently it does). And autofs will fork mount(8) to perfom mounts so that happens quite a bit. In order to check if this was causing a problem we asked people to set NSS_STRICT_NOFORK=DISABLED in the autofs environment before starting the daemon. Looking at the bug I can't see if we got any clear feedback on the result of that. At this point though I think this is all we have to check. So can we try a change like this: function start() { export NSS_STRICT_NOFORK=DISABLED <-------------------- Add this here # Make sure autofs4 module is loaded if ! grep -q autofs /proc/filesystems to check if it is the nss library re-initialisation. Ian Thanks ian, we can try that, no problem. Do you have updated the rpm posted earlier ? (In reply to Guillaume Hilt from comment #64) > Thanks ian, we can try that, no problem. > Do you have updated the rpm posted earlier ? The nspr rpm hasn't changed, it's still available on people.redhat.com. But now it seems that it probably isn't doing anything useful. I looked at the backtrace posted in the case and I see it shows that the crash occurs during connection initialisation. I hadn't looked at the full backtrace recently but seeing that I remember now why I assumed the problem occurred at connection initialisation. And because of that one of the other changes I had done in autofs to try and resolve this was to expand the locking done by the autofs ldap module around connection initialisation. So the locking change in the nspr package might not be doing any additional mutual exclusion. I spent so long trying to work around these problems from autofs seeing this come up again is disturbing to me, I really would like to resolve it but it's proving very difficult. The bug in which the discussion of this occurred is very long, I'll need to work through it again to refresh my memory. Comment on attachment 1492826 [details] Patch - nspr - add locking for error table list Thank you for the patch, but I think it is a wrong approach, as the documentation explicitly says those functions are not thread safe: https://searchfox.org/mozilla-central/source/nsprpub/pr/include/prerror.h#268 (In reply to Daiki Ueno from comment #66) > Comment on attachment 1492826 [details] > Patch - nspr - add locking for error table list > > Thank you for the patch, but I think it is a wrong approach, as the > documentation explicitly says those functions are not thread safe: > https://searchfox.org/mozilla-central/source/nsprpub/pr/include/prerror.h#268 Well, yes, there are a couple of places that are still open to races after adding the locking and the list doesn't "own" the tables that are installed. But the point of the crash isn't at those locations and the patch doesn't help with the problem so something else is causing this. The problem could still be list corruption but it looks more like the installed table memory has gone away for some unknown reason, perhaps because of nss library re-initialization on fork(2). I know that my initial description is rather verbose (sorry for that!), however, please notice that (in the 2nd half) it shows how to trigger the problem in a *new* automount session. Start reading at "Let's start a new automount session" and follow the steps and you'll be able to trigger the inaccessible memory situation in gdb immediately. IMHO there's really no need to test this in production and waste/wait a couple of days each time. (In reply to Karsten Weiss from comment #68) Hi Karsten, > I know that my initial description is rather verbose (sorry for that!), > however, please notice that (in the 2nd half) it shows how to trigger the > problem in a *new* automount session. Start reading at "Let's start a new > automount session" and follow the steps and you'll be able to trigger the > inaccessible memory situation in gdb immediately. Don't be sorry, your description was very much appreciated. The thing is your description was comprehensive enough to make me think that using your procedure to reproduce it wasn't going to give me the information I would need to locate the problem. Going through a procedure that shows everything is ok at one point then isn't ok at a later point doesn't help to narrow the problem location enough to be able to spot the problem using code inspection. Obviously it's good to have to verify any possible fix but to find what needs to be fixed is not so straight forward. OTOH we know (found out through investigation of what I thought were similar problems) that nss triggers a library re-initialization at fork(2) which could possibly lead to data previously pointed to going away or moving. That's the point of checking if NSS_STRICT_NOFORK=DISABLED makes a difference here. I'll try and put together an environment to check this using your procedure. Hi Ian, Any news for this issue ? (maybe in january) Regards, Guillaume (In reply to Guillaume Hilt from comment #70) > Hi Ian, > > Any news for this issue ? (maybe in january) No news but it's still January, ;) I still don't know for sure why this happens, I only have assumptions based on the history detailed above. So based on that lets give this rpm build a try. It can be found at: http://people.redhat.com/~ikent/autofs-5.0.5-140.nss.el6/ If there are any dependency problems installing it let me know and I'll produce a build based on an earlier RHEL-6 release. Thanks Ian, i'll install it this morning. Crash occurs at least once a week, so i'll get back to you in 7 days max. (In reply to Guillaume Hilt from comment #73) > Thanks Ian, i'll install it this morning. > Crash occurs at least once a week, so i'll get back to you in 7 days max. Sorry about this but I've noticed that the build I posted had the patch to pin the libraries commented out so it isn't included in the build. I've rebuild this with the patch included and uploaded it to the above location (same revision). Could you update the installed rpm with this one please. Packages updated on 2 servers, they will reboot this week-end, we'll how it goes. No crash for now, on the 2 servers i used. Good news for now, i'll wait another week before pouring out champagne. (In reply to Guillaume Hilt from comment #76) > No crash for now, on the 2 servers i used. > Good news for now, i'll wait another week before pouring out champagne. We'll need to do one more round of testing if this continues to be ok. I used a rather broad brush approach to the libraries I pinned so I need to remove the libraries I think don't actually need to be pinned. So a re-test will be needed. Ian, i think it is now ok with theses packages. You can try to remove some libraries you added. I have reduced the number of pinned libraries to just the two that should be needed to resolve this. There might be one other (that does the same sort of thing) but the bug history doesn't show this library is part of this problem. So based on that please give this rpm build a try. It can be found at: http://people.redhat.com/~ikent/autofs-5.0.5-140.nss.2.el6/ Ian Created attachment 1535788 [details]
Patch - nss workaround
Thanks Ian ! As usual, let's wait at least a week to see if it works. Hello Ian, No crash on both servers where i installed your 2 packages. I guess we can say it's fixed. Do you need to compile another package or can i deploy the last one on all our rhel 6 ? Thanks again ! I guess the autofs-debuginfo package isn't needed anymore ? (In reply to Guillaume Hilt from comment #82) > Hello Ian, > > No crash on both servers where i installed your 2 packages. > I guess we can say it's fixed. > > Do you need to compile another package or can i deploy the last one on all > our rhel 6 ? Sadly it's not as simple as that. The package I provided was not officially signed because it didn't go though the QA process and the revision isn't what's required (as it was for testing). Now we need to try and get this approved so it can go through the QA and release process. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0718 |