Bug 1644933
Summary: | Segmentation fault in err_string_data_LHASH_COMP | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ian Allison <iana> | ||||||||
Component: | autofs | Assignee: | Ian Kent <ikent> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Kun Wang <kunwan> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 7.6 | CC: | ffotorel, fsorenso, iana, ikent, jbyrd, knweiss, mjtrangoni, mmielke, renaud.maubon, rharwood, tmraz, tthakur, xifeng, xzhou | ||||||||
Target Milestone: | rc | Keywords: | Regression | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | autofs-5.0.7-103 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2019-08-06 13:10:29 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Ian Allison
2018-11-01 00:04:55 UTC
(In reply to Ian Allison from comment #0) > After updating to 7.6 all of our home directory automounts have stopped > working. The mount attempts are causing a segmentation fault in automount. > > Our temporary workaround is to patch openssl to be configured with > -DOPENSSL_NO_ERR (as suggested for a similar issue here > https://bugs.gentoo.org/581172). The discussion there indicates that this > comes from the address of some error strings changing. In the Gentoo bug report there's a comment about building 5.1.1 resolving the problem. That makes me wonder if the problem is related to the build environment, has a rebuild of autofs in the target environment been tried? Our autofs has a lot of the changes included in 5.1.1, particularly those related to LDAP and autofs's library initialisation. Could someone check that please? And what release of RHEL were you upgrading from? I've just tried rebuilding the autofs src.rpm without any modifications (is that what you meant?) on one of the 7.6 machines, but I get the same segfault behaviour. In most cases these were scheduled updates from 7.5 (working) to 7.6, but I found one machine which had been switched off and was running 7.3 (working). It also gives the same segfault after the update. (In reply to Ian Allison from comment #5) > I've just tried rebuilding the autofs src.rpm without any modifications (is > that what you meant?) on one of the 7.6 machines, but I get the same > segfault behaviour. > > In most cases these were scheduled updates from 7.5 (working) to 7.6, but I > found one machine which had been switched off and was running 7.3 (working). > It also gives the same segfault after the update. OK, thanks for that. The other thing that is suspicious is downgrading autofs and OpenSSL not resolving the problem. But here were no changes to the autofs LDAP code between 7.5 and 7.6 so there this has to be some other library change. Can you try downgrading the nss and also (perhaps) the nspr libraries please? (In reply to Ian Kent from comment #6) > Can you try downgrading the nss and also (perhaps) the nspr libraries > please? I tried downgrading nss but no luck, but I followed your general hint and checked `ldd /usr/sbin/mount`. It looks like the problem might be with kerberos (the mount is being made with sec=krb5). If I downgrade the kerberos packages things start working again yum downgrade krb5-libs-1.15.1-19.el7.x86_64 \ krb5-devel-1.15.1-19.el7.x86_64 \ krb5-workstation-1.15.1-19.el7.x86_64 \ libkadm5-1.15.1-19.el7.x86_64 Working back from that, I tried removing the flag `--with-crypto-impl=openssl` and rebuilding krb5-1.15.1-34.el7.src.rpm and the mounts start working again. Sorry, that should have been `ldd /usr/sbin/automount` above, and here is a backtrace in case it is helpful (gdb) bt #0 0x00007ffff42e56d0 in err_string_data_LHASH_COMP () from /lib64/libcrypto.so.10 #1 0x00007ffff42e2f09 in getrn () from /lib64/libcrypto.so.10 #2 0x00007ffff42e352a in lh_retrieve () from /lib64/libcrypto.so.10 #3 0x00007ffff42e5ec7 in int_err_get_item () from /lib64/libcrypto.so.10 #4 0x00007ffff42e6393 in ERR_func_error_string () from /lib64/libcrypto.so.10 #5 0x00007ffff19bf2a0 in ERR_load_SSL_strings () from /lib64/libssl.so.10 #6 0x00007ffff2acfd49 in tlso_init () from /lib64/libldap-2.4.so.2 #7 0x00007ffff2acdbf9 in ldap_int_tls_start () from /lib64/libldap-2.4.so.2 #8 0x00007ffff2ace051 in ldap_start_tls_s () from /lib64/libldap-2.4.so.2 #9 0x00007ffff2cf8900 in init_ldap_connection () from /usr/lib64/autofs/lookup_ldap.so #10 0x00007ffff2cf8b0d in do_connect () from /usr/lib64/autofs/lookup_ldap.so #11 0x00007ffff2cf920f in connect_to_server () from /usr/lib64/autofs/lookup_ldap.so #12 0x00007ffff2cf96db in do_reconnect () from /usr/lib64/autofs/lookup_ldap.so #13 0x00007ffff2cfcf27 in lookup_mount () from /usr/lib64/autofs/lookup_ldap.so #14 0x000055555556c01d in do_lookup_mount () #15 0x000055555556cd31 in lookup_nss_mount () #16 0x00005555555636d0 in do_mount_indirect () #17 0x00007ffff7bc6dd5 in start_thread () from /lib64/libpthread.so.0 #18 0x00007ffff67f7ead in clone () from /lib64/libc.so.6 (In reply to Ian Allison from comment #8) > Sorry, that should have been `ldd /usr/sbin/automount` above, and here is a > backtrace in case it is helpful It might be useful to have a core and sosreport so we can setup a lab system to look at it. Created attachment 1502242 [details]
sosreport for segfaulting system
Created attachment 1502243 [details]
coredump from segfaulting system
If I understand correctly what is written in previous comments when krb5 libraries are downgraded to some older version the crash disappears. I see that the older krb5 libraries were not linked to openssl but the current ones are. Possible cause could be loading and unloading the openssl library from krb5 libraries when the ldap library uses the openssl library later again. This scenario does not really work well and can cause such issues. A possible fix could be to force krb5 libraries to not unload from the automount process but maybe the fixing would have to be done in krb5 libraries themselves. (In reply to Tomas Mraz from comment #13) > If I understand correctly what is written in previous comments when krb5 > libraries are downgraded to some older version the crash disappears. > > I see that the older krb5 libraries were not linked to openssl but the > current ones are. Possible cause could be loading and unloading the openssl > library from krb5 libraries when the ldap library uses the openssl library > later again. This scenario does not really work well and can cause such > issues. I'm having a bit of trouble understanding the usage sequence your describing. Could you explain the possible sequence of operations a little more please. > > A possible fix could be to force krb5 libraries to not unload from the > automount process but maybe the fixing would have to be done in krb5 > libraries themselves. I can dlopen() (and dlclose() at exit, in the automount daemon) any libraries that are needed to prevent this from happening but I'd like to know a little more about which ones I should be doing this for so I can be sure that a test build has a chance of resolving the problem, if this is in fact the problem. I see the ldap library depends on the nss library, doesn't the nss library have a nasty feature of re-initialising libraries (not sure which ones actually) on fork(2), possibly not too good for threaded applications with several indirect dependencies ... I can't do anything about that if that's what's happening. krb5 library loads and uses OpenSSL and then it unloads it - i.e. calls ERR_free_strings() EVP_cleanup() or other cleanup functions Then later ldap library loads OpenSSL and tries to use it - this won't work if OpenSSL was cleaned up before. Also the LDAP library as it is apparent from the backtrace in comment 8 above uses OpenSSL and not NSS. The character of this bug (autofs crashes while looking up/printing error strings because some of their data structures are no longer mapped) reminds me of my old autofs bugs [bz1197622](https://bugzilla.redhat.com/show_bug.cgi?id=1197622) / [bz1381924](https://bugzilla.redhat.com/show_bug.cgi?id=1381924)... (In reply to Karsten Weiss from comment #16) > The character of this bug (autofs crashes while looking up/printing error > strings because some of their data structures are no longer mapped) reminds > me of my old autofs bugs > [bz1197622](https://bugzilla.redhat.com/show_bug.cgi?id=1197622) / > [bz1381924](https://bugzilla.redhat.com/show_bug.cgi?id=1381924)... Yes it is similar in that it appears that some library data has gone missing. The core doesn't give much information, and the way the OpenSSL code is written makes it much harder to work out what's going on. AFAICT there's no list structure involved at all. Just because the stack trace doesn't show nss is being used I don't think we can assume it isn't involved here. (In reply to Ian Kent from comment #17) > (In reply to Karsten Weiss from comment #16) > > The character of this bug (autofs crashes while looking up/printing error > > strings because some of their data structures are no longer mapped) reminds > > me of my old autofs bugs > > [bz1197622](https://bugzilla.redhat.com/show_bug.cgi?id=1197622) / > > [bz1381924](https://bugzilla.redhat.com/show_bug.cgi?id=1381924)... > > Yes it is similar in that it appears that some library data > has gone missing. > > The core doesn't give much information, and the way the OpenSSL > code is written makes it much harder to work out what's going on. > AFAICT there's no list structure involved at all. > > Just because the stack trace doesn't show nss is being used I > don't think we can assume it isn't involved here. Also there's no instance of either ERR_free_strings() or EVP_cleanup() anywhere in the source of OpenLDAP or krb5. per comment 7, the problem disappears if krb5 is rebuilt without the '--with-crypto-impl=openssl' flag. This change was made during bz1570600 - krb5-libs uses slow crypto implementation so this segfault would be a regression caused by that bz backtrace from customer case 2261106: #0 err_string_data_cmp (a=0x7fc4a54a0320, b=0x7fc4a0f45320) at err.c:354 #1 err_string_data_LHASH_COMP (arg1=0x7fc4a54a0320, arg2=0x7fc4a0f45320) at err.c:354 #2 0x00007fc4a6919f09 in getrn (lh=lh@entry=0x55d3c783dc10, data=data@entry=0x7fc4a0f45320, rhash=rhash@entry=0x7fc4a0f452d0) at lhash.c:415 #3 0x00007fc4a691a52a in lh_retrieve (lh=lh@entry=0x55d3c783dc10, data=data@entry=0x7fc4a0f45320) at lhash.c:248 #4 0x00007fc4a691cec7 in int_err_get_item (d=0x7fc4a0f45320) at err.c:394 #5 0x00007fc4a691d393 in ERR_func_error_string (e=<optimized out>) at err.c:972 #6 0x00007fc4a3ff62a0 in ERR_load_SSL_strings () at ssl_err.c:835 #7 0x00007fc4a3fe8832 in SSL_load_error_strings () at ssl_err2.c:67 #8 0x00007fc4a5106d49 in tlso_init () at tls_o.c:148 #9 0x00007fc4a5104bf9 in ldap_int_tls_start (ld=ld@entry=0x7fc494001c80, conn=conn@entry=0x7fc49400afd0, srv=srv@entry=0x7fc494000ca0) at tls2.c:902 #10 0x00007fc4a50ddd01 in ldap_int_open_connection (ld=ld@entry=0x7fc494001c80, conn=conn@entry=0x7fc49400afd0, srv=0x7fc494000ca0, async=async@entry=0) at open.c:448 #11 0x00007fc4a50f107d in ldap_new_connection (ld=ld@entry=0x7fc494001c80, srvlist=srvlist@entry=0x7fc494000988, use_ldsb=use_ldsb@entry=1, connect=connect@entry=1, bind=bind@entry=0x0, m_req=m_req@entry=0, m_res=m_res@entry=0) at request.c:487 #12 0x00007fc4a50dd19f in ldap_open_defconn (ld=ld@entry=0x7fc494001c80) at open.c:41 #13 0x00007fc4a50f2388 in ldap_send_initial_request (ld=ld@entry=0x7fc494001c80, msgtype=msgtype@entry=96, dn=dn@entry=0x0, ber=0x7fc494009f80, msgid=1) at request.c:130 #14 0x00007fc4a50e73c9 in ldap_sasl_bind (ld=ld@entry=0x7fc494001c80, dn=dn@entry=0x0, mechanism=mechanism@entry=0x0, cred=cred@entry=0x7fc4a0f45690, sctrls=sctrls@entry=0x0, cctrls=0x0, msgidp=msgidp@entry=0x7fc4a0f45624) at sasl.c:164 #15 0x00007fc4a50e77f9 in ldap_sasl_bind_s (ld=ld@entry=0x7fc494001c80, dn=dn@entry=0x0, mechanism=mechanism@entry=0x0, cred=cred@entry=0x7fc4a0f45690, sctrls=sctrls@entry=0x0, cctrls=cctrls@entry=0x0, servercredp=servercredp@entry=0x0) at sasl.c:198 #16 0x00007fc4a50e8095 in ldap_simple_bind_s (ld=0x7fc494001c80, dn=dn@entry=0x0, passwd=passwd@entry=0x0) at sbind.c:113 #17 0x00007fc4a532f687 in bind_ldap_simple (logopt=logopt@entry=0, ldap=<optimized out>, uri=uri@entry=0x0, ctxt=ctxt@entry=0x7fc49c0058d0) at lookup_ldap.c:199 #18 0x00007fc4a532fb7f in do_bind (ctxt=0x7fc49c0058d0, uri=0x0, conn=0x7fc4a0f458f0, logopt=0) at lookup_ldap.c:587 #19 do_connect (logopt=0, conn=0x7fc4a0f458f0, uri=0x0, ctxt=0x7fc49c0058d0) at lookup_ldap.c:656 #20 0x00007fc4a5330407 in do_reconnect (logopt=0, conn=0x7fc4a0f458f0, ctxt=0x7fc49c0058d0) at lookup_ldap.c:969 #21 0x00007fc4a5333f27 in lookup_one (ap=<optimized out>, ap=<optimized out>, ctxt=0x7fc49c0058d0, qKey_len=2, qKey=0x7fc494000b90 "ad", source=0x7fc49c000910) at lookup_ldap.c:2986 #22 match_key (ctxt=0x7fc49c0058d0, key_len=2, key=0x7fc494000b90 "ad", source=0x7fc49c000910, ap=0x55d3c795c8c0) at lookup_ldap.c:3485 #23 check_map_indirect (ctxt=0x7fc49c0058d0, key_len=2, key=0x7fc494000b90 "ad", source=0x7fc49c000910, ap=0x55d3c795c8c0) at lookup_ldap.c:3572 #24 lookup_mount (ap=0x55d3c795c8c0, name=<optimized out>, name_len=<optimized out>, context=0x7fc49c0058d0) at lookup_ldap.c:3725 #25 0x000055d3c637e01d in do_lookup_mount (ap=ap@entry=0x55d3c795c8c0, map=0x7fc49c000910, name=name@entry=0x7fc4a0f49d80 "ad", name_len=name_len@entry=2) at lookup.c:850 #26 0x000055d3c637ed31 in lookup_name_source_instance (name_len=2, name=0x7fc4a0f49d80 "ad", type=0x7fc494000b70 "ldap", map=0x55d3c795c9e0, ap=0x55d3c795c8c0) at lookup.c:986 #27 lookup_map_name (this=0x7fc494000b30, name_len=2, name=0x7fc4a0f49d80 "ad", map=0x55d3c795c9e0, ap=0x55d3c795c8c0) at lookup.c:1041 #28 lookup_nss_mount (ap=ap@entry=0x55d3c795c8c0, source=source@entry=0x0, name=name@entry=0x7fc4a0f49d80 "ad", name_len=2) at lookup.c:1276 #29 0x000055d3c63756d0 in do_mount_indirect (arg=<optimized out>) at indirect.c:776 #30 0x00007fc4aa1fddd5 in start_thread (arg=0x7fc4a0f4c700) at pthread_create.c:307 #31 0x00007fc4a8e2eead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 I found that when upgrading from 7.5 to 7.6, sss disappears from the /etc/nsswitch.conf automount services. See, -automount: files sss ldap +automount: files ldap As I reinsert it, autofs started working again. Tomáš, can you provide more information on #c15 for me? We don't call those functions anywhere in krb5. In openssl-1.0.2 it is possible that the unload happens on libcrypto library unload regardless of whether krb5 is calling the unload functions. Perhaps using RTLD_NODELETE or keeping libkrb5 loaded in autofs could help as a workaround? (In reply to Tomas Mraz from comment #26) > In openssl-1.0.2 it is possible that the unload happens on libcrypto library > unload regardless of whether krb5 is calling the unload functions. > > Perhaps using RTLD_NODELETE or keeping libkrb5 loaded in autofs could help > as a workaround? Yes, the question is which library (or libraries) need to be pinned. The problem looks like it is caused by a shared library defining static data and then passing the address (or addresses) of this to some other shared library without regard to the possibility that shared libraries might be unloaded and reloaded later. In this case (and I'm still not sure about this) it looks like libssl defines static error table strings and libcrypto is trying to use them and segfaulting becuase the previous static address is no longer valid. So perhaps pinning libssl and libcrypto would be sufficient to work around this. Tomas I'd appreciate it if you could check if what I think I see is correct wrt. libssl and libcrypto and whether you agree this would be sufficient to resolve this. Ian Ian Yes, pinning libssl and libcrypto or libkrb5 in memory by RTLD_NODELETE should help. Lets try pinning the libssl and libcrypto at the start up of automount to see if that helps with this problem. I have made an autofs build that does this, it's located at: http://people.redhat.com/~ikent/autofs-5.0.7-102.ossl.1.el7/ Please give this a try and report back. It looks like krb5 has been updated (krb5-devel,krb5-libs,krb5-workstation,libkadm5 all go from 1.15.1-34 to 1.15.1-37.el7) and there's a message on the ChangeLog that... 2018-12-18 - Robbie Harwood <rharwood> - 1.15.1-37 - Bring back builtin crypto (openssl broke too many FIPS setups) - Resolves: #1657890 Installing that update, then restarting autofs seems to fix my problem, the mount completes without the segfault. Installing your autofs package also works. If I downgrade to 1.15.1-34 (which brings back the segfault) then install your autofs package the mount completes successfully. I don't know what the reasoning for the crypto change in krb5 was, but your fix would allow them to use openssl without breaking autofs and something similar might work for other applications. Thank you! Well, as I said in the changelog, using openssl there broke too many existing FIPS setups :) I'm glad that it's accidentally resolved for RHEL-7, but please note that this issue probably still occurs in RHEL-8 because RHEL-8 krb5 uses openssl for everything (except curve25519). (In reply to Ian Allison from comment #30) > It looks like krb5 has been updated > (krb5-devel,krb5-libs,krb5-workstation,libkadm5 all go from 1.15.1-34 to > 1.15.1-37.el7) and there's a message on the ChangeLog that... > > 2018-12-18 - Robbie Harwood <rharwood> - 1.15.1-37 > - Bring back builtin crypto (openssl broke too many FIPS setups) > - Resolves: #1657890 > > Installing that update, then restarting autofs seems to fix my problem, the > mount completes without the segfault. > > > Installing your autofs package also works. If I downgrade to 1.15.1-34 > (which brings back the segfault) then install your autofs package the mount > completes successfully. > > I don't know what the reasoning for the crypto change in krb5 was, but your > fix would allow them to use openssl without breaking autofs and something > similar might work for other applications. The idea is simple enough to implement, it's just dlopen()ing the two shared libraries that share static data of one with the other at application start up so they aren't unloaded while the static data is in use (and dlclose()ed at application exit). It's also understandable why this is done, although it shouldn't ever be done between shared libraries. Shared library static data should only ever be used within the same library, pointers to it should never be passed to another shared library. Changing the way this is done is a non-trivial task because to use the data in this way means it would need to be allocated from the applications heap resulting in all the difficulties of cleanup and consistency that come with it, particularly with library unload/load behaviours. I have had a couple of other workarounds in autofs for quite a while for conceptually similar shared library implementation shortcomings, not sure I want to also add workarounds for nss/nspr and libssl/libcrypto upstream ... but I probably have no choice since they don't appear to be easily fixable. And the dlopen()/dlclose() done here doesn't help applications other than autofs either, each application would need to be updated, and probably have to carry RHEL only patches for quite some time which should be avoided if at all possible. Point being I'm not sure if upstream Kerberos would be willing to do this since it's actually an implementation problem with another package. I expect they will say "fix it in the other package" and be done with it. And in principle they are justified in saying that. Or maybe there is a simpler way to fix this (within the library), perhaps there is a way to bump the reference count on the instances of shared library's that do this so they aren't unloaded while the static data is in use. I don't know what's possible on this myself, consulting a specialist in this might help. Ian Created attachment 1535792 [details]
Patch - openssl workaround
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2250 |