Bug 1644933

Summary: Segmentation fault in err_string_data_LHASH_COMP
Product: Red Hat Enterprise Linux 7 Reporter: Ian Allison <iana>
Component: autofsAssignee: Ian Kent <ikent>
Status: CLOSED ERRATA QA Contact: Kun Wang <kunwan>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.6CC: ffotorel, fsorenso, iana, ikent, jbyrd, knweiss, mjtrangoni, mmielke, renaud.maubon, rharwood, tmraz, tthakur, xifeng, xzhou
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: autofs-5.0.7-103 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-06 13:10:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreport for segfaulting system
none
coredump from segfaulting system
none
Patch - openssl workaround none

Description Ian Allison 2018-11-01 00:04:55 UTC
After updating to 7.6 all of our home directory automounts have stopped working. The mount attempts are causing a segmentation fault in automount.

Our temporary workaround is to patch openssl to be configured with -DOPENSSL_NO_ERR (as suggested for a similar issue here https://bugs.gentoo.org/581172). The discussion there indicates that this comes from the address of some error strings changing.


Version-Release number of selected component (if applicable): 
 * autofs-5.0.7-99.el7.x86_64 
 * openssl-1.0.2k-16.el7.x86_64
 * openssl-libs-1.0.2k-16.el7.x86_64


How reproducible: Always on the systems I've tried (6 so far), I'll try to spin up a minimal test case.


Steps to Reproduce: Try to automount an NFS mount point with the mounts stored in LDAP and accessed over TLS.

Actual results: Login hangs, running automount in the foreground shows a segfault (see below)

Expected results: Mount succeeds

Additional info:
I tried downgrading autofs, openssl and openldap-clients but this didn't help. The only thing we've found that works is to path openssl to skip examining the error strings.


Here is a backtrace of the automount process

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/automount...Reading symbols from /usr/sbin/automount...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Missing separate debuginfos, use: debuginfo-install autofs-5.0.7-99.el7.x86_64
(gdb) run
Starting program: /usr/sbin/automount -v -f -d
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: the debug information found in "/usr/lib/debug//lib64/libcrypto.so.1.0.2k.debug" does not match "/lib64/libcrypto.so.10" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/lib64/libcrypto.so.1.0.2k.debug" does not match "/lib64/libcrypto.so.10" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug//usr/lib64/libcrypto.so.1.0.2k.debug" does not match "/lib64/libcrypto.so.10" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/lib64//libcrypto.so.1.0.2k.debug" does not match "/lib64/libcrypto.so.10" (CRC mismatch).

Detaching after fork from child process 6533.
Starting automounter version 5.0.7-99.el7, master map auto.master
using kernel protocol version 5.02
[New Thread 0x7ffff7ff8700 (LWP 6534)]
[New Thread 0x7ffff7fce700 (LWP 6535)]
lookup_nss_read_master: reading master ldap auto.master
warning: the debug information found in "/usr/lib/debug//lib64/libssl.so.1.0.2k.debug" does not match "/lib64/libssl.so.10" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/lib64/libssl.so.1.0.2k.debug" does not match "/lib64/libssl.so.10" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug//usr/lib64/libssl.so.1.0.2k.debug" does not match "/lib64/libssl.so.10" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/lib64//libssl.so.1.0.2k.debug" does not match "/lib64/libssl.so.10" (CRC mismatch).

parse_server_string: lookup(ldap): Attempting to parse LDAP information from string "auto.master".
parse_server_string: lookup(ldap): mapname auto.master
parse_ldap_config: lookup(ldap): ldap authentication configured with the following options:
parse_ldap_config: lookup(ldap): use_tls: 1, tls_required: 1, auth_required: 1, sasl_mech: (null)
parse_ldap_config: lookup(ldap): user: (null), secret: unspecified, client principal: (null) credential cache: (null)
do_init: parse(sun): init gathered global options: (null)
spawn_mount: mtab link detected, passing -n to mount
Detaching after fork from child process 6536.
spawn_umount: mtab link detected, passing -n to mount
Detaching after fork from child process 6537.
find_server: trying server uri ldap://kvm11.pims.math.ca
do_bind: lookup(ldap): auth_required: 1, sasl_mech (null)
do_bind: lookup(ldap): ldap simple bind returned 0
get_query_dn: lookup(ldap): check search base list
get_query_dn: lookup(ldap): found search base under dc=pims,dc=math,dc=ca
get_query_dn: lookup(ldap): found query dn ou=auto.master,dc=pims,dc=math,dc=ca
connected to uri ldap://kvm11.pims.math.ca
lookup_read_master: lookup(ldap): searching for "(objectclass=automount)" under "ou=auto.master,dc=pims,dc=math,dc=ca"
lookup_read_master: lookup(ldap): examining entries
master_do_mount: mounting /home
[New Thread 0x7ffff372a700 (LWP 6538)]
automount_path_to_fifo: fifo name /run/autofs.fifo-home
lookup_nss_read_map: reading map ldap auto.home
warning: the debug information found in "/usr/lib/debug//lib64/libssl.so.1.0.2k.debug" does not match "/lib64/libssl.so.10" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/lib64/libssl.so.1.0.2k.debug" does not match "/lib64/libssl.so.10" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug//usr/lib64/libssl.so.1.0.2k.debug" does not match "/lib64/libssl.so.10" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/lib64//libssl.so.1.0.2k.debug" does not match "/lib64/libssl.so.10" (CRC mismatch).

parse_server_string: lookup(ldap): Attempting to parse LDAP information from string "auto.home".
parse_server_string: lookup(ldap): mapname auto.home
parse_ldap_config: lookup(ldap): ldap authentication configured with the following options:
parse_ldap_config: lookup(ldap): use_tls: 1, tls_required: 1, auth_required: 1, sasl_mech: (null)
parse_ldap_config: lookup(ldap): user: (null), secret: unspecified, client principal: (null) credential cache: (null)
do_init: parse(sun): init gathered global options: (null)
spawn_mount: mtab link detected, passing -n to mount
Detaching after fork from child process 6539.
spawn_umount: mtab link detected, passing -n to mount
Detaching after fork from child process 6540.
read_one_map: map read not needed, so not done
remount_active_mount: trying to re-connect to mount /home
mounted indirect on /home with timeout 300, freq 75 seconds
remount_active_mount: re-connected to mount /home
st_ready: st_ready(): state = 0 path /home
handle_packet: type = 3
handle_packet_missing_indirect: token 2, name iana, request pid 6549
[New Thread 0x7fffee70b700 (LWP 6551)]
attempting to mount entry /home/iana
lookup_mount: lookup(ldap): looking up iana
find_server: trying server uri ldap://kvm11.pims.math.ca

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffee70b700 (LWP 6551)]
0x00007ffff42e56d0 in err_string_data_LHASH_COMP () from /lib64/libcrypto.so.10

Comment 3 Ian Kent 2018-11-01 02:35:43 UTC
(In reply to Ian Allison from comment #0)
> After updating to 7.6 all of our home directory automounts have stopped
> working. The mount attempts are causing a segmentation fault in automount.
> 
> Our temporary workaround is to patch openssl to be configured with
> -DOPENSSL_NO_ERR (as suggested for a similar issue here
> https://bugs.gentoo.org/581172). The discussion there indicates that this
> comes from the address of some error strings changing.

In the Gentoo bug report there's a comment about building 5.1.1
resolving the problem.

That makes me wonder if the problem is related to the build
environment, has a rebuild of autofs in the target environment
been tried?

Our autofs has a lot of the changes included in 5.1.1, particularly
those related to LDAP and autofs's library initialisation.

Could someone check that please?

Comment 4 Ian Kent 2018-11-01 09:39:06 UTC
And what release of RHEL were you upgrading from?

Comment 5 Ian Allison 2018-11-01 16:59:06 UTC
I've just tried rebuilding the autofs src.rpm without any modifications (is that what you meant?) on one of the 7.6 machines, but I get the same segfault behaviour. 

In most cases these were scheduled updates from 7.5 (working) to 7.6, but I found one machine which had been switched off and was running 7.3 (working). It also gives the same segfault after the update.

Comment 6 Ian Kent 2018-11-01 23:52:11 UTC
(In reply to Ian Allison from comment #5)
> I've just tried rebuilding the autofs src.rpm without any modifications (is
> that what you meant?) on one of the 7.6 machines, but I get the same
> segfault behaviour. 
> 
> In most cases these were scheduled updates from 7.5 (working) to 7.6, but I
> found one machine which had been switched off and was running 7.3 (working).
> It also gives the same segfault after the update.

OK, thanks for that.

The other thing that is suspicious is downgrading autofs and OpenSSL
not resolving the problem.

But here were no changes to the autofs LDAP code between 7.5 and 7.6
so there this has to be some other library change.

Can you try downgrading the nss and also (perhaps) the nspr libraries
please?

Comment 7 Ian Allison 2018-11-02 22:13:19 UTC
(In reply to Ian Kent from comment #6)
> Can you try downgrading the nss and also (perhaps) the nspr libraries
> please?

I tried downgrading nss but no luck, but I followed your general hint and checked `ldd /usr/sbin/mount`. It looks like the problem might be with kerberos (the mount is being made with sec=krb5). If I downgrade the kerberos packages things start working again

yum downgrade krb5-libs-1.15.1-19.el7.x86_64 \
 krb5-devel-1.15.1-19.el7.x86_64 \
 krb5-workstation-1.15.1-19.el7.x86_64 \
 libkadm5-1.15.1-19.el7.x86_64

Working back from that, I tried removing the flag `--with-crypto-impl=openssl` and rebuilding krb5-1.15.1-34.el7.src.rpm and the mounts start working again.

Comment 8 Ian Allison 2018-11-05 20:30:59 UTC
Sorry, that should have been `ldd /usr/sbin/automount` above, and here is a backtrace in case it is helpful


(gdb) bt
#0  0x00007ffff42e56d0 in err_string_data_LHASH_COMP () from /lib64/libcrypto.so.10
#1  0x00007ffff42e2f09 in getrn () from /lib64/libcrypto.so.10
#2  0x00007ffff42e352a in lh_retrieve () from /lib64/libcrypto.so.10
#3  0x00007ffff42e5ec7 in int_err_get_item () from /lib64/libcrypto.so.10
#4  0x00007ffff42e6393 in ERR_func_error_string () from /lib64/libcrypto.so.10
#5  0x00007ffff19bf2a0 in ERR_load_SSL_strings () from /lib64/libssl.so.10
#6  0x00007ffff2acfd49 in tlso_init () from /lib64/libldap-2.4.so.2
#7  0x00007ffff2acdbf9 in ldap_int_tls_start () from /lib64/libldap-2.4.so.2
#8  0x00007ffff2ace051 in ldap_start_tls_s () from /lib64/libldap-2.4.so.2
#9  0x00007ffff2cf8900 in init_ldap_connection () from /usr/lib64/autofs/lookup_ldap.so
#10 0x00007ffff2cf8b0d in do_connect () from /usr/lib64/autofs/lookup_ldap.so
#11 0x00007ffff2cf920f in connect_to_server () from /usr/lib64/autofs/lookup_ldap.so
#12 0x00007ffff2cf96db in do_reconnect () from /usr/lib64/autofs/lookup_ldap.so
#13 0x00007ffff2cfcf27 in lookup_mount () from /usr/lib64/autofs/lookup_ldap.so
#14 0x000055555556c01d in do_lookup_mount ()
#15 0x000055555556cd31 in lookup_nss_mount ()
#16 0x00005555555636d0 in do_mount_indirect ()
#17 0x00007ffff7bc6dd5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007ffff67f7ead in clone () from /lib64/libc.so.6

Comment 10 Ian Kent 2018-11-05 23:29:01 UTC
(In reply to Ian Allison from comment #8)
> Sorry, that should have been `ldd /usr/sbin/automount` above, and here is a
> backtrace in case it is helpful

It might be useful to have a core and sosreport so we can setup
a lab system to look at it.

Comment 11 Ian Allison 2018-11-06 00:17:49 UTC
Created attachment 1502242 [details]
sosreport for segfaulting system

Comment 12 Ian Allison 2018-11-06 00:18:46 UTC
Created attachment 1502243 [details]
coredump from segfaulting system

Comment 13 Tomas Mraz 2018-11-06 07:55:11 UTC
If I understand correctly what is written in previous comments when krb5 libraries are downgraded to some older version the crash disappears. 

I see that the older krb5 libraries were not linked to openssl but the current ones are. Possible cause could be loading and unloading the openssl library from krb5 libraries when the ldap library uses the openssl library later again. This scenario does not really work well and can cause such issues.

A possible fix could be to force krb5 libraries to not unload from the automount process but maybe the fixing would have to be done in krb5 libraries themselves.

Comment 14 Ian Kent 2018-11-06 10:38:41 UTC
(In reply to Tomas Mraz from comment #13)
> If I understand correctly what is written in previous comments when krb5
> libraries are downgraded to some older version the crash disappears. 
> 
> I see that the older krb5 libraries were not linked to openssl but the
> current ones are. Possible cause could be loading and unloading the openssl
> library from krb5 libraries when the ldap library uses the openssl library
> later again. This scenario does not really work well and can cause such
> issues.

I'm having a bit of trouble understanding the usage sequence 
your describing. Could you explain the possible sequence of
operations a little more please.

> 
> A possible fix could be to force krb5 libraries to not unload from the
> automount process but maybe the fixing would have to be done in krb5
> libraries themselves.

I can dlopen() (and dlclose() at exit, in the automount daemon)
any libraries that are needed to prevent this from happening
but I'd like to know a little more about which ones I should be
doing this for so I can be sure that a test build has a chance of
resolving the problem, if this is in fact the problem.

I see the ldap library depends on the nss library, doesn't the
nss library have a nasty feature of re-initialising libraries
(not sure which ones actually) on fork(2), possibly not too good
for threaded applications with several indirect dependencies ...
I can't do anything about that if that's what's happening.

Comment 15 Tomas Mraz 2018-11-06 11:01:59 UTC
krb5 library loads and uses OpenSSL and then it unloads it - i.e. calls 
ERR_free_strings()
EVP_cleanup()
or other cleanup functions

Then later ldap library loads OpenSSL and tries to use it - this won't work if OpenSSL was cleaned up before.

Also the LDAP library as it is apparent from the backtrace in comment 8 above uses OpenSSL and not NSS.

Comment 16 Karsten Weiss 2018-11-28 12:34:04 UTC
The character of this bug (autofs crashes while looking up/printing error strings because some of their data structures are no longer mapped) reminds me of my old autofs bugs [bz1197622](https://bugzilla.redhat.com/show_bug.cgi?id=1197622) / [bz1381924](https://bugzilla.redhat.com/show_bug.cgi?id=1381924)...

Comment 17 Ian Kent 2018-11-29 00:24:17 UTC
(In reply to Karsten Weiss from comment #16)
> The character of this bug (autofs crashes while looking up/printing error
> strings because some of their data structures are no longer mapped) reminds
> me of my old autofs bugs
> [bz1197622](https://bugzilla.redhat.com/show_bug.cgi?id=1197622) /
> [bz1381924](https://bugzilla.redhat.com/show_bug.cgi?id=1381924)...

Yes it is similar in that it appears that some library data
has gone missing.

The core doesn't give much information, and the way the OpenSSL
code is written makes it much harder to work out what's going on.
AFAICT there's no list structure involved at all.

Just because the stack trace doesn't show nss is being used I
don't think we can assume it isn't involved here.

Comment 18 Ian Kent 2018-11-29 00:33:37 UTC
(In reply to Ian Kent from comment #17)
> (In reply to Karsten Weiss from comment #16)
> > The character of this bug (autofs crashes while looking up/printing error
> > strings because some of their data structures are no longer mapped) reminds
> > me of my old autofs bugs
> > [bz1197622](https://bugzilla.redhat.com/show_bug.cgi?id=1197622) /
> > [bz1381924](https://bugzilla.redhat.com/show_bug.cgi?id=1381924)...
> 
> Yes it is similar in that it appears that some library data
> has gone missing.
> 
> The core doesn't give much information, and the way the OpenSSL
> code is written makes it much harder to work out what's going on.
> AFAICT there's no list structure involved at all.
> 
> Just because the stack trace doesn't show nss is being used I
> don't think we can assume it isn't involved here.

Also there's no instance of either ERR_free_strings() or
EVP_cleanup() anywhere in the source of OpenLDAP or krb5.

Comment 19 Frank Sorenson 2018-11-29 13:40:06 UTC
per comment 7, the problem disappears if krb5 is rebuilt without the '--with-crypto-impl=openssl' flag.

This change was made during bz1570600 - krb5-libs uses slow crypto implementation


so this segfault would be a regression caused by that bz

Comment 22 Frank Sorenson 2018-11-29 16:31:50 UTC
backtrace from customer case 2261106:

#0  err_string_data_cmp (a=0x7fc4a54a0320, b=0x7fc4a0f45320) at err.c:354
#1  err_string_data_LHASH_COMP (arg1=0x7fc4a54a0320, arg2=0x7fc4a0f45320) at err.c:354
#2  0x00007fc4a6919f09 in getrn (lh=lh@entry=0x55d3c783dc10, data=data@entry=0x7fc4a0f45320, rhash=rhash@entry=0x7fc4a0f452d0) at lhash.c:415
#3  0x00007fc4a691a52a in lh_retrieve (lh=lh@entry=0x55d3c783dc10, data=data@entry=0x7fc4a0f45320) at lhash.c:248
#4  0x00007fc4a691cec7 in int_err_get_item (d=0x7fc4a0f45320) at err.c:394
#5  0x00007fc4a691d393 in ERR_func_error_string (e=<optimized out>) at err.c:972
#6  0x00007fc4a3ff62a0 in ERR_load_SSL_strings () at ssl_err.c:835
#7  0x00007fc4a3fe8832 in SSL_load_error_strings () at ssl_err2.c:67
#8  0x00007fc4a5106d49 in tlso_init () at tls_o.c:148
#9  0x00007fc4a5104bf9 in ldap_int_tls_start (ld=ld@entry=0x7fc494001c80, conn=conn@entry=0x7fc49400afd0, srv=srv@entry=0x7fc494000ca0) at tls2.c:902
#10 0x00007fc4a50ddd01 in ldap_int_open_connection (ld=ld@entry=0x7fc494001c80, conn=conn@entry=0x7fc49400afd0, srv=0x7fc494000ca0, async=async@entry=0) at open.c:448
#11 0x00007fc4a50f107d in ldap_new_connection (ld=ld@entry=0x7fc494001c80, srvlist=srvlist@entry=0x7fc494000988, use_ldsb=use_ldsb@entry=1, connect=connect@entry=1, bind=bind@entry=0x0, 
    m_req=m_req@entry=0, m_res=m_res@entry=0) at request.c:487
#12 0x00007fc4a50dd19f in ldap_open_defconn (ld=ld@entry=0x7fc494001c80) at open.c:41
#13 0x00007fc4a50f2388 in ldap_send_initial_request (ld=ld@entry=0x7fc494001c80, msgtype=msgtype@entry=96, dn=dn@entry=0x0, ber=0x7fc494009f80, msgid=1) at request.c:130
#14 0x00007fc4a50e73c9 in ldap_sasl_bind (ld=ld@entry=0x7fc494001c80, dn=dn@entry=0x0, mechanism=mechanism@entry=0x0, cred=cred@entry=0x7fc4a0f45690, sctrls=sctrls@entry=0x0, cctrls=0x0, 
    msgidp=msgidp@entry=0x7fc4a0f45624) at sasl.c:164
#15 0x00007fc4a50e77f9 in ldap_sasl_bind_s (ld=ld@entry=0x7fc494001c80, dn=dn@entry=0x0, mechanism=mechanism@entry=0x0, cred=cred@entry=0x7fc4a0f45690, sctrls=sctrls@entry=0x0, 
    cctrls=cctrls@entry=0x0, servercredp=servercredp@entry=0x0) at sasl.c:198
#16 0x00007fc4a50e8095 in ldap_simple_bind_s (ld=0x7fc494001c80, dn=dn@entry=0x0, passwd=passwd@entry=0x0) at sbind.c:113
#17 0x00007fc4a532f687 in bind_ldap_simple (logopt=logopt@entry=0, ldap=<optimized out>, uri=uri@entry=0x0, ctxt=ctxt@entry=0x7fc49c0058d0) at lookup_ldap.c:199
#18 0x00007fc4a532fb7f in do_bind (ctxt=0x7fc49c0058d0, uri=0x0, conn=0x7fc4a0f458f0, logopt=0) at lookup_ldap.c:587
#19 do_connect (logopt=0, conn=0x7fc4a0f458f0, uri=0x0, ctxt=0x7fc49c0058d0) at lookup_ldap.c:656
#20 0x00007fc4a5330407 in do_reconnect (logopt=0, conn=0x7fc4a0f458f0, ctxt=0x7fc49c0058d0) at lookup_ldap.c:969
#21 0x00007fc4a5333f27 in lookup_one (ap=<optimized out>, ap=<optimized out>, ctxt=0x7fc49c0058d0, qKey_len=2, qKey=0x7fc494000b90 "ad", source=0x7fc49c000910) at lookup_ldap.c:2986
#22 match_key (ctxt=0x7fc49c0058d0, key_len=2, key=0x7fc494000b90 "ad", source=0x7fc49c000910, ap=0x55d3c795c8c0) at lookup_ldap.c:3485
#23 check_map_indirect (ctxt=0x7fc49c0058d0, key_len=2, key=0x7fc494000b90 "ad", source=0x7fc49c000910, ap=0x55d3c795c8c0) at lookup_ldap.c:3572
#24 lookup_mount (ap=0x55d3c795c8c0, name=<optimized out>, name_len=<optimized out>, context=0x7fc49c0058d0) at lookup_ldap.c:3725
#25 0x000055d3c637e01d in do_lookup_mount (ap=ap@entry=0x55d3c795c8c0, map=0x7fc49c000910, name=name@entry=0x7fc4a0f49d80 "ad", name_len=name_len@entry=2) at lookup.c:850
#26 0x000055d3c637ed31 in lookup_name_source_instance (name_len=2, name=0x7fc4a0f49d80 "ad", type=0x7fc494000b70 "ldap", map=0x55d3c795c9e0, ap=0x55d3c795c8c0) at lookup.c:986
#27 lookup_map_name (this=0x7fc494000b30, name_len=2, name=0x7fc4a0f49d80 "ad", map=0x55d3c795c9e0, ap=0x55d3c795c8c0) at lookup.c:1041
#28 lookup_nss_mount (ap=ap@entry=0x55d3c795c8c0, source=source@entry=0x0, name=name@entry=0x7fc4a0f49d80 "ad", name_len=2) at lookup.c:1276
#29 0x000055d3c63756d0 in do_mount_indirect (arg=<optimized out>) at indirect.c:776
#30 0x00007fc4aa1fddd5 in start_thread (arg=0x7fc4a0f4c700) at pthread_create.c:307
#31 0x00007fc4a8e2eead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Comment 24 Mario Trangoni 2018-11-29 18:28:05 UTC
I found that when upgrading from 7.5 to 7.6, sss disappears from the /etc/nsswitch.conf automount services.

See,

-automount:  files sss ldap
+automount:  files ldap

As I reinsert it, autofs started working again.

Comment 25 Robbie Harwood 2018-11-29 19:02:49 UTC
Tomáš, can you provide more information on #c15 for me?  We don't call those functions anywhere in krb5.

Comment 26 Tomas Mraz 2018-11-30 08:31:56 UTC
In openssl-1.0.2 it is possible that the unload happens on libcrypto library unload regardless of whether krb5 is calling the unload functions.

Perhaps using RTLD_NODELETE or keeping libkrb5 loaded in autofs could help as a workaround?

Comment 27 Ian Kent 2019-01-30 03:40:57 UTC
(In reply to Tomas Mraz from comment #26)
> In openssl-1.0.2 it is possible that the unload happens on libcrypto library
> unload regardless of whether krb5 is calling the unload functions.
> 
> Perhaps using RTLD_NODELETE or keeping libkrb5 loaded in autofs could help
> as a workaround?

Yes, the question is which library (or libraries) need to be pinned.

The problem looks like it is caused by a shared library defining
static data and then passing the address (or addresses) of this
to some other shared library without regard to the possibility
that shared libraries might be unloaded and reloaded later.

In this case (and I'm still not sure about this) it looks like
libssl defines static error table strings and libcrypto is trying
to use them and segfaulting becuase the previous static address
is no longer valid.

So perhaps pinning libssl and libcrypto would be sufficient to
work around this.

Tomas I'd appreciate it if you could check if what I think I
see is correct wrt. libssl and libcrypto and whether you agree
this would be sufficient to resolve this.

Ian

Ian

Comment 28 Tomas Mraz 2019-01-30 09:01:54 UTC
Yes, pinning libssl and libcrypto or libkrb5 in memory by RTLD_NODELETE should help.

Comment 29 Ian Kent 2019-02-06 02:48:11 UTC
Lets try pinning the libssl and libcrypto at the start up of automount
to see if that helps with this problem.

I have made an autofs build that does this, it's located at:
http://people.redhat.com/~ikent/autofs-5.0.7-102.ossl.1.el7/

Please give this a try and report back.

Comment 30 Ian Allison 2019-02-06 21:17:30 UTC
It looks like krb5 has been updated (krb5-devel,krb5-libs,krb5-workstation,libkadm5 all go from 1.15.1-34 to 1.15.1-37.el7) and there's a message on the ChangeLog that...

2018-12-18 - Robbie Harwood <rharwood> - 1.15.1-37
- Bring back builtin crypto (openssl broke too many FIPS setups)
- Resolves: #1657890

Installing that update, then restarting autofs seems to fix my problem, the mount completes without the segfault.


Installing your autofs package also works. If I downgrade to 1.15.1-34 (which brings back the segfault) then install your autofs package the mount completes successfully.

I don't know what the reasoning for the crypto change in krb5 was, but your fix would allow them to use openssl without breaking autofs and something similar might work for other applications. 

Thank you!

Comment 31 Robbie Harwood 2019-02-06 22:38:12 UTC
Well, as I said in the changelog, using openssl there broke too many existing FIPS setups :)

I'm glad that it's accidentally resolved for RHEL-7, but please note that this issue probably still occurs in RHEL-8 because RHEL-8 krb5 uses openssl for everything (except curve25519).

Comment 32 Ian Kent 2019-02-07 00:59:28 UTC
(In reply to Ian Allison from comment #30)
> It looks like krb5 has been updated
> (krb5-devel,krb5-libs,krb5-workstation,libkadm5 all go from 1.15.1-34 to
> 1.15.1-37.el7) and there's a message on the ChangeLog that...
> 
> 2018-12-18 - Robbie Harwood <rharwood> - 1.15.1-37
> - Bring back builtin crypto (openssl broke too many FIPS setups)
> - Resolves: #1657890
> 
> Installing that update, then restarting autofs seems to fix my problem, the
> mount completes without the segfault.
> 
> 
> Installing your autofs package also works. If I downgrade to 1.15.1-34
> (which brings back the segfault) then install your autofs package the mount
> completes successfully.
> 
> I don't know what the reasoning for the crypto change in krb5 was, but your
> fix would allow them to use openssl without breaking autofs and something
> similar might work for other applications. 

The idea is simple enough to implement, it's just dlopen()ing the two shared
libraries that share static data of one with the other at application start
up so they aren't unloaded while the static data is in use (and dlclose()ed
at application exit).

It's also understandable why this is done, although it shouldn't ever be done
between shared libraries. Shared library static data should only ever be used
within the same library, pointers to it should never be passed to another
shared library.

Changing the way this is done is a non-trivial task because to use the data
in this way means it would need to be allocated from the applications heap
resulting in all the difficulties of cleanup and consistency that come with
it, particularly with library unload/load behaviours.

I have had a couple of other workarounds in autofs for quite a while for
conceptually similar shared library implementation shortcomings, not sure
I want to also add workarounds for nss/nspr and libssl/libcrypto upstream
... but I probably have no choice since they don't appear to be easily
fixable.

And the dlopen()/dlclose() done here doesn't help applications other than
autofs either, each application would need to be updated, and probably
have to carry RHEL only patches for quite some time which should be
avoided if at all possible. Point being I'm not sure if upstream Kerberos
would be willing to do this since it's actually an implementation problem
with another package. I expect they will say "fix it in the other package"
and be done with it. And in principle they are justified in saying that.

Or maybe there is a simpler way to fix this (within the library), perhaps
there is a way to bump the reference count on the instances of shared
library's that do this so they aren't unloaded while the static data is
in use. I don't know what's possible on this myself, consulting a specialist
in this might help.

Ian

Comment 33 Ian Kent 2019-02-18 02:39:30 UTC
Created attachment 1535792 [details]
Patch - openssl workaround

Comment 40 errata-xmlrpc 2019-08-06 13:10:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2250