1197622 – automount segfaults in error code path (reproducible PRErrorTableList corruption)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1197622 - automount segfaults in error code path (reproducible PRErrorTableList corruption)

Summary: automount segfaults in error code path (reproducible PRErrorTableList corrupt...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	autofs
Sub Component:
Version:	6.6
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Ian Kent
QA Contact:	Kun Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1268411
TreeView+	depends on / blocked

Reported:	2015-03-02 09:04 UTC by Karsten Weiss
Modified:	2019-04-09 13:35 UTC (History)
CC List:	17 users (show)
Fixed In Version:	autofs-5.0.5-140.el6_10.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1381924 (view as bug list)
Environment:
Last Closed:	2019-04-09 13:35:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
crashdump for autofs-5.0.5-123.el6_8.x86_64 crash (2.03 MB, application/x-gzip) 2018-08-29 08:19 UTC, Guillaume Hilt	no flags	Details
Patch - nspr - add locking for error table list (3.52 KB, patch) 2018-10-11 10:14 UTC, Ian Kent	dueno: review-	Details \| Diff
crash dump with autofs proposed fix (2.11 MB, application/x-gzip) 2018-10-14 13:10 UTC, Guillaume Hilt	no flags	Details
Patch - nss workaround (876 bytes, patch) 2019-02-18 00:55 UTC, Ian Kent	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0718	0	None	None	None	2019-04-09 13:35:04 UTC

Description Karsten Weiss 2015-03-02 09:04:35 UTC

tl;dr I think the initialization of automount (autofs) is broken using
ldap automount maps and causes segfaults in some error code paths. The
reason is (reproducible) corruption of the global PRErrorTableList.
I've seen it both in 6.6 and 7.0! (Full disclosure: My 6.6 test machine
runs CentOS; I've disclosed this problem in a private mail to Ian Kent and he
encouraged me to open this bug.)

In detail:

I'm currently examining an automount segmentation fault that triggered
both on 6.6 and 7.0 systems. It seems to be pretty serious unless I'm
missing something. I've already debugged it a little bit and made some
interesting discoveries but now I'm stuck. Please let me explain from
the beginning:

The automount following oops was caught by abrtd on a 6.6 system
with all the latest updates. The segfault looks like this (7.0 is
exactly the same!):

# abrt-cli list --full
@0
Directory:      /var/spool/abrt/ccpp-2015-02-18-16:58:47-32702
count:          1
executable:     /usr/sbin/automount
package:        autofs-5.0.5-109.el6_6.1
time:           Wed 18 Feb 2015 04:58:47 PM CET
uid:            0

The system is a NFS client which uses "files ldap" to access the automount
map:

# grep automount /etc/nsswitch.conf
automount:      files  ldap

Let's examine the coredump captured by abrtd with gdb:

gdb:
Core was generated by `automount --pid-file /var/run/autofs.pid'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fd8b54126ac in PR_ErrorToString (code=0xffffe8b7, language=0x0) at ../../../nspr/pr/src/misc/prerrortable.c:102
102             if (et->table->base <= code &&
(gdb) bt
#0  0x00007fd8b54126ac in PR_ErrorToString (code=0xffffe8b7, language=0x0) at ../../../nspr/pr/src/misc/prerrortable.c:102
#1  0x00007fd8b67a32ea in tlsm_session_errmsg (sess=<value optimized out>, rc=<value optimized out>, buf=0x7fd8a01ea270 "", len=0x100) at ../../../libraries/libldap/tls_m.c:2732
#2  0x00007fd8b67a2007 in ldap_int_tls_connect (ld=0x7fd85c01dc20, conn=<value optimized out>) at ../../../libraries/libldap/tls2.c:375
#3  0x00007fd8b67a211e in ldap_int_tls_start (ld=0x7fd85c01dc20, conn=<value optimized out>, srv=<value optimized out>) at ../../../libraries/libldap/tls2.c:860
#4  0x00007fd8b677c50f in ldap_int_open_connection (ld=0x7fd85c01dc20, conn=0x7fd85c021130, srv=0x7fd85c01dca0, async=<value optimized out>) at ../../../libraries/libldap/open.c:448
#5  0x00007fd8b678feef in ldap_new_connection (ld=0x7fd85c01dc20, srvlist=0x7fd85c01d9f8, use_ldsb=0x1, connect=<value optimized out>, bind=0x0, m_req=<value optimized out>, m_res=0x0) at ../../../libraries/libldap/request.c:484
#6  0x00007fd8b677c25f in ldap_open_defconn (ld=0x7fd85c01dc20) at ../../../libraries/libldap/open.c:41
#7  0x00007fd8b6790ff8 in ldap_send_initial_request (ld=0x7fd85c01dc20, msgtype=0x60, dn=0x7fd8b67b1227 "", ber=0x7fd85c0210d0, msgid=0x1) at ../../../libraries/libldap/request.c:130
#8  0x00007fd8b6786654 in ldap_sasl_bind (ld=0x7fd85c01dc20, dn=0x7fd8b67b1227 "", mechanism=0x0, cred=0x7fd8a01ea650, sctrls=0x0, cctrls=<value optimized out>, msgidp=0x7fd8a01ea5fc) at ../../../libraries/libldap/sasl.c:148
#9  0x00007fd8b67868eb in ldap_sasl_bind_s (ld=0x7fd85c01dc20, dn=0x0, mechanism=0x0, cred=0x7fd8a01ea650, sctrls=0x0, cctrls=0x0, servercredp=0x0) at ../../../libraries/libldap/sasl.c:182
#10 0x00007fd8b6786ab0 in ldap_simple_bind_s (ld=0x7fd85c01dc20, dn=0x0, passwd=0x0) at ../../../libraries/libldap/sbind.c:113
#11 0x00007fd8b69c9a16 in bind_ldap_simple (logopt=0x0, ldap=<value optimized out>, uri=0x0, ctxt=0x7fd894000c50) at lookup_ldap.c:199
#12 0x00007fd8b69c9b9c in do_bind (logopt=0x0, uri=0x0, ctxt=0x7fd894000c50) at lookup_ldap.c:587
#13 do_connect (logopt=0x0, uri=0x0, ctxt=0x7fd894000c50) at lookup_ldap.c:667
#14 0x00007fd8b69ca3fc in do_reconnect (logopt=0x0, ctxt=0x7fd894000c50) at lookup_ldap.c:924
#15 0x00007fd8b69cad93 in lookup_one (ap=0x7fd8bad56a30, source=0x7fd894000950, key=0x7fd85c01f570 "INCOMING", key_len=0x8, ctxt=0x7fd894000c50) at lookup_ldap.c:2875
#16 match_key (ap=0x7fd8bad56a30, source=0x7fd894000950, key=0x7fd85c01f570 "INCOMING", key_len=0x8, ctxt=0x7fd894000c50) at lookup_ldap.c:3366
#17 0x00007fd8b69cc057 in check_map_indirect (ap=0x7fd8bad56a30, name=<value optimized out>, name_len=<value optimized out>, context=0x7fd894000c50) at lookup_ldap.c:3453
#18 lookup_mount (ap=0x7fd8bad56a30, name=<value optimized out>, name_len=<value optimized out>, context=0x7fd894000c50) at lookup_ldap.c:3606
#19 0x00007fd8b90364c3 in lookup_name_source_instance (ap=0x7fd8bad56a30, source=0x0, name=0x7fd8a01eee50 "INCOMING", name_len=0x8) at lookup.c:934
#20 lookup_map_name (ap=0x7fd8bad56a30, source=0x0, name=0x7fd8a01eee50 "INCOMING", name_len=0x8) at lookup.c:989
#21 lookup_nss_mount (ap=0x7fd8bad56a30, source=0x0, name=0x7fd8a01eee50 "INCOMING", name_len=0x8) at lookup.c:1183
#22 0x00007fd8b902e758 in do_mount_indirect (arg=<value optimized out>) at indirect.c:768
#23 0x00007fd8b8be79d1 in start_thread (arg=0x7fd8a01f1700) at pthread_create.c:301
#24 0x00007fd8b7b059dd in readahead () at ../sysdeps/unix/syscall-template.S:84
#25 0x0000000000000000 in ?? ()
(gdb) list 86,105
86      PR_IMPLEMENT(const char *)
87      PR_ErrorToString(PRErrorCode code, PRLanguageCode language)
88      {
89          /* static buffer only used if code is using inconsistent error message
90           * numbers, so just ignore the possible thread contention
91           */
92          static char buffer[25];
93
94          const char *msg;
95          int offset;
96          PRErrorCode table_num;
97          struct PRErrorTableList *et;
98          int started = 0;
99          char *cp;
100
101         for (et = Table_List; et; et = et->next) {
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
102             if (et->table->base <= code &&
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
103                 et->table->base + et->table->n_msgs > code) {
104                 /* This is the right table */
105                 if (callback_lookup) {
(gdb) info registers
rax            0xffffd085       0xffffd085
rbx            0x7fd8baccc040   0x7fd8baccc040
rcx            0xffffd000       0xffffd000
rdx            0x7fd8b6410750   0x7fd8b6410750
rsi            0x0      0x0
rdi            0xffffe8b7       0xffffe8b7
rbp            0xffffe8b7       0xffffe8b7
rsp            0x7fd8a01ea200   0x7fd8a01ea200
r8             0x1      0x1
r9             0xa49    0xa49
r10            0x9d     0x9d
r11            0x1      0x1
r12            0x0      0x0
r13            0xffffe8b7       0xffffe8b7
r14            0x7fd8b69bbfc0   0x7fd8b69bbfc0
r15            0x7fd8b69bbfc0   0x7fd8b69bbfc0
rip            0x7fd8b54126ac   0x7fd8b54126ac <PR_ErrorToString+28>
eflags         0x10202  [ IF RF ]
cs             0x33     0x33
ss             0x2b     0x2b
ds             0x0      0x0
es             0x0      0x0
fs             0x0      0x0
gs             0x0      0x0
(gdb) print code
$2 = -5961 
(gdb) print et
$7 = (struct PRErrorTableList *) 0x7fd8baccc040
(gdb) print et->table
$8 = (const struct PRErrorTable *) 0x7fd8b6410750
(gdb) print et->table->base
Cannot access memory at address 0x7fd8b6410760

Explanation: At the time of the crash we had a problem with one of our
LDAP servers. This then caused the automount segfault on the client!
PR_ErrorToString() was called with code -5961. It's the code of the
error string "TCP connection reset by peer" (you'll find it nspr's own
"prerr" PRErrorTable; see nspr/pr/src/misc/prerr.et).

However, the automount process crashes in this case because there is
corruption in nspr's global Table_List (where all the various
PRErrorTables are linked) and PR_ErrorToString() runs into the segfault
before the iteration of Table_List's PRErrorTables finally reaches the
"prerr" PRErrorTable (which is the last in the list) that contains code
-5961.

Here's the state of Table_List when the segfault happens:

(gdb) print *Table_List
$9 = {next = 0x7fd8baccc040, table = 0x7fd8b49a3750, table_private = 0x0}
(gdb) print *Table_List.table
$10 = {msgs = 0x7fd8b49a3780, name = 0x7fd8b479934a "sslerr", base = -12288, n_msgs = 133}
(gdb) print *Table_List.next
$11 = {next = 0x7fd8bac73b30, table = 0x7fd8b6410750, table_private = 0x0}
(gdb) print *Table_List.next.table
Cannot access memory at address 0x7fd8b6410750      !!!!!!!!!!!!!!!!!!!!!!
(gdb) print *Table_List.next.next
$12 = {next = 0x7fd8bac730a0, table = 0x7fd8b5c64820, table_private = 0x0}
(gdb) print *Table_List.next.next.table
$13 = {msgs = 0x7fd8b5c64840, name = 0x7fd8b5a59874 "secerrstrings", base = -8192, n_msgs = 180}
(gdb) print *Table_List.next.next.next
$14 = {next = 0x0, table = 0x7fd8b5631600, table_private = 0x0}
(gdb) print *Table_List.next.next.next.table
$15 = {msgs = 0x7fd8b5631620, name = 0x7fd8b5425de0 "prerr", base = -6000, n_msgs = 77}

I.e. there are four PRErrorTables registered in the singly-linked
Table_List but the second entry is inaccessible: The et->table->base
access causes the PR_ErrorToString() segfault.

Now you may ask: Where do these PRErrorTables registered in Table_List
come from?

The first PRErrorTable "sslerr" is from nss:

root@sana2 nss-3.16.2.3 # grep -r -A1 'struct PRErrorTable' .
./nss/lib/ssl/sslerrstrs.c:static const struct PRErrorTable ssl_et = {
./nss/lib/ssl/sslerrstrs.c-    ssltext, "sslerr", SSL_ERROR_BASE,

The 3rd PRErrorTable "secerrstrings" is from nss-util:

root@sana2 nss-util-3.16.2.3 # grep -r "struct PRErrorTable" -A2 .
./nss/lib/util/errstrs.c:static const struct PRErrorTable sec_et = {
./nss/lib/util/errstrs.c-    sectext, "secerrstrings", SEC_ERROR_BASE, 
./nss/lib/util/errstrs.c-        (sizeof sectext)/(sizeof sectext[0])

The 4th PRErrorTable "prerr" is from nspr itself.

Next questions:

* Where does the 2nd PRErrorTable come from?
* What's the name of the corrupt table?
* How does this Table_List corruption happen?

Let's start a new automount session (=> addresses differ from above) in gdb
and watch the Table_List grow:

# gdb /usr/sbin/automount
(gdb) set args -d -f --pid-file /var/run/autofs.pid
(gdb) break PR_ErrorInstallTable
Function "PR_ErrorInstallTable" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (PR_ErrorInstallTable) pending.
(gdb) info breakpoints 
Num     Type           Disp Enb Address    What
1       breakpoint     keep y   <PENDING>  PR_ErrorInstallTable
(gdb) ignore 1 2 
Will ignore next 2 crossings of breakpoint 1.
(gdb) run

I.e. we ignore the first two breakpoint hits in PR_ErrorInstallTable()
which is used to register the PRErrorTable (the first two calls register
the"prerr" and "secerrstrings" ErrorTables) and run the program:

Now the third time we reach the breakpoint it triggers for the first time:

(gdb) print table
$1 = (const struct PRErrorTable *) 0x7ffff53aa750
(gdb) print table.name
$2 = 0x7ffff51a034a "sslerr"

=> PR_ErrorInstallTable() breakpoint triggers when the third(!) PRErrorTable
"sslerr" wants to register.

Here's the Table_List (struct PRErrorTableList *) status before we let the registration happen:

(gdb) print *Table_List
$3 = {next = 0x7ffff8218260, table = 0x7ffff4bfe820, table_private =
0x0}
(gdb) print *Table_List.table
$4 = {msgs = 0x7ffff4bfe840, name = 0x7ffff49f3874 "secerrstrings", base = -8192, n_msgs = 180}
(gdb) print *Table_List.next
$5 = {next = 0x0, table = 0x7ffff45cb600, table_private = 0x0}
(gdb) print *Table_List.next.table
$6 = {msgs = 0x7ffff45cb620, name = 0x7ffff43bfde0 "prerr", base = -6000, n_msgs = 77}

And here's a backtrace from this point in time (ignore the addresses in
this and the next bt because I captured them from a different session -
it's the call trace that's important and interesting!):

(gdb) bt
#0  PR_ErrorInstallTable (table=0x7ffff53aa750) at ../../../nspr/pr/src/misc/prerrortable.c:166
#1  0x00007ffff43ad16a in PR_CallOnce (once=0x7ffff53adca8, func=<value optimized out>) at ../../../nspr/pr/src/misc/prinit.c:775
#2  0x00007ffff518f3c7 in ssl_InitializePRErrorTable () at sslerrstrs.c:32
#3  0x00007ffff518f40d in ssl_Init () at sslinit.c:21
#4  0x00007ffff51999f7 in SSL_CipherPolicySet (which=49195, policy=1) at sslsock.c:1154
#5  0x00007ffff5199ae1 in NSS_SetDomesticPolicy () at sslsock.c:1310
#6  0x00007ffff6141078 in tlsm_deferred_init (arg=0x7ffff8203300) at ../../../libraries/libldap/tls_m.c:1914
#7  tlsm_deferred_ctx_init (arg=0x7ffff8203300) at ../../../libraries/libldap/tls_m.c:2225
#8  0x00007ffff43ad075 in PR_CallOnceWithArg (once=0x7ffff8203350, func=<value optimized out>, arg=<value optimized out>) at ../../../nspr/pr/src/misc/prinit.c:804
#9  0x00007ffff613e77f in tlsm_session_new (ctx=0x7ffff8203300, is_server=0) at ../../../libraries/libldap/tls_m.c:2612
#10 0x00007ffff613cd94 in alloc_handle (ctx_arg=<value optimized out>, is_server=<value optimized out>) at ../../../libraries/libldap/tls2.c:292
#11 0x00007ffff613cf15 in ldap_int_tls_connect (ld=0x7ffff8214b30, conn=<value optimized out>) at ../../../libraries/libldap/tls2.c:337
#12 0x00007ffff613d11e in ldap_int_tls_start (ld=0x7ffff8214b30, conn=<value optimized out>, srv=<value optimized out>) at ../../../libraries/libldap/tls2.c:860
#13 0x00007ffff611750f in ldap_int_open_connection (ld=0x7ffff8214b30, conn=0x7ffff8216350, srv=0x7ffff8214a70, async=<value optimized out>) at ../../../libraries/libldap/open.c:448
#14 0x00007ffff612aeef in ldap_new_connection (ld=0x7ffff8214b30, srvlist=0x7ffff8214bf8, use_ldsb=1, connect=<value optimized out>, bind=0x0, m_req=<value optimized out>, m_res=0) at ../../../libraries/libldap/request.c:484
#15 0x00007ffff611725f in ldap_open_defconn (ld=0x7ffff8214b30) at ../../../libraries/libldap/open.c:41
#16 0x00007ffff612bff8 in ldap_send_initial_request (ld=0x7ffff8214b30, msgtype=96, dn=0x7ffff614c227 "", ber=0x7ffff8215300, msgid=1) at ../../../libraries/libldap/request.c:130
#17 0x00007ffff6121654 in ldap_sasl_bind (ld=0x7ffff8214b30, dn=0x7ffff614c227 "", mechanism=0x0, cred=0x7fffffff9b70, sctrls=0x0, cctrls=<value optimized out>, msgidp=0x7fffffff9b1c) at ../../../libraries/libldap/sasl.c:148
#18 0x00007ffff61218eb in ldap_sasl_bind_s (ld=0x7ffff8214b30, dn=0x0, mechanism=0x0, cred=0x7fffffff9b70, sctrls=0x0, cctrls=0x0, servercredp=0x0) at ../../../libraries/libldap/sasl.c:182
#19 0x00007ffff6121ab0 in ldap_simple_bind_s (ld=0x7ffff8214b30, dn=0x0, passwd=0x0) at ../../../libraries/libldap/sbind.c:113
#20 0x00007ffff6364a16 in bind_ldap_simple (logopt=0, ldap=<value optimized out>, uri=0x0, ctxt=0x7ffff8212030) at lookup_ldap.c:199
#21 0x00007ffff6364b9c in do_bind (logopt=0, uri=0x0, ctxt=0x7ffff8212030) at lookup_ldap.c:587
#22 do_connect (logopt=0, uri=0x0, ctxt=0x7ffff8212030) at lookup_ldap.c:667
#23 0x00007ffff63653fc in do_reconnect (logopt=0, ctxt=0x7ffff8212030) at lookup_ldap.c:924
#24 0x00007ffff6368945 in lookup_read_master (master=<value optimized out>, age=1424949397, context=0x7ffff8212030) at lookup_ldap.c:1798
#25 0x00007ffff7fcfb32 in do_read_master (master=0x7ffff8202540, type=<value optimized out>, age=1424949397) at lookup.c:96
#26 0x00007ffff7fd1939 in lookup_nss_read_master (master=0x7ffff8202540, age=1424949397) at lookup.c:297
#27 0x00007ffff7fe29ec in master_read_master (master=0x7ffff8202540, age=1424949397, readall=0) at master.c:917
#28 0x00007ffff7fc6be8 in main (argc=0, argv=<value optimized out>) at automount.c:2359

Now let the registration happen with "finish" i.e. code runs until the
end of the current frame:

(gdb) finish
Run till exit from #0  PR_ErrorInstallTable (table=0x7ffff53aa750) at
../../../nspr/pr/src/misc/prerrortable.c:166
PR_CallOnce (once=0x7ffff53adca8, func=<value optimized out>) at
../../../nspr/pr/src/misc/prinit.c:776
776                 PR_Lock(mod_init.ml);
Value returned is $7 = 0

Let's check the Table_List data structure another time:

(gdb) print *Table_List
$8 = {next = 0x7ffff8218d00, table = 0x7ffff53aa750, table_private = 0x0}
(gdb) print *Table_List.table
$9 = {msgs = 0x7ffff53aa780, name = 0x7ffff51a034a "sslerr", base = -12288, n_msgs = 133}
(gdb) print *Table_List.next
$10 = {next = 0x7ffff8218260, table = 0x7ffff4bfe820, table_private = 0x0}
(gdb) print *Table_List.next.table
$11 = {msgs = 0x7ffff4bfe840, name = 0x7ffff49f3874 "secerrstrings", base = -8192, n_msgs = 180}
(gdb) print *Table_List.next.next
$12 = {next = 0x0, table = 0x7ffff45cb600, table_private = 0x0}
(gdb) print *Table_List.next.next.table
$13 = {msgs = 0x7ffff45cb620, name = 0x7ffff43bfde0 "prerr", base = -6000, n_msgs = 77}
(gdb)

As you can see until this point in time the global Table_List is fine. It contains three PRErrorTable entries: "sslerr", "secerrstrings" and "prerr".

However, make sure to remember the address 0x7ffff53aa750 of the
"sslerr" table! This will become important very soon.

Now let's resume the execution:

(gdb) continue
[...]
mounted indirect on /soft with timeout 300, freq 75 seconds
st_ready: st_ready(): state = 0 path /soft

=> The startup phase of automount is finished.

Next we access an automounter path from another shell
("ls -l /home/knweiss") and then this happens:

[...]
handle_packet: type = 3
handle_packet_missing_indirect: token 92, name knweiss, request pid
28165
[New Thread 0x7ffff7f48700 (LWP 28166)]
attempting to mount entry /home/knweiss
lookup_mount: lookup(ldap): looking up knweiss
do_bind: lookup(ldap): auth_required: 1, sasl_mech (null)
[Switching to Thread 0x7ffff7f48700 (LWP 28166)]

Breakpoint 1, PR_ErrorInstallTable (table=0x7ffff393d750) at
../../../nspr/pr/src/misc/prerrortable.c:166
166     {

=> We hit the breakpoint on PR_ErrorInstallTable() a 4th time. This
time, however, from an automount worker thread. It wants to register
another PRErrorTable in the global PRErrorTableList.
Let's examine which one:

(gdb) print table
$14 = (const struct PRErrorTable *) 0x7ffff393d750
(gdb) print table.name
$15 = 0x7ffff373334a "sslerr"
(gdb) 

What is happening here? "sslerr" wants to register *again* although it
is already registered (in the main thread).

Here's another backtrace from this point in time (again, please ignore
the addresses and the different filesystem name...):

(gdb) bt
#0  PR_ErrorInstallTable (table=0x7ffff393d750) at ../../../nspr/pr/src/misc/prerrortable.c:166
#1  0x00007ffff43ad16a in PR_CallOnce (once=0x7ffff3940ca8, func=<value optimized out>) at ../../../nspr/pr/src/misc/prinit.c:775
#2  0x00007ffff37223c7 in ssl_InitializePRErrorTable () at sslerrstrs.c:32
#3  0x00007ffff372240d in ssl_Init () at sslinit.c:21
#4  0x00007ffff372c9f7 in SSL_CipherPolicySet (which=49195, policy=1) at sslsock.c:1154
#5  0x00007ffff372cae1 in NSS_SetDomesticPolicy () at sslsock.c:1310
#6  0x00007ffff5740078 in tlsm_deferred_init (arg=0x7fffac00b5d0) at ../../../libraries/libldap/tls_m.c:1914
#7  tlsm_deferred_ctx_init (arg=0x7fffac00b5d0) at ../../../libraries/libldap/tls_m.c:2225
#8  0x00007ffff43ad075 in PR_CallOnceWithArg (once=0x7fffac00b620, func=<value optimized out>, arg=<value optimized out>) at ../../../nspr/pr/src/misc/prinit.c:804
#9  0x00007ffff573d77f in tlsm_session_new (ctx=0x7fffac00b5d0, is_server=0) at ../../../libraries/libldap/tls_m.c:2612
#10 0x00007ffff573bd94 in alloc_handle (ctx_arg=<value optimized out>, is_server=<value optimized out>) at ../../../libraries/libldap/tls2.c:292
#11 0x00007ffff573bf15 in ldap_int_tls_connect (ld=0x7fffac000f10, conn=<value optimized out>) at ../../../libraries/libldap/tls2.c:337
#12 0x00007ffff573c11e in ldap_int_tls_start (ld=0x7fffac000f10, conn=<value optimized out>, srv=<value optimized out>) at ../../../libraries/libldap/tls2.c:860
#13 0x00007ffff571650f in ldap_int_open_connection (ld=0x7fffac000f10, conn=0x7fffac00b420, srv=0x7fffac000c70, async=<value optimized out>) at ../../../libraries/libldap/open.c:448
#14 0x00007ffff5729eef in ldap_new_connection (ld=0x7fffac000f10, srvlist=0x7fffac001ee8, use_ldsb=1, connect=<value optimized out>, bind=0x0, m_req=<value optimized out>, m_res=0) at ../../../libraries/libldap/request.c:484
#15 0x00007ffff571625f in ldap_open_defconn (ld=0x7fffac000f10) at ../../../libraries/libldap/open.c:41
#16 0x00007ffff572aff8 in ldap_send_initial_request (ld=0x7fffac000f10, msgtype=96, dn=0x7ffff574b227 "", ber=0x7fffac00a3d0, msgid=1) at ../../../libraries/libldap/request.c:130
#17 0x00007ffff5720654 in ldap_sasl_bind (ld=0x7fffac000f10, dn=0x7ffff574b227 "", mechanism=0x0, cred=0x7ffff7f41650, sctrls=0x0, cctrls=<value optimized out>, msgidp=0x7ffff7f415fc) at ../../../libraries/libldap/sasl.c:148
#18 0x00007ffff57208eb in ldap_sasl_bind_s (ld=0x7fffac000f10, dn=0x0, mechanism=0x0, cred=0x7ffff7f41650, sctrls=0x0, cctrls=0x0, servercredp=0x0) at ../../../libraries/libldap/sasl.c:182
#19 0x00007ffff5720ab0 in ldap_simple_bind_s (ld=0x7fffac000f10, dn=0x0, passwd=0x0) at ../../../libraries/libldap/sbind.c:113
#20 0x00007ffff5963a16 in bind_ldap_simple (logopt=0, ldap=<value optimized out>, uri=0x0, ctxt=0x7fffec007530) at lookup_ldap.c:199
#21 0x00007ffff5963b9c in do_bind (logopt=0, uri=0x0, ctxt=0x7fffec007530) at lookup_ldap.c:587
#22 do_connect (logopt=0, uri=0x0, ctxt=0x7fffec007530) at lookup_ldap.c:667
#23 0x00007ffff59643fc in do_reconnect (logopt=0, ctxt=0x7fffec007530) at lookup_ldap.c:924
#24 0x00007ffff5964d93 in lookup_one (ap=0x7ffff8215750, source=0x7fffec000950, key=0x7fffac000ef0 "NICHTDA", key_len=7, ctxt=0x7fffec007530) at lookup_ldap.c:2875
#25 match_key (ap=0x7ffff8215750, source=0x7fffec000950, key=0x7fffac000ef0 "NICHTDA", key_len=7, ctxt=0x7fffec007530) at lookup_ldap.c:3366
#26 0x00007ffff5966057 in check_map_indirect (ap=0x7ffff8215750, name=<value optimized out>, name_len=<value optimized out>, context=0x7fffec007530) at lookup_ldap.c:3453
#27 lookup_mount (ap=0x7ffff8215750, name=<value optimized out>, name_len=<value optimized out>, context=0x7fffec007530) at lookup_ldap.c:3606
#28 0x00007ffff7fd04c3 in lookup_name_source_instance (ap=0x7ffff8215750, source=0x0, name=0x7ffff7f45e50 "NICHTDA", name_len=7) at lookup.c:934
#29 lookup_map_name (ap=0x7ffff8215750, source=0x0, name=0x7ffff7f45e50 "NICHTDA", name_len=7) at lookup.c:989
#30 lookup_nss_mount (ap=0x7ffff8215750, source=0x0, name=0x7ffff7f45e50 "NICHTDA", name_len=7) at lookup.c:1183
#31 0x00007ffff7fc8758 in do_mount_indirect (arg=<value optimized out>) at indirect.c:768
#32 0x00007ffff7b819d1 in start_thread (arg=0x7ffff7f48700) at pthread_create.c:301
#33 0x00007ffff6a9f8fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Let's examine the global Table_List once more before we let this second registration of the "sslerr" PRErrorTable happen:

(gdb) print *Table_List
$16 = {next = 0x7ffff8218d00, table = 0x7ffff53aa750, table_private = 0x0}
(gdb) print *Table_List.table
Cannot access memory at address 0x7ffff53aa750 !!!!!!!!!!!
(gdb) print *Table_List.next
$17 = {next = 0x7ffff8218260, table = 0x7ffff4bfe820, table_private = 0x0}
(gdb) print *Table_List.next.table
$18 = {msgs = 0x7ffff4bfe840, name = 0x7ffff49f3874 "secerrstrings", base = -8192, n_msgs = 180}
(gdb) print *Table_List.next.next
$19 = {next = 0x0, table = 0x7ffff45cb600, table_private = 0x0}
(gdb) print *Table_List.next.next.table
$20 = {msgs = 0x7ffff45cb620, name = 0x7ffff43bfde0 "prerr", base = -6000, n_msgs = 77}
(gdb) print *Table_List.next.next.next
Cannot access memory at address 0x0
(gdb)

As you can see there is already corruption because the automount worker
thread can't access memory at 0x7ffff53aa750! (Remember?)

Let the second registration of the "sslerr" (the 4th in total) PRErrorTable happen anyway:

(gdb) finish
Run till exit from #0  PR_ErrorInstallTable (table=0x7ffff393d750) at
../../../nspr/pr/src/misc/prerrortable.c:166
PR_CallOnce (once=0x7ffff3940ca8, func=<value optimized out>) at
../../../nspr/pr/src/misc/prinit.c:776
776                 PR_Lock(mod_init.ml);
Value returned is $21 = 0
(gdb) print *Table_List
$22 = {next = 0x7ffff8271220, table = 0x7ffff393d750, table_private = 0x0}
(gdb) print *Table_List.table
$23 = {msgs = 0x7ffff393d780, name = 0x7ffff373334a "sslerr", base = -12288, n_msgs = 133}
(gdb) print *Table_List.next
$24 = {next = 0x7ffff8218d00, table = 0x7ffff53aa750, table_private = 0x0}
(gdb) print *Table_List.next.table
Cannot access memory at address 0x7ffff53aa750
(gdb) print *Table_List.next.next
$25 = {next = 0x7ffff8218260, table = 0x7ffff4bfe820, table_private = 0x0}
(gdb) print *Table_List.next.next.table
$26 = {msgs = 0x7ffff4bfe840, name = 0x7ffff49f3874 "secerrstrings", base = -8192, n_msgs = 180}
(gdb) 

=> This is exactly the Table_List state we saw in the automount
coredump. I.e. there are four entries and the 2nd is corrupt.

My preliminary analysis: It's strange that the worker thread even
*tries* to register the "sslerr" PRErrorTable a second time because the
code is full of flags that try to prevent this. However, the worker
thread doesn't see the correct contents of these flags and therefore
thinks that the "sslerr" PRErrorTable still isn't registered!

If we take a closer look at the backtrace of the 4th registration (in
the worker thread) the frames 0-5 look suspicious because even
ssl_Init() shouldn't run a 2nd time! It uses the 'ssl_inited' static
variable to prevent a 2nd initialization. The same in nspr which uses
the "once" data structure to prevent a second initialization:

(gdb) bt
#0  PR_ErrorInstallTable (table=0x7ffff393d750) at ../../../nspr/pr/src/misc/prerrortable.c:166
#1  0x00007ffff43ad16a in PR_CallOnce (once=0x7ffff3940ca8, func=<value optimized out>) at ../../../nspr/pr/src/misc/prinit.c:775
#2  0x00007ffff37223c7 in ssl_InitializePRErrorTable () at sslerrstrs.c:32
#3  0x00007ffff372240d in ssl_Init () at sslinit.c:21
#4  0x00007ffff372c9f7 in SSL_CipherPolicySet (which=49195, policy=1) at sslsock.c:1154
#5  0x00007ffff372cae1 in NSS_SetDomesticPolicy () at sslsock.c:1310
#6  0x00007ffff5740078 in tlsm_deferred_init (arg=0x7fffac00b5d0) at ../../../libraries/libldap/tls_m.c:1914

Unfortunately, from the worker thread's point of view these variables
all say "uninitialized" and it executes the initialization code once
again and the global Table_List ends up with the broken entry.

Also, something (I don't know what) releases the data structure / memory
of the first "sslerr" PRErrorTable and makes its fields
inaccessible without de-registering it from the global Table_List
(PRErrorTableList).

I'm not familiar will all the implementation details of all the
components involved. But from my limited understanding I came to the
conclusion that this could be a serious problem.

Fortunately, the issue only triggers in automount error code paths ("TCP
connection reset by peer" in my case).

Unfortunately, the automount process segfaults before it can failover to
an alternative LDAP server. I.e. when we have a issues with a LDAP
server it can cause a 2nd problem on the NFS client (dead automount
process).

Comment 1 Ian Kent 2015-03-05 02:24:11 UTC

As I mentioned in email I'm wondering if this is a side effect of
closing the autofs shared library module on HUP signal or when
autofs initiates a map re-read because it thinks the map has
changed.

Before I dive into the code of related packages we really need
to know if this library re-open is causing the problem somehow.

We can check if this is the case fairly easily by just not doing
the close then open on map re-read and then working through the
failure case you have.

Beware that some of the configuration entries relating to things
like the LDAP server etc. won't be updated if they have changed.
So if you're using this test rpm you'll need to restart autofs
to apply configuration changes.

Can you work through your procedure using the rpms here please:
http://people.redhat.com/~ikent/autofs-5.0.5-109.debug.1.el6.1

Ian

Comment 2 Ian Kent 2015-03-05 02:42:41 UTC

(In reply to Karsten Weiss from comment #0)
> tl;dr I think the initialization of automount (autofs) is broken using
> ldap automount maps and causes segfaults in some error code paths. The
> reason is (reproducible) corruption of the global PRErrorTableList.
> I've seen it both in 6.6 and 7.0! (Full disclosure: My 6.6 test machine
> runs CentOS; I've disclosed this problem in a private mail to Ian Kent and he
> encouraged me to open this bug.)

I don't think it matters if your running CentOS, at least not
to me.

Logging a bug directly in RedHat bugzilla you face the same
problem as anyone else doing so, including RHEL subscribers.

If there are tasks that need my attention that come in through
support then they must get priority so there's a risk this bug
won't get the time it needs.

But I think that's fair since, apart from the obvious business
sense, for these bugzillas there's also a risk I will be stuck
burning time gathering information that should be done by front
line support.

Comment 3 Karsten Weiss 2015-03-05 12:26:59 UTC

I've tested your autofs test rpms in gdb:

When then code enters PR_ErrorInstallTable() for the 4th time we now see this
situation:

(gdb) print *Table_List
$30 = {next = 0x7ffff8239f30, table = 0x7ffff4b12750, table_private = 0x0}
(gdb) print *Table_List.table
$31 = {msgs = 0x0, name = 0x0, base = 0x0, n_msgs = 0x0}      !!!!!!!!!
(gdb) print *Table_List.next
$32 = {next = 0x7ffff8239490, table = 0x7ffff4366820, table_private = 0x0}
(gdb) print *Table_List.next.table
$33 = {msgs = 0x7ffff4366840, name = 0x7ffff415b874 "secerrstrings", base = 0xffffe000, n_msgs = 0xb4}
(gdb) print *Table_List.next.next
$34 = {next = 0x0, table = 0x7ffff3d33600, table_private = 0x0}
(gdb) print *Table_List.next.next.table
$35 = {msgs = 0x7ffff3d33620, name = 0x7ffff3b27de0 "prerr", base = 0xffffe890, n_msgs = 0x4d}

The first table entry now contains a nullified/empty table (gdb output $31)
(which was a "sslerr" table before).

[
With the *official* autofs rpm the same "print" would have shown this output instead:

(gdb) print *Table_List.table
Cannot access memory at address 0x7ffff4b12750
]

When we finish this 4th call to PR_ErrorInstallTable() we get this result:

(gdb) finish
Run till exit from #0  PR_CallOnce (once=0x7ffff176cca8, func=<value optimized out>) at ../../../nspr/pr/src/misc/prinit.c:776
st_expire: state 1 path /soft
handle_packet: type = 3
handle_packet_missing_indirect: token 906, name sana2, request pid 32684
0x00007ffff154e3c7 in ssl_InitializePRErrorTable () at sslerrstrs.c:32
32          return (PR_SUCCESS == PR_CallOnce(&once, ssl_InitializePRErrorTableOnce))
Value returned is $45 = PR_SUCCESS

(gdb) print *Table_List
$46 = {next = 0x7ffff8292450, table = 0x7ffff1769750, table_private = 0x0}
(gdb) print *Table_List.table
$47 = {msgs = 0x7ffff1769780, name = 0x7ffff155f34a "sslerr", base = 0xffffd000, n_msgs = 0x85}
(gdb) print *Table_List.next
$48 = {next = 0x7ffff8239f30, table = 0x7ffff4b12750, table_private = 0x0}
(gdb) print *Table_List.next.table
$49 = {msgs = 0x0, name = 0x0, base = 0x0, n_msgs = 0x0}
(gdb) printable_List.next.next
$50 = {next = 0x7ffff8239490, table = 0x7ffff4366820, table_private = 0x0}
(gdb) print *Table_List.next.next.table
$51 = {msgs = 0x7ffff4366840, name = 0x7ffff415b874 "secerrstrings", base = 0xffffe000, n_msgs = 0xb4}
(gdb) print *Table_List.next.next.next
$52 = {next = 0x0, table = 0x7ffff3d33600, table_private = 0x0}
(gdb) print *Table_List.next.next.next.table
$53 = {msgs = 0x7ffff3d33620, name = 0x7ffff3b27de0 "prerr", base = 0xffffe890, n_msgs = 0x4d}

If we take another look at the (segfaulting) PR_ErrorToString() code in nspr things look better now:

PR_IMPLEMENT(const char *)
PR_ErrorToString(PRErrorCode code, PRLanguageCode language)
{
[...]
    struct PRErrorTableList *et;
[...]
    for (et = Table_List; et; et = et->next) {
!!!
        if (et->table->base <= code &&
!!!
            et->table->base + et->table->n_msgs > code) {
            /* This is the right table */
[...]
            if (callback_lookup) {
                msg = callback_lookup(code, language, et->table,
                    callback_private, et->table_private);
                if (msg) return msg;
            }
    
            return(et->table->msgs[code - et->table->base].en_text);
        }
    }
[...]

As far as I can see the marked "if" now can't segfault anymore:

a) access of et->table is valid for all four PRErrorTableList table entries
   (no more invalid memory accesses)

b) et->table->base (0x0) <= code (-5961) (=> "TCP connection reset by peer")
   is false => we don't enter the if statement's body.

Thus the nullified table should be skipped now. This looks good to me!

Having said that, I still wonder if it really makes sense that we see the 4th
PR_ErrorInstallTable() in the worker thread at all?

I.e. why does the first registration of the "sslerr" table get nullified and
thrown away? Why does the code register the sslerr" table a 2nd time in the
worker thread - thereby re-doing all the initial (nspr) ssl_Init() work of
the main thread?

Shouldn't there only be a single "sslerr" table registration (probably from the main thread)?

But then, once again: I'm not really familiar with this code. I'm just wondering.

I will keep this autofs debug rpm running on my test system and keep you
updated if I see anything unexpected.

Comment 4 Ian Kent 2015-03-05 12:54:19 UTC

(In reply to Karsten Weiss from comment #3)
> 
> Having said that, I still wonder if it really makes sense that we see the 4th
> PR_ErrorInstallTable() in the worker thread at all?
> 
> I.e. why does the first registration of the "sslerr" table get nullified and
> thrown away? Why does the code register the sslerr" table a 2nd time in the
> worker thread - thereby re-doing all the initial (nspr) ssl_Init() work of
> the main thread?
> 
> Shouldn't there only be a single "sslerr" table registration (probably from
> the main thread)?

These are question I can't answer, as I've said I have deliberately
avoided using anything but what I have to, namely the LDAP libraries
in this case, and leave the handling of fragile libraries to those
libraries.

> 
> But then, once again: I'm not really familiar with this code. I'm just
> wondering.

I'm not familiar with the code were looking at either.
And after the last time I needed to look at it I'm not
keen at all to do it again if I can avoid it.

The only change made to autofs in that rpm is to comment
out the close and then open of autofs's own libraries,
nothing more. That's why a HUP signal won't now pick up
some configuration changes.

The point of the exercise is to work out if the dlclose()
and then dlopen() (of autofs's own libraries only) when a
HUP signal is received or a map re-read is triggered, is
causing a library data segment that contains things this
library assumes won't to go away, to go away, and then
come back clean.

If that is the case then the library probably assumes global
variables in data segments it doesn't "own" won't go away,
which isn't good.

If we're sure this is the case or at least not doing the
close then open will resolve the problem I'll need to change
autofs.

That's not straight forward but is better than facing this
problem with who knows how many libraries behaving this way
over time.

Looking forward to your feedback and thanks for the effort.

Ian

Comment 5 Ian Kent 2015-07-09 08:06:07 UTC

I've been working on eliminating the autofs library modules
close then open on receiving a HUP signal.

Doing that was based on the feedback of comment#3.

Now I need to test the changes (particularly so because one
of the modules was quite difficult and fairly complicated
to change) which will be tedious and time consuming so I
thought I'd check whether you have been using the package
of comment#1 and how that went. If I'm wrong about the cause
I'd rather not spend even more time on this change.

Comment 6 Karsten Weiss 2015-07-10 11:59:53 UTC

Unfortunately, the debug autofs rpm was accidentally replaced by the "official" on this test system after a couple of days.

However, we installed the debug rpm again on another test system that recently ran into several autofs crashes. I'll come back to you on Monday.

(Side note: Did you try to call PR_ErrorToString (-5961, 0x0) unconditionally after the autofs error table initializations on a debug build? Maybe this would result in a simple testcase that can be tested without waiting for TCP problems. It may be worth a try.)

Comment 7 Ian Kent 2015-07-11 02:15:40 UTC

(In reply to Karsten Weiss from comment #6)
> Unfortunately, the debug autofs rpm was accidentally replaced by the
> "official" on this test system after a couple of days.

LOL, it happens.

> 
> However, we installed the debug rpm again on another test system that
> recently ran into several autofs crashes. I'll come back to you on Monday.

That doesn't sound good, I'll be interested to see what happened.
There is only very little difference in the rpm.

> 
> (Side note: Did you try to call PR_ErrorToString (-5961, 0x0)
> unconditionally after the autofs error table initializations on a debug
> build? Maybe this would result in a simple testcase that can be tested
> without waiting for TCP problems. It may be worth a try.)

There's a fair bit of procedure in initializing ssl none of
which is done in autofs and there really isn't any need for
autofs to call the ssl layer directly.

As I've said before I leave that to the libraries that need
to do it.

Thanks for getting back to me.
Ian

Comment 8 Karsten Weiss 2015-07-13 09:10:44 UTC

On the 2nd test machine we saw two autofs-5.0.5-109.el6_6.1 crashes
(2015-06-30 and 2015-07-03).

We then decided to updated autofs to your autofs-5.0.5-109.debug.1.el6.1.x86_64 (2015-07-07)
and so far it is running fine and did not crash anymore.

However, I just took a closer look at the two existing backtraces of this 2nd test
system and noticed that they are a little bit different than on the first system.
This time the crash happens because of a "double free or corruption (fasttop)"
during a ldap reconnect:

Here's the 2015-06-30 backtrace (the other one is identical) - please notice that this
is very similar to Red Hat solution #717803 "automount core dumps on RHEL 6.5"
(https://access.redhat.com/solutions/717803):

Core was generated by `automount --pid-file /var/run/autofs.pid'.
Program terminated with signal 6, Aborted.
#0  0x00007fc852e09625 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install libcom_err-1.41.12-21.el6.x86_64
(gdb) bt
#0  0x00007fc852e09625 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007fc852e0ae05 in abort () at abort.c:92
#2  0x00007fc852e47537 in __libc_message (do_abort=2, fmt=0x7fc852f2f840 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x00007fc852e4ce66 in malloc_printerr (action=3, str=0x7fc852f2fb18 "double free or corruption (fasttop)", ptr=<value optimized out>) at malloc.c:6336
#4  0x00007fc85255ba35 in ldap_int_tls_destroy (lo=0x7fc7ec020860) at ../../../libraries/libldap/tls2.c:105
#5  0x00007fc85254204f in ldap_ld_free (ld=0x7fc7ec020a70, close=<value optimized out>, sctrls=<value optimized out>, cctrls=<value optimized out>) at ../../../libraries/libldap/unbind.c:209
#6  0x00007fc852784660 in __unbind_ldap_connection (logopt=0, ldap=0x7fc7ec020a70, ctxt=<value optimized out>) at lookup_ldap.c:227
#7  0x00007fc8527849bd in unbind_ldap_connection (logopt=0, ldap=0x7fc7ec020a70, ctxt=0x7fc7f8023ba0) at lookup_ldap.c:239
#8  0x00007fc852784bcd in do_connect (logopt=0, uri=<value optimized out>, ctxt=0x7fc7f8023ba0) at lookup_ldap.c:668
#9  0x00007fc8527853fc in do_reconnect (logopt=0, ctxt=0x7fc7f8023ba0) at lookup_ldap.c:924
#10 0x00007fc852785d93 in lookup_one (ap=0x7fc855e73100, source=0x7fc838000950, key=0x7fc7ec02c0f0 "session.zip", key_len=11, ctxt=0x7fc7f8023ba0) at lookup_ldap.c:2875
#11 match_key (ap=0x7fc855e73100, source=0x7fc838000950, key=0x7fc7ec02c0f0 "session.zip", key_len=11, ctxt=0x7fc7f8023ba0) at lookup_ldap.c:3366
#12 0x00007fc852787057 in check_map_indirect (ap=0x7fc855e73100, name=<value optimized out>, name_len=<value optimized out>, context=0x7fc7f8023ba0) at lookup_ldap.c:3453
#13 lookup_mount (ap=0x7fc855e73100, name=<value optimized out>, name_len=<value optimized out>, context=0x7fc7f8023ba0) at lookup_ldap.c:3606
#14 0x00007fc8543f04c3 in lookup_nss_mount ()
#15 0x00007fc8543e8758 in ?? ()
#16 0x00007fc853fa19d1 in start_thread (arg=0x7fc7fffff700) at pthread_create.c:301
#17 0x00007fc852ebf8fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) up 4
#4  0x00007fc85255ba35 in ldap_int_tls_destroy (lo=0x7fc7ec020860) at ../../../libraries/libldap/tls2.c:105
105                     ldap_pvt_tls_ctx_free( lo->ldo_tls_ctx );
(gdb) list
100
101     void
102     ldap_int_tls_destroy( struct ldapoptions *lo )
103     {
104             if ( lo->ldo_tls_ctx ) {
105                     ldap_pvt_tls_ctx_free( lo->ldo_tls_ctx );
106                     lo->ldo_tls_ctx = NULL;
107             }
108
109             if ( lo->ldo_tls_certfile ) {
(gdb) print *lo
$1 = {ldo_valid = 2, ldo_debug = 0, ldo_version = 3, ldo_deref = 0, ldo_timelimit = 10, ldo_sizelimit = 0, ldo_tm_api = {tv_sec = -1, tv_usec = 0}, ldo_tm_net = {tv_sec = 8, tv_usec = 0}, ldo_defludp = 0x0, ldo_defport = 389, 
  ldo_defbase = 0x7fc808000a10 "dc=<redacted>", ldo_defbinddn = 0x0, ldo_keepalive_idle = 0, ldo_keepalive_probes = 0, ldo_keepalive_interval = 0, ldo_refhoplimit = 5, ldo_sctrls = 0x0, ldo_cctrls = 0x0, 
  ldo_rebind_proc = 0, ldo_rebind_params = 0x0, ldo_nextref_proc = 0, ldo_nextref_params = 0x0, ldo_urllist_proc = 0, ldo_urllist_params = 0x0, ldo_conn_cbs = 0x0, ldo_booleans = 1, ldo_peer = 0x0, ldo_cldapdn = 0x0, 
  ldo_is_udp = 0, ldo_tls_ctx = 0x7fc80800ad30, ldo_tls_connect_cb = 0, ldo_tls_connect_arg = 0x0, ldo_tls_info = {lt_certfile = 0x0, lt_keyfile = 0x0, lt_dhfile = 0x0, lt_cacertfile = 0x0, lt_cacertdir = 0x0, 
    lt_ciphersuite = 0x0, lt_crlfile = 0x0, lt_randfile = 0x0, lt_protocol_min = 0}, ldo_tls_mode = 0, ldo_tls_require_cert = 2, ldo_tls_impl = 0, ldo_tls_crlcheck = 0, ldo_def_sasl_mech = 0x0, ldo_def_sasl_realm = 0x0, 
  ldo_def_sasl_authcid = 0x0, ldo_def_sasl_authzid = 0x0, ldo_sasl_secprops = {min_ssf = 0, max_ssf = 2147483647, maxbufsize = 16777215, security_flags = 17, property_names = 0x0, property_values = 0x0}}

(Notice that there are many automount messages like this in the system log:
 automount[3166]: key "session.zip" not found in map source(s).
This is caused by software that is not under our control which performs some strange map lookups
like that. automount still shouldn't crash, though.)

Comment 9 Ian Kent 2015-07-13 10:54:06 UTC

(In reply to Karsten Weiss from comment #8)
> On the 2nd test machine we saw two autofs-5.0.5-109.el6_6.1 crashes
> (2015-06-30 and 2015-07-03).
> 
> We then decided to updated autofs to your
> autofs-5.0.5-109.debug.1.el6.1.x86_64 (2015-07-07)
> and so far it is running fine and did not crash anymore.
> 
> However, I just took a closer look at the two existing backtraces of this
> 2nd test
> system and noticed that they are a little bit different than on the first
> system.
> This time the crash happens because of a "double free or corruption
> (fasttop)"
> during a ldap reconnect:

I'm aware of this (at least this looks like the same problem) and
there were a couple of corrections that came out of that investigation
although they are not included in the current RHEL release becuase
the customer hasn't tested the change.

I thought I included them in that the test rpm I provided but I will
check.

Problem is that I thought the customer with this issue (and another
similar one) had reported the corrections didn't fix the problem
but now I look I see they haven't given any feedback from testing.

That's a problem because I had started thinking the shared library
close to open on HUP signal (or certain type map modified) was also
a factor with this problem.

So, assuming I have included all the changes as well as disabling
the close to open on HUP signal in the provided rpm, we should be
good to test.

Will confirm what I've included in the rpm.

Ian

Comment 10 Ian Kent 2015-07-13 11:33:57 UTC

(In reply to Ian Kent from comment #9)
> 
> I thought I included them in that the test rpm I provided but I will
> check.
> 

snip ...

> 
> Will confirm what I've included in the rpm.
> 

I see that one change was included in rev 104 but there's
another that I should have included.

Let me add that patch and provide a new test rpm tomorrow.

The backtrace above could be covered by the missing change
or it could be due to the close to open behaviour. But if it
was due to the missing change I'd have expected to see
get_query_dn() in the backtrace.

If it occurs with the updated rpm I'll need to dig deeper.

Ian

Comment 11 Karsten Weiss 2015-07-17 12:03:47 UTC

autofs-5.0.5-109.debug.1.el6.1.x86_64 is still up and running on the 2nd test machine.

Comment 12 Ian Kent 2015-07-22 02:53:54 UTC

(In reply to Karsten Weiss from comment #11)
> autofs-5.0.5-109.debug.1.el6.1.x86_64 is still up and running on the 2nd
> test machine.

That's good to know.

After thinking about this it doesn't give us conclusive
evidence that my assumption of the problem cause is correct.

The problem being the change in the rpm you have prevents
the code being executed and while some of that code is
still well used for lookups the problem could still be
with code that is not used.

I think the better way to go now is to test the actual
change that I've done.

It's certain to have mistakes in it so it will be a challenge
for us both, if your willing to help with testing it.

So that's the question, are you willing to set aside a machine
or two to do this?

If so I'll get onto back porting the patches and produce a
test rpm.

Ian

Comment 13 Karsten Weiss 2015-07-22 06:58:57 UTC

Yes, I can test a new debug version.

Comment 14 Ian Kent 2015-07-22 11:03:09 UTC

(In reply to Karsten Weiss from comment #13)
> Yes, I can test a new debug version.

Thanks, I'll get onto the back port, I've written it against
the upstream source, and do some initial testing.

Comment 15 Ian Kent 2015-07-30 10:32:35 UTC

I've back ported the changes that I hope will help with
this problem.

I've done some general testing and that worked out OK.

I think it's worth trying this out to see if it does in
fact help with the problem. If it doesn't I don't really
want to keep spending time on this and I'll need to start
looking deeper into dependent library code used by autofs.

The packages can be found at:
http://people.redhat.com/~ikent/autofs-5.0.5-114.bz1243525.1.el6/

These packages have been built against rhel-6.7 but should
work OK with earlier rhel revisions. If not there is a
source rpm included at the location above that can be used
to re-build the package on the target system.

Please be conservative when selecting systems to use for
testing as we don't know what mistakes I've made and there
are quite a few changes included in the build.

Please test and let me know how it goes.
Ian

Comment 16 Karsten Weiss 2015-08-03 07:22:58 UTC

Thanks. I've just updated the test system from autofs-5.0.5-109.debug.1.el6.1.x86_64 to this new version.

Comment 17 Ian Kent 2015-08-26 12:37:35 UTC

There are updated test packages which can be found at:
http://people.redhat.com/~ikent/autofs-5.0.5-114.bz1243525.3.el6/

Changes since revision 1:
- fixed incorrect use of the new lookup module reinit interface.
- fixed a couple of problems resulting from the above change.
- added patch to prevent calling ldap unbind on an ldap connection
  that had never been bound.

Please use these packages for testing.

Comment 18 Karsten Weiss 2015-08-27 10:52:10 UTC

My test system is now using autofs-5.0.5-114.bz1243525.3.el6.x86_64.

Comment 19 Karsten Weiss 2015-08-27 11:31:39 UTC

Ian, I've just examined the Table_List of the new
autofs-5.0.5-114.bz1243525.3.el6.x86_64 right after initialization
and noticed the inaccessible table entry again.

I.e. launch autofs, then attach gdb to the running process and do this:

(gdb) print *Table_List
$1 = {next = 0x7f3c967a7490, table = 0x7f3c912c8710, table_private = 0x0}
(gdb)  print *Table_List.table
$2 = {msgs = 0x7f3c912c8740, name = 0x7f3c910bdb2a "sslerr", base = -12288, n_msgs = 134}
(gdb)  print *Table_List.next
$3 = {next = 0x7f3c96735bc0, table = 0x7f3c92b33710, table_private = 0x0}
(gdb)  print *Table_List.next.table
!!!!!!!!!!!!!!!!!!!!
Cannot access memory at address 0x7f3c92b33710
!!!!!!!!!!!!!!!!!!!!
(gdb)  print *Table_List.next.next
$4 = {next = 0x7f3c96735100, table = 0x7f3c92386820, table_private = 0x0}
(gdb)  print *Table_List.next.next.table
$5 = {msgs = 0x7f3c92386840, name = 0x7f3c9217b8f4 "secerrstrings", base = -8192, n_msgs = 180}
(gdb)  print *Table_List.next.next.next
$6 = {next = 0x0, table = 0x7f3c91d53600, table_private = 0x0}
(gdb)  print *Table_List.next.next.next.table
$7 = {msgs = 0x7f3c91d53620, name = 0x7f3c91b47bc0 "prerr", base = -6000, n_msgs = 77}
(gdb)  print *Table_List.next.next.next.next
Cannot access memory at address 0x0

I'm pretty sure this version would crash in PR_ErrorToString() again (as
initially reported) and we're back to square one. Don't you think so?

Could you please try to reproduce this simple test by yourself?

Comment 20 Karsten Weiss 2015-08-27 11:48:13 UTC

(You'll need the matching nspr-debuginfo package to resolve the Table_List symbol.)

Comment 21 Ian Kent 2015-09-14 05:54:19 UTC

(In reply to Karsten Weiss from comment #0)
> 
> If we take a closer look at the backtrace of the 4th registration (in
> the worker thread) the frames 0-5 look suspicious because even
> ssl_Init() shouldn't run a 2nd time! It uses the 'ssl_inited' static
> variable to prevent a 2nd initialization. The same in nspr which uses
> the "once" data structure to prevent a second initialization:
> 
> (gdb) bt
> #0  PR_ErrorInstallTable (table=0x7ffff393d750) at
> ../../../nspr/pr/src/misc/prerrortable.c:166
> #1  0x00007ffff43ad16a in PR_CallOnce (once=0x7ffff3940ca8, func=<value
> optimized out>) at ../../../nspr/pr/src/misc/prinit.c:775
> #2  0x00007ffff37223c7 in ssl_InitializePRErrorTable () at sslerrstrs.c:32
> #3  0x00007ffff372240d in ssl_Init () at sslinit.c:21
> #4  0x00007ffff372c9f7 in SSL_CipherPolicySet (which=49195, policy=1) at
> sslsock.c:1154
> #5  0x00007ffff372cae1 in NSS_SetDomesticPolicy () at sslsock.c:1310
> #6  0x00007ffff5740078 in tlsm_deferred_init (arg=0x7fffac00b5d0) at
> ../../../libraries/libldap/tls_m.c:1914
> 
> Unfortunately, from the worker thread's point of view these variables
> all say "uninitialized" and it executes the initialization code once
> again and the global Table_List ends up with the broken entry.

And that's what made me think it could be the autofs library
close to open causing library data segments to be discarded
on some libraries but not on others.

Changing that hasn't helped.

> 
> Also, something (I don't know what) releases the data structure / memory
> of the first "sslerr" PRErrorTable and makes its fields
> inaccessible without de-registering it from the global Table_List
> (PRErrorTableList).

Possibly something is keeping track of what it has allocated
and is freeing it on connection shutdown, can't find where
that is happening either.

But the nspr code doesn't provide any way to remove an error
table once registered. It appears to assume they will always
be available once registered.

So even the work done on ensuring the nss shutdown function
is called at library unload (in rhel-7.2 I believe) if it
hasn't already been called probably won't help with this
either.

In nspr there's no consideration given to the possibility
there could be multiple concurrent processes modifying the
list of tables either but then the openldap library is said
to not be thread safe for initialization so perhaps that's
not a problem in itself since callers need to serialize
their calls anyway.

I've been working on expanding the autofs ldap serialization
even further than it is now, shouldn't be long before that's
done.

I'm also been considering dropping the re-init patch series
and adding code to dlopen() the openldap, nss and nspr
libraries at automount program start to see if that will help
with retaining the expected library state.

Is there anything you care to add regarding this before I
dive into it?

Ian

Comment 22 Karsten Weiss 2015-09-17 02:54:15 UTC

No.

Comment 23 Rich Megginson 2015-09-17 20:59:05 UTC

> I.e. why does the first registration of the "sslerr" table get nullified and
thrown away? Why does the code register the sslerr" table a 2nd time in the
worker thread - thereby re-doing all the initial (nspr) ssl_Init() work of
the main thread?

This is really mysterious to me.  Here is the code from sslerrstrs.c:

  static const struct PRErrorTable ssl_et = {
    ssltext, "sslerr", SSL_ERROR_BASE,
        (sizeof ssltext)/(sizeof ssltext[0])
  };

  static PRStatus
  ssl_InitializePRErrorTableOnce(void) {
      return PR_ErrorInstallTable(&ssl_et);
  }

  static PRCallOnceType once;

  SECStatus
  ssl_InitializePRErrorTable(void)
  {
      return (PR_SUCCESS == PR_CallOnce(&once, ssl_InitializePRErrorTableOnce))
  		? SECSuccess : SECFailure; 
  }

If I understand this correctly, there should be only 1 copy of `once`, ssl_et, and the functions in the libssl3.so library, and there should be only 1 copy of that library in the process.  Each one of these symbols should always have the same address.  But this is not the case - the first call stack:

#0  PR_ErrorInstallTable (table=0x7ffff53aa750) at ../../../nspr/pr/src/misc/prerrortable.c:166
#1  0x00007ffff43ad16a in PR_CallOnce (once=0x7ffff53adca8, func=<value optimized out>) at ../../../nspr/pr/src/misc/prinit.c:775
#2  0x00007ffff518f3c7 in ssl_InitializePRErrorTable () at sslerrstrs.c:32
#3  0x00007ffff518f40d in ssl_Init () at sslinit.c:21
#4  0x00007ffff51999f7 in SSL_CipherPolicySet (which=49195, policy=1) at sslsock.c:1154
#5  0x00007ffff5199ae1 in NSS_SetDomesticPolicy () at sslsock.c:1310
#6  0x00007ffff6141078 in tlsm_deferred_init (arg=0x7ffff8203300) at ../../../libraries/libldap/tls_m.c:1914

You'll notice that the address of once=0x7ffff53adca8.

Here is the second call stack:

#0  PR_ErrorInstallTable (table=0x7ffff393d750) at ../../../nspr/pr/src/misc/prerrortable.c:166
#1  0x00007ffff43ad16a in PR_CallOnce (once=0x7ffff3940ca8, func=<value optimized out>) at ../../../nspr/pr/src/misc/prinit.c:775
#2  0x00007ffff37223c7 in ssl_InitializePRErrorTable () at sslerrstrs.c:32
#3  0x00007ffff372240d in ssl_Init () at sslinit.c:21
#4  0x00007ffff372c9f7 in SSL_CipherPolicySet (which=49195, policy=1) at sslsock.c:1154
#5  0x00007ffff372cae1 in NSS_SetDomesticPolicy () at sslsock.c:1310
#6  0x00007ffff5740078 in tlsm_deferred_init (arg=0x7fffac00b5d0) at ../../../libraries/libldap/tls_m.c:1914

The addresses of once, table, ssl_InitializePRErrorTable, ssl_Init, SSL_CipherPolicySet, tlsm_deferred_init, etc. are all different, which suggests that the NSS libraries and the openldap libraries were loaded somehow at different locations.

But, the address of PR_CallOnce is the same - 0x00007ffff43ad16a.  So why would there be some sort of per-process shared library symbol address randomization for some of the libraries but not others?

It's almost as if there are two different openldap and NSS libraries loaded in the same process.

Comment 24 Rich Megginson 2015-09-18 02:30:26 UTC

> And here's a backtrace from this point in time (ignore the addresses in
this and the next bt because I captured them from a different session -
it's the call trace that's important and interesting!):

So, never mind my analysis, back to the drawing board.

Does the automount code call fork() anywhere?  That would cause NSS to unload its modules and reload them again: from the openldap tls_m.c:

        /* NSS enforces the pkcs11 requirement that modules should be unloaded after
           a fork() - since there is no portable way to determine if NSS has been
           already initialized in a parent process, we just call SECMOD_RestartModules
           with force == FALSE - if the module has been unloaded due to a fork, it will
           be reloaded, otherwise, it is a no-op */
        if ( SECFailure == ( rc = SECMOD_RestartModules(PR_FALSE /* do not force */) ) ) {

That would explain why

> What is happening here? "sslerr" wants to register *again* although it
is already registered (in the main thread).

Both the `once` and the `ssl_et` are addresses in the library.  If the library is unloaded and reloaded, those address will be assigned different values.  So now the Table_List is pointing to memory that is no longer used by the process, or possibly in a completely different memory region of the program.

It would seem that the correct thing for NSS to do would be to remove its error tables from NSPR when unloaded, but unfortunately there appears to be no way to do that - there is nothing like a PR_ErrorRemoveTable() call.  There definitely needs to be one to allow libraries that are unloaded to remove their tables.

Another way would be for NSS to use malloc() memory for the `once` object and for the `ssl_et` table.  That would avoid having a corrupted pointer, since the malloc() memory would live forever, but that would leak every time fork() is called - NSS would re-malloc() `once` and `ssl_et`.

An insecure (violates PKCS11 standards) workaround might be to set NSS_STRICT_NOFORK=DISABLED in the automount daemon environment.  openldap is already doing this:

http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=libraries/libldap/tls_m.c;h=9b101f954d38c8f804aadbddb5de1282e4c36e60;hb=refs/heads/OPENLDAP_REL_ENG_2_4#l3082

but perhaps there is another call somewhere to NSS before this is called, that is controlling the fork behavior.

Comment 25 Ian Kent 2015-09-18 03:31:56 UTC

(In reply to Rich Megginson from comment #24)

Ha, I wrote a response to the previous comment before
seeing this.

It's still mostly relevant and I'll add a slightly modified
version of it below.

But there's also a couple of things to comment on here now too.

> > And here's a backtrace from this point in time (ignore the addresses in
> this and the next bt because I captured them from a different session -
> it's the call trace that's important and interesting!):
> 
> So, never mind my analysis, back to the drawing board.
> 
> Does the automount code call fork() anywhere?  That would cause NSS to
> unload its modules and reload them again: from the openldap tls_m.c:

That depends on your definition of fork, certainly autofs does
fork mount(8) to perform mounts but does no ldap interaction
in that code.

But it also does most of its actions in separate threads (see
below for a procedural description).

> 
>         /* NSS enforces the pkcs11 requirement that modules should be
> unloaded after
>            a fork() - since there is no portable way to determine if NSS has
> been
>            already initialized in a parent process, we just call
> SECMOD_RestartModules
>            with force == FALSE - if the module has been unloaded due to a
> fork, it will
>            be reloaded, otherwise, it is a no-op */
>         if ( SECFailure == ( rc = SECMOD_RestartModules(PR_FALSE /* do not
> force */) ) ) {
> 
> That would explain why
> 
> > What is happening here? "sslerr" wants to register *again* although it
> is already registered (in the main thread).
> 
> Both the `once` and the `ssl_et` are addresses in the library.  If the
> library is unloaded and reloaded, those address will be assigned different
> values.  So now the Table_List is pointing to memory that is no longer used
> by the process, or possibly in a completely different memory region of the
> program.
> 
> It would seem that the correct thing for NSS to do would be to remove its
> error tables from NSPR when unloaded, but unfortunately there appears to be
> no way to do that - there is nothing like a PR_ErrorRemoveTable() call. 
> There definitely needs to be one to allow libraries that are unloaded to
> remove their tables.

That was, more or less, my conclusion after looking at the nspr
code. And there probably needs to be some serialization on the
list modifications too.

> 
> Another way would be for NSS to use malloc() memory for the `once` object
> and for the `ssl_et` table.  That would avoid having a corrupted pointer,
> since the malloc() memory would live forever, but that would leak every time
> fork() is called - NSS would re-malloc() `once` and `ssl_et`.

Sure, the structures need to have clear ownership, either all nss
or all nspr with associated cleanup for each. But in this case
there appears to be a mix of data ownership between the two
libraries.

> 
> An insecure (violates PKCS11 standards) workaround might be to set
> NSS_STRICT_NOFORK=DISABLED in the automount daemon environment.  openldap is
> already doing this:
> 
> http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=libraries/
> libldap/tls_m.c;h=9b101f954d38c8f804aadbddb5de1282e4c36e60;hb=refs/heads/
> OPENLDAP_REL_ENG_2_4#l3082
> 
> but perhaps there is another call somewhere to NSS before this is called,
> that is controlling the fork behavior.

From a procedural point of view here's now autofs does things.

First autofs does almost all of its actions in distinct threads.
So they're always going to be running from different thread
processes. Clearly that implies clone(2) will be called but
it isn't a full fork(2).

I'm not sure if that meets the criteria above for an nss
library re-load.

It gets a little more complicated from this point.

autofs will dlopen() its lookup module shared library for
a given map source, ldap in this case, once for each map that
is being used, each with its own private data segment that
holds map specific data.

On every lookup a connection to the ldap server is opened,
the map queried, and the connection closed.

Now, autofs tries to work out if a map has changed since
last access, based on what it gets back from the map entry
query. And if it thinks it has changed it will initiate a
map re-read.

This essentially results in the equivalent of sending a
HUP signal to the application.

We know that there was a recent regression where autofs
wasn't quite getting this right and excessive map re-reads
were being triggered so we know re-reads were happening.

On receiving a HUP signal autofs will, take a write lock
(against a read-write mutex) of the given map to ensure all
lookups for this particular map are completed then dlcose()
the lookup module and then re-open it to effect the
required per map initialization, since the map arguments
may have changed.

So there is plenty of opportunity for dependent demand
loaded shared libraries to think they are no longer in
use and to be unloaded.

One thing is sure though, all co-operating libraries should
not assume state of the data segment of another library when
there is a degree of context shared between the two. But that
does seem to be happening here.

The look of things makes me wonder if there's a library
reference count bug somewhere, causing inconsistent library
unloads between dependent libraries ...

Not sure this is helpful but thought a procedural description
of what happens might help make sense of the above observations.

Ian

Comment 26 Ian Kent 2015-09-18 03:38:18 UTC

(In reply to Rich Megginson from comment #24)
> 
> An insecure (violates PKCS11 standards) workaround might be to set
> NSS_STRICT_NOFORK=DISABLED in the automount daemon environment.  openldap is
> already doing this:
> 
> http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=libraries/
> libldap/tls_m.c;h=9b101f954d38c8f804aadbddb5de1282e4c36e60;hb=refs/heads/
> OPENLDAP_REL_ENG_2_4#l3082
> 
> but perhaps there is another call somewhere to NSS before this is called,
> that is controlling the fork behavior.

I've been careful to not call anything related to libraries
that ldap depends on. Certainly there are no calls to the nss
or nspr libraries in the autofs ldap code.

But there are sasl ldap and Kerberos auth calls going on here
too, with multi-step authentication, so what's called and
when within dependent libraries isn't at all straight forward.

Comment 27 Rich Megginson 2015-09-18 15:09:07 UTC

(In reply to Ian Kent from comment #25)
> (In reply to Rich Megginson from comment #24)
> 
> Ha, I wrote a response to the previous comment before
> seeing this.
> 
> It's still mostly relevant and I'll add a slightly modified
> version of it below.
> 
> But there's also a couple of things to comment on here now too.
> 
> > > And here's a backtrace from this point in time (ignore the addresses in
> > this and the next bt because I captured them from a different session -
> > it's the call trace that's important and interesting!):
> > 
> > So, never mind my analysis, back to the drawing board.
> > 
> > Does the automount code call fork() anywhere?  That would cause NSS to
> > unload its modules and reload them again: from the openldap tls_m.c:
> 
> That depends on your definition of fork, certainly autofs does
> fork mount(8) to perform mounts but does no ldap interaction
> in that code.

That would trigger the NSS fork detection code, if NSS is used anywhere in the forked process.

> 
> But it also does most of its actions in separate threads (see
> below for a procedural description).
> 
> > 
> >         /* NSS enforces the pkcs11 requirement that modules should be
> > unloaded after
> >            a fork() - since there is no portable way to determine if NSS has
> > been
> >            already initialized in a parent process, we just call
> > SECMOD_RestartModules
> >            with force == FALSE - if the module has been unloaded due to a
> > fork, it will
> >            be reloaded, otherwise, it is a no-op */
> >         if ( SECFailure == ( rc = SECMOD_RestartModules(PR_FALSE /* do not
> > force */) ) ) {
> > 
> > That would explain why
> > 
> > > What is happening here? "sslerr" wants to register *again* although it
> > is already registered (in the main thread).
> > 
> > Both the `once` and the `ssl_et` are addresses in the library.  If the
> > library is unloaded and reloaded, those address will be assigned different
> > values.  So now the Table_List is pointing to memory that is no longer used
> > by the process, or possibly in a completely different memory region of the
> > program.
> > 
> > It would seem that the correct thing for NSS to do would be to remove its
> > error tables from NSPR when unloaded, but unfortunately there appears to be
> > no way to do that - there is nothing like a PR_ErrorRemoveTable() call. 
> > There definitely needs to be one to allow libraries that are unloaded to
> > remove their tables.
> 
> That was, more or less, my conclusion after looking at the nspr
> code. And there probably needs to be some serialization on the
> list modifications too.
> 
> > 
> > Another way would be for NSS to use malloc() memory for the `once` object
> > and for the `ssl_et` table.  That would avoid having a corrupted pointer,
> > since the malloc() memory would live forever, but that would leak every time
> > fork() is called - NSS would re-malloc() `once` and `ssl_et`.
> 
> Sure, the structures need to have clear ownership, either all nss
> or all nspr with associated cleanup for each. But in this case
> there appears to be a mix of data ownership between the two
> libraries.
> 
> > 
> > An insecure (violates PKCS11 standards) workaround might be to set
> > NSS_STRICT_NOFORK=DISABLED in the automount daemon environment.  openldap is
> > already doing this:
> > 
> > http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=libraries/
> > libldap/tls_m.c;h=9b101f954d38c8f804aadbddb5de1282e4c36e60;hb=refs/heads/
> > OPENLDAP_REL_ENG_2_4#l3082
> > 
> > but perhaps there is another call somewhere to NSS before this is called,
> > that is controlling the fork behavior.
> 
> From a procedural point of view here's now autofs does things.
> 
> First autofs does almost all of its actions in distinct threads.
> So they're always going to be running from different thread
> processes. Clearly that implies clone(2) will be called but
> it isn't a full fork(2).

It must be a fork(2) to trigger the NSS pkcs11 code.

> 
> I'm not sure if that meets the criteria above for an nss
> library re-load.

No, it doesn't.

> 
> It gets a little more complicated from this point.
> 
> autofs will dlopen() its lookup module shared library for
> a given map source, ldap in this case, once for each map that
> is being used, each with its own private data segment that
> holds map specific data.
> 
> On every lookup a connection to the ldap server is opened,
> the map queried, and the connection closed.
> 
> Now, autofs tries to work out if a map has changed since
> last access, based on what it gets back from the map entry
> query. And if it thinks it has changed it will initiate a
> map re-read.
> 
> This essentially results in the equivalent of sending a
> HUP signal to the application.
> 
> We know that there was a recent regression where autofs
> wasn't quite getting this right and excessive map re-reads
> were being triggered so we know re-reads were happening.
> 
> On receiving a HUP signal autofs will, take a write lock
> (against a read-write mutex) of the given map to ensure all
> lookups for this particular map are completed then dlcose()
> the lookup module and then re-open it to effect the
> required per map initialization, since the map arguments
> may have changed.
> 
> So there is plenty of opportunity for dependent demand
> loaded shared libraries to think they are no longer in
> use and to be unloaded.

Do we know if, in this case, automount is being HUP'd?  If it is, that would certainly explain why symbols are being assigned different addresses.  It may also explain why NSPR is not, since NSPR is used by a wide variety of software, and may have already been loaded into the process address space before openldap and NSS.

If automount is never being HUP'd, and we are getting this problem, then I'm not sure what's going on.

How can we find out if the problem is caused by automount being HUP'd?

> 
> One thing is sure though, all co-operating libraries should
> not assume state of the data segment of another library when
> there is a degree of context shared between the two. But that
> does seem to be happening here.
> 
> The look of things makes me wonder if there's a library
> reference count bug somewhere, causing inconsistent library
> unloads between dependent libraries ...
> 
> Not sure this is helpful but thought a procedural description
> of what happens might help make sense of the above observations.
> 
> Ian

Comment 28 Rich Megginson 2015-09-18 15:10:57 UTC

(In reply to Ian Kent from comment #26)
> (In reply to Rich Megginson from comment #24)
> > 
> > An insecure (violates PKCS11 standards) workaround might be to set
> > NSS_STRICT_NOFORK=DISABLED in the automount daemon environment.  openldap is
> > already doing this:
> > 
> > http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=libraries/
> > libldap/tls_m.c;h=9b101f954d38c8f804aadbddb5de1282e4c36e60;hb=refs/heads/
> > OPENLDAP_REL_ENG_2_4#l3082
> > 
> > but perhaps there is another call somewhere to NSS before this is called,
> > that is controlling the fork behavior.
> 
> I've been careful to not call anything related to libraries
> that ldap depends on. Certainly there are no calls to the nss
> or nspr libraries in the autofs ldap code.

Not explicitly, but if you are calling into openldap code to handle TLS/SSL connections, then you are implicitly calling into NSS and NSPR.
 
> But there are sasl ldap and Kerberos auth calls going on here
> too, with multi-step authentication, so what's called and
> when within dependent libraries isn't at all straight forward.

There may be sasl or Kerberos libraries that use NSS for crypto.

Comment 29 Ian Kent 2015-09-19 02:38:59 UTC

(In reply to Rich Megginson from comment #27)
> > That depends on your definition of fork, certainly autofs does
> > fork mount(8) to perform mounts but does no ldap interaction
> > in that code.
> 
> That would trigger the NSS fork detection code, if NSS is used anywhere in
> the forked process.

On second thoughts we have to assume that the fork()ed
process is calling NSS related functions because we don't
know if the autofs map entry being mounted uses, for example,
Kerberos to authenticate to the NFS server.

snip ...

> > 
> > It gets a little more complicated from this point.
> > 
> > autofs will dlopen() its lookup module shared library for
> > a given map source, ldap in this case, once for each map that
> > is being used, each with its own private data segment that
> > holds map specific data.
> > 
> > On every lookup a connection to the ldap server is opened,
> > the map queried, and the connection closed.
> > 
> > Now, autofs tries to work out if a map has changed since
> > last access, based on what it gets back from the map entry
> > query. And if it thinks it has changed it will initiate a
> > map re-read.
> > 
> > This essentially results in the equivalent of sending a
> > HUP signal to the application.
> > 
> > We know that there was a recent regression where autofs
> > wasn't quite getting this right and excessive map re-reads
> > were being triggered so we know re-reads were happening.
> > 
> > On receiving a HUP signal autofs will, take a write lock
> > (against a read-write mutex) of the given map to ensure all
> > lookups for this particular map are completed then dlcose()
> > the lookup module and then re-open it to effect the
> > required per map initialization, since the map arguments
> > may have changed.
> > 
> > So there is plenty of opportunity for dependent demand
> > loaded shared libraries to think they are no longer in
> > use and to be unloaded.
> 
> Do we know if, in this case, automount is being HUP'd?  If it is, that would
> certainly explain why symbols are being assigned different addresses.  It
> may also explain why NSPR is not, since NSPR is used by a wide variety of
> software, and may have already been loaded into the process address space
> before openldap and NSS.
> 
> If automount is never being HUP'd, and we are getting this problem, then I'm
> not sure what's going on.
> 
> How can we find out if the problem is caused by automount being HUP'd?

automount(8) catches HUP signals and deals with them.
There's no process restart of any kind.
There is a lot of detached thread creation and most of them
run to completion in a fairly brief amount of time.

So I'm not sure that receiving a HUP signal counts.

Ian

Comment 30 Rich Megginson 2015-09-21 17:13:52 UTC

(In reply to Ian Kent from comment #29)
> (In reply to Rich Megginson from comment #27)
> > > That depends on your definition of fork, certainly autofs does
> > > fork mount(8) to perform mounts but does no ldap interaction
> > > in that code.
> > 
> > That would trigger the NSS fork detection code, if NSS is used anywhere in
> > the forked process.
> 
> On second thoughts we have to assume that the fork()ed
> process is calling NSS related functions because we don't
> know if the autofs map entry being mounted uses, for example,
> Kerberos to authenticate to the NFS server.

We could easily verify that this is the problem by setting NSS_STRICT_NOFORK=DISABLED in the automount daemon environment, and trying to reproduce the problem.

> 
> snip ...
> 
> > > 
> > > It gets a little more complicated from this point.
> > > 
> > > autofs will dlopen() its lookup module shared library for
> > > a given map source, ldap in this case, once for each map that
> > > is being used, each with its own private data segment that
> > > holds map specific data.
> > > 
> > > On every lookup a connection to the ldap server is opened,
> > > the map queried, and the connection closed.
> > > 
> > > Now, autofs tries to work out if a map has changed since
> > > last access, based on what it gets back from the map entry
> > > query. And if it thinks it has changed it will initiate a
> > > map re-read.
> > > 
> > > This essentially results in the equivalent of sending a
> > > HUP signal to the application.
> > > 
> > > We know that there was a recent regression where autofs
> > > wasn't quite getting this right and excessive map re-reads
> > > were being triggered so we know re-reads were happening.
> > > 
> > > On receiving a HUP signal autofs will, take a write lock
> > > (against a read-write mutex) of the given map to ensure all
> > > lookups for this particular map are completed then dlcose()
> > > the lookup module and then re-open it to effect the
> > > required per map initialization, since the map arguments
> > > may have changed.
> > > 
> > > So there is plenty of opportunity for dependent demand
> > > loaded shared libraries to think they are no longer in
> > > use and to be unloaded.
> > 
> > Do we know if, in this case, automount is being HUP'd?  If it is, that would
> > certainly explain why symbols are being assigned different addresses.  It
> > may also explain why NSPR is not, since NSPR is used by a wide variety of
> > software, and may have already been loaded into the process address space
> > before openldap and NSS.
> > 
> > If automount is never being HUP'd, and we are getting this problem, then I'm
> > not sure what's going on.
> > 
> > How can we find out if the problem is caused by automount being HUP'd?
> 
> automount(8) catches HUP signals and deals with them.

Right, as I understand it as you have explained it, it catches the HUP to reload configuration.  As part of that process, libraries can be unloaded and reloaded.  Is this not the case?

> There's no process restart of any kind.

Right.  Did I mention process restart?

> There is a lot of detached thread creation and most of them
> run to completion in a fairly brief amount of time.

Ok.

> 
> So I'm not sure that receiving a HUP signal counts.

What I want to know is this:

Does sending the automount daemon a SIGHUP cause the crash?  Not directly, but eventually via the NSS crash detailed above?

Does automount crash if it is never sent a SIGHUP?

> 
> Ian

Comment 31 Ian Kent 2015-09-22 01:12:02 UTC

(In reply to Rich Megginson from comment #30)
> (In reply to Ian Kent from comment #29)
> > (In reply to Rich Megginson from comment #27)
> > > > That depends on your definition of fork, certainly autofs does
> > > > fork mount(8) to perform mounts but does no ldap interaction
> > > > in that code.
> > > 
> > > That would trigger the NSS fork detection code, if NSS is used anywhere in
> > > the forked process.
> > 
> > On second thoughts we have to assume that the fork()ed
> > process is calling NSS related functions because we don't
> > know if the autofs map entry being mounted uses, for example,
> > Kerberos to authenticate to the NFS server.
> 
> We could easily verify that this is the problem by setting
> NSS_STRICT_NOFORK=DISABLED in the automount daemon environment, and trying
> to reproduce the problem.

Karsten, this can be done by adding the "NSS_STRICT_NOFORK=DISABLED"
to /etc/sysconfig/autofs and restarting autofs.

Could you check this for us please.

> 
> > 
> > snip ...
> > 
> > > > 
> > > > It gets a little more complicated from this point.
> > > > 
> > > > autofs will dlopen() its lookup module shared library for
> > > > a given map source, ldap in this case, once for each map that
> > > > is being used, each with its own private data segment that
> > > > holds map specific data.
> > > > 
> > > > On every lookup a connection to the ldap server is opened,
> > > > the map queried, and the connection closed.
> > > > 
> > > > Now, autofs tries to work out if a map has changed since
> > > > last access, based on what it gets back from the map entry
> > > > query. And if it thinks it has changed it will initiate a
> > > > map re-read.
> > > > 
> > > > This essentially results in the equivalent of sending a
> > > > HUP signal to the application.
> > > > 
> > > > We know that there was a recent regression where autofs
> > > > wasn't quite getting this right and excessive map re-reads
> > > > were being triggered so we know re-reads were happening.
> > > > 
> > > > On receiving a HUP signal autofs will, take a write lock
> > > > (against a read-write mutex) of the given map to ensure all
> > > > lookups for this particular map are completed then dlcose()
> > > > the lookup module and then re-open it to effect the
> > > > required per map initialization, since the map arguments
> > > > may have changed.
> > > > 
> > > > So there is plenty of opportunity for dependent demand
> > > > loaded shared libraries to think they are no longer in
> > > > use and to be unloaded.
> > > 
> > > Do we know if, in this case, automount is being HUP'd?  If it is, that would
> > > certainly explain why symbols are being assigned different addresses.  It
> > > may also explain why NSPR is not, since NSPR is used by a wide variety of
> > > software, and may have already been loaded into the process address space
> > > before openldap and NSS.
> > > 
> > > If automount is never being HUP'd, and we are getting this problem, then I'm
> > > not sure what's going on.
> > > 
> > > How can we find out if the problem is caused by automount being HUP'd?
> > 
> > automount(8) catches HUP signals and deals with them.
> 
> Right, as I understand it as you have explained it, it catches the HUP to
> reload configuration.  As part of that process, libraries can be unloaded
> and reloaded.  Is this not the case?

There's a little more to it due to the history in this bug.

Basically, yes, automount(8) will close and reopen the map
lookup shared library on receiving a HUP signal (or a service
reload) and also when it thinks the map has been modified,
although a HUP signal isn't needed for the later.

The history here is that, way back at the start, I provided
an rpm with the close then open code commented out to test
exactly this and Karsten reported he didn't see the corruption
with this package.

That lead to the rpm with changes that are supposed to
reinit the lookup modules rather than do the close then
open but then Karsten reported he had again seen the
corruption with this change in place.

At this point I was stumped, although my change could be
at fault, of course.

I don't think we verified the corruption wasn't occurring
with original rpm to the point we could be sure the lack
of corruption was completely reproducible.

Karsten, do you still have that original rpm, if not I can
recreate it and provide it again.

Could you help us with this by performing that test again
please?

Comment 32 Karsten Weiss 2015-09-23 10:11:24 UTC

I've repeated my test in Comment #19 with this autofs version

# rpm -qa|grep autofs
autofs-debuginfo-5.0.5-114.bz1243525.3.el6.x86_64
autofs-5.0.5-114.bz1243525.3.el6.x86_64

and set NSS_STRICT_NOFORK=DISABLED. Result:

(gdb) print *Table_List
$1 = {next = 0x7f8401f78e20, table = 0x7f83fe831710, table_private = 0x0}
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
(gdb) print *Table_List.table
Cannot access memory at address 0x7f83fe831710
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
(gdb)  print *Table_List.next
$2 = {next = 0x7f8401f78350, table = 0x7f83fe084820, table_private = 0x0}
(gdb) print *Table_List.next.table
$3 = {msgs = 0x7f83fe084840, name = 0x7f83fde798f4 "secerrstrings", base = 0xffffe000, n_msgs = 0xb4}
(gdb) print *Table_List.next.next
$4 = {next = 0x0, table = 0x7f83fda51600, table_private = 0x0}
(gdb) print *Table_List.next.next.table
$5 = {msgs = 0x7f83fda51620, name = 0x7f83fd845bc0 "prerr", base = 0xffffe890, n_msgs = 0x4d}

I.e. there's inaccessible memory, too, although this time it's *Table_List.table
instead of *Table_List.next.table.

Am I right that this should not happen?

Can you reproduce this test? (See comment #19 for instructions)

Comment 33 Rich Megginson 2015-09-23 20:01:36 UTC

(In reply to Karsten Weiss from comment #32)
> I've repeated my test in Comment #19 with this autofs version
> 
> # rpm -qa|grep autofs
> autofs-debuginfo-5.0.5-114.bz1243525.3.el6.x86_64
> autofs-5.0.5-114.bz1243525.3.el6.x86_64
> 
> and set NSS_STRICT_NOFORK=DISABLED. Result:
> 
> (gdb) print *Table_List
> $1 = {next = 0x7f8401f78e20, table = 0x7f83fe831710, table_private = 0x0}
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> (gdb) print *Table_List.table
> Cannot access memory at address 0x7f83fe831710
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> (gdb)  print *Table_List.next
> $2 = {next = 0x7f8401f78350, table = 0x7f83fe084820, table_private = 0x0}
> (gdb) print *Table_List.next.table
> $3 = {msgs = 0x7f83fe084840, name = 0x7f83fde798f4 "secerrstrings", base =
> 0xffffe000, n_msgs = 0xb4}
> (gdb) print *Table_List.next.next
> $4 = {next = 0x0, table = 0x7f83fda51600, table_private = 0x0}
> (gdb) print *Table_List.next.next.table
> $5 = {msgs = 0x7f83fda51620, name = 0x7f83fd845bc0 "prerr", base =
> 0xffffe890, n_msgs = 0x4d}
> 
> I.e. there's inaccessible memory, too, although this time it's
> *Table_List.table
> instead of *Table_List.next.table.
> 
> Am I right that this should not happen?

You are right - this should not happen.
> 
> Can you reproduce this test? (See comment #19 for instructions)

(In reply to Karsten Weiss from comment #19)
> Ian, I've just examined the Table_List of the new
> autofs-5.0.5-114.bz1243525.3.el6.x86_64 right after initialization
> and noticed the inaccessible table entry again.
> 
> I.e. launch autofs, then attach gdb to the running process and do this:
> 
> (gdb) print *Table_List
> $1 = {next = 0x7f3c967a7490, table = 0x7f3c912c8710, table_private = 0x0}
> (gdb)  print *Table_List.table
> $2 = {msgs = 0x7f3c912c8740, name = 0x7f3c910bdb2a "sslerr", base = -12288,
> n_msgs = 134}
> (gdb)  print *Table_List.next
> $3 = {next = 0x7f3c96735bc0, table = 0x7f3c92b33710, table_private = 0x0}
> (gdb)  print *Table_List.next.table
> !!!!!!!!!!!!!!!!!!!!
> Cannot access memory at address 0x7f3c92b33710
> !!!!!!!!!!!!!!!!!!!!
> (gdb)  print *Table_List.next.next
> $4 = {next = 0x7f3c96735100, table = 0x7f3c92386820, table_private = 0x0}
> (gdb)  print *Table_List.next.next.table
> $5 = {msgs = 0x7f3c92386840, name = 0x7f3c9217b8f4 "secerrstrings", base =
> -8192, n_msgs = 180}
> (gdb)  print *Table_List.next.next.next
> $6 = {next = 0x0, table = 0x7f3c91d53600, table_private = 0x0}
> (gdb)  print *Table_List.next.next.next.table
> $7 = {msgs = 0x7f3c91d53620, name = 0x7f3c91b47bc0 "prerr", base = -6000,
> n_msgs = 77}
> (gdb)  print *Table_List.next.next.next.next
> Cannot access memory at address 0x0
> 
> I'm pretty sure this version would crash in PR_ErrorToString() again (as
> initially reported) and we're back to square one. Don't you think so?
> 
> Could you please try to reproduce this simple test by yourself?

In this case, is automount unloading and reloading the ldap module, at automount startup?  That is, you are not doing anything except starting automount?

Comment 34 Ian Kent 2015-09-23 23:51:24 UTC

(In reply to Karsten Weiss from comment #32)
> I've repeated my test in Comment #19 with this autofs version

Thanks for that.

> 
> # rpm -qa|grep autofs
> autofs-debuginfo-5.0.5-114.bz1243525.3.el6.x86_64
> autofs-5.0.5-114.bz1243525.3.el6.x86_64
> 
> and set NSS_STRICT_NOFORK=DISABLED. Result:

Just to be sure, how did you set this?

It should be sufficient to set it in /etc/sysconfig/autofs but
the export keyword cannot be added in that file.

So, to be sure, it would be best to add the define with the
export prefix just before the daemon is run in the init script.

Or did you export the variable and run automount from the
command line, that should be fine too?

> 
> (gdb) print *Table_List
> $1 = {next = 0x7f8401f78e20, table = 0x7f83fe831710, table_private = 0x0}
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> (gdb) print *Table_List.table
> Cannot access memory at address 0x7f83fe831710
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> (gdb)  print *Table_List.next
> $2 = {next = 0x7f8401f78350, table = 0x7f83fe084820, table_private = 0x0}
> (gdb) print *Table_List.next.table
> $3 = {msgs = 0x7f83fe084840, name = 0x7f83fde798f4 "secerrstrings", base =
> 0xffffe000, n_msgs = 0xb4}
> (gdb) print *Table_List.next.next
> $4 = {next = 0x0, table = 0x7f83fda51600, table_private = 0x0}
> (gdb) print *Table_List.next.next.table
> $5 = {msgs = 0x7f83fda51620, name = 0x7f83fd845bc0 "prerr", base =
> 0xffffe890, n_msgs = 0x4d}
> 
> I.e. there's inaccessible memory, too, although this time it's
> *Table_List.table
> instead of *Table_List.next.table.
> 
> Am I right that this should not happen?
> 
> Can you reproduce this test? (See comment #19 for instructions)

Your analysis is so clearly showing a problem that has occurred
at some unknown previous time that I hadn't attempted to reproduce
it.

Reproducing this does involve setting up an ldap server, generating
CA and certificate and adding test maps, etc. I'm writing a test to
do this now for a related bug (as support were unable to reproduce
the problem of the related bug) and I will use that to check your
findings as well.

TBH I'm still not sure what additional information I will get
from doing so since it's still not clear what's causing it.

This test is good feedback, to be sure, but I was actually asking
about the original rpm I provided where the library re-load was
commented out, ie. our initial check to try and work out if it was
related to the close to open of lookup modules. Do you still have
that rpm and could you test with it again please?

Ian

Comment 35 Karsten Weiss 2015-09-24 13:44:01 UTC

Ian, I did what you wrote in comment 31 i.e. "adding the "NSS_STRICT_NOFORK=DISABLED" to /etc/sysconfig/autofs and restarting autofs". I.e. I did *not* export this variable myself.

> That is, you are not doing anything except starting automount?

Exactly, as described in comment 19: Launch autofs, then (immediately without
doing anything special) attach gdb to the running process and print the
Table_List (and follow its pointers) and look if there are inaccessbily
memory regions.

(Make sure to install the nspr-debuginfo package to be able to resolve the
symbols in gdb.)

> Reproducing this does involve setting up an ldap server,

I know. However, at this point I think it's crucial that a) you have a local test environment, too, and b) you are able to reproduce the issue independently.

> This test is good feedback, to be sure, but I was actually asking
> about the original rpm I provided where the library re-load was
> commented out, ie. our initial check to try and work out if it was
> related to the close to open of lookup modules. Do you still have
> that rpm and could you test with it again please?

Isn't the answer already in comment 3 (apart from NSS_STRICT_NOFORK=DISABLED)?

Comment 43 Ian Kent 2016-09-28 03:40:56 UTC

I believe this problem has been resolved in autofs revision 122.
Can you check if this is the case please?

Comment 44 Karsten Weiss 2016-10-04 13:30:58 UTC

I've tested autofs-5.0.5-123:

(gdb) print *Table_List
$1 = {next = 0x7f2d1d40f270, table = 0x7f2d184233d0, table_private = 0x0}
(gdb) print *Table_List.table
$2 = {msgs = 0x0, name = 0x0, base = 0, n_msgs = 0}
                  ??????????
"name" was a ptr to "sslerr" before but it is 0x0 now.
(gdb)  print *Table_List.next
$3 = {next = 0x7f2d1d40e7a0, table = 0x7f2d17c71820, table_private = 0x0}
(gdb) print *Table_List.next.table
$4 = {msgs = 0x7f2d17c71840, name = 0x7f2d17a66e14 "secerrstrings", base = -8192, n_msgs = 180}
(gdb)  print *Table_List.next.next
$5 = {next = 0x0, table = 0x7f2d1763e600, table_private = 0x0}
(gdb) print *Table_List.next.next.table
$6 = {msgs = 0x7f2d1763e620, name = 0x7f2d17432d60 "prerr", base = -6000, n_msgs = 77}

I.e there's no more inaccessbily memory region but the first table consists of 0 bytes.

Comment 45 Ian Kent 2016-10-05 00:41:26 UTC

(In reply to Karsten Weiss from comment #44)
> I've tested autofs-5.0.5-123:
> 
> (gdb) print *Table_List
> $1 = {next = 0x7f2d1d40f270, table = 0x7f2d184233d0, table_private = 0x0}
> (gdb) print *Table_List.table
> $2 = {msgs = 0x0, name = 0x0, base = 0, n_msgs = 0}
>                   ??????????
> "name" was a ptr to "sslerr" before but it is 0x0 now.
> (gdb)  print *Table_List.next
> $3 = {next = 0x7f2d1d40e7a0, table = 0x7f2d17c71820, table_private = 0x0}
> (gdb) print *Table_List.next.table
> $4 = {msgs = 0x7f2d17c71840, name = 0x7f2d17a66e14 "secerrstrings", base =
> -8192, n_msgs = 180}
> (gdb)  print *Table_List.next.next
> $5 = {next = 0x0, table = 0x7f2d1763e600, table_private = 0x0}
> (gdb) print *Table_List.next.next.table
> $6 = {msgs = 0x7f2d1763e620, name = 0x7f2d17432d60 "prerr", base = -6000,
> n_msgs = 77}
> 
> I.e there's no more inaccessbily memory region but the first table consists
> of 0 bytes.

I never got to the bottom of the recent problems related to this.

I believe they were due to OpenLDAP changing the libraries it uses.

There were a number of problems including one very similar to this
(also table corruption in this library).

After considering all the evidence I made assumptions about what
the problem was and after quite extensive changes to autofs reports
indicated the problems had been resolved.

I don't know if the empty table above is actually a problem but if
there's no observable failure then there's nothing to go on.

The other thing is that this library isn't used directly by autofs
at all, it's used by OpenLDAP so it's not strictly an autofs problem.

So are you seeing an actual problem now?

Comment 46 Karsten Weiss 2016-10-05 11:23:58 UTC

I'm not able to reproduce the problem on 6.8 with autofs-5.0.5-123.x86_64 right now.

(On 7.2 using autofs-5.0.7-54.el7 it's still happening, though.)

Comment 47 Karsten Weiss 2016-10-05 11:31:57 UTC

See https://bugzilla.redhat.com/show_bug.cgi?id=1381924

Comment 48 Ian Kent 2016-10-06 02:40:23 UTC

(In reply to Karsten Weiss from comment #46)
> I'm not able to reproduce the problem on 6.8 with autofs-5.0.5-123.x86_64
> right now.
> 
> (On 7.2 using autofs-5.0.7-54.el7 it's still happening, though.)

Yes, it is.

The changes that went into RHEL-6.8 that I hope fixed this have
also gone into RHEL-7.3 and RHEL-7.3 is well into the QA process
now.

I can put the RHEL-7.3 package on people.redhat.com if you would
like to check if it helps with this problem before 7.3 is actually
released.

Comment 49 Jan Kurik 2017-12-06 11:47:24 UTC

Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/

Comment 50 Guillaume Hilt 2018-08-29 08:18:37 UTC

We still have the issue with recent autofs packages on RHEL 6 on our servers :
- kernel 2.6.32-754.3.5.el6.x86_64
- autofs 5.0.5-123.el6_8.x86_64 or 5.0.5-133.el6_9.x86_64

A crash dump is attached.

Comment 51 Guillaume Hilt 2018-08-29 08:19:36 UTC

Created attachment 1479406 [details]
crashdump for autofs-5.0.5-123.el6_8.x86_64 crash

Comment 52 Jean-Sébastien 2018-08-29 08:23:12 UTC

Hello,

At my side too we continue to see this bug with the error:

Jul 20 16:03:09 uljvpdrr01 kernel: automount[14859]: segfault at 7f5ad1915fb0 ip 00007f5ad090463c sp 00007f5acf07d1b0 error 4 in libnspr4.so[7f5ad08ea000+39000]

Can you investigate on this ?
Thanks,
Jean-Sébastien BEVILACQUA

Comment 53 Ian Kent 2018-09-20 02:03:17 UTC

(In reply to Guillaume Hilt from comment #51)
> Created attachment 1479406 [details]
> crashdump for autofs-5.0.5-123.el6_8.x86_64 crash

I finally got some time to have a look at this dump.

Unfortunately the autofs package in use has been built against
a different glibc so I can't get the symbols I need to look
closely at it.

But I don't think that matters because, from what I can see,
the fault is the same as described in comment #0.

I made a lot of changes to autofs to try and work around this
bug in nspr but clearly the assumptions made about when and
how it happens must be wrong.

On the face of it the nspr library is lacking locking around
error tables list modifications. There must be a reason this
is not done but I don't know what it may be.

I'm not willing to make further changes to autofs, I've made
far to many that I shouldn't have needed to make already.

I'm thinking as minimal as possible a change to add locking
around the error table list handling would be worth trying
and we'll see what our options are if that does help, given
the point where RHEL-6 is in its life cycle.

Ian

Comment 54 Ian Kent 2018-10-04 09:52:14 UTC

Could someone who is seeing this problem try this test build
of nspr to see if it helps with the problem please.

It's located at:
http://people.redhat.com/~ikent/nspr-4.19.0-1.bz1197622.el6/

Comment 55 Guillaume Hilt 2018-10-04 20:24:33 UTC

I'm trying it tomorrow, thanks Ian.

Comment 56 Guillaume Hilt 2018-10-05 10:00:51 UTC

Package installed.
As it crashes only once a week, please be patient.

Comment 57 Guillaume Hilt 2018-10-11 08:20:55 UTC

autofs hasn't crashed on the server i installed your package, look like you fixed it :) Nice job !

Regarding the support ticket 02188519 i opened at rhel, they said they won't fix it for RHEL6, but if your fix works, they may do something.

Comment 58 Jean-Sébastien 2018-10-11 08:37:06 UTC

Thanks Ian Kent for your work.

Sincerely,
Jean-Sebastien BEVILACQUA

Comment 59 Ian Kent 2018-10-11 09:06:12 UTC

(In reply to Guillaume Hilt from comment #57)
> autofs hasn't crashed on the server i installed your package, look like you
> fixed it :) Nice job !

Originally I thought this problem was related to other similar
problems I tried to work around within autofs but clearly it
isn't and equally clearly (although we should wait a little
longer to be sure) not using mutual exclusion for the error
table list handling is not ok.

Perhaps the library authors think this is the responsibility
of the library users, OpenLDAP in this case.

> 
> Regarding the support ticket 02188519 i opened at rhel, they said they won't
> fix it for RHEL6, but if your fix works, they may do something.

Yes, it is hard to get changes into RHEL-6 now but lets see
how we go.

The reality is a properly approved bug and the package (nspr)
maintainer happy with the change and willing to commit it is
what's needed. The bug approvals will be the difficult part of
this.

Anyway, first we need to consult the nspr package maintainer
and I think see if we can get someone in support interested.

I'll see what I can do while we give the package a bit longer
to be sure this fixes the problem.

Comment 60 Ian Kent 2018-10-11 10:14:39 UTC

Created attachment 1492826 [details]
Patch - nspr - add locking for error table list

Comment 61 Guillaume Hilt 2018-10-14 13:08:26 UTC

Sadly, the second server on which i installed your fix (3 days ago) saw autofs crash this morning.
Here's the core dump.

Comment 62 Guillaume Hilt 2018-10-14 13:10:34 UTC

Created attachment 1493703 [details]
crash dump with autofs proposed fix

Comment 63 Ian Kent 2018-10-15 11:39:53 UTC

(In reply to Guillaume Hilt from comment #62)
> Created attachment 1493703 [details]
> crash dump with autofs proposed fix

I have had a look at the core.

It appears the packages the core requires don't quite match
the packages listed in the installed-rpms of the included
sosreport, I don't know why.

I did however get most of the dependent debuginfo packages
installed but, because of the mismatch, I'm not sure how
much I can trust what I saw looking at the core.

On the face of it, it looks like the error table list is
in tact but the tables that the list entries point to are
not accessible (it looks like they have gone away), which
naturally causes a SEGV.

So that much makes sense.

At this point the problem here is looking a lot like the
similar related problem, which I mentioned before, that
was meant to be resolved.

In that bug SEGVs were being caused by memory going away
for no apparent reason, like it appears to be doing here.

It was thought this might be due to interactions between
the nss library and autofs closing and then re-opening
some it's libraries. That lead to me changing autofs to
no close and re-open any of it's libraries.

The other possibility was that the nss library was
re-initializing due to fork(2)s being done (which
apparently it does). And autofs will fork mount(8)
to perfom mounts so that happens quite a bit.

In order to check if this was causing a problem we asked
people to set NSS_STRICT_NOFORK=DISABLED in the autofs
environment before starting the daemon.

Looking at the bug I can't see if we got any clear feedback
on the result of that.

At this point though I think this is all we have to check.

So can we try a change like this:

      function start() {
        export NSS_STRICT_NOFORK=DISABLED <-------------------- Add this
                                                                here
        # Make sure autofs4 module is loaded
        if ! grep -q autofs /proc/filesystems

to check if it is the nss library re-initialisation.

Ian

Comment 64 Guillaume Hilt 2018-10-15 11:51:30 UTC

Thanks ian, we can try that, no problem.
Do you have updated the rpm posted earlier ?

Comment 65 Ian Kent 2018-10-15 12:22:03 UTC

(In reply to Guillaume Hilt from comment #64)
> Thanks ian, we can try that, no problem.
> Do you have updated the rpm posted earlier ?

The nspr rpm hasn't changed, it's still available on
people.redhat.com.

But now it seems that it probably isn't doing anything
useful.

I looked at the backtrace posted in the case and I see
it shows that the crash occurs during connection
initialisation. I hadn't looked at the full backtrace
recently but seeing that I remember now why I assumed
the problem occurred at connection initialisation.

And because of that one of the other changes I had done
in autofs to try and resolve this was to expand the
locking done by the autofs ldap module around connection
initialisation.

So the locking change in the nspr package might not be
doing any additional mutual exclusion.

I spent so long trying to work around these problems from
autofs seeing this come up again is disturbing to me, I
really would like to resolve it but it's proving very
difficult.

The bug in which the discussion of this occurred is very
long, I'll need to work through it again to refresh my
memory.

Comment 66 Daiki Ueno 2018-10-17 09:01:45 UTC

Comment on attachment 1492826 [details]
Patch - nspr - add locking for error table list

Thank you for the patch, but I think it is a wrong approach, as the documentation explicitly says those functions are not thread safe:
https://searchfox.org/mozilla-central/source/nsprpub/pr/include/prerror.h#268

Comment 67 Ian Kent 2018-10-17 10:48:29 UTC

(In reply to Daiki Ueno from comment #66)
> Comment on attachment 1492826 [details]
> Patch - nspr - add locking for error table list
> 
> Thank you for the patch, but I think it is a wrong approach, as the
> documentation explicitly says those functions are not thread safe:
> https://searchfox.org/mozilla-central/source/nsprpub/pr/include/prerror.h#268

Well, yes, there are a couple of places that are still open to
races after adding the locking and the list doesn't "own" the
tables that are installed.

But the point of the crash isn't at those locations and the
patch doesn't help with the problem so something else is
causing this.

The problem could still be list corruption but it looks more
like the installed table memory has gone away for some unknown
reason, perhaps because of nss library re-initialization on
fork(2).

Comment 68 Karsten Weiss 2018-10-31 21:53:17 UTC

I know that my initial description is rather verbose (sorry for that!), however, please notice that (in the 2nd half) it shows how to trigger the problem in a *new* automount session. Start reading at "Let's start a new automount session" and follow the steps and you'll be able to trigger the inaccessible memory situation in gdb immediately.

IMHO there's really no need to test this in production and waste/wait a couple of days each time.

Comment 69 Ian Kent 2018-11-01 03:15:57 UTC

(In reply to Karsten Weiss from comment #68)

Hi Karsten,

> I know that my initial description is rather verbose (sorry for that!),
> however, please notice that (in the 2nd half) it shows how to trigger the
> problem in a *new* automount session. Start reading at "Let's start a new
> automount session" and follow the steps and you'll be able to trigger the
> inaccessible memory situation in gdb immediately.

Don't be sorry, your description was very much appreciated.

The thing is your description was comprehensive enough to make
me think that using your procedure to reproduce it wasn't going
to give me the information I would need to locate the problem.

Going through a procedure that shows everything is ok at one
point then isn't ok at a later point doesn't help to narrow
the problem location enough to be able to spot the problem
using code inspection.

Obviously it's good to have to verify any possible fix but
to find what needs to be fixed is not so straight forward.

OTOH we know (found out through investigation of what I
thought were similar problems) that nss triggers a library
re-initialization at fork(2) which could possibly lead to
data previously pointed to going away or moving.

That's the point of checking if NSS_STRICT_NOFORK=DISABLED
makes a difference here.

I'll try and put together an environment to check this
using your procedure.

Comment 70 Guillaume Hilt 2018-12-18 10:29:30 UTC

Hi Ian,

Any news for this issue ? (maybe in january)

Regards,

Guillaume

Comment 72 Ian Kent 2019-01-29 02:21:37 UTC

(In reply to Guillaume Hilt from comment #70)
> Hi Ian,
> 
> Any news for this issue ? (maybe in january)

No news but it's still January, ;)

I still don't know for sure why this happens, I only have assumptions
based on the history detailed above.

So based on that lets give this rpm build a try.
It can be found at:
http://people.redhat.com/~ikent/autofs-5.0.5-140.nss.el6/

If there are any dependency problems installing it let me know and
I'll produce a build based on an earlier RHEL-6 release.

Comment 73 Guillaume Hilt 2019-01-29 07:58:14 UTC

Thanks Ian, i'll install it this morning.
Crash occurs at least once a week, so i'll get back to you in 7 days max.

Comment 74 Ian Kent 2019-01-31 01:42:11 UTC

(In reply to Guillaume Hilt from comment #73)
> Thanks Ian, i'll install it this morning.
> Crash occurs at least once a week, so i'll get back to you in 7 days max.

Sorry about this but I've noticed that the build I posted had the
patch to pin the libraries commented out so it isn't included in
the build.

I've rebuild this with the patch included and uploaded it to the above
location (same revision).

Could you update the installed rpm with this one please.

Comment 75 Guillaume Hilt 2019-02-01 13:04:18 UTC

Packages updated on 2 servers, they will reboot this week-end, we'll how it goes.

Comment 76 Guillaume Hilt 2019-02-08 09:51:51 UTC

No crash for now, on the 2 servers i used.
Good news for now, i'll wait another week before pouring out champagne.

Comment 77 Ian Kent 2019-02-08 22:44:21 UTC

(In reply to Guillaume Hilt from comment #76)
> No crash for now, on the 2 servers i used.
> Good news for now, i'll wait another week before pouring out champagne.

We'll need to do one more round of testing if this continues to
be ok.

I used a rather broad brush approach to the libraries I pinned
so I need to remove the libraries I think don't actually need
to be pinned. 

So a re-test will be needed.

Comment 78 Guillaume Hilt 2019-02-14 13:18:12 UTC

Ian, i think it is now ok with theses packages. You can try to remove some libraries you added.

Comment 79 Ian Kent 2019-02-18 00:54:03 UTC

I have reduced the number of pinned libraries to just the two
that should be needed to resolve this. There might be one other
(that does the same sort of thing) but the bug history doesn't
show this library is part of this problem.

So based on that please give this rpm build a try.
It can be found at:
http://people.redhat.com/~ikent/autofs-5.0.5-140.nss.2.el6/

Ian

Comment 80 Ian Kent 2019-02-18 00:55:35 UTC

Created attachment 1535788 [details]
Patch - nss workaround

Comment 81 Guillaume Hilt 2019-02-18 08:50:54 UTC

Thanks Ian !
As usual, let's wait at least a week to see if it works.

Comment 82 Guillaume Hilt 2019-03-01 10:44:47 UTC

Hello Ian,

No crash on both servers where i installed your 2 packages.
I guess we can say it's fixed.

Do you need to compile another package or can i deploy the last one on all our rhel 6 ?

Thanks again !

Comment 83 Guillaume Hilt 2019-03-01 10:45:30 UTC

I guess the autofs-debuginfo package isn't needed anymore ?

Comment 84 Ian Kent 2019-03-01 11:02:58 UTC

(In reply to Guillaume Hilt from comment #82)
> Hello Ian,
> 
> No crash on both servers where i installed your 2 packages.
> I guess we can say it's fixed.
> 
> Do you need to compile another package or can i deploy the last one on all
> our rhel 6 ?

Sadly it's not as simple as that.

The package I provided was not officially signed because it didn't go though
the QA process and the revision isn't what's required (as it was for testing).

Now we need to try and get this approved so it can go through the QA and
release process.

Comment 98 errata-xmlrpc 2019-04-09 13:35:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0718

Note You need to log in before you can comment on or make changes to this bug.