Bug 454675

Summary:	nss_ldap 253-12 causing shutdown segfaults and issues
Product:	Red Hat Enterprise Linux 5	Reporter:	Englestrife <auralvance>
Component:	SysVinit	Assignee:	Bill Nottingham <notting>
Status:	CLOSED ERRATA	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	5.2	CC:	dmair, jplans, mattdm, nalin, notting, rvokal, sputhenp
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-01-20 21:23:31 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	466889

Description Englestrife 2008-07-09 17:34:55 UTC

Note: I believe this to be related to or a continuation of bug 488014

Since upgrading from 5.1 -> 5.2, we immediately began experiencing issues on our
LDAP enabled systems. (nsswitch.conf: ldap files) Bash and scripting was broken,
plus a range of other symptoms. This turned out to be bug 488014. We've since
implemented the fix in that bug's upstreamed patch, and this solved the
immediate symptom, but now we can't shutdown or reboot without infinite
segfaulting on shutdown. Also, shutdown no longer posts a system message and
kills active ssh connections, they just stay open and hang. See the following
logs for both:

Shutdown with 253-12:
---------------------
INIT: Sending processes the TERM signal
INIT: Hangup
Shutting down smartd: [  OK  ]
Stopping HAL daemon: [  OK  ]
Stopping anacron: [  OK  ]
Stopping atd: [  OK  ]
Stopping cups: [  OK  ]
Shutting down console mouse services: [  OK  ]
Stopping sshd: [  OK  ]
Stopping xinetd: [  OK  ]
Stopping acpi daemon: [  OK  ]
Stopping crond: [  OK  ]
Stopping ipmi drivers: [  OK  ]
Shutting down ntpd: [  OK  ]
Stopping system message bus: [  OK  ]
Stopping RPC idmapd: [  OK  ]
Stopping NFS statd: [  OK  ]
Stopping portmap: [  OK  ]
Stopping auditd: [  OK  ]
audit: *NO* daemon at audit_pid=2583
audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=320
audit: auditd dissapeared

audit(1215638796.166:479): auid=101268 uid=0 gid=0 ses=1 pid=3528
comm="shutdown" sig=11
audit(1215638796.356:480): auid=101268 uid=0 gid=0 ses=1 pid=3530
comm="shutdown" sig=11
audit(1215638796.371:481): auid=101268 uid=0 gid=0 ses=1 pid=3517
comm="shutdown" sig=11
audit(1215638796.371:482): auid=101268 uid=0 gid=0 ses=1 pid=3521
comm="shutdown" sig=11
audit(1215638796.372:483): auid=101268 uid=0 gid=0 ses=1 pid=3520
comm="shutdown" sig=11
audit(1215638796.372:484): auid=101268 uid=0 gid=0 ses=1 pid=3516
comm="shutdown" sig=11
audit(1215638796.518:485): auid=101268 uid=0 gid=0 ses=1 pid=3526
comm="shutdown" sig=11
audit(1215638796.518:486): auid=101268 uid=0 gid=0 ses=1 pid=3519
comm="shutdown" sig=11
Stopping PC/SC smart card daemon (pcscd): [  OK  ]
Shutting down kernel logger: shutdown[3637]: segfault at 0000000000000003 rip
00002b375f4fe330 rsp 00007fff4b77ec58 error 4
printk: 345 messages suppressed.
shutdown[3647]: segfault at 0000000000000003 rip 00002b375f4fe330 rsp
00007fff4b77b738 error 4
audit(1215638805.754:602): auid=101268 uid=0 gid=0 ses=1 pid=3637
comm="shutdown" sig=11
shutdown[3648]: segfault at 0000000000000003 rip 00002b375f4fe330 rsp
00007fff4b77b1e8 error 4
[  OK  ]
Shutting down system logger: printk: 6 messages suppressed.
audit(1215638806.382:605): user pid=3015 uid=0 auid=101268
msg='op=PAM:session_close acct=dhalik exe="/usr/sbin/sshd"
(hostname=gunslinger.rutgers.edu, addr=192.168.16.59, terminal=ssh res=success)'
shutdown[3646]: segfault at 0000000000000003 rip 00002b375f4fe330 rsp
00007fff4b77bc88 error 4
shutdown[3659]: segfault at 0000000000000003 rip 00002b375f4fe330 rsp
00007fff4b777778 error 4
shutdown[3633]: segfault at 0000000000000003 rip 00002b375f4fe330 rsp
00007fff4b780198 error 4
shutdown[10879]: segfault at 0000000000000003 rip 00002b375f4fe330 rsp
00007fff4ae2dca8 error 4

(these go on continually forever)

---------------------------------

The terminal difference...

Normal:
-------

[root@test1 ~]# reboot

Broadcast message from root (pts/0) (Wed Jul  9 16:35:47 2008):

The system is going down for reboot NOW!
[root@test1 ~]# Connection to 192.168.x.x closed by remote host.

With nss_ldap 253-5:
--------------------

[root@test1 ~]# reboot
[root@test1 ~]#
(dead window)

------------------------

The only way I have been able to correct these issues has been to downgrade to
5.1's nss_ldap-253-5. Rev 12 has been completely unstable.

Sufficed to say, 253-12 has been wrecking havoc on all our machines with ldap.
While downgrading to 253-5 works fine, I find it unacceptable to have to
downgrade to a 5.1 released package as a solution.

Comment 1 Englestrife 2008-07-09 17:36:19 UTC

Note: I miss typed the bug number. It should have been bug 448014

Comment 2 Englestrife 2008-07-09 20:12:14 UTC

I should also mention that /var/log/audit/audit.log was full of these from the
event:

type=ANOM_ABEND msg=audit(1215629413.998:96): auid=101268 uid=0 gid=0 ses=1
pid=7743 comm="shutdown" sig=11
type=ANOM_ABEND msg=audit(1215629414.003:97): auid=101268 uid=0 gid=0 ses=1
pid=7702 comm="shutdown" sig=11
type=ANOM_ABEND msg=audit(1215629414.004:98): auid=101268 uid=0 gid=0 ses=1
pid=7747 comm="shutdown" sig=11
type=ANOM_ABEND msg=audit(1215629414.008:99): auid=101268 uid=0 gid=0 ses=1
pid=7746 comm="shutdown" sig=11

Comment 3 Matthew Miller 2008-07-10 21:57:03 UTC

We're hitting this too. For added fun, switch your shell to tcsh -- any command
you try to run will fail with Broken Pipe.

Comment 4 Nalin Dahyabhai 2008-07-14 16:41:09 UTC

Matthew, is the problem you're seeing the same as the one in bug #448014?  In
particular, does the patch attached there make this problem stop happening with
tcsh?

David, is your directory server's address information stored in the directory? 
If so, is it available via any other protocols?  Can you attach your
/etc/nsswitch.conf and /etc/hosts files, along with the output of
"getent -s 'files dns' hosts [nameofserver]"?  I'm having trouble getting my 5.2
system to exhibit the behavior you're describing.

Comment 5 Englestrife 2008-07-14 17:27:10 UTC

Our ldap servers are all specified in dns, but as an added measure we use
/etc/hosts to narrow down the correct ldap server per client. This is so that we
may ensure the correct ldap server is contacted based off of which network the
client is in. Because of this, I don't believe we rely on the directory server's
address info being listed in the directory.

For security I'm not going to post our hosts file, but I did test it and I can
assure you that the ldap entry there matches what is returned by your getent
command, so the host is always aware of the ldap server regardless of what is in
the ldap directory itself. 

nsswitch.conf:

passwd:     ldap files
shadow:     ldap [NOTFOUND=return] files
group:      files ldap

#hosts:     db files nisplus nis dns
hosts:      files dns

# Example - obey only what nisplus tells us...
#services:   nisplus [NOTFOUND=return] files
#networks:   nisplus [NOTFOUND=return] files
#protocols:  nisplus [NOTFOUND=return] files
#rpc:        nisplus [NOTFOUND=return] files
#ethers:     nisplus [NOTFOUND=return] files
#netmasks:   nisplus [NOTFOUND=return] files     

bootparams: nisplus [NOTFOUND=return] files

ethers:     files
netmasks:   files
networks:   files
protocols:  files
rpc:        files
services:   files

netgroup:   files ldap

publickey:  nisplus

automount:  files
aliases:    files nisplus

Comment 6 Nalin Dahyabhai 2008-07-14 18:09:22 UTC

Hmm, why are you using "ldap files" rather than "files ldap" as the source for
user and group information?  That explicitly forces processes to consult the
directory server before checking local files for information about users
(including system users, and that's what worries me) which will probably cause
problems after the network interface is taken down (the change has caused my 5.2
box to get stuck, but then it uses dhcp, too, so YMMV).

Comment 7 Englestrife 2008-07-14 18:29:11 UTC

Yep, it's admittedly crazy, but has been the way we're been running for some
time. We run an ldap user database of about 60,000 across a whole university,
which changes on a daily basis. In order to ensure that changes in student
accounting results in accurate info everywhere, we force ldap -> files. files ->
ldap results in too many issues with local account info not being maintained,
and security problems.

When the interface is down, that is a problem... but if that does in fact
happen, the service is down anyways and we have bigger issues. The segfault
never happened with any prior versions of nss_ldap. As soon as the upgrade
happened, those symptoms were worse than any ldap problems by a long shot. By
the way, some tweaks such as the following will ensure that the box does not get
stuck. LDAP will always time out eventually, this speeds the process up.

ldap.conf:
# Timeout adjustments so LDAP fails faster when networking is not present, i.e.
during boot
nss_reconnect_tries 1
nss_reconnect_sleeptime 1
nss_reconnect_maxsleeptime 8
nss_reconnect_maxconntries 2

Comment 8 Nalin Dahyabhai 2008-07-14 20:38:51 UTC

Okay, with the patch for #448014 applied, setting shorter timeouts does seem to
let the system boot correctly (though with timeouts from udev), but the only
errors I'm getting when I then go to reboot the box are from setroubleshoot.

Without the patch, I don't actually get as far as you do.

Comment 9 Englestrife 2008-07-14 23:20:48 UTC

I just realized that since we run with cores unlimited, I have a couple thousand
shutdown core dumps. Unfortunately, there aren't any debug symbols, so this
won't be of much use unless some more of the libraries have them. I'll try and
rebuild shutdown with them. Just as a start though, here is the output:

eading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_ldap.so.2
Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /lib64/libkeyutils.so.1...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libkeyutils.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libselinux.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libsepol.so.1...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libsepol.so.1
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `shutdown -h 0 w'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002b375f4fe330 in ?? ()
(gdb) where
#0  0x00002b375f4fe330 in ?? ()
#1  0x00000000004027c5 in sync@plt ()
#2  0x0000000000402aca in shutdown ()
#3  0x00002b375f8509c0 in _nss_ldap_mergeconfigfromdns () from
/lib64/libnss_ldap.so.2
#4  0x00002b375f850744 in _nss_ldap_mergeconfigfromdns () from
/lib64/libnss_ldap.so.2
#5  0x00002b375f8311d8 in _nss_ldap_mergeconfigfromdns () from
/lib64/libnss_ldap.so.2
#6  0x0000000000000000 in ?? ()

Comment 10 Englestrife 2008-07-14 23:48:17 UTC

It's amazing what you can find with a bit of searching ;) I just learned how to
use the debuginfo packages, and the core dump looks a bit better now. Note that
all the core dumps have the same segfault:

(gdb) where
#0  0x00002b72d985f330 in ?? ()
#1  0x00000000004027c5 in init_setenv (name=0x4041c5 "INIT_HALT", value=0x3
<Address 0x3 out of bounds>) at shutdown.c:142
#2  0x0000000000402aca in shutdown (halttype=0x3 <Address 0x3 out of bounds>) at
shutdown.c:396
#3  0x00002b72d9bb19c0 in _nss_ldap_mergeconfigfromdns () from
/lib64/libnss_ldap.so.2
#4  0x00002b72d9bb12eb in _nss_ldap_mergeconfigfromdns () from
/lib64/libnss_ldap.so.2
#5  0x00002b72d9bb1744 in _nss_ldap_mergeconfigfromdns () from
/lib64/libnss_ldap.so.2
#6  0x00002b72d9b921d8 in _nss_ldap_mergeconfigfromdns () from
/lib64/libnss_ldap.so.2
#7  0x00002b72d9b811ef in _nss_ldap_test_initgroups_ignoreuser () from
/lib64/libnss_ldap.so.2
#8  0x00002b72d9b8425a in _nss_ldap_leave () from /lib64/libnss_ldap.so.2
#9  0x00002b72d987f5c3 in ?? ()
#10 0x00002b72d9b37148 in ?? ()
#11 0x00007fffd13fd800 in ?? ()
#12 0x00007fffd13fd800 in ?? ()
#13 0x00002b72d987f43a in ?? ()
#14 0x00002b72d9b37178 in ?? ()
#15 0x0000000000000000 in ?? ()

Comment 11 Nalin Dahyabhai 2008-07-15 17:09:54 UTC

Hmm, do you have the parts of ldap.conf which deal with how to connect to the
server (absent any login information)?  I can get something like this to happen
with a combination of an ldaps:// "uri" setting and "ssl off", which makes me
think that the fix to recognize the "port" option might have something to do
with it.

Comment 12 Englestrife 2008-07-15 17:17:42 UTC

I'm emailing you our ldap.conf minus the commented out bits.

Comment 13 Nalin Dahyabhai 2008-07-15 22:33:33 UTC

Okay, I couldn't get the same type of error by switching from ldaps to
ldap-with-starttls, but something odd is definitely happening when shutdown hits
nss_ldap's atfork handler.  The handler wasn't always being correctly installed
before, which I think explains why this is cropping up now.

Comment 15 RHEL Program Management 2008-07-16 15:31:10 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 16 Nalin Dahyabhai 2008-07-16 20:07:31 UTC

Turns out that shutdown's problem is that it defines a function named shutdown()
(which, among other things, calls fork() to warn users that things are about to
happen), and when nss_ldap's child atfork handler goes to close the connection
to the server, it calls a libldap function which in turn calls shutdown() (which
is supposed to close the socket connection), and ends up calling the
locally-defined shutdown() instead, causing the child to immediately fork a new
child.  And the process repeats itself.

Comment 18 Englestrife 2008-07-16 20:21:56 UTC

That would explain the thousands of core dumps and infinite segfaulting I was
seeing on shutdown. :) Thanks for the hard work Nalin, we appreciate it.

Comment 21 Bill Nottingham 2008-09-17 20:51:02 UTC

Building as 2.86-15.el5.

Comment 23 Matthew Miller 2008-10-13 15:44:08 UTC

Sorry for the long-delayed response -- I missed your question earlier, Nalin. Yes, the patch in the other bug fixed the tcsh symptom.

Comment 27 errata-xmlrpc 2009-01-20 21:23:31 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0149.html