Note: I believe this to be related to or a continuation of bug 488014 Since upgrading from 5.1 -> 5.2, we immediately began experiencing issues on our LDAP enabled systems. (nsswitch.conf: ldap files) Bash and scripting was broken, plus a range of other symptoms. This turned out to be bug 488014. We've since implemented the fix in that bug's upstreamed patch, and this solved the immediate symptom, but now we can't shutdown or reboot without infinite segfaulting on shutdown. Also, shutdown no longer posts a system message and kills active ssh connections, they just stay open and hang. See the following logs for both: Shutdown with 253-12: --------------------- INIT: Sending processes the TERM signal INIT: Hangup Shutting down smartd: [ OK ] Stopping HAL daemon: [ OK ] Stopping anacron: [ OK ] Stopping atd: [ OK ] Stopping cups: [ OK ] Shutting down console mouse services: [ OK ] Stopping sshd: [ OK ] Stopping xinetd: [ OK ] Stopping acpi daemon: [ OK ] Stopping crond: [ OK ] Stopping ipmi drivers: [ OK ] Shutting down ntpd: [ OK ] Stopping system message bus: [ OK ] Stopping RPC idmapd: [ OK ] Stopping NFS statd: [ OK ] Stopping portmap: [ OK ] Stopping auditd: [ OK ] audit: *NO* daemon at audit_pid=2583 audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=320 audit: auditd dissapeared audit(1215638796.166:479): auid=101268 uid=0 gid=0 ses=1 pid=3528 comm="shutdown" sig=11 audit(1215638796.356:480): auid=101268 uid=0 gid=0 ses=1 pid=3530 comm="shutdown" sig=11 audit(1215638796.371:481): auid=101268 uid=0 gid=0 ses=1 pid=3517 comm="shutdown" sig=11 audit(1215638796.371:482): auid=101268 uid=0 gid=0 ses=1 pid=3521 comm="shutdown" sig=11 audit(1215638796.372:483): auid=101268 uid=0 gid=0 ses=1 pid=3520 comm="shutdown" sig=11 audit(1215638796.372:484): auid=101268 uid=0 gid=0 ses=1 pid=3516 comm="shutdown" sig=11 audit(1215638796.518:485): auid=101268 uid=0 gid=0 ses=1 pid=3526 comm="shutdown" sig=11 audit(1215638796.518:486): auid=101268 uid=0 gid=0 ses=1 pid=3519 comm="shutdown" sig=11 Stopping PC/SC smart card daemon (pcscd): [ OK ] Shutting down kernel logger: shutdown[3637]: segfault at 0000000000000003 rip 00002b375f4fe330 rsp 00007fff4b77ec58 error 4 printk: 345 messages suppressed. shutdown[3647]: segfault at 0000000000000003 rip 00002b375f4fe330 rsp 00007fff4b77b738 error 4 audit(1215638805.754:602): auid=101268 uid=0 gid=0 ses=1 pid=3637 comm="shutdown" sig=11 shutdown[3648]: segfault at 0000000000000003 rip 00002b375f4fe330 rsp 00007fff4b77b1e8 error 4 [ OK ] Shutting down system logger: printk: 6 messages suppressed. audit(1215638806.382:605): user pid=3015 uid=0 auid=101268 msg='op=PAM:session_close acct=dhalik exe="/usr/sbin/sshd" (hostname=gunslinger.rutgers.edu, addr=192.168.16.59, terminal=ssh res=success)' shutdown[3646]: segfault at 0000000000000003 rip 00002b375f4fe330 rsp 00007fff4b77bc88 error 4 shutdown[3659]: segfault at 0000000000000003 rip 00002b375f4fe330 rsp 00007fff4b777778 error 4 shutdown[3633]: segfault at 0000000000000003 rip 00002b375f4fe330 rsp 00007fff4b780198 error 4 shutdown[10879]: segfault at 0000000000000003 rip 00002b375f4fe330 rsp 00007fff4ae2dca8 error 4 (these go on continually forever) --------------------------------- The terminal difference... Normal: ------- [root@test1 ~]# reboot Broadcast message from root (pts/0) (Wed Jul 9 16:35:47 2008): The system is going down for reboot NOW! [root@test1 ~]# Connection to 192.168.x.x closed by remote host. With nss_ldap 253-5: -------------------- [root@test1 ~]# reboot [root@test1 ~]# (dead window) ------------------------ The only way I have been able to correct these issues has been to downgrade to 5.1's nss_ldap-253-5. Rev 12 has been completely unstable. Sufficed to say, 253-12 has been wrecking havoc on all our machines with ldap. While downgrading to 253-5 works fine, I find it unacceptable to have to downgrade to a 5.1 released package as a solution.
Note: I miss typed the bug number. It should have been bug 448014
I should also mention that /var/log/audit/audit.log was full of these from the event: type=ANOM_ABEND msg=audit(1215629413.998:96): auid=101268 uid=0 gid=0 ses=1 pid=7743 comm="shutdown" sig=11 type=ANOM_ABEND msg=audit(1215629414.003:97): auid=101268 uid=0 gid=0 ses=1 pid=7702 comm="shutdown" sig=11 type=ANOM_ABEND msg=audit(1215629414.004:98): auid=101268 uid=0 gid=0 ses=1 pid=7747 comm="shutdown" sig=11 type=ANOM_ABEND msg=audit(1215629414.008:99): auid=101268 uid=0 gid=0 ses=1 pid=7746 comm="shutdown" sig=11
We're hitting this too. For added fun, switch your shell to tcsh -- any command you try to run will fail with Broken Pipe.
Matthew, is the problem you're seeing the same as the one in bug #448014? In particular, does the patch attached there make this problem stop happening with tcsh? David, is your directory server's address information stored in the directory? If so, is it available via any other protocols? Can you attach your /etc/nsswitch.conf and /etc/hosts files, along with the output of "getent -s 'files dns' hosts [nameofserver]"? I'm having trouble getting my 5.2 system to exhibit the behavior you're describing.
Our ldap servers are all specified in dns, but as an added measure we use /etc/hosts to narrow down the correct ldap server per client. This is so that we may ensure the correct ldap server is contacted based off of which network the client is in. Because of this, I don't believe we rely on the directory server's address info being listed in the directory. For security I'm not going to post our hosts file, but I did test it and I can assure you that the ldap entry there matches what is returned by your getent command, so the host is always aware of the ldap server regardless of what is in the ldap directory itself. nsswitch.conf: passwd: ldap files shadow: ldap [NOTFOUND=return] files group: files ldap #hosts: db files nisplus nis dns hosts: files dns # Example - obey only what nisplus tells us... #services: nisplus [NOTFOUND=return] files #networks: nisplus [NOTFOUND=return] files #protocols: nisplus [NOTFOUND=return] files #rpc: nisplus [NOTFOUND=return] files #ethers: nisplus [NOTFOUND=return] files #netmasks: nisplus [NOTFOUND=return] files bootparams: nisplus [NOTFOUND=return] files ethers: files netmasks: files networks: files protocols: files rpc: files services: files netgroup: files ldap publickey: nisplus automount: files aliases: files nisplus
Hmm, why are you using "ldap files" rather than "files ldap" as the source for user and group information? That explicitly forces processes to consult the directory server before checking local files for information about users (including system users, and that's what worries me) which will probably cause problems after the network interface is taken down (the change has caused my 5.2 box to get stuck, but then it uses dhcp, too, so YMMV).
Yep, it's admittedly crazy, but has been the way we're been running for some time. We run an ldap user database of about 60,000 across a whole university, which changes on a daily basis. In order to ensure that changes in student accounting results in accurate info everywhere, we force ldap -> files. files -> ldap results in too many issues with local account info not being maintained, and security problems. When the interface is down, that is a problem... but if that does in fact happen, the service is down anyways and we have bigger issues. The segfault never happened with any prior versions of nss_ldap. As soon as the upgrade happened, those symptoms were worse than any ldap problems by a long shot. By the way, some tweaks such as the following will ensure that the box does not get stuck. LDAP will always time out eventually, this speeds the process up. ldap.conf: # Timeout adjustments so LDAP fails faster when networking is not present, i.e. during boot nss_reconnect_tries 1 nss_reconnect_sleeptime 1 nss_reconnect_maxsleeptime 8 nss_reconnect_maxconntries 2
Okay, with the patch for #448014 applied, setting shorter timeouts does seem to let the system boot correctly (though with timeouts from udev), but the only errors I'm getting when I then go to reboot the box are from setroubleshoot. Without the patch, I don't actually get as far as you do.
I just realized that since we run with cores unlimited, I have a couple thousand shutdown core dumps. Unfortunately, there aren't any debug symbols, so this won't be of much use unless some more of the libraries have them. I'll try and rebuild shutdown with them. Just as a start though, here is the output: eading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_ldap.so.2 Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libcom_err.so.2 Reading symbols from /lib64/libkeyutils.so.1... (no debugging symbols found)...done. Loaded symbols for /lib64/libkeyutils.so.1 Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libselinux.so.1 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Reading symbols from /lib64/libsepol.so.1... (no debugging symbols found)...done. Loaded symbols for /lib64/libsepol.so.1 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Core was generated by `shutdown -h 0 w'. Program terminated with signal 11, Segmentation fault. #0 0x00002b375f4fe330 in ?? () (gdb) where #0 0x00002b375f4fe330 in ?? () #1 0x00000000004027c5 in sync@plt () #2 0x0000000000402aca in shutdown () #3 0x00002b375f8509c0 in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #4 0x00002b375f850744 in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #5 0x00002b375f8311d8 in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #6 0x0000000000000000 in ?? ()
It's amazing what you can find with a bit of searching ;) I just learned how to use the debuginfo packages, and the core dump looks a bit better now. Note that all the core dumps have the same segfault: (gdb) where #0 0x00002b72d985f330 in ?? () #1 0x00000000004027c5 in init_setenv (name=0x4041c5 "INIT_HALT", value=0x3 <Address 0x3 out of bounds>) at shutdown.c:142 #2 0x0000000000402aca in shutdown (halttype=0x3 <Address 0x3 out of bounds>) at shutdown.c:396 #3 0x00002b72d9bb19c0 in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #4 0x00002b72d9bb12eb in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #5 0x00002b72d9bb1744 in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #6 0x00002b72d9b921d8 in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #7 0x00002b72d9b811ef in _nss_ldap_test_initgroups_ignoreuser () from /lib64/libnss_ldap.so.2 #8 0x00002b72d9b8425a in _nss_ldap_leave () from /lib64/libnss_ldap.so.2 #9 0x00002b72d987f5c3 in ?? () #10 0x00002b72d9b37148 in ?? () #11 0x00007fffd13fd800 in ?? () #12 0x00007fffd13fd800 in ?? () #13 0x00002b72d987f43a in ?? () #14 0x00002b72d9b37178 in ?? () #15 0x0000000000000000 in ?? ()
Hmm, do you have the parts of ldap.conf which deal with how to connect to the server (absent any login information)? I can get something like this to happen with a combination of an ldaps:// "uri" setting and "ssl off", which makes me think that the fix to recognize the "port" option might have something to do with it.
I'm emailing you our ldap.conf minus the commented out bits.
Okay, I couldn't get the same type of error by switching from ldaps to ldap-with-starttls, but something odd is definitely happening when shutdown hits nss_ldap's atfork handler. The handler wasn't always being correctly installed before, which I think explains why this is cropping up now.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Turns out that shutdown's problem is that it defines a function named shutdown() (which, among other things, calls fork() to warn users that things are about to happen), and when nss_ldap's child atfork handler goes to close the connection to the server, it calls a libldap function which in turn calls shutdown() (which is supposed to close the socket connection), and ends up calling the locally-defined shutdown() instead, causing the child to immediately fork a new child. And the process repeats itself.
That would explain the thousands of core dumps and infinite segfaulting I was seeing on shutdown. :) Thanks for the hard work Nalin, we appreciate it.
Building as 2.86-15.el5.
Sorry for the long-delayed response -- I missed your question earlier, Nalin. Yes, the patch in the other bug fixed the tcsh symptom.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0149.html