Bug 454675
Summary: | nss_ldap 253-12 causing shutdown segfaults and issues | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | David Halik <auralvance> |
Component: | SysVinit | Assignee: | Bill Nottingham <notting> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.2 | CC: | dmair, jplans, mattdm, nalin, notting, rvokal, sputhenp |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-01-20 21:23:31 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 466889 |
Description
David Halik
2008-07-09 17:34:55 UTC
Note: I miss typed the bug number. It should have been bug 448014 I should also mention that /var/log/audit/audit.log was full of these from the event: type=ANOM_ABEND msg=audit(1215629413.998:96): auid=101268 uid=0 gid=0 ses=1 pid=7743 comm="shutdown" sig=11 type=ANOM_ABEND msg=audit(1215629414.003:97): auid=101268 uid=0 gid=0 ses=1 pid=7702 comm="shutdown" sig=11 type=ANOM_ABEND msg=audit(1215629414.004:98): auid=101268 uid=0 gid=0 ses=1 pid=7747 comm="shutdown" sig=11 type=ANOM_ABEND msg=audit(1215629414.008:99): auid=101268 uid=0 gid=0 ses=1 pid=7746 comm="shutdown" sig=11 We're hitting this too. For added fun, switch your shell to tcsh -- any command you try to run will fail with Broken Pipe. Matthew, is the problem you're seeing the same as the one in bug #448014? In particular, does the patch attached there make this problem stop happening with tcsh? David, is your directory server's address information stored in the directory? If so, is it available via any other protocols? Can you attach your /etc/nsswitch.conf and /etc/hosts files, along with the output of "getent -s 'files dns' hosts [nameofserver]"? I'm having trouble getting my 5.2 system to exhibit the behavior you're describing. Our ldap servers are all specified in dns, but as an added measure we use /etc/hosts to narrow down the correct ldap server per client. This is so that we may ensure the correct ldap server is contacted based off of which network the client is in. Because of this, I don't believe we rely on the directory server's address info being listed in the directory. For security I'm not going to post our hosts file, but I did test it and I can assure you that the ldap entry there matches what is returned by your getent command, so the host is always aware of the ldap server regardless of what is in the ldap directory itself. nsswitch.conf: passwd: ldap files shadow: ldap [NOTFOUND=return] files group: files ldap #hosts: db files nisplus nis dns hosts: files dns # Example - obey only what nisplus tells us... #services: nisplus [NOTFOUND=return] files #networks: nisplus [NOTFOUND=return] files #protocols: nisplus [NOTFOUND=return] files #rpc: nisplus [NOTFOUND=return] files #ethers: nisplus [NOTFOUND=return] files #netmasks: nisplus [NOTFOUND=return] files bootparams: nisplus [NOTFOUND=return] files ethers: files netmasks: files networks: files protocols: files rpc: files services: files netgroup: files ldap publickey: nisplus automount: files aliases: files nisplus Hmm, why are you using "ldap files" rather than "files ldap" as the source for user and group information? That explicitly forces processes to consult the directory server before checking local files for information about users (including system users, and that's what worries me) which will probably cause problems after the network interface is taken down (the change has caused my 5.2 box to get stuck, but then it uses dhcp, too, so YMMV). Yep, it's admittedly crazy, but has been the way we're been running for some time. We run an ldap user database of about 60,000 across a whole university, which changes on a daily basis. In order to ensure that changes in student accounting results in accurate info everywhere, we force ldap -> files. files -> ldap results in too many issues with local account info not being maintained, and security problems. When the interface is down, that is a problem... but if that does in fact happen, the service is down anyways and we have bigger issues. The segfault never happened with any prior versions of nss_ldap. As soon as the upgrade happened, those symptoms were worse than any ldap problems by a long shot. By the way, some tweaks such as the following will ensure that the box does not get stuck. LDAP will always time out eventually, this speeds the process up. ldap.conf: # Timeout adjustments so LDAP fails faster when networking is not present, i.e. during boot nss_reconnect_tries 1 nss_reconnect_sleeptime 1 nss_reconnect_maxsleeptime 8 nss_reconnect_maxconntries 2 Okay, with the patch for #448014 applied, setting shorter timeouts does seem to let the system boot correctly (though with timeouts from udev), but the only errors I'm getting when I then go to reboot the box are from setroubleshoot. Without the patch, I don't actually get as far as you do. I just realized that since we run with cores unlimited, I have a couple thousand shutdown core dumps. Unfortunately, there aren't any debug symbols, so this won't be of much use unless some more of the libraries have them. I'll try and rebuild shutdown with them. Just as a start though, here is the output: eading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_ldap.so.2 Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libcom_err.so.2 Reading symbols from /lib64/libkeyutils.so.1... (no debugging symbols found)...done. Loaded symbols for /lib64/libkeyutils.so.1 Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libselinux.so.1 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Reading symbols from /lib64/libsepol.so.1... (no debugging symbols found)...done. Loaded symbols for /lib64/libsepol.so.1 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Core was generated by `shutdown -h 0 w'. Program terminated with signal 11, Segmentation fault. #0 0x00002b375f4fe330 in ?? () (gdb) where #0 0x00002b375f4fe330 in ?? () #1 0x00000000004027c5 in sync@plt () #2 0x0000000000402aca in shutdown () #3 0x00002b375f8509c0 in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #4 0x00002b375f850744 in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #5 0x00002b375f8311d8 in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #6 0x0000000000000000 in ?? () It's amazing what you can find with a bit of searching ;) I just learned how to use the debuginfo packages, and the core dump looks a bit better now. Note that all the core dumps have the same segfault: (gdb) where #0 0x00002b72d985f330 in ?? () #1 0x00000000004027c5 in init_setenv (name=0x4041c5 "INIT_HALT", value=0x3 <Address 0x3 out of bounds>) at shutdown.c:142 #2 0x0000000000402aca in shutdown (halttype=0x3 <Address 0x3 out of bounds>) at shutdown.c:396 #3 0x00002b72d9bb19c0 in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #4 0x00002b72d9bb12eb in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #5 0x00002b72d9bb1744 in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #6 0x00002b72d9b921d8 in _nss_ldap_mergeconfigfromdns () from /lib64/libnss_ldap.so.2 #7 0x00002b72d9b811ef in _nss_ldap_test_initgroups_ignoreuser () from /lib64/libnss_ldap.so.2 #8 0x00002b72d9b8425a in _nss_ldap_leave () from /lib64/libnss_ldap.so.2 #9 0x00002b72d987f5c3 in ?? () #10 0x00002b72d9b37148 in ?? () #11 0x00007fffd13fd800 in ?? () #12 0x00007fffd13fd800 in ?? () #13 0x00002b72d987f43a in ?? () #14 0x00002b72d9b37178 in ?? () #15 0x0000000000000000 in ?? () Hmm, do you have the parts of ldap.conf which deal with how to connect to the server (absent any login information)? I can get something like this to happen with a combination of an ldaps:// "uri" setting and "ssl off", which makes me think that the fix to recognize the "port" option might have something to do with it. I'm emailing you our ldap.conf minus the commented out bits. Okay, I couldn't get the same type of error by switching from ldaps to ldap-with-starttls, but something odd is definitely happening when shutdown hits nss_ldap's atfork handler. The handler wasn't always being correctly installed before, which I think explains why this is cropping up now. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Turns out that shutdown's problem is that it defines a function named shutdown() (which, among other things, calls fork() to warn users that things are about to happen), and when nss_ldap's child atfork handler goes to close the connection to the server, it calls a libldap function which in turn calls shutdown() (which is supposed to close the socket connection), and ends up calling the locally-defined shutdown() instead, causing the child to immediately fork a new child. And the process repeats itself. That would explain the thousands of core dumps and infinite segfaulting I was seeing on shutdown. :) Thanks for the hard work Nalin, we appreciate it. Building as 2.86-15.el5. Sorry for the long-delayed response -- I missed your question earlier, Nalin. Yes, the patch in the other bug fixed the tcsh symptom. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0149.html |