Bug 564108

Summary:	nss_ldap with start_tls and ldapi breaks su
Product:	Red Hat Enterprise Linux 5	Reporter:	Chris Adams <linux>
Component:	openldap	Assignee:	Jan Zeleny <jzeleny>
Status:	CLOSED WONTFIX	QA Contact:	BaseOS QE Security Team <qe-baseos-security>
Severity:	medium	Docs Contact:
Priority:	low
Version:	5.4	CC:	dpal, jplans, jvcelak, jzeleny
Target Milestone:	rc	Keywords:	Reopened
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-08-16 07:31:17 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Chris Adams 2010-02-11 20:57:00 UTC

I'm trying to configure a new server with users stored in LDAP. I configured OpenLDAP with certs for TLS and to listen on a Unix socket (LDAPI). I configured nss_ldap to use start_tls and to talk on the ldapi:/// URI.

Everything works fine, except "su" from one user to another (su to root works okay, but root is not in LDAP). It mostly doesn't work, but does occasionally. If I change /etc/ldap.conf from "ssl start_tls" to "ssl off", or to use ldap://hostname/, it works. With ldapi:/// and start_tls, su fails with no error message or log entry.

The last log entry when it fails is the PAM "session opened" message (there is no corresponding "session closed" entry).

I did an strace (as root with "-u user" to properly simulate su), and I see the process died with a SIGPIPE when writing to the LDAPI socket. I turned on slapd logging, and I see a socket closed with "connection lost" (I compared the LDAP log from a TLS config with a non-TLS confg, and that's the only difference other than the STARTTLS command itself).

I was planning to use LDAPI for the local slapd because I thought it would have lower overhead. I planned to use TLS because I plan to have replicated slaves on other servers for redundancy, and I wanted to keep that connection secure (and I don't think you can configure nss_ldap to use TLS for some servers and not others).

It looks like there's either a bug in nss_ldap or the OpenLDAP client libraries; I'm leaning towards nss_ldap not handling something correctly, but why does it only show up with "su"?

Comment 1 Nalin Dahyabhai 2010-07-01 15:17:01 UTC

I'm not able to reproduce this error with nss_ldap-253-17.el5, nss_ldap-253-21.el5, or nss_ldap-253-25.el5, but I can with nss_ldap-253-12.el5, which suffered from bug #448014, which would have caused symptoms like this.

I'm going to mark this bug as a duplicate of #448014, but please reopen it if you find that it happens with newer versions (and please include an indication of which version of the package you're using).

Thanks!

*** This bug has been marked as a duplicate of bug 448014 ***

Comment 2 Chris Adams 2010-07-01 15:30:16 UTC

My initial report was with everything current as of that date, RHEL 5.4+updates.  I still see this with RHEL 5.5+updates, nss_ldap-253-25.el5.x86_64.

I can reproduce this by just changing "uri ldap://host/ ldap://host2/" to "uri ldapi:///" in /etc/ldap.conf.  I also see sudo failing (I wasn't testing that before).  In both cases, the exit code is 141, and no error is logged (with su I get a "session opened" log entry, and with sudo I get the command logged).

Comment 4 Nalin Dahyabhai 2010-07-01 18:31:35 UTC

Aaagh.  I'm not seeing weird when I run sudo as the user from the directory, either.

You mentioned you were using replication -- are you using referrals on your server?  To see if that makes a difference, does setting 'referrals no' in /etc/ldap.conf cause it to stop failing?  (I don't think it would -- the referrals would be connections to a different server, but it's something that might be different between your setup and mine.)

A SIGPIPE while writing to the socket suggests that the process on the other end is hanging up on the client.  If you turn up the logging on the directory server, is it actually doing that?

Comment 5 Chris Adams 2010-07-01 19:23:03 UTC

I always find the weird ones. :)

Turning of referrals didn't make a difference, but this is the master (so if I understand things right wouldn't generate any referrals anyway).

Hmm, this may be some type of race condition.  I had "loglevel none" in slapd.conf (since OpenLDAP is rather verbose by default, and "none" still means "log important stuff"), and su/sudo seems to always fail.  I changed that to add a bunch of logging ("loglevel trace packets conns stats"), and su/sudo work almost every time.

Looking at the log with the above extra loglevel, I see several searches/connections for a sudo attempt:

- Conn 1: search for passwd information for the sudo-to user (I did an nscd -i passwd after restarting ldap); only done the first time I try after nscd -i passwd of course.

- Conn 2: search for shadow information for the sudo-from user; for some reason, the same search is done three times (same filter, same attribs) - this connection always ends with "connection lost" (nss_ldap not shutting down nicely?).  These look the same between success and failure.

- Conn 3: search for passwd information to the sudo-from user; this does not seem to be cached by nscd for some reason.  I do see a small difference in the log info for the success(-) and failure(+) connections:

    conn=X fd=20 ACCEPT from PATH=/var/run/ldapi (PATH=/var/run/ldapi)
    conn=X op=0 STARTTLS
    conn=X op=0 RESULT oid= err=0 text=
    conn=X fd=20 TLS established tls_ssf=256 ssf=256
    conn=X op=1 BIND dn="cn=Manager,dc=hiwaay,dc=net" method=128
    conn=X op=1 BIND dn="cn=Manager,dc=hiwaay,dc=net" mech=SIMPLE ssf=0
    send_ldap_result: conn=X op=1 p=3
    conn=X op=1 RESULT tag=97 err=0 text=
    conn=X op=2 SRCH base="ou=People,dc=hiwaay,dc=net" scope=1 deref=0 filter="(uid=cmadams)"
    => send_search_entry: conn X dn="uid=cmadams,ou=People,dc=hiwaay,dc=net"
    <= send_search_entry: conn X exit.
    send_ldap_result: conn=X op=2 p=3
    conn=X op=2 SEARCH RESULT tag=101 err=0 nentries=1 text=
   -connection_closing: readying conn=X sd=20 for close
   -connection_close: deferring conn=X sd=-1
    conn=X op=3 UNBIND
   +connection_closing: readying conn=X sd=20 for close
    connection_resched: attempting closing conn=X sd=20
    connection_close: conn=X sd=-1
    conn=X fd=20 closed

With the logging raised, I get mostly success with sudo, but if I strace it (running strace as root with -u cmadams), it always fails, ending with a SIGPIPE.  I cannot get a success while tracing.

This only seems to happen with ldapi:/// and "ssl start_tls"; I haven't seen a problem when I either use ldap://host/ or comment out "ssl start_tls".  One thing that seems to work as a work-around is to use "uri ldapi:/// ldaps://host2/" (ldapS instead of ldap) and comment out "ssl start_tls".  I get encryption for network connections and not local (where it isn't really needed and runs a slight bit faster without it) and don't seem to hit this problem (I can even strace the sudo process).

Comment 6 Nalin Dahyabhai 2010-07-01 19:43:09 UTC

Ah, I didn't see that you were using nscd.  I'm wondering if this is a permissions problem, as my test box doesn't allow nscd to contact the ldapi socket in enforcing mode.  If you're in SELinux enforcing mode, does temporarily disabling enforcement make a difference?  If it does, then I can reassign this to policy; if not, I need to open a new bug for policy.

Comment 7 Chris Adams 2010-07-01 20:02:11 UTC

I'm running in permissive mode right now (this setup is still in testing).  I did add the following to my local policy early in my setup though:

# allow NSCD to access LDAP socket
allow nscd_t slapd_t:unix_stream_socket connectto;
allow nscd_t slapd_var_run_t:sock_file write;

Comment 8 Nalin Dahyabhai 2010-07-01 22:14:08 UTC

When you run 'ldapsearch -H ldapi:/// -ZZ -x -b "" -s base' under strace, does it also catch a SIGPIPE?  It does here, so now I'm wondering if we're hitting a race condition in the connection teardown code in libldap, possibly in conjunction with slapd.

Running ldapsearch under gdb, breaking on ssl3_dispatch_alert, and telling it to continue when it hits the breakpoint seems to reach the same position, but it's not completely reliable.  When that happens, my backtrace looks something like this:

#0  0x00000035d1ac6070 in __write_nocancel () from /lib64/libc.so.6
#1  0x00000035d2208ffc in sb_debug_write (sbiod=0x6b2860, buf=0x6c1ce0, len=37)
    at ../../../libraries/liblber/sockbuf.c:820
#2  0x00000035d262a15f in sb_tls_bio_write (b=0x6b33c0,
    buf=0x6c1ce0 "\025\003\001", len=37)
    at ../../../libraries/libldap/tls.c:701
#3  0x00000035dd87666e in BIO_write (b=0x6b33c0, in=0x6c1ce0, inl=37)
    at bio_lib.c:247
#4  0x00000035df42074d in ssl3_write_pending (s=0x67bc30,
    type=<value optimized out>, buf=<value optimized out>,
    len=<value optimized out>) at s3_pkt.c:741
#5  0x00000035df420bea in ssl3_dispatch_alert (s=0x5) at s3_pkt.c:1323
#6  0x00000035df41e7e2 in ssl3_shutdown (s=0x5) at s3_lib.c:2343
#7  0x00000035d262a08a in sb_tls_close (sbiod=<value optimized out>)
    at ../../../libraries/libldap/tls.c:550
#8  0x00000035d220952b in ber_int_sb_close (sb=0x611840)
    at ../../../libraries/liblber/sockbuf.c:369
#9  0x00000035d261d455 in ldap_free_connection (ld=0x6116c0, lc=0x61aaa0,
    force=1, unbind=1) at ../../../libraries/libldap/request.c:638
#10 0x00000035d2614926 in ldap_ld_free (ld=0x6116c0, close=1,
    sctrls=<value optimized out>, cctrls=<value optimized out>)
    at ../../../libraries/libldap/unbind.c:96
#11 0x0000000000405b08 in ldap_search_ext ()
#12 0x00000035d1a1d994 in __libc_start_main (
    main=0x404e20 <ldap_search_ext+6816>, argc=9, ubp_av=0x7fffffffea28,
    init=<value optimized out>, fini=<value optimized out>,
    rtld_fini=<value optimized out>, stack_end=0x7fffffffea18)
    at libc-start.c:231
#13 0x00000000004033b9 in ldap_search_ext ()
    at ../../../libraries/libldap/search.c:68
#14 0x00007fffffffea18 in ?? ()

Comment 9 Chris Adams 2010-07-01 23:06:50 UTC

Wow, I hadn't thought about that.  Yes, it does SIGPIPE.

One note: I am currently running openldap packages from RH support (because of a STARTTLS hang introduced in RHEL 5.5 when trying to fix a different STARTTLS hang), openldap-2.3.43-13.el5.x86_64.  I also have 32 bit versions of the same packages installed on a 32 bit server (old server that is going to be a backup replica), and it does NOT show a SIGPIPE when I run the above ldapsearch.  It is in production for other services, so I can't configure it for LDAP logins to test my original problem.

Comment 11 Jan Zeleny 2010-07-29 07:04:30 UTC

I've been inspecting this issue for quite some time. This seems to be slapd issue. I found that slapd receives EAGAIN at some point when trying to read from the socket. At that point it closes the connection, while client is still writing to the socket.

I managed to reproduce this issue even on the most current systems, except for Fedora rawhide (openldap there uses NSS and doesn't support TLS over ldapi at all). I already filed upstream issue tracker:

http://www.openldap.org/its/index.cgi/Incoming?id=6610;selectid=6610

Comment 12 Jan Zeleny 2010-07-29 12:40:31 UTC

Please see ITS from last comment for upstream suggestion. If it works, I'd like to close the bug as WONTFIX.

Comment 13 Chris Adams 2010-07-29 13:55:18 UTC

I had already switched from ldapi+ldap with STARTTLS to ldapi+ldaps (enabled ldaps on my servers), and that does seem to work just fine.  I wasn't aware of the starttls URL extension (is that documented somewhere?).

Anyway, go ahead and close the bug.

Comment 15 RHEL Program Management 2010-08-09 18:22:56 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 16 Jan Zeleny 2010-08-16 07:31:17 UTC

As a result of previous discussion and available workaround, I'm closing the bug.