Description of problem: SASL GSSAPI test sometimes fails on shadowfoot. No other platforms report the failure. Shadowfoot is only a host which runs 32-bit OS (RHEL4) and is located in colo. This is the success case result: bind with mech=GSSAPI is finished with err=0 and adding dn="cn=Sasl TestEntry 3,ou=TestPeople2,o=sasl.com" follows to it. [...] conn=13 fd=64 slot=64 connection from 10.14.0.25 to 10.14.0.25 [...] conn=13 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI [...] conn=13 op=0 RESULT err=14 tag=97 nentries=0 etime=0, SASL bind in progress [...] conn=13 op=1 BIND dn="" method=sasl version=3 mech=GSSAPI [...] conn=13 op=1 RESULT err=14 tag=97 nentries=0 etime=0, SASL bind in progress [...] conn=13 op=2 BIND dn="" method=sasl version=3 mech=GSSAPI [...] conn=13 op=2 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=sasl testentry 1,ou=testpeople,o=sasl.com" [...] conn=13 op=3 ADD dn="cn=Sasl TestEntry 3,ou=TestPeople2,o=sasl.com" [...] conn=13 op=3 RESULT err=0 tag=105 nentries=0 etime=0 [...] conn=13 op=4 UNBIND This is the failure case result: bind was successfully done, but before receiving add request, connection is closed with B1 == SLAPD_DISCONNECT_BAD_BER_TAG. [...] conn=13 fd=64 slot=64 connection from 10.14.0.25 to 10.14.0.25 [...] conn=13 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI [...] conn=13 op=0 RESULT err=14 tag=97 nentries=0 etime=0, SASL bind in progress [...] conn=13 op=1 BIND dn="" method=sasl version=3 mech=GSSAPI [...] conn=13 op=1 RESULT err=14 tag=97 nentries=0 etime=0, SASL bind in progress [...] conn=13 op=2 BIND dn="" method=sasl version=3 mech=GSSAPI [...] conn=13 op=2 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=sasl testentry 1,ou=testpeople,o=sasl.com" [...] conn=13 op=-1 fd=64 closed - B1 On the client side, nsldapi_send_server_request (via ldap_add_ext) sets LDAP_SERVER_DOWN and returns error. Since both sides are very terse (if it's a serious error, it'd have been reported somehow...), I searched most suspicous quiet failure candidates, which could be one of the PR_Recv call with PR_INTERVAL_NO_WAIT... (one in connection.c and 2 in sasl_io.c) E.g., ret = PR_Recv(c->c_prfd,sp->encrypted_buffer + sp->encrypted_buffer_offset,bytes_remaining_to_read,0,*P**R_INTERVAL_NO_WAIT*); if (ret < 0) { *err = PR_GetError(); return -1; } But both side does not return -1, but 0. That is, both are thinking "no messages are available to be received and the peer has performed an orderly shutdown."
Created attachment 153394 [details] cvs diff sasl_io.c and the fix description
(In reply to comment #1) > Created an attachment (id=153394) [edit] > cvs diff sasl_io.c and the fix description It looks ok, but how do we handle this condition in other code? SSL code or even regular code may have the same problem, or they solve it in a different way?
(In reply to comment #2) > (In reply to comment #1) > > Created an attachment (id=153394) [edit] [edit] > > cvs diff sasl_io.c and the fix description > > It looks ok, but how do we handle this condition in other code? SSL code or > even regular code may have the same problem, or they solve it in a different way? Regular code does not have this problem. When connection_read_ldap_data returns 0 (returned from PR_Recv), it really means the connection was closed by peer. Not like this SASL case: the connection is okay and a packet was received, but more packet(s) is expected to decrypt the entire message. I'll double-check the SSL code next (but I think the chance that is sharing the problem is very slim...).
(In reply to comment #3) > Regular code does not have this problem. When connection_read_ldap_data returns > 0 (returned from PR_Recv), it really means the connection was closed by peer. > Not like this SASL case: the connection is okay and a packet was received, but > more packet(s) is expected to decrypt the entire message. I'll double-check the > SSL code next (but I think the chance that is sharing the problem is very slim...). Ok. Then I think your proposal is fine.
I also think your proposal is fine.
Created attachment 153451 [details] cvs commit message Thank you, Rich and Nathan, for the comments and the discussion. Checked in into HEAD.
> It failed again, but it's changed the behavior! > [27/Apr/2007:05:14:39 -0700] conn=13 fd=64 slot=64 connection from 10.14.0.25 to 10.14.0.25 > [27/Apr/2007:05:14:39 -0700] conn=13 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI > [27/Apr/2007:05:14:39 -0700] conn=13 op=0 RESULT err=14 tag=97 nentries=0 etime=0, SASL bind in progress > [27/Apr/2007:05:14:39 -0700] conn=13 op=1 BIND dn="" method=sasl version=3 mech=GSSAPI > [27/Apr/2007:05:14:39 -0700] conn=13 op=1 RESULT err=14 tag=97 nentries=0 etime=0, SASL bind in progress > [27/Apr/2007:05:14:39 -0700] conn=13 op=2 BIND dn="" method=sasl version=3 mech=GSSAPI > [27/Apr/2007:05:14:39 -0700] conn=13 op=2 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=sasl testentry 1,ou=testpeople,o=sasl.com" > *[27/Apr/2007:05:14:39 -0700] conn=13 op=-1 fd=64 closed error 11 (Resource temporarily unavailable) - unknown* > [27/Apr/2007:05:14:39 -0700] conn=14 fd=64 slot=64 connection from 10.14.0.25 to 10.14.0.25 > [27/Apr/2007:05:14:39 -0700] conn=14 op=0 BIND dn="cn=directory manager" method=128 version=3 > > Now, we have the better sight on this problem. PR_Recv in sasl_recv_connection gets EAGAIN (== errno 11), which is returned to connection_read_operation via connection_read_ldap_data. And the error number is checked in the macro SLAPD_PR_WOULD_BLOCK_ERROR if it's a temporary non-blocking I/O error or not as follows. > # define SLAPD_PR_WOULD_BLOCK_ERROR( prerrno ) \ > ((prerrno) == PR_WOULD_BLOCK_ERROR || (prerrno) == PR_IO_TIMEOUT_ERROR) > This EAGAIN is another candidate to be retried, but for now it's never been done and the connection is closed. > > We have another macro SLAPD_SYSTEM_WOULD_BLOCK_ERROR in slap.h. > /* > * SLAPD_SYSTEM_WOULD_BLOCK_ERROR() returns non-zero if syserrno is an OS > * error code that indicates a temporary non-blocking I/O error, > * e.g., EAGAIN. > */ > #define SLAPD_SYSTEM_WOULD_BLOCK_ERROR( syserrno ) \ > ((syserrno)==EAGAIN || (syserrno)==EWOULDBLOCK) > > Can we check the error number with the macro, as well? And if PR_Poll agrees, can we retry reading from the connection instead of closing it immediately? > Index: connection.c > =================================================================== > RCS file: /cvs/dirsec/ldapserver/ldap/servers/slapd/connection.c,v > retrieving revision 1.14 > diff -t -w -U4 -r1.14 connection.c > --- connection.c 27 Feb 2007 02:57:29 -0000 1.14 > +++ connection.c 27 Apr 2007 17:03:00 -0000 > @@ -1744,9 +1744,10 @@ > return CONN_DONE; > } > /* err = PR_GetError(); */ > /* If we would block, we need to poll for a while */ > - if ( SLAPD_PR_WOULD_BLOCK_ERROR( err ) ) { > + if ( SLAPD_PR_WOULD_BLOCK_ERROR( err ) || > + SLAPD_SYSTEM_WOULD_BLOCK_ERROR( err ) ) { > struct PRPollDesc pr_pd; > PRIntervalTime timeout = PR_MillisecondsToInterval(CONN_TURBO_TIMEOUT_INTERVAL); > pr_pd.fd = (PRFileDesc *)conn->c_prfd; > pr_pd.in_flags = PR_POLL_READ; Ok.
Created attachment 153651 [details] cvs commit message Reviewed by Rich (Thank you!) Checked in into HEAD.
Today's acceptance test still shows the error: [13/May/2007:05:14:30 -0700] conn=13 op=-1 fd=65 closed error 11 (Resource temporarily unavailable) - unknown The proposal in the comment #7 was not good enough to catch EAGAIN. I should have checked system error. Index: connection.c =================================================================== RCS file: /cvs/dirsec/ldapserver/ldap/servers/slapd/connection.c,v retrieving revision 1.15 diff -t -w -U4 -r1.15 connection.c --- connection.c 27 Apr 2007 18:00:21 -0000 1.15 +++ connection.c 14 May 2007 21:09:32 -0000 @@ -1744,10 +1744,11 @@ return CONN_DONE; } /* err = PR_GetError(); */ /* If we would block, we need to poll for a while */ + syserr = PR_GetOSError(); if ( SLAPD_PR_WOULD_BLOCK_ERROR( err ) || - SLAPD_SYSTEM_WOULD_BLOCK_ERROR( err ) ) { + SLAPD_SYSTEM_WOULD_BLOCK_ERROR( syserr ) ) { struct PRPollDesc pr_pd; PRIntervalTime timeout = PR_MillisecondsToInterval(CONN_TURBO_TIMEOUT_INTERVAL); pr_pd.fd = (PRFileDesc *)conn->c_prfd; pr_pd.in_flags = PR_POLL_READ;
(In reply to comment #9) > Index: connection.c > =================================================================== > RCS file: /cvs/dirsec/ldapserver/ldap/servers/slapd/connection.c,v > retrieving revision 1.15 > diff -t -w -U4 -r1.15 connection.c > --- connection.c 27 Apr 2007 18:00:21 -0000 1.15 > +++ connection.c 14 May 2007 21:09:32 -0000 > @@ -1744,10 +1744,11 @@ > return CONN_DONE; > } > /* err = PR_GetError(); */ > /* If we would block, we need to poll for a while */ > + syserr = PR_GetOSError(); > if ( SLAPD_PR_WOULD_BLOCK_ERROR( err ) || > - SLAPD_SYSTEM_WOULD_BLOCK_ERROR( err ) ) { > + SLAPD_SYSTEM_WOULD_BLOCK_ERROR( syserr ) ) { > struct PRPollDesc pr_pd; > PRIntervalTime timeout = > PR_MillisecondsToInterval(CONN_TURBO_TIMEOUT_INTERVAL); > pr_pd.fd = (PRFileDesc *)conn->c_prfd; > pr_pd.in_flags = PR_POLL_READ; > Ok.
Created attachment 154692 [details] cvs commit message Reviewed by Rich (Thank you!) Checked in into HEAD.