237731 – Random SASL GSSAPI test failure on shadowfoot

Bug 237731 - Random SASL GSSAPI test failure on shadowfoot

Summary: Random SASL GSSAPI test failure on shadowfoot

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	389
Classification:	Retired
Component:	Security - SASL
Sub Component:
Version:	1.0.4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Noriko Hosoi
QA Contact:	Viktor Ashirov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	152373 240316 FDS1.1.0
TreeView+	depends on / blocked

Reported:	2007-04-25 01:22 UTC by Noriko Hosoi
Modified:	2015-12-07 16:51 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-07 16:51:53 UTC
Embargoed:

Attachments	(Terms of Use)
cvs diff sasl_io.c and the fix description (7.72 KB, text/html) 2007-04-25 01:29 UTC, Noriko Hosoi	no flags	Details
cvs commit message (562 bytes, text/plain) 2007-04-25 20:25 UTC, Noriko Hosoi	no flags	Details
cvs commit message (593 bytes, text/plain) 2007-04-27 18:01 UTC, Noriko Hosoi	no flags	Details
cvs commit message (431 bytes, text/plain) 2007-05-14 22:06 UTC, Noriko Hosoi	no flags	Details
View All

Description Noriko Hosoi 2007-04-25 01:22:11 UTC

Description of problem:
SASL GSSAPI test sometimes fails on shadowfoot.
No other platforms report the failure.
Shadowfoot is only a host which runs 32-bit OS (RHEL4) and is located in colo.

This is the success case result: bind with mech=GSSAPI is finished with err=0
and adding dn="cn=Sasl TestEntry 3,ou=TestPeople2,o=sasl.com" follows to it.

    [...] conn=13 fd=64 slot=64 connection from 10.14.0.25 to 10.14.0.25
    [...] conn=13 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI
    [...] conn=13 op=0 RESULT err=14 tag=97 nentries=0 etime=0, SASL
    bind in progress
    [...] conn=13 op=1 BIND dn="" method=sasl version=3 mech=GSSAPI
    [...] conn=13 op=1 RESULT err=14 tag=97 nentries=0 etime=0, SASL
    bind in progress
    [...] conn=13 op=2 BIND dn="" method=sasl version=3 mech=GSSAPI
    [...] conn=13 op=2 RESULT err=0 tag=97 nentries=0 etime=0
    dn="cn=sasl testentry 1,ou=testpeople,o=sasl.com"
    [...] conn=13 op=3 ADD dn="cn=Sasl TestEntry
    3,ou=TestPeople2,o=sasl.com"
    [...] conn=13 op=3 RESULT err=0 tag=105 nentries=0 etime=0
    [...] conn=13 op=4 UNBIND

This is the failure case result: bind was successfully done, but before
receiving add request, connection is closed with B1 == SLAPD_DISCONNECT_BAD_BER_TAG.

    [...] conn=13 fd=64 slot=64 connection from 10.14.0.25 to 10.14.0.25
    [...] conn=13 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI
    [...] conn=13 op=0 RESULT err=14 tag=97 nentries=0 etime=0, SASL
    bind in progress
    [...] conn=13 op=1 BIND dn="" method=sasl version=3 mech=GSSAPI
    [...] conn=13 op=1 RESULT err=14 tag=97 nentries=0 etime=0, SASL
    bind in progress
    [...] conn=13 op=2 BIND dn="" method=sasl version=3 mech=GSSAPI
    [...] conn=13 op=2 RESULT err=0 tag=97 nentries=0 etime=0
    dn="cn=sasl testentry 1,ou=testpeople,o=sasl.com"
    [...] conn=13 op=-1 fd=64 closed - B1

On the client side, nsldapi_send_server_request (via ldap_add_ext) sets
LDAP_SERVER_DOWN and returns error.  Since both sides are very terse (if it's a
serious error, it'd have been reported somehow...), I searched most suspicous
quiet failure candidates, which could be one of the PR_Recv call with
PR_INTERVAL_NO_WAIT... (one in connection.c and 2 in sasl_io.c)  E.g.,

        ret = PR_Recv(c->c_prfd,sp->encrypted_buffer +
 sp->encrypted_buffer_offset,bytes_remaining_to_read,0,*P**R_INTERVAL_NO_WAIT*);
        if (ret < 0) {
            *err = PR_GetError();
            return -1;
        }

But both side does not return -1, but 0.  That is, both are thinking "no
messages are available to be received and the peer has performed an orderly
shutdown."

Comment 1 Noriko Hosoi 2007-04-25 01:29:03 UTC

Created attachment 153394 [details]
cvs diff sasl_io.c and the fix description

Comment 2 Rich Megginson 2007-04-25 01:58:10 UTC

(In reply to comment #1)
> Created an attachment (id=153394) [edit]
> cvs diff sasl_io.c and the fix description

It looks ok, but how do we handle this condition in other code?  SSL code or
even regular code may have the same problem, or they solve it in a different way?

Comment 3 Noriko Hosoi 2007-04-25 02:08:10 UTC

(In reply to comment #2)
> (In reply to comment #1)
> > Created an attachment (id=153394) [edit] [edit]
> > cvs diff sasl_io.c and the fix description
> 
> It looks ok, but how do we handle this condition in other code?  SSL code or
> even regular code may have the same problem, or they solve it in a different way?

Regular code does not have this problem.  When connection_read_ldap_data returns
0 (returned from PR_Recv), it really means the connection was closed by peer. 
Not like this SASL case: the connection is okay and a packet was received, but
more packet(s) is expected to decrypt the entire message.  I'll double-check the
SSL code next (but I think the chance that is sharing the problem is very slim...).

Comment 4 Rich Megginson 2007-04-25 03:03:14 UTC

(In reply to comment #3)
> Regular code does not have this problem.  When connection_read_ldap_data returns
> 0 (returned from PR_Recv), it really means the connection was closed by peer. 
> Not like this SASL case: the connection is okay and a packet was received, but
> more packet(s) is expected to decrypt the entire message.  I'll double-check the
> SSL code next (but I think the chance that is sharing the problem is very
slim...).

Ok.  Then I think your proposal is fine.

Comment 5 Nathan Kinder 2007-04-25 16:42:18 UTC

I also think your proposal is fine.

Comment 6 Noriko Hosoi 2007-04-25 20:25:52 UTC

Created attachment 153451 [details]
cvs commit message

Thank you, Rich and Nathan, for the comments and the discussion.

Checked in into HEAD.

Comment 7 Noriko Hosoi 2007-04-27 17:55:00 UTC

> It failed again, but it's changed the behavior!
> [27/Apr/2007:05:14:39 -0700] conn=13 fd=64 slot=64 connection from 10.14.0.25
to 10.14.0.25
> [27/Apr/2007:05:14:39 -0700] conn=13 op=0 BIND dn="" method=sasl version=3
mech=GSSAPI
> [27/Apr/2007:05:14:39 -0700] conn=13 op=0 RESULT err=14 tag=97 nentries=0
etime=0, SASL bind in progress
> [27/Apr/2007:05:14:39 -0700] conn=13 op=1 BIND dn="" method=sasl version=3
mech=GSSAPI
> [27/Apr/2007:05:14:39 -0700] conn=13 op=1 RESULT err=14 tag=97 nentries=0
etime=0, SASL bind in progress
> [27/Apr/2007:05:14:39 -0700] conn=13 op=2 BIND dn="" method=sasl version=3
mech=GSSAPI
> [27/Apr/2007:05:14:39 -0700] conn=13 op=2 RESULT err=0 tag=97 nentries=0
etime=0 dn="cn=sasl testentry 1,ou=testpeople,o=sasl.com"
> *[27/Apr/2007:05:14:39 -0700] conn=13 op=-1 fd=64 closed error 11 (Resource
temporarily unavailable) - unknown*
> [27/Apr/2007:05:14:39 -0700] conn=14 fd=64 slot=64 connection from 10.14.0.25
to 10.14.0.25
> [27/Apr/2007:05:14:39 -0700] conn=14 op=0 BIND dn="cn=directory manager"
method=128 version=3
>
> Now, we have the better sight on this problem.  PR_Recv in
sasl_recv_connection gets EAGAIN (== errno 11), which is returned to
connection_read_operation via connection_read_ldap_data.  And the error number
is checked in the macro SLAPD_PR_WOULD_BLOCK_ERROR if it's a temporary
non-blocking I/O error or not as follows.
>     # define SLAPD_PR_WOULD_BLOCK_ERROR( prerrno )   \
>                   ((prerrno) == PR_WOULD_BLOCK_ERROR || (prerrno) ==
PR_IO_TIMEOUT_ERROR)
> This EAGAIN is another candidate to be retried, but for now it's never been
done and the connection is closed.
>
> We have another macro SLAPD_SYSTEM_WOULD_BLOCK_ERROR in slap.h.
> /*
>  * SLAPD_SYSTEM_WOULD_BLOCK_ERROR() returns non-zero if syserrno is an OS
>  *  error code that indicates a temporary non-blocking I/O error,
>  *  e.g., EAGAIN.
>  */
> #define SLAPD_SYSTEM_WOULD_BLOCK_ERROR( syserrno )  \
>     ((syserrno)==EAGAIN || (syserrno)==EWOULDBLOCK)
>
> Can we check the error number with the macro, as well?  And if PR_Poll agrees,
can we retry reading from the connection instead of closing it immediately?
> Index: connection.c
> ===================================================================
> RCS file: /cvs/dirsec/ldapserver/ldap/servers/slapd/connection.c,v
> retrieving revision 1.14
> diff -t -w -U4 -r1.14 connection.c
> --- connection.c        27 Feb 2007 02:57:29 -0000      1.14
> +++ connection.c        27 Apr 2007 17:03:00 -0000
> @@ -1744,9 +1744,10 @@
>                                  return CONN_DONE;
>                          }
>                          /* err = PR_GetError(); */
>                          /* If we would block, we need to poll for a while */
> -                        if ( SLAPD_PR_WOULD_BLOCK_ERROR( err ) ) {
> +                        if ( SLAPD_PR_WOULD_BLOCK_ERROR( err ) ||
> +                             SLAPD_SYSTEM_WOULD_BLOCK_ERROR( err ) ) {
>                                  struct PRPollDesc       pr_pd;
>                                  PRIntervalTime  timeout =
PR_MillisecondsToInterval(CONN_TURBO_TIMEOUT_INTERVAL);
>                                  pr_pd.fd = (PRFileDesc *)conn->c_prfd;
>                                  pr_pd.in_flags = PR_POLL_READ;
Ok.

Comment 8 Noriko Hosoi 2007-04-27 18:01:13 UTC

Created attachment 153651 [details]
cvs commit message

Reviewed by Rich (Thank you!)

Checked in into HEAD.

Comment 9 Noriko Hosoi 2007-05-14 21:24:41 UTC

Today's acceptance test still shows the error:
[13/May/2007:05:14:30 -0700] conn=13 op=-1 fd=65 closed error 11 (Resource
temporarily unavailable) - unknown

The proposal in the comment #7 was not good enough to catch EAGAIN.  I should
have checked system error.

Index: connection.c
===================================================================
RCS file: /cvs/dirsec/ldapserver/ldap/servers/slapd/connection.c,v
retrieving revision 1.15
diff -t -w -U4 -r1.15 connection.c
--- connection.c        27 Apr 2007 18:00:21 -0000      1.15
+++ connection.c        14 May 2007 21:09:32 -0000
@@ -1744,10 +1744,11 @@
                                 return CONN_DONE;
                         }
                         /* err = PR_GetError(); */
                         /* If we would block, we need to poll for a while */
+                        syserr = PR_GetOSError();
                         if ( SLAPD_PR_WOULD_BLOCK_ERROR( err ) ||
-                             SLAPD_SYSTEM_WOULD_BLOCK_ERROR( err ) ) {
+                             SLAPD_SYSTEM_WOULD_BLOCK_ERROR( syserr ) ) {
                                 struct PRPollDesc       pr_pd;
                                 PRIntervalTime  timeout =
PR_MillisecondsToInterval(CONN_TURBO_TIMEOUT_INTERVAL);
                                 pr_pd.fd = (PRFileDesc *)conn->c_prfd;
                                 pr_pd.in_flags = PR_POLL_READ;

Comment 10 Rich Megginson 2007-05-14 21:43:26 UTC

(In reply to comment #9)
> Index: connection.c
> ===================================================================
> RCS file: /cvs/dirsec/ldapserver/ldap/servers/slapd/connection.c,v
> retrieving revision 1.15
> diff -t -w -U4 -r1.15 connection.c
> --- connection.c        27 Apr 2007 18:00:21 -0000      1.15
> +++ connection.c        14 May 2007 21:09:32 -0000
> @@ -1744,10 +1744,11 @@
>                                  return CONN_DONE;
>                          }
>                          /* err = PR_GetError(); */
>                          /* If we would block, we need to poll for a while */
> +                        syserr = PR_GetOSError();
>                          if ( SLAPD_PR_WOULD_BLOCK_ERROR( err ) ||
> -                             SLAPD_SYSTEM_WOULD_BLOCK_ERROR( err ) ) {
> +                             SLAPD_SYSTEM_WOULD_BLOCK_ERROR( syserr ) ) {
>                                  struct PRPollDesc       pr_pd;
>                                  PRIntervalTime  timeout =
> PR_MillisecondsToInterval(CONN_TURBO_TIMEOUT_INTERVAL);
>                                  pr_pd.fd = (PRFileDesc *)conn->c_prfd;
>                                  pr_pd.in_flags = PR_POLL_READ;
> 

Ok.

Comment 11 Noriko Hosoi 2007-05-14 22:06:08 UTC

Created attachment 154692 [details]
cvs commit message

Reviewed by Rich (Thank you!)
Checked in into HEAD.

Note You need to log in before you can comment on or make changes to this bug.