Bug 237731

Summary: Random SASL GSSAPI test failure on shadowfoot
Product: [Retired] 389 Reporter: Noriko Hosoi <nhosoi>
Component: Security - SASLAssignee: Noriko Hosoi <nhosoi>
Status: CLOSED CURRENTRELEASE QA Contact: Viktor Ashirov <vashirov>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.0.4CC: nkinder, rmeggins
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-07 16:51:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 152373, 240316, 427409    
Attachments:
Description Flags
cvs diff sasl_io.c and the fix description
none
cvs commit message
none
cvs commit message
none
cvs commit message none

Description Noriko Hosoi 2007-04-25 01:22:11 UTC
Description of problem:
SASL GSSAPI test sometimes fails on shadowfoot.
No other platforms report the failure.
Shadowfoot is only a host which runs 32-bit OS (RHEL4) and is located in colo.

This is the success case result: bind with mech=GSSAPI is finished with err=0
and adding dn="cn=Sasl TestEntry 3,ou=TestPeople2,o=sasl.com" follows to it.

    [...] conn=13 fd=64 slot=64 connection from 10.14.0.25 to 10.14.0.25
    [...] conn=13 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI
    [...] conn=13 op=0 RESULT err=14 tag=97 nentries=0 etime=0, SASL
    bind in progress
    [...] conn=13 op=1 BIND dn="" method=sasl version=3 mech=GSSAPI
    [...] conn=13 op=1 RESULT err=14 tag=97 nentries=0 etime=0, SASL
    bind in progress
    [...] conn=13 op=2 BIND dn="" method=sasl version=3 mech=GSSAPI
    [...] conn=13 op=2 RESULT err=0 tag=97 nentries=0 etime=0
    dn="cn=sasl testentry 1,ou=testpeople,o=sasl.com"
    [...] conn=13 op=3 ADD dn="cn=Sasl TestEntry
    3,ou=TestPeople2,o=sasl.com"
    [...] conn=13 op=3 RESULT err=0 tag=105 nentries=0 etime=0
    [...] conn=13 op=4 UNBIND

This is the failure case result: bind was successfully done, but before
receiving add request, connection is closed with B1 == SLAPD_DISCONNECT_BAD_BER_TAG.

    [...] conn=13 fd=64 slot=64 connection from 10.14.0.25 to 10.14.0.25
    [...] conn=13 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI
    [...] conn=13 op=0 RESULT err=14 tag=97 nentries=0 etime=0, SASL
    bind in progress
    [...] conn=13 op=1 BIND dn="" method=sasl version=3 mech=GSSAPI
    [...] conn=13 op=1 RESULT err=14 tag=97 nentries=0 etime=0, SASL
    bind in progress
    [...] conn=13 op=2 BIND dn="" method=sasl version=3 mech=GSSAPI
    [...] conn=13 op=2 RESULT err=0 tag=97 nentries=0 etime=0
    dn="cn=sasl testentry 1,ou=testpeople,o=sasl.com"
    [...] conn=13 op=-1 fd=64 closed - B1

On the client side, nsldapi_send_server_request (via ldap_add_ext) sets
LDAP_SERVER_DOWN and returns error.  Since both sides are very terse (if it's a
serious error, it'd have been reported somehow...), I searched most suspicous
quiet failure candidates, which could be one of the PR_Recv call with
PR_INTERVAL_NO_WAIT... (one in connection.c and 2 in sasl_io.c)  E.g.,

        ret = PR_Recv(c->c_prfd,sp->encrypted_buffer +
 sp->encrypted_buffer_offset,bytes_remaining_to_read,0,*P**R_INTERVAL_NO_WAIT*);
        if (ret < 0) {
            *err = PR_GetError();
            return -1;
        }

But both side does not return -1, but 0.  That is, both are thinking "no
messages are available to be received and the peer has performed an orderly
shutdown."

Comment 1 Noriko Hosoi 2007-04-25 01:29:03 UTC
Created attachment 153394 [details]
cvs diff sasl_io.c and the fix description

Comment 2 Rich Megginson 2007-04-25 01:58:10 UTC
(In reply to comment #1)
> Created an attachment (id=153394) [edit]
> cvs diff sasl_io.c and the fix description

It looks ok, but how do we handle this condition in other code?  SSL code or
even regular code may have the same problem, or they solve it in a different way?


Comment 3 Noriko Hosoi 2007-04-25 02:08:10 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > Created an attachment (id=153394) [edit] [edit]
> > cvs diff sasl_io.c and the fix description
> 
> It looks ok, but how do we handle this condition in other code?  SSL code or
> even regular code may have the same problem, or they solve it in a different way?

Regular code does not have this problem.  When connection_read_ldap_data returns
0 (returned from PR_Recv), it really means the connection was closed by peer. 
Not like this SASL case: the connection is okay and a packet was received, but
more packet(s) is expected to decrypt the entire message.  I'll double-check the
SSL code next (but I think the chance that is sharing the problem is very slim...).

Comment 4 Rich Megginson 2007-04-25 03:03:14 UTC
(In reply to comment #3)
> Regular code does not have this problem.  When connection_read_ldap_data returns
> 0 (returned from PR_Recv), it really means the connection was closed by peer. 
> Not like this SASL case: the connection is okay and a packet was received, but
> more packet(s) is expected to decrypt the entire message.  I'll double-check the
> SSL code next (but I think the chance that is sharing the problem is very
slim...).

Ok.  Then I think your proposal is fine.

Comment 5 Nathan Kinder 2007-04-25 16:42:18 UTC
I also think your proposal is fine.

Comment 6 Noriko Hosoi 2007-04-25 20:25:52 UTC
Created attachment 153451 [details]
cvs commit message

Thank you, Rich and Nathan, for the comments and the discussion.

Checked in into HEAD.

Comment 7 Noriko Hosoi 2007-04-27 17:55:00 UTC
> It failed again, but it's changed the behavior!
> [27/Apr/2007:05:14:39 -0700] conn=13 fd=64 slot=64 connection from 10.14.0.25
to 10.14.0.25
> [27/Apr/2007:05:14:39 -0700] conn=13 op=0 BIND dn="" method=sasl version=3
mech=GSSAPI
> [27/Apr/2007:05:14:39 -0700] conn=13 op=0 RESULT err=14 tag=97 nentries=0
etime=0, SASL bind in progress
> [27/Apr/2007:05:14:39 -0700] conn=13 op=1 BIND dn="" method=sasl version=3
mech=GSSAPI
> [27/Apr/2007:05:14:39 -0700] conn=13 op=1 RESULT err=14 tag=97 nentries=0
etime=0, SASL bind in progress
> [27/Apr/2007:05:14:39 -0700] conn=13 op=2 BIND dn="" method=sasl version=3
mech=GSSAPI
> [27/Apr/2007:05:14:39 -0700] conn=13 op=2 RESULT err=0 tag=97 nentries=0
etime=0 dn="cn=sasl testentry 1,ou=testpeople,o=sasl.com"
> *[27/Apr/2007:05:14:39 -0700] conn=13 op=-1 fd=64 closed error 11 (Resource
temporarily unavailable) - unknown*
> [27/Apr/2007:05:14:39 -0700] conn=14 fd=64 slot=64 connection from 10.14.0.25
to 10.14.0.25
> [27/Apr/2007:05:14:39 -0700] conn=14 op=0 BIND dn="cn=directory manager"
method=128 version=3
>
> Now, we have the better sight on this problem.  PR_Recv in
sasl_recv_connection gets EAGAIN (== errno 11), which is returned to
connection_read_operation via connection_read_ldap_data.  And the error number
is checked in the macro SLAPD_PR_WOULD_BLOCK_ERROR if it's a temporary
non-blocking I/O error or not as follows.
>     # define SLAPD_PR_WOULD_BLOCK_ERROR( prerrno )   \
>                   ((prerrno) == PR_WOULD_BLOCK_ERROR || (prerrno) ==
PR_IO_TIMEOUT_ERROR)
> This EAGAIN is another candidate to be retried, but for now it's never been
done and the connection is closed.
>
> We have another macro SLAPD_SYSTEM_WOULD_BLOCK_ERROR in slap.h.
> /*
>  * SLAPD_SYSTEM_WOULD_BLOCK_ERROR() returns non-zero if syserrno is an OS
>  *  error code that indicates a temporary non-blocking I/O error,
>  *  e.g., EAGAIN.
>  */
> #define SLAPD_SYSTEM_WOULD_BLOCK_ERROR( syserrno )  \
>     ((syserrno)==EAGAIN || (syserrno)==EWOULDBLOCK)
>
> Can we check the error number with the macro, as well?  And if PR_Poll agrees,
can we retry reading from the connection instead of closing it immediately?
> Index: connection.c
> ===================================================================
> RCS file: /cvs/dirsec/ldapserver/ldap/servers/slapd/connection.c,v
> retrieving revision 1.14
> diff -t -w -U4 -r1.14 connection.c
> --- connection.c        27 Feb 2007 02:57:29 -0000      1.14
> +++ connection.c        27 Apr 2007 17:03:00 -0000
> @@ -1744,9 +1744,10 @@
>                                  return CONN_DONE;
>                          }
>                          /* err = PR_GetError(); */
>                          /* If we would block, we need to poll for a while */
> -                        if ( SLAPD_PR_WOULD_BLOCK_ERROR( err ) ) {
> +                        if ( SLAPD_PR_WOULD_BLOCK_ERROR( err ) ||
> +                             SLAPD_SYSTEM_WOULD_BLOCK_ERROR( err ) ) {
>                                  struct PRPollDesc       pr_pd;
>                                  PRIntervalTime  timeout =
PR_MillisecondsToInterval(CONN_TURBO_TIMEOUT_INTERVAL);
>                                  pr_pd.fd = (PRFileDesc *)conn->c_prfd;
>                                  pr_pd.in_flags = PR_POLL_READ;
Ok. 

Comment 8 Noriko Hosoi 2007-04-27 18:01:13 UTC
Created attachment 153651 [details]
cvs commit message

Reviewed by Rich (Thank you!)

Checked in into HEAD.

Comment 9 Noriko Hosoi 2007-05-14 21:24:41 UTC
Today's acceptance test still shows the error:
[13/May/2007:05:14:30 -0700] conn=13 op=-1 fd=65 closed error 11 (Resource
temporarily unavailable) - unknown

The proposal in the comment #7 was not good enough to catch EAGAIN.  I should
have checked system error.

Index: connection.c
===================================================================
RCS file: /cvs/dirsec/ldapserver/ldap/servers/slapd/connection.c,v
retrieving revision 1.15
diff -t -w -U4 -r1.15 connection.c
--- connection.c        27 Apr 2007 18:00:21 -0000      1.15
+++ connection.c        14 May 2007 21:09:32 -0000
@@ -1744,10 +1744,11 @@
                                 return CONN_DONE;
                         }
                         /* err = PR_GetError(); */
                         /* If we would block, we need to poll for a while */
+                        syserr = PR_GetOSError();
                         if ( SLAPD_PR_WOULD_BLOCK_ERROR( err ) ||
-                             SLAPD_SYSTEM_WOULD_BLOCK_ERROR( err ) ) {
+                             SLAPD_SYSTEM_WOULD_BLOCK_ERROR( syserr ) ) {
                                 struct PRPollDesc       pr_pd;
                                 PRIntervalTime  timeout =
PR_MillisecondsToInterval(CONN_TURBO_TIMEOUT_INTERVAL);
                                 pr_pd.fd = (PRFileDesc *)conn->c_prfd;
                                 pr_pd.in_flags = PR_POLL_READ;


Comment 10 Rich Megginson 2007-05-14 21:43:26 UTC
(In reply to comment #9)
> Index: connection.c
> ===================================================================
> RCS file: /cvs/dirsec/ldapserver/ldap/servers/slapd/connection.c,v
> retrieving revision 1.15
> diff -t -w -U4 -r1.15 connection.c
> --- connection.c        27 Apr 2007 18:00:21 -0000      1.15
> +++ connection.c        14 May 2007 21:09:32 -0000
> @@ -1744,10 +1744,11 @@
>                                  return CONN_DONE;
>                          }
>                          /* err = PR_GetError(); */
>                          /* If we would block, we need to poll for a while */
> +                        syserr = PR_GetOSError();
>                          if ( SLAPD_PR_WOULD_BLOCK_ERROR( err ) ||
> -                             SLAPD_SYSTEM_WOULD_BLOCK_ERROR( err ) ) {
> +                             SLAPD_SYSTEM_WOULD_BLOCK_ERROR( syserr ) ) {
>                                  struct PRPollDesc       pr_pd;
>                                  PRIntervalTime  timeout =
> PR_MillisecondsToInterval(CONN_TURBO_TIMEOUT_INTERVAL);
>                                  pr_pd.fd = (PRFileDesc *)conn->c_prfd;
>                                  pr_pd.in_flags = PR_POLL_READ;
> 

Ok.

Comment 11 Noriko Hosoi 2007-05-14 22:06:08 UTC
Created attachment 154692 [details]
cvs commit message

Reviewed by Rich (Thank you!)
Checked in into HEAD.