Bug 458712

Summary: RH 5.2 - SCTP Messages out of order
Product: Red Hat Enterprise Linux 5 Reporter: William Reich <reich>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: medium    
Version: 5.2CC: nhorman
Target Milestone: rc   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-06-15 13:13:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
code that performs BIND operation
none
tcpdump output files ( gunzip, then tar xvf ... ) none

Description William Reich 2008-08-11 19:36:04 UTC
Description of problem:
Using RH5.2 on a 32bit platform,
I have a user space application trying to use the
SCTP interface.
I have 4 networks within a single association to another machine.
( That machine is using the exact version of RH & SCTP ).

By default, the SCTP will only use a single LAN to carry the traffic,
with the other LANs being hot spares.

However, I am trying to load balance the traffic instead of the default behavior. 
I'm allowed to do this by selecting the network that I want
the traffic to go over.

When there is a light traffic load, the messages flow
from one machine to the other with no trouble.

However, when the message load between the two machine increases,
I am getting messages out of sequence.
Since the messages are out of sequence, my application detects a missing sequence number and fails.

SCTP is supposed to guarantee the message sequence.

This problem is also observed on RH4 Update 4.






Version-Release number of selected component (if applicable):
lksctp-tools-devel-1.0.6-1.el5.1
lksctp-tools-1.0.6-1.el5.1
lksctp-tools-doc-1.0.6-1.el5.1
Linux yyy 2.6.18-92.1.10.el5PAE #1 SMP Wed Jul 23 04:10:43 EDT 2008 i686


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:
Messages are out of order.

Expected results:
messages should arrive in order to the user application regardless of the
network the message traveled on.
Solaris 10 has this behavior.

Additional info:
This is some user space code that is used to force the messages to a specific network:
( the value of "peer_index" rotates from 0 thru 3 and then repeats )...

++++++++++++++
process_SCTP_SENDTO_OP(sctp_link_t   *plink,
                       sctp_header_t *hdr,
                       int           ctrlen,
                       void          *data,
                       int           datalen)
{
        sctp_send_t      *send_request = (sctp_send_t *) hdr;
        sctp_ip_addr_t   *addr;
        struct sockaddr  *to = NULL;
        socklen_t        tolen = 0;
        int              error = EINVAL;
        const sctp_send_options_t *options = &gsctp_DefaultSendOptions;

        /* sendmsg() structures */
        struct msghdr          outmsg;
        struct iovec           iov;
        char                   outcmsg[CMSG_SPACE(
                                    sizeof(struct sctp_sndrcvinfo))];
        struct cmsghdr         *cmsg;
        struct sctp_sndrcvinfo *sinfo;

        options = &send_request->options;

        if( (send_request->peer_index > -1) &&
            (send_request->peer_index < plink->peer_addr.ip_addr_count) )
        {
                /*
                 * Peer address was specified
                 */
                addr = &plink->peer_addr.ip_addr[send_request->peer_index];

                if(addr->ip_version == 4)
                {
                        struct sockaddr_in in4;
                        memset(&in4, 0, sizeof(in4) );
                        in4.sin_family = AF_INET;
                        in4.sin_port = plink->peer_addr.port;
                        in4.sin_addr = addr->addr.ipv4;
                        tolen = sizeof(in4);
                        to = (struct sockaddr *) &in4;
                }
                else /* IPv6 */
                {
                        struct sockaddr_in6 in6;
                        memset(&in6, 0, sizeof(in6) );
                        in6.sin6_family = AF_INET6;
                        in6.sin6_port = plink->peer_addr.port;
                        in6.sin6_addr = addr->addr.ipv6;
                        tolen = sizeof(in6);
                        to = (struct sockaddr *) &in6;
                }
        }

       outmsg.msg_name = to;
        outmsg.msg_namelen = tolen;
        outmsg.msg_iov = &iov;
        iov.iov_base = (void *)data;
        iov.iov_len = datalen;
        outmsg.msg_iovlen = 1;

        outmsg.msg_control = outcmsg;
        outmsg.msg_controllen = sizeof(outcmsg);
        outmsg.msg_flags = 0;

        cmsg = CMSG_FIRSTHDR(&outmsg);
        cmsg->cmsg_level = IPPROTO_SCTP;
        cmsg->cmsg_type = SCTP_SNDRCV;
        cmsg->cmsg_len = CMSG_LEN(sizeof(struct sctp_sndrcvinfo));

        outmsg.msg_controllen = cmsg->cmsg_len;
        sinfo = (struct sctp_sndrcvinfo *)CMSG_DATA(cmsg);

        /*
         * Populare sinfo structure based on ULCM SCTP options
         */
        if ( sctp_set_sinfo_options(plink, sinfo, options ) == -1 )
        {
                goto send_failed;
        }

        if (sendmsg(plink->native_sctp_fd, &outmsg, 0) == -1)
        {
                error = errno;
        etc etc etc...

Comment 1 Zdenek Prikryl 2008-08-14 12:00:26 UTC
Hello,
could you add as an attachment whole source code for simple server and client? I'd like to see how you initialize multi-home connection. Thanks.

Comment 2 Zdenek Prikryl 2008-08-14 12:07:39 UTC
I'm not sure that you can do the load balancing like you do. In doc is, among other things, written that a primary address/path must be chosen and the other addresses/paths will only be used if the primary fails or is suspect.

Comment 3 William Reich 2008-08-14 12:16:14 UTC

this style of load balancing works fine in Solaris 10
after Sol 10 patch 137111-02 is applied to the machine.

Comment 4 William Reich 2008-08-14 12:19:02 UTC
The Solaris 10 case number was " Sun Case 65871261 "
which became patch 137111-02.

Comment 5 William Reich 2008-08-14 12:25:06 UTC
in reply to #2, the user is allowed to override the default/primary network selection as long as no network failures exist.
If a network failure exists, then I agree that SCTP will ignore the user specified network.

Comment 6 William Reich 2008-08-14 14:02:36 UTC
Created attachment 314318 [details]
code that performs BIND operation

initialization of multi-home connection - same code for listen or connect

Comment 7 Zdenek Prikryl 2008-08-19 11:45:20 UTC
According to teh sun patch it seems that this is kernel stuff. So I'm reassigning the bug to kernel.

Comment 8 Neil Horman 2008-08-19 13:56:09 UTC
First question, just to make sure we don;t miss anything obviuos, is what are the values in gsctp_DefaultSendOptions and what does sctp_set_sinfo_options look like?  I'd like to be certain that we aren't inadvertently setting SCTP_UNORDERED in the sinfo_flags field.  That would certainly explain unordered delivery, as would some mungings of the stream id.

Comment 9 William Reich 2008-08-19 14:47:42 UTC

notice that gsctp_DefaultSendOptions is not actually used.
The "options" pointer gets a new value of "&send_request->options", which
is data from the incoming message.

This header is initialized to all zeros except for the 
peer_index ( which will have values rotating from 0 thru 3 and over again ).
A message from my applicatio to SCTP does not ever contain the SCTP_UNORDERED in the sinfo_flags field.

( requested details below )

++++++++++++++++++


typedef struct
{
 sctp_header_t header;
 sctp_send_options_t options;
 int32_t peer_index;
}
sctp_send_t;



typedef struct sctp_send_options_s
{
 uint8_t version;
 uint16_t stream;
 uint16_t flags;
 uint32_t protocol;
 uint32_t context;
} sctp_send_options_t;


typedef struct sctp_header_s
{
 uint32_t op;
 uint32_t error;
 uint32_t wait_tag;
}
sctp_header_t;

++++++++++++++++++++++++++

application sets the fields as
.
 .
  .

                cntl = ( sctp_send_t * ) cntl_mp->b_rptr;
                cntl->header.op = SCTP_SENDTO_OP;
                cntl->options.version = 0;
                cntl->options.stream = 0;
                cntl->options.flags = 0;  /*   <====      */
                cntl->options.protocol = 0;
                cntl->options.context = 0;

                cntl->peer_index =
                       mr->peer_index++ % mr->peer_addr.ip_addr_count;
.
 .
  .


++++++++++++++++++++++++++

and then

inline static int
sctp_set_sinfo_options(sctp_link_t               *plink,
                       struct sctp_sndrcvinfo    *sinfo,
                       const sctp_send_options_t *options)
{
        int prev_no_delay, err;

        memset(sinfo, 0, sizeof(struct sctp_sndrcvinfo) );
        sinfo->sinfo_ppid    = htonl(options->protocol);
        sinfo->sinfo_stream  = options->stream;
        sinfo->sinfo_context = options->context;

        /*
         * Map ULCM SCTP send options to LKSCTP send options.
         */
        if ( options->flags & SCTP_UNORDERED )
        {
                sinfo->sinfo_flags |= MSG_UNORDERED;
        }

        /*
         * Calling setsockopt() for each data message could
         * consume too much CPU time, therefore this is
         * only done when a new option is different from
         * the previous.
         *
         * WARNING:
         * SCTP_NODELAY, Section 7.1.5 SCTP Sockets API IETF draft,
         * globally disables message bundling, where as ULCM SCTP
         * does it on a per message basis.  This approximation
         * is good enough to disable message bundling.
         */
        prev_no_delay   = plink->no_delay;
        plink->no_delay = (options->flags & SCTP_NOBUNDLE)? 1: 0;
        if ( prev_no_delay != plink->no_delay)
        {
                if ( setsockopt(plink->native_sctp_fd,
                                IPPROTO_SCTP,
                                SCTP_NODELAY,
                                (void *)&plink->no_delay,
                                sizeof(plink->no_delay)) == -1 )
                {
                        err = errno;
                        ULCM_WARN("[id=%d] setsockopt(SCTP_NODELAY) failed, errno = %d",
                                        plink->id, err );
                        errno = err;
                        return -1;
                }
        }
        return 0;


 } /* end sctp_set_sinfo_options */

Comment 10 Neil Horman 2008-08-19 15:13:31 UTC
ok, do you have tcpdumps that show this ordering problem?  That would help  diagnose this problem for me.  Thanks!

Comment 11 William Reich 2008-08-19 15:36:15 UTC
I do not have tcpdumps.
All I have is my application complaining about
getting messages out of order.

I have the same problem is the number of networks is 4 or 2.

Comment 12 Neil Horman 2008-08-20 11:51:55 UTC
Please take the tcpdump captures as we discussed via email and attach them here if you would.  Thanks!

Comment 13 William Reich 2008-08-20 12:10:11 UTC
bummer...
My first attempts at running tcpdump are not fruitful.
Each time I run tcpdump ( one on sender and one on receiver ), the problem
does not occur.
When I do not use tcpdump, the problem appears.

I am using the command
/usr/sbin/tcpdump -i any -s 0 -w <file>
on each machine.

I'll try varying the number of networks to see if I can get
tcpdump to capture a failure...
( my first attempt was with 2 networks in the association. )

Comment 14 William Reich 2008-08-20 12:31:38 UTC
ok - got lucky...
data files coming...

Comment 15 William Reich 2008-08-20 12:39:11 UTC
Created attachment 314625 [details]
tcpdump output files ( gunzip, then tar xvf ... )

This gzip'd tar file contains the
output of tcpdump on 2 machines
The time of the failure was between 8:30:36 and 8:30:46.
The failure was reported by the receiver.

Comment 16 William Reich 2008-08-20 12:44:46 UTC
additional information for attachment 314625 [details] in comment 15:

/etc/hosts file looks like this:

172.25.2.202	alderaan	alderaan.ulticom.com
172.25.2.200	endor	endor.ulticom.com

10.2.202.1      alderaan alderaan-a
10.2.202.129    alderaan alderaan-b
10.2.201.1      alderaan alderaan-c
10.2.201.129    alderaan alderaan-d

10.2.202.3      endor endor-a 
10.2.202.131    endor endor-b
10.2.201.3      endor  endor-c
10.2.201.131    endor endor-d

++++++++++

networks used are  " -c, -d , -a, and -b" on each machine.

The sender is alderaan
The receiver is endor

Based on my understanding,
alderaan will think that his primary network is "-c".
endor will think that his primary network is "-d"

Traffic was flowing on all 4 ( -c, -d, -a, -b ) networks at the 
time of the error.
endor ( recv1 file ) is the receiver. endor reported the error.

The 172 addresses were used for xterm connections only.

Comment 17 Neil Horman 2008-08-20 15:34:44 UTC
I've checked the receive dump and noted that we capture all sctp chunks in order (according to TSN value) so whatever the problem is its not happening on the wire or at the sender.  how is your application determining that frames are not in the proper order?  Is is check the stream id and stream sequence number?

Comment 18 William Reich 2008-08-20 16:21:37 UTC
in answer to comment 17,
I have two user space applications ( one on each machine ).
The first ( sender ) is sending a series of messages to the
second ( receiver ). The applications do not even know that
they are on different machines. The applications have no idea
that SCTP is being used.
Within each message is a sequence count.
When the receiver gets a message out-of-order, it complains.

So, the sender is sending 30 messages numbered 1..30.
The receiver is expecting to get the messages 1,2,3,... 30.
However, in the test corresponding to the tcpdump data,
message 26 arrived at the receiver before message 25.

The application that is complaining has no knowledge of
stream id and stream sequence number.

Comment 19 Neil Horman 2008-08-20 16:52:27 UTC
Then I would either augment the  sending and receiving applications to check the stream id ans stream sequence numbers.  Alternatively, I would try to reproduce the problem with the sctp_darn utility or some other piece of code that I can use for a reproducer here.  I checked the tcpdumps again and both the TSN values and stream sequence numbers are all ordinal and in order at the receiver.  Either the application is processing the recevied frames out of order or something in the stack is re-ordering them (although I think you would be the first to see that).  

I would try to run the same sctp connection using sctp_darn.  You can generate input data using seq to simulate the payload sequence numbers you describe above.

Alternatively you can change the recvmsg call in your receiver to be sctp_recvmsg and interrogate the sctp_sndrcvinfo structure to validate that the messages are being read out of the kernel out of order.

Comment 20 William Reich 2008-08-20 17:20:39 UTC
I do not see any options in sctp_darn to load share over all the networks,
which is the key point in the test.
Is there any source code around for sctp_darn so that a modification
could be made in this area?

+++++++
The testing that I am doing mirrors the test that I performed
on Sol 10 sparc. ( comment #3 & #4 ) So, I feel confident that the application is not part of the trouble.

Comment 21 Neil Horman 2008-08-20 17:36:32 UTC
"is there any source code around for sctp_darn"

Yes, sctp_darn is part of the lksctp-tools package, you can download the soruces here:
tp://ftp.redhat.com/pub/redhat/linux

Alternatively, you can just write some throwaway code based on your application to simply do sending and receiving of some sequential data using your already existing load balancing code.  That way if you can reproduce you can send me the code to attempt a reproduction here.

" feel confident that the application
is not part of the trouble."

I understand why you feel that way, yet the evidence that we have in this bug at the moemnt contradicts that assertion.  There is obviously still room for discrepancy in that the stack might be re-ordering frames (although I really think its unlikely).  Thats what this test is meant to verify however.  We have a tcpdump showing a stream of sctp data getting received in sequence order.  If we can show that those same sequence numbers in the sctp chunk headers arrive at the receving application in a different order, we can conclude that something in the stack re-ordered them.

Comment 22 William Reich 2008-08-20 18:41:46 UTC
I'll try to hack sctp_darn first...

Comment 23 Neil Horman 2009-03-23 10:57:49 UTC
ping, any progress here?

Comment 24 William Reich 2009-03-23 12:29:12 UTC
I have nothing new to report.
I am currently in the middle of a major delivery cycle.
Since this issue is not related to that delivery, I will not have time to perform any investigations on this issue until after the middle of May 2009.

Comment 25 Neil Horman 2009-03-23 14:28:24 UTC
Ok, let us know when you get to it.  Thanks.

Comment 26 Neil Horman 2009-05-13 13:46:22 UTC
ping, as per your comment #24, I just wanted to check in to see if you managed to do any further investigation regarding this bug

Comment 27 Neil Horman 2009-06-15 13:13:37 UTC
Ok, its been an extra month here, I'm closing due to inactivity.  Please re-open if/when you get time to test.

Comment 28 William Reich 2009-08-14 12:31:28 UTC
new ticket 517504 has been created to reopen this issue. Detailed
programs that can reproduced the problem are attached to that ticket.