Description of problem:
Using RH5.2 on a 32bit platform,
I have a user space application trying to use the
I have 4 networks within a single association to another machine.
( That machine is using the exact version of RH & SCTP ).
By default, the SCTP will only use a single LAN to carry the traffic,
with the other LANs being hot spares.
However, I am trying to load balance the traffic instead of the default behavior.
I'm allowed to do this by selecting the network that I want
the traffic to go over.
When there is a light traffic load, the messages flow
from one machine to the other with no trouble.
However, when the message load between the two machine increases,
I am getting messages out of sequence.
Since the messages are out of sequence, my application detects a missing sequence number and fails.
SCTP is supposed to guarantee the message sequence.
This problem is also observed on RH4 Update 4.
Version-Release number of selected component (if applicable):
Linux yyy 2.6.18-92.1.10.el5PAE #1 SMP Wed Jul 23 04:10:43 EDT 2008 i686
Steps to Reproduce:
Messages are out of order.
messages should arrive in order to the user application regardless of the
network the message traveled on.
Solaris 10 has this behavior.
This is some user space code that is used to force the messages to a specific network:
( the value of "peer_index" rotates from 0 thru 3 and then repeats )...
sctp_send_t *send_request = (sctp_send_t *) hdr;
struct sockaddr *to = NULL;
socklen_t tolen = 0;
int error = EINVAL;
const sctp_send_options_t *options = &gsctp_DefaultSendOptions;
/* sendmsg() structures */
struct msghdr outmsg;
struct iovec iov;
struct cmsghdr *cmsg;
struct sctp_sndrcvinfo *sinfo;
options = &send_request->options;
if( (send_request->peer_index > -1) &&
(send_request->peer_index < plink->peer_addr.ip_addr_count) )
* Peer address was specified
addr = &plink->peer_addr.ip_addr[send_request->peer_index];
if(addr->ip_version == 4)
struct sockaddr_in in4;
memset(&in4, 0, sizeof(in4) );
in4.sin_family = AF_INET;
in4.sin_port = plink->peer_addr.port;
in4.sin_addr = addr->addr.ipv4;
tolen = sizeof(in4);
to = (struct sockaddr *) &in4;
else /* IPv6 */
struct sockaddr_in6 in6;
memset(&in6, 0, sizeof(in6) );
in6.sin6_family = AF_INET6;
in6.sin6_port = plink->peer_addr.port;
in6.sin6_addr = addr->addr.ipv6;
tolen = sizeof(in6);
to = (struct sockaddr *) &in6;
outmsg.msg_name = to;
outmsg.msg_namelen = tolen;
outmsg.msg_iov = &iov;
iov.iov_base = (void *)data;
iov.iov_len = datalen;
outmsg.msg_iovlen = 1;
outmsg.msg_control = outcmsg;
outmsg.msg_controllen = sizeof(outcmsg);
outmsg.msg_flags = 0;
cmsg = CMSG_FIRSTHDR(&outmsg);
cmsg->cmsg_level = IPPROTO_SCTP;
cmsg->cmsg_type = SCTP_SNDRCV;
cmsg->cmsg_len = CMSG_LEN(sizeof(struct sctp_sndrcvinfo));
outmsg.msg_controllen = cmsg->cmsg_len;
sinfo = (struct sctp_sndrcvinfo *)CMSG_DATA(cmsg);
* Populare sinfo structure based on ULCM SCTP options
if ( sctp_set_sinfo_options(plink, sinfo, options ) == -1 )
if (sendmsg(plink->native_sctp_fd, &outmsg, 0) == -1)
error = errno;
etc etc etc...
could you add as an attachment whole source code for simple server and client? I'd like to see how you initialize multi-home connection. Thanks.
I'm not sure that you can do the load balancing like you do. In doc is, among other things, written that a primary address/path must be chosen and the other addresses/paths will only be used if the primary fails or is suspect.
this style of load balancing works fine in Solaris 10
after Sol 10 patch 137111-02 is applied to the machine.
The Solaris 10 case number was " Sun Case 65871261 "
which became patch 137111-02.
in reply to #2, the user is allowed to override the default/primary network selection as long as no network failures exist.
If a network failure exists, then I agree that SCTP will ignore the user specified network.
Created attachment 314318 [details]
code that performs BIND operation
initialization of multi-home connection - same code for listen or connect
According to teh sun patch it seems that this is kernel stuff. So I'm reassigning the bug to kernel.
First question, just to make sure we don;t miss anything obviuos, is what are the values in gsctp_DefaultSendOptions and what does sctp_set_sinfo_options look like? I'd like to be certain that we aren't inadvertently setting SCTP_UNORDERED in the sinfo_flags field. That would certainly explain unordered delivery, as would some mungings of the stream id.
notice that gsctp_DefaultSendOptions is not actually used.
The "options" pointer gets a new value of "&send_request->options", which
is data from the incoming message.
This header is initialized to all zeros except for the
peer_index ( which will have values rotating from 0 thru 3 and over again ).
A message from my applicatio to SCTP does not ever contain the SCTP_UNORDERED in the sinfo_flags field.
( requested details below )
typedef struct sctp_send_options_s
typedef struct sctp_header_s
application sets the fields as
cntl = ( sctp_send_t * ) cntl_mp->b_rptr;
cntl->header.op = SCTP_SENDTO_OP;
cntl->options.version = 0;
cntl->options.stream = 0;
cntl->options.flags = 0; /* <==== */
cntl->options.protocol = 0;
cntl->options.context = 0;
mr->peer_index++ % mr->peer_addr.ip_addr_count;
inline static int
struct sctp_sndrcvinfo *sinfo,
const sctp_send_options_t *options)
int prev_no_delay, err;
memset(sinfo, 0, sizeof(struct sctp_sndrcvinfo) );
sinfo->sinfo_ppid = htonl(options->protocol);
sinfo->sinfo_stream = options->stream;
sinfo->sinfo_context = options->context;
* Map ULCM SCTP send options to LKSCTP send options.
if ( options->flags & SCTP_UNORDERED )
sinfo->sinfo_flags |= MSG_UNORDERED;
* Calling setsockopt() for each data message could
* consume too much CPU time, therefore this is
* only done when a new option is different from
* the previous.
* SCTP_NODELAY, Section 7.1.5 SCTP Sockets API IETF draft,
* globally disables message bundling, where as ULCM SCTP
* does it on a per message basis. This approximation
* is good enough to disable message bundling.
prev_no_delay = plink->no_delay;
plink->no_delay = (options->flags & SCTP_NOBUNDLE)? 1: 0;
if ( prev_no_delay != plink->no_delay)
if ( setsockopt(plink->native_sctp_fd,
sizeof(plink->no_delay)) == -1 )
err = errno;
ULCM_WARN("[id=%d] setsockopt(SCTP_NODELAY) failed, errno = %d",
plink->id, err );
errno = err;
} /* end sctp_set_sinfo_options */
ok, do you have tcpdumps that show this ordering problem? That would help diagnose this problem for me. Thanks!
I do not have tcpdumps.
All I have is my application complaining about
getting messages out of order.
I have the same problem is the number of networks is 4 or 2.
Please take the tcpdump captures as we discussed via email and attach them here if you would. Thanks!
My first attempts at running tcpdump are not fruitful.
Each time I run tcpdump ( one on sender and one on receiver ), the problem
does not occur.
When I do not use tcpdump, the problem appears.
I am using the command
/usr/sbin/tcpdump -i any -s 0 -w <file>
on each machine.
I'll try varying the number of networks to see if I can get
tcpdump to capture a failure...
( my first attempt was with 2 networks in the association. )
ok - got lucky...
data files coming...
Created attachment 314625 [details]
tcpdump output files ( gunzip, then tar xvf ... )
This gzip'd tar file contains the
output of tcpdump on 2 machines
The time of the failure was between 8:30:36 and 8:30:46.
The failure was reported by the receiver.
additional information for attachment 314625 [details] in comment 15:
/etc/hosts file looks like this:
172.25.2.202 alderaan alderaan.ulticom.com
172.25.2.200 endor endor.ulticom.com
10.2.202.1 alderaan alderaan-a
10.2.202.129 alderaan alderaan-b
10.2.201.1 alderaan alderaan-c
10.2.201.129 alderaan alderaan-d
10.2.202.3 endor endor-a
10.2.202.131 endor endor-b
10.2.201.3 endor endor-c
10.2.201.131 endor endor-d
networks used are " -c, -d , -a, and -b" on each machine.
The sender is alderaan
The receiver is endor
Based on my understanding,
alderaan will think that his primary network is "-c".
endor will think that his primary network is "-d"
Traffic was flowing on all 4 ( -c, -d, -a, -b ) networks at the
time of the error.
endor ( recv1 file ) is the receiver. endor reported the error.
The 172 addresses were used for xterm connections only.
I've checked the receive dump and noted that we capture all sctp chunks in order (according to TSN value) so whatever the problem is its not happening on the wire or at the sender. how is your application determining that frames are not in the proper order? Is is check the stream id and stream sequence number?
in answer to comment 17,
I have two user space applications ( one on each machine ).
The first ( sender ) is sending a series of messages to the
second ( receiver ). The applications do not even know that
they are on different machines. The applications have no idea
that SCTP is being used.
Within each message is a sequence count.
When the receiver gets a message out-of-order, it complains.
So, the sender is sending 30 messages numbered 1..30.
The receiver is expecting to get the messages 1,2,3,... 30.
However, in the test corresponding to the tcpdump data,
message 26 arrived at the receiver before message 25.
The application that is complaining has no knowledge of
stream id and stream sequence number.
Then I would either augment the sending and receiving applications to check the stream id ans stream sequence numbers. Alternatively, I would try to reproduce the problem with the sctp_darn utility or some other piece of code that I can use for a reproducer here. I checked the tcpdumps again and both the TSN values and stream sequence numbers are all ordinal and in order at the receiver. Either the application is processing the recevied frames out of order or something in the stack is re-ordering them (although I think you would be the first to see that).
I would try to run the same sctp connection using sctp_darn. You can generate input data using seq to simulate the payload sequence numbers you describe above.
Alternatively you can change the recvmsg call in your receiver to be sctp_recvmsg and interrogate the sctp_sndrcvinfo structure to validate that the messages are being read out of the kernel out of order.
I do not see any options in sctp_darn to load share over all the networks,
which is the key point in the test.
Is there any source code around for sctp_darn so that a modification
could be made in this area?
The testing that I am doing mirrors the test that I performed
on Sol 10 sparc. ( comment #3 & #4 ) So, I feel confident that the application is not part of the trouble.
"is there any source code around for sctp_darn"
Yes, sctp_darn is part of the lksctp-tools package, you can download the soruces here:
Alternatively, you can just write some throwaway code based on your application to simply do sending and receiving of some sequential data using your already existing load balancing code. That way if you can reproduce you can send me the code to attempt a reproduction here.
" feel confident that the application
is not part of the trouble."
I understand why you feel that way, yet the evidence that we have in this bug at the moemnt contradicts that assertion. There is obviously still room for discrepancy in that the stack might be re-ordering frames (although I really think its unlikely). Thats what this test is meant to verify however. We have a tcpdump showing a stream of sctp data getting received in sequence order. If we can show that those same sequence numbers in the sctp chunk headers arrive at the receving application in a different order, we can conclude that something in the stack re-ordered them.
I'll try to hack sctp_darn first...
ping, any progress here?
I have nothing new to report.
I am currently in the middle of a major delivery cycle.
Since this issue is not related to that delivery, I will not have time to perform any investigations on this issue until after the middle of May 2009.
Ok, let us know when you get to it. Thanks.
ping, as per your comment #24, I just wanted to check in to see if you managed to do any further investigation regarding this bug
Ok, its been an extra month here, I'm closing due to inactivity. Please re-open if/when you get time to test.
new ticket 517504 has been created to reopen this issue. Detailed
programs that can reproduced the problem are attached to that ticket.