Bug 517504 - SCTP Messages out of order
Summary: SCTP Messages out of order
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Neil Horman
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 532045
TreeView+ depends on / blocked
 
Reported: 2009-08-14 12:12 UTC by William Reich
Modified: 2010-03-30 07:45 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 532045 (view as bug list)
Environment:
Last Closed: 2010-03-30 07:45:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
tar file containing test programs (90.00 KB, application/octet-stream)
2009-08-14 12:16 UTC, William Reich
no flags Details
output of sctp_native_catcher during tcpdump capture (3.78 KB, text/plain)
2009-08-31 20:16 UTC, William Reich
no flags Details
output of "/usr/sbin/tcpdump any -w gordon.tcpout" in gzip format (10.64 MB, application/x-gzip)
2009-08-31 20:18 UTC, William Reich
no flags Details
output of "/usr/sbin/tcpdump -i any -w jarrett.tcpout" in gzip format (11.46 MB, application/x-gzip)
2009-08-31 20:22 UTC, William Reich
no flags Details
2nd output of sctp_native_catcher during tcpdump capture (6.74 KB, text/plain)
2009-09-01 15:48 UTC, William Reich
no flags Details
2nd run - tcpdump file from gordon in gzip format (9.50 MB, application/x-gzip)
2009-09-01 15:49 UTC, William Reich
no flags Details
2nd tcpdump from jarrett in gzip format (8.04 MB, application/x-gzip)
2009-09-01 15:52 UTC, William Reich
no flags Details
patch to serialize sctp recevies per association (3.51 KB, patch)
2009-09-08 20:51 UTC, Neil Horman
no flags Details | Diff
full output of catcher machine before panic of comment 38 occurred (10.53 KB, text/plain)
2009-09-14 12:53 UTC, William Reich
no flags Details
full output of catcher program on "gordon" ( in reply to comment 56 ) (4.07 KB, text/plain)
2009-09-18 11:38 UTC, William Reich
no flags Details
tcpdump output of catcher in reply to comment 56 (10.95 MB, application/x-gzip)
2009-09-18 11:50 UTC, William Reich
no flags Details
tcpdump from pitcher side ( jarrett ) in reply to comment 56 (11.67 MB, application/x-gzip)
2009-09-18 12:04 UTC, William Reich
no flags Details
messages file from test of v8 kernel (602.75 KB, application/octet-stream)
2009-09-21 12:51 UTC, William Reich
no flags Details
messages file from test of v9 kernel (462.95 KB, application/octet-stream)
2009-09-21 20:24 UTC, William Reich
no flags Details
output of catcher on "gordon" using v10 kernel (3.12 KB, text/plain)
2009-09-22 14:29 UTC, William Reich
no flags Details
tcpdump of catcher using v10 kernel (12.83 MB, application/x-gzip)
2009-09-22 14:30 UTC, William Reich
no flags Details
/var/log/messages from test run of comment 78 (4.24 KB, application/octet-stream)
2009-09-22 14:31 UTC, William Reich
no flags Details
tcpdump of pitcher using v10 kernel (14.55 MB, application/x-gzip)
2009-09-22 14:34 UTC, William Reich
no flags Details
patch to augment catcher reproducer (2.79 KB, patch)
2009-09-30 16:53 UTC, Neil Horman
no flags Details | Diff
corrected reproducer patch (2.78 KB, application/octet-stream)
2009-09-30 18:27 UTC, Neil Horman
no flags Details
patch to record tsn history (5.24 KB, patch)
2009-10-01 19:05 UTC, Neil Horman
no flags Details | Diff
output from program of comment 89 (18.44 KB, text/plain)
2009-10-01 19:38 UTC, William Reich
no flags Details
another run from tool of comment 89 - this time with associated tcpdumps (9.65 KB, text/plain)
2009-10-02 12:19 UTC, William Reich
no flags Details
tcpdump to match comment 92 from "catcher" (17.91 MB, application/x-gzip)
2009-10-02 12:24 UTC, William Reich
no flags Details
tcpdump of pitcher to match comment 92 (17.41 MB, application/x-gzip)
2009-10-02 12:26 UTC, William Reich
no flags Details
new patch to attempt serialization of outgonig transports (2.61 KB, patch)
2009-10-06 19:48 UTC, Neil Horman
no flags Details | Diff
wireshark output file (8.48 MB, application/octet-stream)
2009-10-23 12:52 UTC, William Reich
no flags Details
test 4 of wireshark output - (5.07 MB, application/octet-stream)
2009-10-23 13:27 UTC, William Reich
no flags Details
pitcher side of test 4 - wireshark format (9.00 MB, application/octet-stream)
2009-10-23 13:31 UTC, William Reich
no flags Details
output from kernel from comment 127 on pitcher side (353.52 KB, application/octet-stream)
2009-10-28 12:37 UTC, William Reich
no flags Details
test 2 - /var/log/messages from pitcher side (548.39 KB, application/octet-stream)
2009-10-28 12:49 UTC, William Reich
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0178 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update 2010-03-29 12:18:21 UTC

Description William Reich 2009-08-14 12:12:59 UTC
Description of problem:
Using RH5.2 on a 32bit platform
( 2.6.18-128 on a 64 bit platform ),
I have a user space application trying to use the
SCTP interface.
I have 2 networks within a single association to another machine.
( That machine is using the exact version of RH & SCTP ).

By default, the SCTP will only use a single LAN to carry the traffic,
with the other LANs being hot spares.

However, I am trying to load balance the traffic instead of the default
behavior. 
I'm allowed to do this by selecting the network that I want
the traffic to go over.

When there is a light traffic load, the messages flow
from one machine to the other with no trouble.

However, when the message load between the two machine increases,
I am getting messages out of sequence.
Since the messages are out of sequence, my application detects a missing
sequence number and fails.

SCTP is supposed to guarantee the message sequence.

This problem is also observed on RH4 Update 4.





Version-Release number of selected component (if applicable):


How reproducible:
lksctp-tools-devel-1.0.6-1.el5.1
lksctp-tools-1.0.6-1.el5.1
lksctp-tools-doc-1.0.6-1.el5.1
Linux yyy 2.6.18-92.1.10.el5PAE #1 SMP Wed Jul 23 04:10:43 EDT 2008 i686
also 2.6.18-128


Steps to Reproduce:
1. setup two machines with multiple networks
2. run attached program
3.
  
Actual results:
messages are reported as being out of order

Expected results:
there should be no out of order messages.

Additional info:

This ticket attempts to re-open ticket 458712.

Comment 1 William Reich 2009-08-14 12:16:14 UTC
Created attachment 357447 [details]
tar file containing test programs

this program contains pitcher & catcher programs to recreate the reported problem

Comment 2 William Reich 2009-08-14 12:25:47 UTC
I have two machines, gordon & jarrett, that are connected
to each other.

The etc_hosts file in the tar file of comment 1 shows that
there is a gordon-x network composed of two networks.
jarrett-x is also composed of two networks.

The -0 network is the corporate lan. The -2 & -3 networks
are direct-connections between the two machines.
We expect that traffic will be slower on the corporate network.

We start the catcher program on gordon and tell it to bind to
the two networks defined by gordon-x:

sctp_native_catcher -report 1000000 -bind gordon-x

We start the pitcher program on jarrett and tell it to bind to the
two networks defined by jarrett-x and connect to the peer defined by gordon-x:

sctp_native_pitcher -size 848 -bind jarrett-x  gordon-x

PLEASE NOTE - I believe that the message size is important here.
With a size of 848, the failure occurs within a few seconds.
However, with a default size of 300, I have not yet seen the error.

Comment 3 William Reich 2009-08-14 12:26:34 UTC
to compile the programs, untar the tar ball, and then
assuming gcc is available...

make -f LI*

Comment 4 William Reich 2009-08-14 13:30:12 UTC
In a test using a message size of 300 , a failure
did eventually occur after a couple of hours...

Comment 5 Neil Horman 2009-08-23 23:35:48 UTC
Its going to be awhile before I get to this, in the interim, it would expidite things for me a bit if you had a tcpdump showing this reproducer.  I'd like to see if we're sending out of order, or just receiving out of order.
Thanks!

Comment 6 William Reich 2009-08-24 11:30:22 UTC
I am on vacation - will return Aug 31

Comment 7 Neil Horman 2009-08-24 13:53:42 UTC
Ok. Let me know

Comment 8 William Reich 2009-08-31 20:14:34 UTC
from gordon...
  /usr/sbin/tcpdump -i any  -w gordon.tcpout
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 96 bytes
597316 packets captured
644222 packets received by filter
46904 packets dropped by kernel

from jarrett...
 /usr/sbin/tcpdump -i any -w jarrett.tcpout
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 96 bytes
642331 packets captured
642609 packets received by filter
274 packets dropped by kernel

Comment 9 William Reich 2009-08-31 20:16:43 UTC
Created attachment 359314 [details]
output of sctp_native_catcher during tcpdump capture

this is the output of the sctp_native_catcher program
while the tcpdump was running on gordon

Comment 10 William Reich 2009-08-31 20:18:35 UTC
Created attachment 359316 [details]
output of "/usr/sbin/tcpdump any -w gordon.tcpout" in gzip format

output of /usr/sbin/tcpdump any -w gordon.tcpout
in gzip format

Comment 11 William Reich 2009-08-31 20:22:57 UTC
Created attachment 359318 [details]
output of "/usr/sbin/tcpdump -i any -w jarrett.tcpout" in gzip format

output of 
/usr/sbin/tcpdump -i any -w jarrett.tcpout
in gzip format

Comment 12 William Reich 2009-08-31 20:27:25 UTC
regarding the comments 8,9,10, & 11,
I believe that the only sctp traffic that you will find
will be from this test.
Note that one of the interfaces used in the test was the corporate LAN, so
you are going to have to sift through a lot of X traffic...

Comment 13 William Reich 2009-08-31 20:29:41 UTC
more notes - gordon and jarrett are 64bit machines.

Command for the pitcher on jarrett  was:

 sctp_native_pitcher -size 848 -bind jarrett-x gordon-x

Comment 14 William Reich 2009-08-31 20:31:17 UTC
current version info of gordon & jarrett...

omnitest@gordon 30% uname -a
Linux gordon 2.6.18-128.2.1.el5 #1 SMP Wed Jul 8 11:54:47 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux



omnitest@jarrett 17% uname -a
Linux jarrett 2.6.18-128.2.1.el5 #1 SMP Wed Jul 8 11:54:47 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Comment 15 Neil Horman 2009-09-01 14:30:07 UTC
Thank you for the tcpdumps.  Unfortunately you didn't set our capture length high enough so that I could see TSN values in the sending stream.  As such I could not determine if the frames are getting sent in order or not.  Judging by the text in the frames it appears you are getting lots of retransmits and lots of triple sacks to do fast retransmits.  Can you confirm that in your sctp stats?  Also, if you could recapture these so that I can see the TSN values in the stream that would be good.

Comment 16 William Reich 2009-09-01 15:47:28 UTC

2nd attempt at tcpdumps...

omnitest@gordon 35%  /usr/sbin/tcpdump -i any -s 0 -w 2gordon.tcp.out
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
437013 packets captured
584763 packets received by filter
147746 packets dropped by kernel


omnitest@jarrett 21% /usr/sbin/tcpdump -i any -s 0 -w 2jarrett.tcpout
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
371398 packets captured
584936 packets received by filter
213534 packets dropped by kernel



NAME
       tcpdump - dump traffic on a network

SYNOPSIS
       tcpdump [ -AdDeflLnNOpqRStuUvxX ] [ -c count ]
               [ -C file_size ] [ -F file ]
               [ -i interface ] [ -m module ] [ -M secret ]
               [ -r file ] [ -s snaplen ] [ -T type ] [ -w file ]
               [ -W filecount ]
               [ -E spi@ipaddr algo:secret,...  ]
               [ -y datalinktype ] [ -Z user ]
               [ expression ]

...
       -r     Read packets from file (which was created with the  -w  option).
              Standard input is used if file is ‘‘-’’.

       -S     Print absolute, rather than relative, TCP sequence numbers.

       -s     Snarf  snaplen  bytes  of  data from each packet rather than the
              default of 68 (with SunOS’s NIT, the minimum  is  actually  96).
              68  bytes is adequate for IP, ICMP, TCP and UDP but may truncate
              protocol information from  name  server  and  NFS  packets  (see
              below).   Packets  truncated  because  of a limited snapshot are
              indicated in the output with ‘‘[|proto]’’, where  proto  is  the
              name of the protocol level at which the truncation has occurred.
              Note that taking larger snapshots both increases the  amount  of
              time it takes to process packets and, effectively, decreases the
              amount of packet buffering.  This may cause packets to be  lost.
              You  should  limit snaplen to the smallest number that will cap-
              ture the protocol information  you’re  interested  in.   Setting
              snaplen  to 0 means use the required length to catch whole pack-
              ets.

on jarrett:
sctp_native_pitcher -size 848 -bind jarrett-x gordon-x

Comment 17 William Reich 2009-09-01 15:48:28 UTC
Created attachment 359411 [details]
2nd   output of sctp_native_catcher during tcpdump capture

2nd   output of sctp_native_catcher during tcpdump capture

Comment 18 William Reich 2009-09-01 15:49:49 UTC
Created attachment 359412 [details]
2nd run - tcpdump file from gordon in gzip format

2nd run - tcpdump file from gordon in gzip format

Comment 19 William Reich 2009-09-01 15:52:30 UTC
Created attachment 359413 [details]
2nd tcpdump from jarrett in gzip format

2nd tcpdump from jarrett in gzip format

Comment 20 Neil Horman 2009-09-04 20:11:36 UTC
Note to self: I'm looking over the provided new sets of tcpdumps and I see some interesting results.  First off, the sequence values in the transported data are not just out of order, but missing entirely, along with their corresponding TSN value.  This is rather troubling, as I would think that should be reported somehow.  I'm guessing that they were dropped by the kernel.  Although its interesting to note that all the missing seq values were dropped, whcih suggests they were lost after being ACKED by the receiver and not reneged.  I'll look further over the weekend.

Comment 21 Neil Horman 2009-09-08 16:42:53 UTC
So, a few more interesting observations.  I notice that in both traces the out-of-order sequence values are missing from both the jarret and gordon traces in every case, which suggests that they are all lost in the frames that the tcpdump capture dropped.  Its odd that that should happen in both traces at the same time for every re-ordered frame.  It suggests that these re-order events correlate to some odd network traffic spike.  Have you noticed any such spike during these loss periods?


That said, with these traces, even though I can't observe the missing sequence numbers directly, I can correlate the sequence values to TSN values, and see that the sctp stream acks the corresponding tsn values, which means that the values are getting sent and received in an appropriate order, so there appears to be nothing wrong with the sender  or the trasport of these frames, which is good.

I'm beginning to forulate a theory about this problem.  I think it may have to do with synchronization of multiple frame receives on multiple network interfaces.  I've tried to reproduce this problem here, and am unable, but I was using systems with only one interface on them.  Is this problem reproducible for you if you only use one of your ip addresses to send/receive sctp traffic?  Also, is it possible for you to try modify your sender so that instead of rotating send addresses every count you rotate addresses every, say 100 counts (or some other value), and insert a pause of a few milliseconds between rotatations.  The idea is to reduce the number of rotations that we go through and provide a quiescent point between them so as to reduce the likelyhood of a potential race on the receiver side.

If either of those mitigate the problem you are seeing, I think I can put together a patch to introduce some locking on the receiver so that any possible races are blocked.  Let me know the results of the above testing and I'll go to work on that.

Comment 22 Neil Horman 2009-09-08 16:56:17 UTC
asan additional test, it would be interesting if you:
1) turned of the irqbalance daemon
2) manually set the irq affinity for all of the nics in the receiving system to be bound to the same cpu core.

That would force all nic interrupts to be handled in a serial fashion so that no reordering could take place.  Its not a fix, but it would provide support to my theory above.

Comment 23 William Reich 2009-09-08 17:16:01 UTC
regarding your questions...

Recall that the network configuration is one corporate lan
and one private lan. This test can be run at any time of day and
generate the same results.
This sort-of takes out the external network spike consideration - 
except for the idea that the test applications themselves generator tons of traffic. So, maybe in one sense, the test itself is generating a network spike.

This problem is not reproducible when only one network is used for the test.

Comment 24 William Reich 2009-09-08 18:56:57 UTC

I changed the pitcher so that the network selection changed every 128 messages
( instead of each message ),
and then also placed in a 100ms delay when the network switch occurred.

This drasticaly changed the behavior of the system.
In the "before" test, the pitcher would use nearly 100% of one CPU.
In the "after" test, the pitcher is only using 6% of the CPU.

No messages were reported as lost. However, the time to process
1 million messages went from a few seconds to several minutes.


++++++++++++
I then changed the pitcher to remove the 100ms delay, but kept the
network switch every 128.
The dropped messages returned.

++++++++++++
I then changed the pitcher to have a 5ms delay each time the network
switched every 128 messages. The dropped messages were observed.

++++++++++++
I then changed the pitcher to have a 20ms delay each time the network
switched every 128 messages. NO dropped messages were observed.
CPU was about 5%.

++++++++++++
I then changed the pitcher to have a 10ms delay each time the network
switched every 128 messages. NO dropped messages were observed.
CPU was about 9%.

++++++++++++
I then changed the pitcher to have a 7ms delay each time the network
switched every 128 messages. The dropped messages were observed.

++++++++++++
I then changed the pitcher to have a 7ms delay each time the network
switched every 10,000 messages. NO dropped messages were observed.

++++++++++++
I then changed the pitcher to have a 7ms delay each time the network
switched every 2,000 messages. NO dropped messages were observed.

++++++++++++
I then changed the pitcher to have a 7ms delay each time the network
switched every 500 messages. NO dropped messages were observed.

++++++++++++
I then changed the pitcher to have a 7ms delay each time the network
switched every 250 messages. NO dropped messages were observed.

++++++++++++
I then changed the pitcher to have a 7ms delay each time the network
switched every 150 messages. Dropped messages did occur.

Comment 25 William Reich 2009-09-08 18:58:47 UTC
hmm....

I should not use the phrase "dropped messages".
The correct phrase is "mis-sequenced messages".

Comment 26 William Reich 2009-09-08 19:19:30 UTC
Note -

- The machine I have, gordon and jarrett,  each have only 2 CPUs

Comment 27 Neil Horman 2009-09-08 19:19:46 UTC
Interesting, thats strong enough evidence for me to suggest we have a receive race with multiple nics on the same association.  I'll have a test patch available soon.  Thanks!

Comment 28 Neil Horman 2009-09-08 20:51:56 UTC
Created attachment 360130 [details]
patch to serialize sctp recevies per association

hey, this patch hasn't been tested yet, but you're pretty quick at testing so I thought you might want it asap.  Please let me know if you run into trouble with it.  Thanks!

Comment 31 William Reich 2009-09-10 19:31:24 UTC

I had to edit the patch to get a clean compile.

@@ -1337,6 +1339,9 @@ struct sctp_association {
 	/* Associations on the same socket. */
 	struct list_head asocs;
 
+	/* rx lock */
+	struct spinlock_t rx_lock;  <+++++++++++++++++
+
 	/* association id. */
 	sctp_assoc_t assoc_id;

get rid of the word "struct"...

I load this new kernel on both machines, and reran the test.

I still observed mis-sequenced messages.

Comment 32 Neil Horman 2009-09-10 20:05:21 UTC
I've fixed a simmilar hole in the endpoint code during my other fixups, please wait for my test kernel

Comment 33 Neil Horman 2009-09-11 00:58:22 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v2.x86_64.rpm

Test rpm is for you there.  Let me know the results please

Comment 34 William Reich 2009-09-11 14:16:34 UTC

(In reply to comment #33)
> http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v2.x86_64.rpm

Using this test kernel, I still get mis-sequenced messages.

Comment 35 Neil Horman 2009-09-11 15:22:06 UTC
sigh, it just doesn't make sense.  Ok, lets start taking this in smaller steps:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1981934
I've got a build going here with debug code in place, to log warning messages from the kernel in the event we note decreasing tsn values when we post them to the socket recevie queue.  This should at least tell us if we're missing some amount of tsn ordering when we post to the user space socket queue, of if something is happening earlier in the process.

Comment 37 Neil Horman 2009-09-12 16:28:43 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v3.x86_64.rpm
New kernel.  This one should warn you if the sctp stack in the kernel queues messages out of order or not.

Comment 38 William Reich 2009-09-14 12:52:00 UTC
here is a small sample of the output.
This is from gordon where the catcher was running.
The machine panic'd.
The test was executed as in comment 2. ( ie full blast ).

The pitcher machine ( jarrett ) had no output.

I will attempt to gather panic info by repeating the test...

Sep 14 08:33:48 gordon kernel: SCTP DATA CHUNK TSN DECREASE!
Sep 14 08:33:48 gordon kernel: ulpq last seen tsn = 1104208148
Sep 14 08:33:48 gordon kernel: event tsn = 1104208149
Sep 14 08:33:48 gordon kernel: SCTP DATA CHUNK TSN DECREASE!
Sep 14 08:33:48 gordon kernel: ulpq last seen tsn = 1104208149
Sep 14 08:33:48 gordon kernel: event tsn = 1104208150
Sep 14 08:33:48 gordon kernel: SCTP DATA CHUNK TSN DECREASE!
Sep 14 08:33:48 gordon kernel: ulpq last seen tsn = 1104208150
Sep 14 08:33:48 gordon kernel: event tsn = 1104208151
Sep 14 08:33:48 gordon kernel: SCTP DATA CHUNK TSN DECREASE!
Sep 14 08:33:48 gordon kernel: ulpq last seen tsn = 1104208151
Sep 14 08:33:48 gordon kernel: event tsn = 1104208152

Comment 39 William Reich 2009-09-14 12:53:40 UTC
Created attachment 360935 [details]
full output of catcher machine before panic of comment 38 occurred

full output of catcher machine before panic of comment 38 occurred

Comment 40 William Reich 2009-09-14 13:00:54 UTC
correction - the machine did not panic. It locked up.
There was no panic/oops backtrace.

Running the test a second time got the same results:
-the catcher machine produced output similar to comments 37 & 38
-the catcher machine locked up
- AND this time the pitcher machine locked up.

Comment 41 Neil Horman 2009-09-14 13:14:21 UTC
sorry, I reversed the sense on the check, I'll rebuild shortly.

Comment 43 Neil Horman 2009-09-14 17:03:39 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v4.x86_64.rpm

There you go, that has the check corrected.

Comment 44 William Reich 2009-09-14 18:11:21 UTC

panic on the catcher machine using the kernel from comment 43 on both machines

Unable to handle kernel NULL pointer dereference at 0000000000000014 RIP: 
 [<ffffffff886ab2bf>] :sctp:sctp_ulpq_tail_data+0x36d/0x3cb     
PGD 6e552067 PUD 6e4aa067 PMD 0 
Oops: 0000 [1] SMP 
last sysfs file: /block/ram0/range
CPU 0 
Modules linked in: md5 sctp autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth sunrpc bondid
Pid: 0, comm: swapper Not tainted 2.6.18-165.el5.bz517504.v4 #1
RIP: 0010:[<ffffffff886ab2bf>]  [<ffffffff886ab2bf>] :sctp:sctp_ulpq_tail_data+0x36d/0x3cb
RSP: 0018:ffffffff8043dac0  EFLAGS: 00010282
RAX: 00000000c9bc9713 RBX: 0000000000000000 RCX: ffff81006e265650
RDX: ffff81006e265650 RSI: ffff810072366580 RDI: ffff810074657e80
RBP: ffff81006e265628 R08: ffff810074657ed8 R09: ffff810074542412
R10: 0000000024120000 R11: ffffffff886a345c R12: ffffffff886b4c80
R13: ffff81006fa49780 R14: 0000000000000000 R15: ffff81006fa49780
FS:  0000000000000000(0000) GS:ffffffff803c1000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000014 CR3: 000000007d7b6000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff803f2000, task ffffffff80300ae0)
Stack:  ffff810074657e80 ffff810074657e80 0000000100000001 ffff81006e264000
 ffffffff8043db30 ffff81006e264000 ffffffff886b4c80 ffffffff886a2885
 ffff81006f840600 00000004f2b70d26 0000000100000000 0000000000000001
Call Trace:
 <IRQ>  [<ffffffff886a2885>] :sctp:sctp_do_sm+0x4f0/0xd07
 [<ffffffff886abfb1>] :sctp:sctp_bind_addr_match+0x2d/0x58
 [<ffffffff886a53a0>] :sctp:sctp_assoc_bh_rcv+0xc5/0x109
 [<ffffffff886b2673>] :sctp:sctp_rcv+0x69c/0x75d
 [<ffffffff8006c9a8>] do_IRQ+0xec/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 [<ffffffff80034844>] ip_local_deliver+0x19d/0x263
 [<ffffffff800359b2>] ip_rcv+0x539/0x57c
 [<ffffffff80020807>] netif_receive_skb+0x3c9/0x3f5
 [<ffffffff881d66ff>] :e1000:e1000_clean_rx_irq+0x42f/0x526
 [<ffffffff881d5562>] :e1000:e1000_clean+0x89/0x26e
 [<ffffffff8002e2f2>] __wake_up+0x38/0x4f
 [<ffffffff8000c845>] net_rx_action+0xac/0x1e0
 [<ffffffff8001235a>] __do_softirq+0x89/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006cb20>] do_softirq+0x2c/0x85
 [<ffffffff8006c9a8>] do_IRQ+0xec/0xf5
 [<ffffffff800571c3>] mwait_idle+0x0/0x4a
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff800571f9>] mwait_idle+0x36/0x4a
 [<ffffffff800493a3>] cpu_idle+0x95/0xb8
 [<ffffffff803fd7fd>] start_kernel+0x220/0x225
 [<ffffffff803fd22f>] _sinittext+0x22f/0x236


Code: 3b 43 14 76 30 48 c7 c7 ea 6e 6b 88 31 c0 e8 a9 64 9e f7 8b 
RIP  [<ffffffff886ab2bf>] :sctp:sctp_ulpq_tail_data+0x36d/0x3cb
 RSP <ffffffff8043dac0>
CR2: 0000000000000014
 <0>Kernel panic - not syncing: Fatal exception

Comment 45 William Reich 2009-09-14 18:12:45 UTC
regarding comment 44, no other output was found in /var/log/messages

Comment 49 Neil Horman 2009-09-16 14:19:36 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v6.x86_64.rpm

Here, sorry about that, this should avoid that oops.

Comment 50 William Reich 2009-09-16 18:21:07 UTC
testing the kernel from comment 49 ( on both machines ),
the test was executed.
No panic...
However, mis-sequenced messages were still reported on the catcher.
There was nothing output to /var/log/messages.
Nothing in the output of 'dmesg' either...

Comment 51 Neil Horman 2009-09-16 19:42:06 UTC
So thats rather interesting.  TSN values in sctp associations are ordinal an d increasing (with the potential to wrap).  This patch should have correlated each sequence number N that ou send in the data chunk to a TSN value, and they should increase in step with one another.  We would have missed events in which we skipped a tsn, but we should have had a message in the logs for every time we wen't backwards in tsn's (and sequence numbers in your 2gordon output).  Such as it is, we did not, which tells me that as we pass through sctp_ulpq_tail_data we're seeing all our frames in order.  Some theories:

1) The frames, while getting enqueued in order, are dequeued in order but processed by the application out of order.  I've looked at your receiver, and see that it has a call to fork, but you only have one child process that ever receives on the socket, correct?  Does this occur if you skip the fork with the --wait option to your receiver?

2) While tail_data sees the frames in order, the actuall enqueue routine (sctp_ulpq_tail_event) actually puts them on the socket receive queue out of order.  Looking at the code, I see how that might happen.  If you enter partial delivery mode and then exit it, it is possible to enqueue the pd_lobby frames after the incomming frame, which will result in out of sequence tsns, that the above patch wouldn't catch.  It appears to be fixed by upstream commit d0cf0d9940ef27b46fcbbd9e0cc8427c30fe05eb.  If you would please confirm that (1) above is not an issue, I'll backport that patch as a test for you.  I'll also be keeping the low level serialization code in place, as I think that needed and will send that upstream shortly.

Comment 52 William Reich 2009-09-16 22:38:11 UTC
I will perform the task you ask.
But please keep in mind that this is only a test application
that reveals the problem.
My real code, which is the real product, also sees this issue.
It contains only one thread per socket.

Comment 53 Neil Horman 2009-09-16 23:54:10 UTC
Ok, thats a sufficient answer to eliminate that theory I think.  I'll have a test kernel for you in the am

Comment 54 Neil Horman 2009-09-17 14:45:25 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v7.x86_64.rpm

There you go, new test build which incorporates a backport of upstream commit d0cf0d9940ef27b46fcbbd9e0cc8427c30fe05eb, which looks like it will fix this issue.

Comment 55 William Reich 2009-09-17 19:04:59 UTC
using the kernel from comment 54 on both machines,
the problem of mis-sequenced messages still exists.
However, it should be noted that it takes longer for the problem to occur
( based on the sequence number that is reported... ).

Comment 56 Neil Horman 2009-09-17 20:15:22 UTC
please post new tcpdumps and catcher output please.  I'd like to see if the mis-sequencing pattern looks the same.

Comment 57 William Reich 2009-09-18 11:37:05 UTC

ok

omnitest@gordon 215% sc -report 1000000 -bind gordon-x
+---------------------+    Native SCTP performance test (Aug 14 2009)
| native sctp_catcher |    Copyright (c) 2000 - 2009 Ulticom, Inc.
+---------------------+    All Rights Reserved.

                native_sctp_bind  - host gordon-x , port 10000 

                native_sctp_addrs count is 2 
Local (bound) address for socket 3
Address count 2, port 10000
        address  0: ::ffff:172.25.4.75
        address  1: ::ffff:10.10.3.2
New catcher: 11145
Peer address for socket 4
Address count 2, port 51740
        address  0: ::ffff:10.10.3.1
        address  1: ::ffff:172.25.4.76
          Packets  Packet   Cumulative Rate       Current Rate    Packets
Endpoint  (x1000)    Size   (bytes/s) (msg/s)  (bytes/s)  (msg/s) Skipped
========  =======  ======  ==================  ================== =======
BUMMER: message received out of sequence... expected 253392, got 253393
BUMMER: message received out of sequence... expected 253394, got 253392
BUMMER: message received out of sequence... expected 253393, got 253394


omnitest@gordon 118% /usr/sbin/tcpdump -i any -s 0 -w 3gordon.tcp.out
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
506385 packets captured
713202 packets received by filter
206809 packets dropped by kernel

omnitest@jarrett 99% /usr/sbin/tcpdump -i any -s 0 -w 3jarrett.tcp.out
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
541670 packets captured
713308 packets received by filter
171633 packets dropped by kernel

Comment 58 William Reich 2009-09-18 11:38:10 UTC
Created attachment 361628 [details]
full output of catcher program on "gordon" ( in reply to comment 56 )

full output of catcher program on "gordon" ( in reply to comment 56 )

Comment 59 William Reich 2009-09-18 11:50:20 UTC
Created attachment 361632 [details]
tcpdump output of catcher in reply to comment 56

tcpdump output of catcher in reply to comment 56 ( gzip'd )

Comment 60 William Reich 2009-09-18 12:04:41 UTC
Created attachment 361636 [details]
tcpdump from pitcher side ( jarrett ) in reply to comment 56

tcpdump from pitcher side ( jarrett ) in reply to comment 56 ( gzip format )

Comment 61 Neil Horman 2009-09-18 17:54:51 UTC
This is just too bizzare.  Every out of sequence group of frames is missing from these tcpdump as well.  I see the SACKS for them, but the frames with the requisite TSN and SSN values are just not there.  Its completely frustrating.  I don't for the life of me understand why the kernel would decide to drop those frames.  Is there any way you can do this on a private network so that we can be sure that this isn't correlated to some other externally driven network event?

In the interim, I've found another possibility for what might be causing the misodering on occasion.  I've modified my patch to change the locking and queuing ordering slightly.  I've also updated the debug code to dump out more verbose information.  With it we should be able to see the exact ordering of TSN and SSN values that come in to the socket queue and the order in which they are removed.  So if you run a tcpdump in conjuncition with this next kernel, we should be able to correlate user received frames to whats on the wire.

Comment 63 William Reich 2009-09-18 18:37:55 UTC
I changed the configuration of the machines
to use 2 private networks instead of "1 corp and 1 private".

The problem still occurs, but the mis-sequence is not observed until
the 668996'th message in one test case, and 242175, in another,
and in a 3rd test case, the test ran for over one minute successfully
before I killed it. A fourth test failed at message 575891.

Given this, what can I do from a command line to keep the size
of the tcpdump files managable ??
( referencing comment 57 for the tcpdump command lines )

Comment 64 Neil Horman 2009-09-18 19:20:46 UTC
well, the first thing that comes to mind is restrict your capture size.  I notice that each frame has 848 or so bytes of payload,  and you only use about 40-50 bytes, since the sctp header ends 64 bytes in, if you set -s 128 on your tcpdump you should be able to catch the sctp header plus any relevant data, as well as shrink your capture file by a factor of about 4 (given a 8x dcrease in the size of the captured data chunk frames, which make up about 50% of the trace)

also, you can filter on prot sctp to remove the extraneous arp frames and telnet frames that creep into the trace.

Comment 66 Neil Horman 2009-09-19 00:10:20 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v8.x86_64.rpm
There you go, thats the new kernel I was mentioning.  It should close the locking hole that I found when a bh gets the sock lock while a user space process is running in sctp_backlog_rcv.  It will also dump a stream of data showing the tsn/ctsn/ssn values for every enqueued and dequeued event on the receiver.  It will be a good deal of data, but it should show us if anything gets out of order on the queue, especially when the pd_lobby is cleared.

Comment 67 William Reich 2009-09-21 12:51:32 UTC
Created attachment 361920 [details]
messages file from test of v8 kernel


Here is the messages file from test of v8 kernel ( catcher side ) ...
Well... the test ran. No mis-sequences were reported.
However, the machine was very slow/burdened/unresponsive.
I could not even log into the machine with another xterm
while the test was running.
The /var/log/messages file shows a lot of debug.
This test used 2 private networks.

I'll do another execution of the test, and just let it run a while...

Comment 68 William Reich 2009-09-21 13:23:00 UTC
just to show the slowness of the machine,
using the newest kernel V8,
we passed 1.7 million 848 byte messages in 17 minutes.
Using the original 2.6.18-128 kernel,
we passed 1.5 million 848 messages in 47 seconds.
Since the machine is so slow, I can not
give an honest pass/fail option on the V8 kernel.

Comment 69 Neil Horman 2009-09-21 15:17:05 UTC
No, tcpdumps won't be needed.  The output confirms that messages are getting queued in the right order and that the TSN values are all properly set (no blanck tsns or other odd values), so I'll rebuild and filter the messages so we just get them now if the TSN value goes backward, or skips, that should restore performance.

Comment 72 Neil Horman 2009-09-21 19:24:03 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v9.x86_64.rpm

There you are.  New kernel with reduced debugging.  Should only give you messages when things are enqueued or dequeued out of order to an sctp socket.  Locking fixes are all the same.

Comment 73 William Reich 2009-09-21 20:24:38 UTC
Created attachment 361987 [details]
messages file from test of v9 kernel


Here is the messages file from the catcher while testing the V9 kernel.
There was no mis-sequences from the catcher test tool,
but there seems to be a lot of output.

Let me know what is next to do...

Comment 74 Neil Horman 2009-09-21 22:59:33 UTC
crud, I think we're getting some wierd aliasing from some split skbs being added as a list.  I'll start modifying the patch.  In the interim, if you would like to try this patch again, you should be able to do so:
1) by turning of syslog
and
2) raising the printk level in /proc/sys/kernel/printk to KERN_EMERG

That should suppress the messages, and let you try a full speed test.

I'll provide you another kernel with more restrictive printks shortly.

Comment 75 William Reich 2009-09-22 11:46:06 UTC
Having disabled syslog, and raising the printk level in /proc/sys/kernel/printk to KERN_EMERG, I retested the v9 kernel.
Mis-seqenced messages were observed.
A rough observation seems to indicate that performance was restored.

Comment 76 Neil Horman 2009-09-22 13:10:09 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v10.x86_64.rpm

Ok, lets run this kernel.  If you restore the printk and syslog settings, it should dump out messages only when out-of-order messages are enqueued or dequeued.  This whole issue hinges on us figuring out where those messages are getting seen out of order

Comment 77 William Reich 2009-09-22 13:41:34 UTC
"catcher side"

+---------------------+    Native SCTP performance test (Aug 14 2009)
| native sctp_catcher |    Copyright (c) 2000 - 2009 Ulticom, Inc.
+---------------------+    All Rights Reserved.

                native_sctp_bind  - host gordon-x , port 10000 

                native_sctp_addrs count is 2 
Local (bound) address for socket 3
Address count 2, port 10000
        address  0: ::ffff:10.10.1.2
        address  1: ::ffff:10.10.3.2
New catcher: 11194
Peer address for socket 4
Address count 2, port 48347
        address  0: ::ffff:10.10.3.1
        address  1: ::ffff:10.10.1.1
          Packets  Packet   Cumulative Rate       Current Rate    Packets
Endpoint  (x1000)    Size   (bytes/s) (msg/s)  (bytes/s)  (msg/s) Skipped
========  =======  ======  ==================  ================== =======
BUMMER: message received out of sequence... expected 342080, got 342081
BUMMER: message received out of sequence... expected 342082, got 342080
BUMMER: message received out of sequence... expected 342081, got 342082
BUMMER: message received out of sequence... expected 640070, got 640071
BUMMER: message received out of sequence... expected 640072, got 640070
BUMMER: message received out of sequence... expected 640071, got 640072
BUMMER: message received out of sequence... expected 640898, got 640899
BUMMER: message received out of sequence... expected 640900, got 640898

Only 2 lines in /var/log/messages:
Sep 22 09:40:48 gordon kernel: SCTP_DBG: Process Rx tsn out of order
Sep 22 09:40:48 gordon kernel: SCTP_DBG: ETSN = 2038533981, QTSN = 0 SSN = 0

I will rerun the test & get tcpdumps

Comment 78 William Reich 2009-09-22 14:27:57 UTC
using the v10 kernel on both machines...

+---------------------+    Native SCTP performance test (Aug 14 2009)
| native sctp_catcher |    Copyright (c) 2000 - 2009 Ulticom, Inc.
+---------------------+    All Rights Reserved.

                native_sctp_bind  - host gordon-x , port 10000 

                native_sctp_addrs count is 2 
Local (bound) address for socket 3
Address count 2, port 10000
        address  0: ::ffff:10.10.1.2
        address  1: ::ffff:10.10.3.2
New catcher: 31019
Peer address for socket 4
Address count 2, port 34953
        address  0: ::ffff:10.10.3.1
        address  1: ::ffff:10.10.1.1
          Packets  Packet   Cumulative Rate       Current Rate    Packets
Endpoint  (x1000)    Size   (bytes/s) (msg/s)  (bytes/s)  (msg/s) Skipped
========  =======  ======  ==================  ================== =======
BUMMER: message received out of sequence... expected 263891, got 263892
BUMMER: message received out of sequence... expected 263893, got 263891
BUMMER: message received out of sequence... expected 263892, got 263893
BUMMER: message received out of sequence... expected 282040, got 282041
BUMMER: message received out of sequence... expected 282042, got 282040
BUMMER: message received out of sequence... expected 282041, got 282042
BUMMER: message received out of sequence... expected 282142, got 282143


gordon 266%  /usr/sbin/tcpdump -i any -s 256 -w 4gordon.tcp.out sctp
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 256 bytes
663349 packets captured
743096 packets received by filter
79743 packets dropped by kernel

/usr/sbin/tcpdump -i any -s 256 -w 4jarrett.tcp.out sctp   
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 256 bytes
^C740700 packets captured
743096 packets received by filter
2392 packets dropped by kernel

Comment 79 William Reich 2009-09-22 14:29:03 UTC
Created attachment 362091 [details]
output of catcher on "gordon" using v10 kernel

output of catcher on "gordon" using v10 kernel from comment 78

Comment 80 William Reich 2009-09-22 14:30:19 UTC
Created attachment 362092 [details]
tcpdump of catcher using v10 kernel


tcpdump of catcher using v10 kernel from comment 78

Comment 81 William Reich 2009-09-22 14:31:59 UTC
Created attachment 362093 [details]
/var/log/messages from test run of comment 78

/var/log/messages from test run of comment 78

Interestly, only 2 lines of useful output:
Sep 22 10:00:36 gordon kernel: SCTP_DBG: Process Rx tsn out of order
Sep 22 10:00:36 gordon kernel: SCTP_DBG: ETSN = 169222358, QTSN = 0 SSN = 0

Comment 82 William Reich 2009-09-22 14:34:18 UTC
Created attachment 362094 [details]
tcpdump of pitcher using v10 kernel

tcpdump of pitcher using v10 kernel of comment 78

Comment 83 Neil Horman 2009-09-24 18:18:16 UTC
It just doesn't make sense.  If we're getting sequence numbers out of order, the TSN's should be out of order too, and their not, or these printk's would have shown that.

I have an idea.  It should let us see the tsn sequence along with your payload sequence numbers.  I'm going to modify your test program to subscribe to data_io_events, by issuing a setsockopt call to set the SCTP_EVENTS option.  Then we should get a cmsg header with each arriving packet that details the sequence number and the tsn value that goes with it, allowing us to directly compare the TSN and your payload sequence value.  Give me a few days and I'll have a patch for you.

Comment 84 Neil Horman 2009-09-30 16:53:56 UTC
Created attachment 363205 [details]
patch to augment catcher reproducer

I've not had a chance to test it yet, but this patch should help us along here.  It augments your catcher to dump out sndrcv info to stderr as packets are received by your cather.  This should let us see a misordering in tsn to correlate to the payload misordering.  I've not tested it yet, but it should work.  If you could run this on your reproducer and catch the stderr output to a file, as well as the normal cather output (no tcpdump needed now :) ).  That would be great.  Thanks!

Comment 85 William Reich 2009-09-30 17:42:32 UTC

...no joy...

Wed Sep 30 13:57:07 EDT 2009
omnitest@gordon 143% ./sctp_native_catcher -bind gordon-x
+---------------------+    Native SCTP performance test (Sep 30 2009)
| native sctp_catcher |    Copyright (c) 2000 - 2009 Ulticom, Inc.
+---------------------+    All Rights Reserved.

                native_sctp_bind  - host gordon-x , port 10000 

                native_sctp_addrs count is 2 
Local (bound) address for socket 3
Address count 2, port 10000
        address  0: ::ffff:10.10.1.2
        address  1: ::ffff:10.10.3.2
New catcher: 3524
Peer address for socket 4
ERROR: sctp_getpaddrs:  errno 22: Invalid argument
NEW MESSAGE
sid = 0ssn = 0tsn = -481373512ctsn = 0flags = 0
NEW MESSAGE
sid = 0ssn = 0tsn = -481373512ctsn = 0flags = 0
NEW MESSAGE
sid = 0ssn = 0tsn = -481373512ctsn = 0flags = 0
NEW MESSAGE
sid = 0ssn = 0tsn = -481373512ctsn = 0flags = 0
NEW MESSAGE
sid = 0ssn = 0tsn = -481373512ctsn = 0flags = 0
NEW MESSAGE
sid = 0ssn = 0tsn = -481373512ctsn = 0flags = 0
NEW MESSAGE
sid = 0ssn = 0tsn = -481373512ctsn = 0flags = 0
NEW MESSAGE
sid = 0ssn = 0tsn = -481373512ctsn = 0flags = 0
NEW MESSAGE
sid = 0ssn = 0tsn = -481373512ctsn = 0flags = 0
NEW MESSAGE
sid = 0ssn = 0tsn = -481373512ctsn = 0flags = 0
NEW MESSAGE
sid = 0ssn = 0tsn = -481373512ctsn = 0flags = 0


...sequence repeats infinitely...

I expected to see something like this:./sctp_native_catcher.orig -bind gordon-x
+---------------------+    Native SCTP performance test (Aug 14 2009)
| native sctp_catcher |    Copyright (c) 2000 - 2009 Ulticom, Inc.
+---------------------+    All Rights Reserved.

                native_sctp_bind  - host gordon-x , port 10000 

                native_sctp_addrs count is 2 
Local (bound) address for socket 3
Address count 2, port 10000
        address  0: ::ffff:10.10.1.2
        address  1: ::ffff:10.10.3.2
New catcher: 8561
Peer address for socket 4
Address count 2, port 45637
        address  0: ::ffff:10.10.3.1
        address  1: ::ffff:10.10.1.1
          Packets  Packet   Cumulative Rate       Current Rate    Packets
Endpoint  (x1000)    Size   (bytes/s) (msg/s)  (bytes/s)  (msg/s) Skipped
========  =======  ======  ==================  ================== =======
Catcher 8561: exiting; 5 packets received

Comment 86 Neil Horman 2009-09-30 18:27:32 UTC
Created attachment 363214 [details]
corrected reproducer patch

sorry, forgot to advance the cmsghdr loop.  Heres a corrected patch.

Comment 87 William Reich 2009-09-30 19:13:25 UTC
Using this newest catcher program, there were no
mis-sequenced messages reported.
The timing of the whole machine changed as this new version
of the catcher takes ~95% of the CPU, while the original version
only uses about 39%. ( lots of printf's )
I ran the test 3 times. Each time over 4 million messages with
no mis-sequences ( total of 12 million for all the tests together )

Then, I went back and ran the original catcher program.
The mis-sequences were observed several times within the first
one million messages.

Comment 88 Neil Horman 2009-09-30 21:29:31 UTC
sigh, then I'll modify it to keep a history buffer and only dump the buffer in the event of mis-sequenced data

Comment 89 Neil Horman 2009-10-01 19:05:44 UTC
Created attachment 363385 [details]
patch to record tsn history

Ok, here you go, heres a new variant of the patch.  Instead of printing everything out, it only keeps a history of the last 10 packets sndrcvinfo data, and then it dumps those 10 entrys every time you detect an out of sequence payload.  Should be much faster.  Its zero copy from the standpoint of the app, and doesn't hit the console until something goes wrong.

Comment 90 William Reich 2009-10-01 19:38:37 UTC
Created attachment 363393 [details]
output from program of comment 89


full output of catcher after patch from comment 89 is applied.

do I need tcpdumps ?

short version of output here:

./sctp_native_catcher -report 1000000 -bind gordon-x
+---------------------+    Native SCTP performance test (Oct  1 2009)
| native sctp_catcher |    Copyright (c) 2000 - 2009 Ulticom, Inc.
+---------------------+    All Rights Reserved.

                native_sctp_bind  - host gordon-x , port 10000 

                native_sctp_addrs count is 2 
Local (bound) address for socket 3
Address count 2, port 10000
        address  0: ::ffff:10.10.1.2
        address  1: ::ffff:10.10.3.2
New catcher: 14348
Peer address for socket 4
Address count 2, port 42797
        address  0: ::ffff:10.10.3.1
        address  1: ::ffff:10.10.1.1
          Packets  Packet   Cumulative Rate       Current Rate    Packets
Endpoint  (x1000)    Size   (bytes/s) (msg/s)  (bytes/s)  (msg/s) Skipped
========  =======  ======  ==================  ================== =======
    0        1000    848    41800651    49293   41800651    49293    0
    0        2000    848    40153731    47351   38631666    45556    0
    0        3000    848    39618951    46720   38591018    45508   BUMMER: message received out of sequence... expected 3291450, got 3291451

Detected bad sequence in payload, dumping tsn history
sid=0 ssn=3931 flags=0, tsn=40353b9d ctsn=0
sid=0 ssn=3932 flags=0, tsn=40353b9e ctsn=0
sid=0 ssn=3933 flags=0, tsn=40353b9f ctsn=0
sid=0 ssn=3934 flags=0, tsn=40353ba0 ctsn=0
sid=0 ssn=3935 flags=0, tsn=40353ba1 ctsn=0
sid=0 ssn=3936 flags=0, tsn=40353ba2 ctsn=0
sid=0 ssn=3937 flags=0, tsn=40353ba3 ctsn=0
sid=0 ssn=3938 flags=0, tsn=40353ba4 ctsn=0
sid=0 ssn=3939 flags=0, tsn=40353ba5 ctsn=0
sid=0 ssn=3930 flags=0, tsn=40353b9c ctsn=0
BUMMER: message received out of sequence... expected 3291452, got 3291450

Detected bad sequence in payload, dumping tsn history
sid=0 ssn=3932 flags=0, tsn=40353b9e ctsn=0
sid=0 ssn=3933 flags=0, tsn=40353b9f ctsn=0
sid=0 ssn=3934 flags=0, tsn=40353ba0 ctsn=0
sid=0 ssn=3935 flags=0, tsn=40353ba1 ctsn=0
sid=0 ssn=3936 flags=0, tsn=40353ba2 ctsn=0
sid=0 ssn=3937 flags=0, tsn=40353ba3 ctsn=0
sid=0 ssn=3938 flags=0, tsn=40353ba4 ctsn=0
sid=0 ssn=3939 flags=0, tsn=40353ba5 ctsn=0
sid=0 ssn=393a flags=0, tsn=40353ba6 ctsn=0
sid=0 ssn=3931 flags=0, tsn=40353b9d ctsn=0

Comment 91 Neil Horman 2009-10-01 20:12:43 UTC
Yes, if you have them that would be good.  These are some interesting results to say the least.  Keep in mind the last entry should actually be the first (thats a minor bug in my changes).  Regardless, despite the out of order sequence numbers, the tsn values and ssn values that you see in these dumps are all in order.  That suggests that from a protocol standpoint, we're delivering these messages to your app in the order they should be delivered.  That in turn suggests that the payload getting generated is someone either mixed up during the send, or in the sending application itself.  We really need to see those frames that go missing in those tcpdumps.  I'll see if we can't do some sender side augmentation as well to figure out whats going on.

Comment 92 William Reich 2009-10-02 12:19:39 UTC
Created attachment 363466 [details]
another run from tool of comment 89 - this time with associated  tcpdumps


another run from tool of comment 89 - this time with associated  tcpdumps
This attachment is the output of the 'catcher'. 

GORDON
/usr/sbin/tcpdump -i any -s 256 -w 8ordon.tcp.out sctp
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 256 bytes
918582 packets captured
949498 packets received by filter
30912 packets dropped by kernel

omnitest@jarrett 
125% /usr/sbin/tcpdump -i any -s 256 -w 8jarrett.tcp.out sct>
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 256 bytes
886915 packets captured
949498 packets received by filter
62579 packets dropped by kernel

(short output here... )
./sctp_native_catcher -report 1000000 -bind gordon-x
+---------------------+    Native SCTP performance test (Oct  1 2009)
| native sctp_catcher |    Copyright (c) 2000 - 2009 Ulticom, Inc.
+---------------------+    All Rights Reserved.

                native_sctp_bind  - host gordon-x , port 10000 

                native_sctp_addrs count is 2 
Local (bound) address for socket 3
Address count 2, port 10000
        address  0: ::ffff:10.10.1.2
        address  1: ::ffff:10.10.3.2
New catcher: 3967
Peer address for socket 4
Address count 2, port 50265
        address  0: ::ffff:10.10.3.1
        address  1: ::ffff:10.10.1.1
          Packets  Packet   Cumulative Rate       Current Rate    Packets
Endpoint  (x1000)    Size   (bytes/s) (msg/s)  (bytes/s)  (msg/s) Skipped
========  =======  ======  ==================  ================== =======
BUMMER: message received out of sequence... expected 3992Detected bad sequence in payload, dumping tsn history
sid=0 ssn=177f flags=0, tsn=2aa445d7 ctsn=0
sid=0 ssn=1780 flags=0, tsn=2aa445d8 ctsn=0
sid=0 ssn=1781 flags=0, tsn=2aa445d9 ctsn=0
sid=0 ssn=1782 flags=0, tsn=2aa445da ctsn=0
sid=0 ssn=1783 flags=0, tsn=2aa445db ctsn=0
sid=0 ssn=1784 flags=0, tsn=2aa445dc ctsn=0
sid=0 ssn=1785 flags=0, tsn=2aa445dd ctsn=0
sid=0 ssn=1786 flags=0, tsn=2aa445de ctsn=0
sid=0 ssn=1787 flags=0, tsn=2aa445df ctsn=0
sid=0 ssn=177e flags=0, tsn=2aa445d6 ctsn=0
BUMMER: message received out of sequence... expected 399242, got 399240
40, got 399241

Comment 93 William Reich 2009-10-02 12:24:07 UTC
Created attachment 363469 [details]
tcpdump to match comment 92 from "catcher"


gzip format

Comment 94 William Reich 2009-10-02 12:26:43 UTC
Created attachment 363471 [details]
tcpdump of pitcher to match comment 92


gzip format

Comment 95 Neil Horman 2009-10-06 18:33:49 UTC
When we assign TSN values, we do so in sctp_packet_transmit, which is the last function that we have in the sctp code prior to sending a packet to the ip layer for routing/sending.  Unfortunately, this function is called per transport (i.e. called independently based on which interface you are sending out of.  If each of your two paths have different queueing environment (if one path is corked to long, or has a pmtu thats smaller than the alternate path, etc), its possible that these per-transport queues will get serviced out of order.  Sometimes on those borders where you switch transports, it will be possible that packets are handled out of order between these two paths.  TSN's will still be assigned sequentially, so everything is orderly from a transport point of view, but from a user space payload viewpoint data will be out of order.

I'll put together a patch for this shortly.

Comment 96 Neil Horman 2009-10-06 19:48:50 UTC
Created attachment 363887 [details]
new patch to attempt serialization of outgonig transports

Ok, I've not built/tested it yet, but I wanted to get it to you.  Basically what I try to do here is, during a primary path change, I mark the old active path so that when we go to flush queues we're guaranteed to get the old active path flushed before we transmit from the new active path.  If you could build this patch in a kernel and run it on your sender I'd appreciate it.

Comment 97 William Reich 2009-10-06 20:51:01 UTC
My test environment is setup better to receive a complete experimental kernel. This also allows for you to confirm that the patch was properly applied.

Thanks

Comment 99 Neil Horman 2009-10-07 16:28:40 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v11.x86_64.rpm

New kernel for you to test.

Comment 100 William Reich 2009-10-07 17:46:56 UTC
panic on pitcher machine using kernel from comment 99

sctp_assoc_set_primary+0x39

I will set up the test again to get a full backtrace from a serial console...

Comment 101 Neil Horman 2009-10-07 18:12:05 UTC
no need, I think I know what happened.  active_path can be null there at the start of an association.  I'll fix it

Comment 104 William Reich 2009-10-08 18:40:06 UTC

Using the kernel from comment 103 on both machines --

better, but still mis-sequences...
( it just took a longer test run to show the mis-sequences )

This is the catcher side...

Peer address for socket 4
Address count 2, port 52598
        address  0: ::ffff:10.10.3.1
        address  1: ::ffff:10.10.1.1
          Packets  Packet   Cumulative Rate       Current Rate    Packets
Endpoint  (x1000)    Size   (bytes/s) (msg/s)  (bytes/s)  (msg/s) Skipped
========  =======  ======  ==================  ================== =======
    0        1000    848    39697649    46813   39697649    46813    0
    0        2000    848    39461476    46534   39228096    46259    0
    0        3000    848    39406640    46470   39297425    46341    0
BUMMER: message received out of sequence... expected 3054125, got 3054126
BUMMER: message received out of sequence... expected 3054127, got 3054125
BUMMER: message received out of sequence... expected 3054126, got 3054127
    0        4000    848    39416636    46481   39446654    46517    0
    0        5000    848    39394865    46456   39308019    46353    0
    0        6000    848    39388595    46448   39357278    46411    0
BUMMER: message received out of sequence... expected 6878181, got 6878182
BUMMER: message received out of sequence... expected 6878183, got 6878181
BUMMER: message received out of sequence... expected 6878182, got 6878183
BUMMER: message received out of sequence... expected 6971977, got 6971978

Comment 105 William Reich 2009-10-08 18:42:23 UTC

Same kernels, but this time the catcher was the tool
that printed out extra information....

ocal (bound) address for socket 3
Address count 2, port 10000
        address  0: ::ffff:10.10.1.2
        address  1: ::ffff:10.10.3.2
New catcher: 16126
Peer address for socket 4
Address count 2, port 42953
        address  0: ::ffff:10.10.3.1
        address  1: ::ffff:10.10.1.1
          Packets  Packet   Cumulative Rate       Current Rate    Packets
Endpoint  (x1000)    Size   (bytes/s) (msg/s)  (bytes/s)  (msg/s) Skipped
========  =======  ======  ==================  ================== =======
BUMMER: message received out of sequence... expected 128372, got 128373

Detected bad sequence in payload, dumping tsn history
sid=0 ssn=f56b flags=0, tsn=84f93adf ctsn=0
sid=0 ssn=f56c flags=0, tsn=84f93ae0 ctsn=0
sid=0 ssn=f56d flags=0, tsn=84f93ae1 ctsn=0
sid=0 ssn=f56e flags=0, tsn=84f93ae2 ctsn=0
sid=0 ssn=f56f flags=0, tsn=84f93ae3 ctsn=0
sid=0 ssn=f570 flags=0, tsn=84f93ae4 ctsn=0
sid=0 ssn=f571 flags=0, tsn=84f93ae5 ctsn=0
sid=0 ssn=f572 flags=0, tsn=84f93ae6 ctsn=0
sid=0 ssn=f573 flags=0, tsn=84f93ae7 ctsn=0
sid=0 ssn=f56a flags=0, tsn=84f93ade ctsn=0
BUMMER: message received out of sequence... expected 128374, got 128372

Detected bad sequence in payload, dumping tsn history

Comment 108 Neil Horman 2009-10-09 19:47:02 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v13.x86_64.rpm

Ok, heres yet another kernel for you to try.  I've cleaned up the previous transport code  a bit more, and added a sysctl called /proc/sys/net/sctp/early_tsn_assign.  It should further serialize the tsn assignments in the tx path, so we can catch other errors in the same area of code.  If you could try this kernel on your sender, with early_tsn_assign set to both 0 and then 1, that would be great.  Thanks!

Comment 109 William Reich 2009-10-09 20:35:55 UTC

well - something is wrong on the catcher side.

Using the kernel from comment 108 on both the pitcher and catcher machines,
if I send more than 5 packets via the pitcher, the catcher does not see them.
The pitcher thinks he is completed, but the catcher doesn't reporting
catching them.

no panic, but no cpu time is being used on the catcher-side either.

This is the same regardless of the value of /proc/sys/net/sctp/early_tsn_assign.

When I tell the catcher to report each packet
(  sctp_native_catcher -report 1   -bind gordon-x ),
I see 6 packets, and then nothing..., even though I have told
the pitcher to send 40.

This occurs using the original catcher as well as the catcher
from comment 89.

Also note that as I am writing this, I see a few reports
spit out after a couple of minutes.
So, this seems to imply some kind of lock problem (???).

Comment 110 Neil Horman 2009-10-14 18:46:22 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v15.x86_64.rpm

New kernel for you (please exucse the missing v14, there was a build system issue).  Anywho, I've tested this version and it works well for me.  Its still got the early_tsn_assign code in place, so if you could test it both with that settting on and off, that would be great.  Thanks!

Comment 111 William Reich 2009-10-16 17:47:45 UTC
using the new kernel from comment 110,
mis-sequenced messages were noted regardless of the value of
/proc/sys/net/sctp/early_tsn_assign.

What data would you like me to present to you ?

Comment 112 William Reich 2009-10-16 17:59:07 UTC
more info - note that when
/proc/sys/net/sctp/early_tsn_assign = 1 on the pitcher, there are many less
mis-sequences...
but they still do occur...

Comment 113 Neil Horman 2009-10-16 21:47:11 UTC
thats enough info.  Given our previous tests, I think this proves that, even when assigning tsn values early in sctp_outq_flush, it would seem the outq we are enqueuing to is receiving those messages out of order.  I'll make a new kernel that attempts to trap on which frame we are going out of order on.

Comment 115 Neil Horman 2009-10-20 20:17:30 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v17.x86_64.rpm

New kernel for you:

changes:
1) removed the early_tsn_assign bits.  Those showed that even if we assigned
tsns in sctp_outq_flush we had misordering events, which suggests misordering
is happening in a queue prior to calling sctp_outq_flush

2) added sequencing assignment and checks.
in sctp_make_datafrag_empty we assign a sequence number to each chunk,
direivedfrom a per-asociation counter.  Each outq (outq:association ratio is
1:1) checks to make sure that sequencing doesn't go backwards in
sctp_outq_flush.  This should help us pin down ordering issues there.

Comment 116 William Reich 2009-10-21 12:04:51 UTC
Using the pitcher and catcher from comment 1, and the kernel
from comment 115 on both machines,
mis-sequences were observed on the catcher.
+++++++++++

New catcher: 11164
Peer address for socket 4
Address count 2, port 42973
        address  0: ::ffff:10.10.3.1
        address  1: ::ffff:10.10.1.1
          Packets  Packet   Cumulative Rate       Current Rate    Packets
Endpoint  (x1000)    Size   (bytes/s) (msg/s)  (bytes/s)  (msg/s) Skipped
========  =======  ======  ==================  ================== =======
BUMMER: message received out of sequence... expected 569003, got 569004
BUMMER: message received out of sequence... expected 569005, got 569003
BUMMER: message received out of sequence... expected 569004, got 569005
BUMMER: message received out of sequence... expected 649501, got 649502
BUMMER: message received out of sequence... expected 649503, got 649501
BUMMER: message received out of sequence... expected 649502, got 649503
BUMMER: message received out of sequence... expected 649529, got 649530
BUMMER: message received out of sequence... expected 649531, got 649529
BUMMER: message received out of sequence... expected 649530, got 649531

+++++++++
what data do you need from me regarding this test ?

Comment 117 William Reich 2009-10-21 12:09:03 UTC

Same test as comment 116, but this time using the catcher from
comment 89...

omnitest@gordon 338% sc  -report 1000000 -bind gordon-x
+---------------------+    Native SCTP performance test (Oct  1 2009)
| native sctp_catcher |    Copyright (c) 2000 - 2009 Ulticom, Inc.
+---------------------+    All Rights Reserved.

                native_sctp_bind  - host gordon-x , port 10000 

                native_sctp_addrs count is 2 
Local (bound) address for socket 3
Address count 2, port 10000
        address  0: ::ffff:10.10.1.2
        address  1: ::ffff:10.10.3.2
New catcher: 16113
Peer address for socket 4
Address count 2, port 36514
        address  0: ::ffff:10.10.3.1
        address  1: ::ffff:10.10.1.1
          Packets  Packet   Cumulative Rate       Current Rate    Packets
Endpoint  (x1000)    Size   (bytes/s) (msg/s)  (bytes/s)  (msg/s) Skipped
========  =======  ======  ==================  ================== =======
BUMMER: message received out of sequence... expected 144917, got 144918

Detected bad sequence in payload, dumping tsn history
sid=0 ssn=360c flags=0, tsn=afe6fd16 ctsn=0
sid=0 ssn=360d flags=0, tsn=afe6fd17 ctsn=0
sid=0 ssn=360e flags=0, tsn=afe6fd18 ctsn=0
sid=0 ssn=360f flags=0, tsn=afe6fd19 ctsn=0
sid=0 ssn=3610 flags=0, tsn=afe6fd1a ctsn=0
sid=0 ssn=3611 flags=0, tsn=afe6fd1b ctsn=0
sid=0 ssn=3612 flags=0, tsn=afe6fd1c ctsn=0
sid=0 ssn=3613 flags=0, tsn=afe6fd1d ctsn=0
sid=0 ssn=3614 flags=0, tsn=afe6fd1e ctsn=0
sid=0 ssn=360b flags=0, tsn=afe6fd15 ctsn=0
BUMMER: message received out of sequence... expected 144919, got 144917

Detected bad sequence in payload, dumping tsn history
sid=0 ssn=360d flags=0, tsn=afe6fd17 ctsn=0
sid=0 ssn=360e flags=0, tsn=afe6fd18 ctsn=0
sid=0 ssn=360f flags=0, tsn=afe6fd19 ctsn=0
sid=0 ssn=3610 flags=0, tsn=afe6fd1a ctsn=0
sid=0 ssn=3611 flags=0, tsn=afe6fd1b ctsn=0
sid=0 ssn=3612 flags=0, tsn=afe6fd1c ctsn=0
sid=0 ssn=3613 flags=0, tsn=afe6fd1d ctsn=0
sid=0 ssn=3614 flags=0, tsn=afe6fd1e ctsn=0
sid=0 ssn=3615 flags=0, tsn=afe6fd1f ctsn=0
sid=0 ssn=360c flags=0, tsn=afe6fd16 ctsn=0
BUMMER: message received out of sequence... expected 144918, got 144919

Comment 118 Neil Horman 2009-10-21 17:54:37 UTC
I need the output of /var/log/messages from these tests.  That will hold any alerts as to mis-sequencing between the time you called sendmsg from userspace and the time we passed those chunks to the ip layer

Comment 119 William Reich 2009-10-21 20:03:24 UTC
on the catcher-side, there was nothing in /var/log/messages.

on the pitcher-side, here is what was found...


Oct 21 08:45:21 jarrett kernel: ODD INPUT SEQUENCE
Oct 21 08:45:21 jarrett kernel: LAST_SEEN = 0 CHUNK=1 ASSOC=ffff8100aec4e000
Oct 21 08:45:33 jarrett kernel: ODD INPUT SEQUENCE
Oct 21 08:45:33 jarrett kernel: LAST_SEEN = 0 CHUNK=1 ASSOC=ffff8100af446000
Oct 21 08:49:58 jarrett kernel: ODD INPUT SEQUENCE
Oct 21 08:49:58 jarrett kernel: LAST_SEEN = 0 CHUNK=1 ASSOC=ffff8100ae004000


I ran the pitcher on 3 tests this morning...

Comment 120 Neil Horman 2009-10-21 20:17:16 UTC
Ok, I am now officially at a loss.  we have:

1) determined that we are receiving tsns in order ( comments #117, 116 and prior)
2) determined that we are mot rearranging sequences prior to tsn assignment (comment 119, not that seq 0 is an artifact that comes from starting assignments at 1, but initalizing the last seen value to 0)

3) been completely unable to record a tcpdump in which the affected frames can be captured.

I just don't know what to do anymore.  By all our tests, we're sending everything in order, and we're receiving everything in order, but your app is saying that the data we're getting is out of order.  The only thing I can think is happening is that we're re-ordering data prior to assigning sequencing values to it.  but this last tests indicates that we're not doing that.

any thoughts?

Comment 121 William Reich 2009-10-23 12:52:53 UTC
Created attachment 365846 [details]
wireshark output file


Here is a wireshark capture file of the catcher-side
of one particular test.

New catcher: 26810
Peer address for socket 4
Address count 2, port 59370
        address  0: ::ffff:10.10.3.1
        address  1: ::ffff:10.10.1.1
          Packets  Packet   Cumulative Rate       Current Rate    Packets
Endpoint  (x1000)    Size   (bytes/s) (msg/s)  (bytes/s)  (msg/s) Skipped
========  =======  ======  ==================  ================== =======
BUMMER: message received out of sequence... expected 182985, got 182986

Detected bad sequence in payload, dumping tsn history
sid=0 ssn=cac0 flags=0, tsn=5dfd24c2 ctsn=0
sid=0 ssn=cac1 flags=0, tsn=5dfd24c3 ctsn=0
sid=0 ssn=cac2 flags=0, tsn=5dfd24c4 ctsn=0
sid=0 ssn=cac3 flags=0, tsn=5dfd24c5 ctsn=0
sid=0 ssn=cac4 flags=0, tsn=5dfd24c6 ctsn=0
sid=0 ssn=cac5 flags=0, tsn=5dfd24c7 ctsn=0
sid=0 ssn=cac6 flags=0, tsn=5dfd24c8 ctsn=0
sid=0 ssn=cac7 flags=0, tsn=5dfd24c9 ctsn=0
sid=0 ssn=cac8 flags=0, tsn=5dfd24ca ctsn=0
sid=0 ssn=cabf flags=0, tsn=5dfd24c1 ctsn=0
BUMMER: message received out of sequence... expected 182987, got 182985

In the wireshark file item 15190 contains sctp message 182974.
Item 15215 contains sctp message 182987
AND item 15221 contains sctp message 182984 ( ?? )omnitest@gordon 366% cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
::1             localhost6.localdomain6 localhost6
172.25.4.75     gordon gordon.ulticom.com gordon-0 
10.10.1.2       gordon gordon-1 gordon  gordon-x
10.10.2.2       gordon gordon-2 gordon.ulticom.com gordon gordon-bond gordon-slan
10.10.3.2       gordon gordon-3  gordon.ulticom.com gordon-x

10.10.3.1       jarrett jarrett-3 jarrett.ulticom.com jarrett-3 jarrett-x
10.10.2.1       jarrett jarrett-2 jarrett  jarrett.ulticom.com jarrett-bond jarrett-slan
172.25.4.76     jarrett jarrett.ulticom.com jarrett-0 
10.10.1.1       jarrett jarrett-1 jarrett jarrett-1 jarrett-x


I did not look at the file in any detail other than to find
the location where the trouble was reported by the catcher tool.

+++++++++++
I will attempt to execute another test, but this time get
both pitcher and catcher wireshark files...


Note that 'gordon' is the catcher.
Here is /etc/hosts for reference...

Comment 122 William Reich 2009-10-23 12:57:02 UTC
comment 121 has the /etc/hosts file placed in the middle
of my typing.
... cut & paste issue...

Comment 123 William Reich 2009-10-23 13:27:36 UTC
Created attachment 365851 [details]
test 4 of wireshark output - 


Here is another test
This attachment is the catcher output in wireshark format.

New catcher: 25138
Peer address for socket 4
Address count 2, port 38710
        address  0: ::ffff:10.10.3.1
        address  1: ::ffff:10.10.1.1
          Packets  Packet   Cumulative Rate       Current Rate    Packets
Endpoint  (x1000)    Size   (bytes/s) (msg/s)  (bytes/s)  (msg/s) Skipped
========  =======  ======  ==================  ================== =======
BUMMER: message received out of sequence... expected 94328, got 94329

Detected bad sequence in payload, dumping tsn history
sid=0 ssn=706f flags=0, tsn=c570b907 ctsn=0
sid=0 ssn=7070 flags=0, tsn=c570b908 ctsn=0
sid=0 ssn=7071 flags=0, tsn=c570b909 ctsn=0
sid=0 ssn=7072 flags=0, tsn=c570b90a ctsn=0
sid=0 ssn=7073 flags=0, tsn=c570b90b ctsn=0
sid=0 ssn=7074 flags=0, tsn=c570b90c ctsn=0
sid=0 ssn=7075 flags=0, tsn=c570b90d ctsn=0
sid=0 ssn=7076 flags=0, tsn=c570b90e ctsn=0
sid=0 ssn=7077 flags=0, tsn=c570b90f ctsn=0
sid=0 ssn=706e flags=0, tsn=c570b906 ctsn=0
BUMMER: message received out of sequence... expected 94330, got 94328


On the catcher side, wireshark item 6944 is sctp msg 94326,
item 6948 is sctp msg 94329, and item 6949 is sctp msg 94328.

Comment 124 William Reich 2009-10-23 13:31:37 UTC
Created attachment 365852 [details]
pitcher side of test 4 - wireshark format


Here is the pitcher side ( jarrett ) of test 4.

In the wireshark file, item 16063 is sctp msg 94320.
Item 16076 is sctp msg 94330.
It appears that some output may have been lost, although no error 
were reported by the tool

Comment 125 Neil Horman 2009-10-23 16:56:56 UTC
Thank you, that actually helps alot.  This is what I extracted from the wireshark capture:

DUP     FRAME#          TSN             SID     SSN             PAYLOAD
        15193           5dfd24c1        0       51903           182976
        15199           5dfd24c2        0       51904           182977
        15200           5dfd24c3        0       51905           182978
        15201           5dfd24c4        0       51906           182979
        15202           5dfd24c5        0       51907           182980
        15203           5dfd24c6        0       51908           182981
        15204           5dfd24c7        0       51909           182982
        15205           5dfd24c8        0       51910           182983

        15221           5dfd24c9        0       51911           182984
X       15231           5dfd24c9        0       51911           182984

        15223           5dfd24ca        0       51912           182986
X       15233           5dfd24ca        0       51912           182986

        15213           5dfd24cb        0       51913           182985

It shows us a few things:

1) We seem to have a loose correlation between duplicate frames and sequence reversals

2) We're definately seeing in-sequence TSN values with outof sequence payloads.  This has be caused either by a reassigning of tsn values in flight, or a misordering of the frames in queue prior to receiving their TSN values.  I think the latter is far more likely, and as such, I must have just missed something during my debugging.  I'll keep looking and have a new kernel for you soon.  Thanks!

Comment 127 Neil Horman 2009-10-26 19:49:57 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v18.x86_64.rpm

New build for you.  You'll need to run it on the sender system at least (if not both systems).  This debug code is rather sensitive to your user space code (it extracts the sequence number you embed in your payload based on string offset), but it uses that data to track sequence changes throughout the sctp stack, from the moment it enters the kernel, to the time its handed off to the ip stack.

Comment 128 William Reich 2009-10-28 12:37:17 UTC
Created attachment 366423 [details]
output from kernel from comment 127 on pitcher side



Running a test using the kernel from comment 127 on both machines,
there were no mis-sequences reported.

However there was a LARGE amount of debug reported in /var/log/messages 
on the pitcher side ( attached file )
Nothing observed on the catcher side.

Did the large amount of debug change the timing such that no
mis-sequences were observed ?

Comment 129 William Reich 2009-10-28 12:49:40 UTC
Created attachment 366425 [details]
test 2 - /var/log/messages from pitcher side


I re-ran the test, but let it run longer this time.

This time, mis-sequences were reported on the catcher side
when both the pitcher & catcher machines used the kernel from comment 127.
There was nothing in the catcher-side /var/log/messages.
The pitcher-side /var/log/messages is attached.
( This file actually contains the output of both tests that I did today. )

The catcher user space tool reported this on the 2nd test:
Peer address for socket 4
Address count 2, port 52496
        address  0: ::ffff:10.10.3.1
        address  1: ::ffff:10.10.1.1
          Packets  Packet   Cumulative Rate       Current Rate    Packets
Endpoint  (x1000)    Size   (bytes/s) (msg/s)  (bytes/s)  (msg/s) Skipped
========  =======  ======  ==================  ================== =======
    0        1000    848    39710144    46828   39710144    46828    0
    0        2000    848    39205291    46232   38713114    45652    0
    0        3000    848    39098801    46107   38887548    45857    0
    0        4000    848    39746385    46870   41824575    49321    0
BUMMER: message received out of sequence... expected 4026317, got 4026318

Detected bad sequence in payload, dumping tsn history
sid=0 ssn=6fc4 flags=0, tsn=d6a7c3c9 ctsn=0
sid=0 ssn=6fc5 flags=0, tsn=d6a7c3ca ctsn=0
sid=0 ssn=6fc6 flags=0, tsn=d6a7c3cb ctsn=0
sid=0 ssn=6fc7 flags=0, tsn=d6a7c3cc ctsn=0
sid=0 ssn=6fc8 flags=0, tsn=d6a7c3cd ctsn=0
sid=0 ssn=6fc9 flags=0, tsn=d6a7c3ce ctsn=0
sid=0 ssn=6fca flags=0, tsn=d6a7c3cf ctsn=0
sid=0 ssn=6fcb flags=0, tsn=d6a7c3d0 ctsn=0
sid=0 ssn=6fcc flags=0, tsn=d6a7c3d1 ctsn=0
sid=0 ssn=6fc3 flags=0, tsn=d6a7c3c8 ctsn=0
BUMMER: message received out of sequence... expected 4026319, got 4026317

Detected bad sequence in payload, dumping tsn history
sid=0 ssn=6fc5 flags=0, tsn=d6a7c3ca ctsn=0
sid=0 ssn=6fc6 flags=0, tsn=d6a7c3cb ctsn=0
sid=0 ssn=6fc7 flags=0, tsn=d6a7c3cc ctsn=0
sid=0 ssn=6fc8 flags=0, tsn=d6a7c3cd ctsn=0
sid=0 ssn=6fc9 flags=0, tsn=d6a7c3ce ctsn=0
sid=0 ssn=6fca flags=0, tsn=d6a7c3cf ctsn=0
sid=0 ssn=6fcb flags=0, tsn=d6a7c3d0 ctsn=0
sid=0 ssn=6fcc flags=0, tsn=d6a7c3d1 ctsn=0
sid=0 ssn=6fcd flags=0, tsn=d6a7c3d2 ctsn=0
sid=0 ssn=6fc4 flags=0, tsn=d6a7c3c9 ctsn=0
BUMMER: message received out of sequence... expected 4026318, got 4026319

Detected bad sequence in payload, dumping tsn history
sid=0 ssn=6fc6 flags=0, tsn=d6a7c3cb ctsn=0
sid=0 ssn=6fc7 flags=0, tsn=d6a7c3cc ctsn=0
sid=0 ssn=6fc8 flags=0, tsn=d6a7c3cd ctsn=0
sid=0 ssn=6fc9 flags=0, tsn=d6a7c3ce ctsn=0
sid=0 ssn=6fca flags=0, tsn=d6a7c3cf ctsn=0
sid=0 ssn=6fcb flags=0, tsn=d6a7c3d0 ctsn=0
sid=0 ssn=6fcc flags=0, tsn=d6a7c3d1 ctsn=0
sid=0 ssn=6fcd flags=0, tsn=d6a7c3d2 ctsn=0
sid=0 ssn=6fce flags=0, tsn=d6a7c3d3 ctsn=0
sid=0 ssn=6fc5 flags=0, tsn=d6a7c3ca ctsn=0
BUMMER: message received out of sequence... expected 4479263, got 4479264

Detected bad sequence in payload, dumping tsn history
sid=0 ssn=5916 flags=0, tsn=d6aead1b ctsn=0
sid=0 ssn=5917 flags=0, tsn=d6aead1c ctsn=0
sid=0 ssn=5918 flags=0, tsn=d6aead1d ctsn=0
sid=0 ssn=5919 flags=0, tsn=d6aead1e ctsn=0
sid=0 ssn=591a flags=0, tsn=d6aead1f ctsn=0
sid=0 ssn=591b flags=0, tsn=d6aead20 ctsn=0
sid=0 ssn=591c flags=0, tsn=d6aead21 ctsn=0
sid=0 ssn=591d flags=0, tsn=d6aead22 ctsn=0
sid=0 ssn=591e flags=0, tsn=d6aead23 ctsn=0
sid=0 ssn=5915 flags=0, tsn=d6aead1a ctsn=0
BUMMER: message received out of sequence... expected 4479265, got 4479263

Detected bad sequence in payload, dumping tsn history
sid=0 ssn=5917 flags=0, tsn=d6aead1c ctsn=0
sid=0 ssn=5918 flags=0, tsn=d6aead1d ctsn=0
sid=0 ssn=5919 flags=0, tsn=d6aead1e ctsn=0
sid=0 ssn=591a flags=0, tsn=d6aead1f ctsn=0
sid=0 ssn=591b flags=0, tsn=d6aead20 ctsn=0
sid=0 ssn=591c flags=0, tsn=d6aead21 ctsn=0
sid=0 ssn=591d flags=0, tsn=d6aead22 ctsn=0
sid=0 ssn=591e flags=0, tsn=d6aead23 ctsn=0
sid=0 ssn=591f flags=0, tsn=d6aead24 ctsn=0
sid=0 ssn=5916 flags=0, tsn=d6aead1b ctsn=0
BUMMER: message received out of sequence... expected 4479264, got 4479265

Detected bad sequence in payload, dumping tsn history
sid=0 ssn=5918 flags=0, tsn=d6aead1d ctsn=0
sid=0 ssn=5919 flags=0, tsn=d6aead1e ctsn=0
sid=0 ssn=591a flags=0, tsn=d6aead1f ctsn=0
sid=0 ssn=591b flags=0, tsn=d6aead20 ctsn=0
sid=0 ssn=591c flags=0, tsn=d6aead21 ctsn=0
sid=0 ssn=591d flags=0, tsn=d6aead22 ctsn=0
sid=0 ssn=591e flags=0, tsn=d6aead23 ctsn=0
sid=0 ssn=591f flags=0, tsn=d6aead24 ctsn=0
sid=0 ssn=5920 flags=0, tsn=d6aead25 ctsn=0
sid=0 ssn=5917 flags=0, tsn=d6aead1c ctsn=0
    0        5000    848    39470593    46545   38404666    45288    0
    0        6000    848    39368877    46425   38868061    45834    0
Catcher 21047: exiting; 6093400 packets received
(cntl-c)

Comment 131 Neil Horman 2009-10-29 00:57:19 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v19.x86_64.rpm

New build for you.  This one should in fact fix your problem.  It also has the prior debug code (modified to be a bit less verbose), but I think this will fix your problems as per my explination in comment #30 above.  Please run this on your sender and receiver and let me know the results.  Thanks!

Comment 132 William Reich 2009-10-29 12:04:38 UTC

I can not see comment 30...

Comment 133 William Reich 2009-10-29 12:16:51 UTC
test has started...
So far, so good ( after 3 minutes ). I am going to let it run.
However, there is a lot of debug on the pitcher machine coming out...

( the pitcher machine does not have the correct time.
...must be missing a daylight savings time patch... )

ct 29 09:05:09 jarrett kernel: Odd seq 3315561:3315559 seen at sctp_outq_flush_rtx
Oct 29 09:05:09 jarrett kernel: Odd seq 3315820:3315561 seen at sctp_outq_flush_rtx
Oct 29 09:05:09 jarrett kernel: Odd seq 3315822:3315820 seen at sctp_outq_flush_rtx
Oct 29 09:05:09 jarrett kernel: Odd seq 3316065:3315822 seen at sctp_outq_flush_rtx
Oct 29 09:05:09 jarrett kernel: Odd seq 3316067:3316065 seen at sctp_outq_flush_rtx
Oct 29 09:05:09 jarrett kernel: Odd seq 3316069:3316067 seen at sctp_outq_flush_rtx
Oct 29 09:05:09 jarrett kernel: Odd seq 3316332:3316069 seen at sctp_outq_flush_rtx
Oct 29 09:05:09 jarrett kernel: Odd seq 3316494:3316332 seen at sctp_outq_flush_rtx
Oct 29 09:05:09 jarrett kernel: Odd seq 3316508:3316494 seen at sctp_outq_flush_rtx
Oct 29 09:05:09 jarrett kernel: Odd seq 3316510:3316508 seen at sctp_outq_flush_rtx
Oct 29 09:05:09 jarrett kernel: Odd seq 3316531:3316510 seen at sctp_outq_flush_rtx
Oct 29 09:05:09 jarrett kernel: Odd seq 3316817:3316531 seen at sctp_outq_flush_rtx

Comment 134 William Reich 2009-10-29 12:17:50 UTC
The test in comment 133 is using the kernel from comment 131

Comment 135 William Reich 2009-10-29 12:47:16 UTC

the test ( from comment 133 ) has been running for 32 minutes - no mis-sequenced messages have been observed yet.

Comment 136 Neil Horman 2009-10-29 13:18:27 UTC
Ok, I'll make you one more kernel with the debug stuff removed to confirm, but I think we've found our problem!

Comment 137 William Reich 2009-10-29 14:26:19 UTC
the test ( from comment 133 ) has been running for 2 hours 15 minutes - no
mis-sequenced messages have been observed yet.

Comment 139 Neil Horman 2009-10-29 17:48:54 UTC
http://people.redhat.com/nhorman/rpms/kernel-2.6.18-165.el5.bz517504.v20.x86_64.rpm

Ok, one last kernel for you to try.  This is great news about the lack of mis-sequenced messages.  This last kernel has just the fix for the problem in place.  I've removed all the debug code so that we can be sure (given that this is so timing sensitive), that that code wasn't interfering with the test.  If this passes I'll get this upstream and in RHEL5 ASAP.  Thanks!

Comment 140 William Reich 2009-10-29 17:53:46 UTC
the test ( from comment 133 ) ran for 5 hours 30 minutes with no
mis-sequenced messages observed at the time that the test was terminated.

Comment 141 William Reich 2009-10-29 18:09:57 UTC
the test using the kernel from comment 139 has begun.
So far , so good...

Comment 142 William Reich 2009-10-29 19:42:56 UTC
45 minutes, and still going... ( using the kernel from comment 139 )

Comment 143 William Reich 2009-10-29 20:50:49 UTC
the test using the kernel from comment 139 has been running for over 2 hours now - no issues observed. I am going to let it run overnight...

Comment 144 Neil Horman 2009-10-29 22:59:58 UTC
sweet!  I think we have a winner.  I'll start prepping this for upstream submission in the morning.  thanks william!

Comment 145 William Reich 2009-10-30 12:05:47 UTC
2.147 trillion packets passed with no errors regarding mis-sequences...
It looks like the case is completed.

Comment 146 William Reich 2009-10-30 12:07:20 UTC
Please create a patch for RH4 too.

Comment 147 William Reich 2009-10-30 12:08:01 UTC
Can you share information about the root cause of the problem ?

Comment 148 Neil Horman 2009-10-30 12:43:44 UTC
Yes, of course.

Root cause summary
=====================================
Some initial backround for understanding: SCTP at the protocol level enforces packet ordering in a simmilar way to the TCP protocol.  Each atomic data unit, or 'chunk' is assigned a Transmission sequence number (TSN) during sender transmit.  These tsn values are what define reordering in the protocol, i.e. the receiver uses this TSN value to deliver packets to the receiving program in the appropriate order.

This problem was caused because on the transmit side of the sctp stack in the linux kernel, there was a hole in which packets could become re-ordered prior to an assignment of a TSN value.  The conditions which caused this are the fact that multiple interfaces were used in transmission, where each had differing path maximum transmission unit (pmtu) values.

For specifics, we need to look at the kernel functions: sctp_outq_flush,  sctp_packet_transmit_chunk, and sctp_packet_transmit.  sctp_outq_flush is responsible for taking a stream of user data, and sending it to the ip layer (indicating which source ip address/interface is to be used by the ip layer in the process).  The flow of sctp_outq_flush is roughly as follows:

1) transmit control chunks
2) is the association in a state in which we can send data.  if no return, else goto 3
3) if the data queue is empty goto 8
4) dequeue a chunk from the data queue
5) append the chunk to the waiting packet
6) place the transport (the source ip address to send the last chunk from) at the end of the list of transport to have their waiting packets sent to the ip layer
7) goto 3
8) for each transport in the list of waiting trasports transmit the waiting packet by calling sctp_packet_transmit

Note that sctp_packet_transmit is where tsns are assigned to each chunk in a packet.  This creates an assumption that all data chunks in all packets will arrive at sctp_packet_trasnmit in the order that the user space application sent them in.

The problem arrises in step 5 above.  This appending is achieved by calling sctp_packet_transmit_chunk.  This function attempts to append a chunk to the waiting packet.  Should the packet become larger than the pmtu of the associated transport by this appending, this function will immediately call sctp_packet_transmit on the packet, and then create a new packet to append this chunk to.  By calling sctp_packet_transmit however, we create a situation in which a transport with a small pmtu might send a packet ahead of other packets that are waiting in queue in sctp_outq_flush.  This violates the assumption I noted above, and creates this situation in which you noticed misordered payloads, despite the fact that TSN values were assigned, sent and delivered in order.


I'll clone this bz for RHEL4 shortly.

Comment 149 Neil Horman 2009-10-30 14:11:27 UTC
I've sent this patch upstream and cc'd you.  As soon as its accepted, I'll post it for RHEL5/4

Comment 150 RHEL Program Management 2009-10-30 14:30:00 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 152 Don Zickus 2009-11-23 15:33:10 UTC
in kernel-2.6.18-175.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 154 William Reich 2009-11-23 16:56:43 UTC
test of kernel from comment 152 has begun...

Comment 155 William Reich 2009-11-23 22:37:24 UTC
my test ran successfully for 5.5 hours using the kernel from comment 152

Comment 157 errata-xmlrpc 2010-03-30 07:45:11 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html


Note You need to log in before you can comment on or make changes to this bug.