Description of problem: While using the LINUX provided SCTP on a RedHat AS4 Update 4 system ( dual CPU, 32bit OS ), we configure a "catcher" and 6 "pitchers" to run on the same box. Randomly, one of the pitchers just stops passing data to the catcher. Debug shows that the pitcher is stuck on a "send" operation, and the corresponding "catcher" thread is stuck on a "recv" operation. This implies that the sctp stuck in some way. Version-Release number of selected component (if applicable): Connected to chaplin.ulticom.com. Escape character is '^]'. Red Hat Enterprise Linux AS release 4 (Nahant Update 4) Kernel 2.6.9-42.0.10.ELsmp on an i686 Last login: Fri Mar 23 08:10:12 from blade1.ulticom.com chaplin 96% uname -a Linux chaplin 2.6.9-42.0.10.ELsmp #1 SMP Fri Feb 16 17:17:21 EST 2007 i686 i686 i386 GNU/Linux chaplin 99% rpm -qa | grep -i sctp lksctp-tools-1.0.2-6.4E.1 lksctp-tools-doc-1.0.2-6.4E.1 lksctp-tools-devel-1.0.2-6.4E.1 NOTE - This machine was specifically built using the lastest available versions as of 3/14/07. How reproducible: Steps to Reproduce: 1. Run attached pitcher/catcher programs 2. 3. Actual results: at least one of the pitcher will stop passing traffic. This is random. It may work one time, then fail the next time. Expected results: traffic should pass on all pitchers. Additional info: 1) This problem was recreated on a 2.6.9-34 kernel as well. 2) native sctp Pitcher / Catcher test +---------+ | pitcher | ----------------------------\ +---------+ \ \ \ +---------+ \ | pitcher | -----------------------\ \ +---------+ \ \ \ | -----+ | +---------+ | | | pitcher | ---------------------\ | | +---------+ \ v v --> +---------+ | catcher | --> +---------+ / ^ ^ +---------+ / | | | pitcher | --------------------/ | | +---------+ | | -------+ | / / +---------+ / / | pitcher | ---------------------/ / +---------+ / / / +---------+ / | pitcher | ---------------------------/ +---------+ 3) I could not find an appropriate component to file this buzz under, so I just guessed.
Created attachment 150748 [details] version information
Created attachment 150749 [details] compile script for pitcher & catcher
Created attachment 150750 [details] catcher source code
Created attachment 150751 [details] pitcher source code
Created attachment 150752 [details] script used to launch 6 pitchers
To recreate the test, use 2 xterm windows on the same machine. In first window, launch the catcher... ./catcher & In 2nd window, launch the 6 pitcher by using the script 'j'... ./j Observe in the first window that a report is printed out for every 1000 messages that are received by the catcher for each pitcher. During the failure case, at least one of the reports will simply stop appearing.
Created attachment 150753 [details] sysreport from a machine where the problem appeared. This is the output of the sysreport tool.
Please note that this problem has only been seen with the pitchers & catcher are on the same box.
Comment on attachment 150748 [details] version information fixed description of attachement 150748
Created attachment 150754 [details] picture of test configuration
Please note that in the source code of the test programs that the IP address is hardcoded. Change this address as appropriate for your own testing.
Please try this against the latest U5 kernel. I believe this is a problem that you and I resoved in another bugzilla previously. Thanks!
msg received... I'm going on vacation, so I'll get to this upon my return. ( estimated time of work is 6/20-6/27/07 )
Thank you! If U5 doesn't fix it, let me know, I have a patch that is being reviewed upstream that completely rewrites how our receive buffer management code works, and should fix any problems that remain in U5 with receive drops/stalls.
The machine I was assigned for this does not contain the correct OS. I gotta get the correct OS installed.
ok, let me know when you do.
Created attachment 158214 [details] sys report from redhat 5 box this is the SYSreport from the machine that I used to test the fix. Redhat Enterprise Linux 5 2.6.18-8 sctp packages 1.0.6-1.el5.1
I ran the test, and it passed. ( Details in comment 17. )
Ok, so it works with a RHEL5 kernel, does it work with a 4.5 kernel as well? (since this is a RHEL4 bug)
I'm off to vacation. I will investigate the week of July 9, 2007.
fine, let me know
I ran this test on a RH 4 Update 5 32 bit system. The test passed. It is curious however that the traffic across the 6 applications is not even. Some applications pass up to twice as much traffic as others. -- case closed -- Thanks.
Created attachment 159542 [details] output of catcher on RH 4 update 5 32 bit system output of "catcher" program that shows uneven traffic. not a failure... just a curious observation...
quite likely has a good deal to do with receive buffer limitations and the need for retransmits. Hard to say though without more data. I'm closing this as currentrelease. If the problem persists after 4.7 is released (I plan to try backport my recevie buffer changes to 4.6 or 4.7), feel free to open a new bz for this issue).