Bug 233608 - Native SCTP - traffic randomly stops on RedHat AS4 Update 4
Summary: Native SCTP - traffic randomly stops on RedHat AS4 Update 4
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: lksctp-tools
Version: 4.4
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Neil Horman
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-03-23 13:08 UTC by William Reich
Modified: 2007-11-17 01:14 UTC (History)
0 users

Fixed In Version: 2.6.9-55.EL5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-07-18 18:45:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
version information (529 bytes, text/plain)
2007-03-23 13:08 UTC, William Reich
no flags Details
compile script for pitcher & catcher (118 bytes, text/plain)
2007-03-23 13:10 UTC, William Reich
no flags Details
catcher source code (2.34 KB, text/plain)
2007-03-23 13:11 UTC, William Reich
no flags Details
pitcher source code (1.80 KB, text/plain)
2007-03-23 13:12 UTC, William Reich
no flags Details
script used to launch 6 pitchers (167 bytes, text/plain)
2007-03-23 13:13 UTC, William Reich
no flags Details
sysreport from a machine where the problem appeared. (882.43 KB, application/octet-stream)
2007-03-23 13:18 UTC, William Reich
no flags Details
picture of test configuration (1.46 KB, text/plain)
2007-03-23 13:23 UTC, William Reich
no flags Details
sys report from redhat 5 box (457.54 KB, application/octet-stream)
2007-06-29 15:31 UTC, William Reich
no flags Details
output of catcher on RH 4 update 5 32 bit system (7.14 KB, text/plain)
2007-07-18 17:56 UTC, William Reich
no flags Details

Description William Reich 2007-03-23 13:08:56 UTC
Description of problem:
While using the LINUX provided SCTP on a RedHat AS4 Update 4 system
( dual CPU,  32bit OS ),
we configure a "catcher" and 6 "pitchers" to run on the same box.

Randomly, one of the pitchers just stops passing data to the catcher.

Debug shows that the pitcher is stuck on a "send" operation, and
the corresponding "catcher" thread is stuck on a "recv" operation.
This implies that the sctp stuck in some way.

Version-Release number of selected component (if applicable):

 

Connected to chaplin.ulticom.com.
Escape character is '^]'.
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
Kernel 2.6.9-42.0.10.ELsmp on an i686
Last login: Fri Mar 23 08:10:12 from blade1.ulticom.com


chaplin 96% uname -a
Linux chaplin 2.6.9-42.0.10.ELsmp #1 SMP Fri Feb 16 17:17:21 EST 2007 i686 i686
i386 GNU/Linux

chaplin 99% rpm -qa | grep -i sctp
lksctp-tools-1.0.2-6.4E.1
lksctp-tools-doc-1.0.2-6.4E.1
lksctp-tools-devel-1.0.2-6.4E.1

NOTE - This machine was specifically built using the
lastest available versions as of 3/14/07.


How reproducible:


Steps to Reproduce:
1. Run attached pitcher/catcher programs
2.
3.
  
Actual results:
at least one of the pitcher will stop passing traffic.
This is random. It may work one time, then fail the next time.

Expected results:
traffic should pass on all pitchers.

Additional info:

1) This problem was recreated on a 2.6.9-34 kernel as well.
2)

               native sctp   Pitcher / Catcher  test


   +---------+
   | pitcher | ----------------------------\
   +---------+                              \
                                             \
                                              \
   +---------+                                 \
   | pitcher | -----------------------\         \
   +---------+                         \         \
                                        \         |
                                         -----+   |
   +---------+                                |   |
   | pitcher | ---------------------\         |   |
   +---------+                       \        v   v
                                      -->  +---------+
                                           | catcher |
                                      -->  +---------+
                                     /        ^   ^
   +---------+                      /         |   |
   | pitcher | --------------------/          |   |
   +---------+                                |   |
                                       -------+   |
                                      /          /
   +---------+                       /          /
   | pitcher | ---------------------/          /
   +---------+                                /
                                             /
                                            /
   +---------+                             /
   | pitcher | ---------------------------/
   +---------+

3) I could not find an appropriate component to file this buzz under,
so I just guessed.

Comment 1 William Reich 2007-03-23 13:08:56 UTC
Created attachment 150748 [details]
version information

Comment 2 William Reich 2007-03-23 13:10:55 UTC
Created attachment 150749 [details]
compile script for pitcher & catcher

Comment 3 William Reich 2007-03-23 13:11:41 UTC
Created attachment 150750 [details]
catcher source code

Comment 4 William Reich 2007-03-23 13:12:23 UTC
Created attachment 150751 [details]
pitcher source code

Comment 5 William Reich 2007-03-23 13:13:16 UTC
Created attachment 150752 [details]
script used to launch 6 pitchers

Comment 6 William Reich 2007-03-23 13:15:59 UTC
To recreate the test,
use 2 xterm windows on the same machine.

In first window, launch the catcher...

./catcher &

In 2nd window, launch the 6 pitcher by using the script 'j'...

./j

Observe in the first window that a report is printed out for
every 1000 messages that are received by the catcher for each pitcher.
During the failure case, at least one of the reports will simply stop
appearing.

Comment 7 William Reich 2007-03-23 13:18:08 UTC
Created attachment 150753 [details]
sysreport from a machine where the problem appeared.

This is the output of the sysreport tool.

Comment 8 William Reich 2007-03-23 13:20:31 UTC
Please note that this problem has only been seen
with the pitchers & catcher are on the same box.


Comment 9 William Reich 2007-03-23 13:22:53 UTC
Comment on attachment 150748 [details]
version information

fixed description of attachement 150748

Comment 10 William Reich 2007-03-23 13:23:57 UTC
Created attachment 150754 [details]
picture of test configuration

Comment 11 William Reich 2007-03-23 14:57:44 UTC
Please note that in the source code of the
test programs that the IP address is hardcoded.

Change this address as appropriate for your own testing.

Comment 12 Neil Horman 2007-06-08 14:59:59 UTC
Please try this against the latest U5 kernel.  I believe this is a problem that
you and I resoved in another bugzilla previously.  Thanks!

Comment 13 William Reich 2007-06-12 13:18:32 UTC
msg received...
I'm going on vacation, so I'll get to this
upon my return.
( estimated time of work is 6/20-6/27/07 )

Comment 14 Neil Horman 2007-06-12 14:36:49 UTC
Thank you!  If U5 doesn't fix it, let me know, I have a patch that is being
reviewed upstream that completely rewrites how our receive buffer management
code works, and should fix any problems that remain in U5 with receive drops/stalls.

Comment 15 William Reich 2007-06-21 13:12:53 UTC
The machine I was assigned for this does not
contain the correct OS.
I gotta get the correct OS installed.

Comment 16 Neil Horman 2007-06-21 16:47:31 UTC
ok, let me know when you do.

Comment 17 William Reich 2007-06-29 15:31:43 UTC
Created attachment 158214 [details]
sys report from redhat 5 box

this is the SYSreport from the machine that I used
to test the fix.
Redhat Enterprise Linux 5
2.6.18-8
sctp packages 1.0.6-1.el5.1

Comment 18 William Reich 2007-06-29 15:33:54 UTC
I ran the test, and it passed.
( Details in comment 17. )

Comment 19 Neil Horman 2007-06-29 16:08:50 UTC
Ok, so it works with a RHEL5 kernel, does it work with a 4.5 kernel as well?
(since this is a RHEL4 bug)

Comment 20 William Reich 2007-06-29 16:58:32 UTC
I'm off to vacation.
I will investigate the week of July 9, 2007.

Comment 21 Neil Horman 2007-06-29 19:50:10 UTC
fine, let me know

Comment 22 William Reich 2007-07-18 17:54:39 UTC
I ran this test on a RH 4 Update 5 32 bit system.
The test passed.
It is curious however that the traffic across the 6 applications is
not even. Some applications pass up to twice as much traffic as others.
-- case closed -- Thanks.

Comment 23 William Reich 2007-07-18 17:56:35 UTC
Created attachment 159542 [details]
output of catcher on RH 4 update 5 32 bit system

output of "catcher" program that shows uneven traffic.
not a failure... just a curious observation...

Comment 24 Neil Horman 2007-07-18 18:45:35 UTC
quite likely has a good deal to do with receive buffer limitations and the need
for retransmits.  Hard to say though without more data.  I'm closing this as
currentrelease. If the problem persists after 4.7 is released (I plan to try
backport my recevie buffer changes to 4.6 or 4.7), feel free to open a new bz
for this issue).


Note You need to log in before you can comment on or make changes to this bug.