169456 – COMM_LOST problem with SCTP stream socket

Bug 169456 - COMM_LOST problem with SCTP stream socket

Summary: COMM_LOST problem with SCTP stream socket

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Neil Horman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	181409 188593
TreeView+	depends on / blocked

Reported:	2005-09-28 15:10 UTC by Jere Leppanen
Modified:	2007-11-30 22:07 UTC (History)
CC List:	4 users (show)
Fixed In Version:	RHSA-2006-0575
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-08-10 21:22:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Modifies sctp_darn and adds a script for testing this bug (4.10 KB, patch) 2005-09-28 15:10 UTC, Jere Leppanen	no flags	Details \| Diff
patch to ensure that user data is all received from a socket before reporting socket errors (634 bytes, patch) 2005-11-23 21:31 UTC, Neil Horman	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2006:0575	0	normal	SHIPPED_LIVE	Important: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 4	2006-08-10 04:00:00 UTC

Description Jere Leppanen 2005-09-28 15:10:25 UTC

Created attachment 119373 [details]
Modifies sctp_darn and adds a script for testing this bug

Comment 1 Jere Leppanen 2005-09-28 15:10:25 UTC

Description of problem:
SCTP_COMM_LOST is incorrectly delivered from an SCTP stream socket _after_ 
recvmsg()=-1 ECONNRESET, if the other side performs an abortive shutdown of the 
association and recvmsg() is called after the association has been aborted.

Version-Release number of selected component (if applicable):
kernel-2.6.9-16.EL

How reproducible:
Every time.

Steps to Reproduce:
A script does all the work, so these are the setup steps, the script is run in 
the final step.
1. Download lksctp-tools-1.0.3, untar or install
2. Patch with lksctp-tools-1.0.3-comm_lost_test.patch (attached)
3. configure, make
4. cd to lksctp-tools-1.0.3/src/apps
5. Run comm_lost_test.sh
  
Actual results:
--- clip ---
$ ./comm_lost_test.sh

Case 1:
NOTIFICATION: ASSOC_CHANGE - COMM_LOST
recvmsg: -1 (Connection reset by peer)
recvmsg: -1 (Transport endpoint is not connected)

Case 2:
recvmsg: -1 (Connection reset by peer)
NOTIFICATION: ASSOC_CHANGE - COMM_LOST
recvmsg: -1 (Transport endpoint is not connected)
--- clip ---

In Case 1, recvmsg() has already been called (blocking) when the association is 
aborted. The COMM_LOST is delivered, then the next recvmsg() returns -1 with 
ECONNRESET. This is the correct behaviour.

In Case 2, the association is aborted, then recvmsg() is called; it returns -1 
with ECONNRESET. The next recvmsg() delivers the COMM_LOST notification. This 
is the incorrect behaviour.

Expected results:
Output of Case 2 should be identical to that of Case 1.

Additional info:
lksctp-tools-1.0.3-comm_lost_test.patch
    - allows client to receive and server to abort in sctp_darn
      (patch is a little messy but it works for this test)
    - adds the comm_lost_test.sh script (very simple)
    - adds input files that the script feeds to sctp_darn's interactive mode

I can reproduce this bug with 100% consistency, but as this is some kind of 
event ordering / timing problem, I can only hope that it can be reproduced by 
others also.

Comment 2 Neil Horman 2005-11-07 21:11:22 UTC

It would appear that this is being caused by the fact that sk_err is read
synchronously, causing the effects of setting it (sk_err) to take effect before
the COMM_LOST message is dequeued from the packet queue.  I'm not sure if thats
wrong or not, as POSIX appears to not address this clearly.  It certainly feels
wrong, however, I think we can probably get around it by delaying the setting so
sk_err to ECONNRESET until after the COMM_LOST message is read, although this
solution may be rejected by the upstream community.

Comment 3 Neil Horman 2005-11-23 20:50:52 UTC

I've tested this with TCP and it appears that tcp does indeed allow the
reception of user data that has already been received before returning an socket
level error to the user.

Comment 4 Neil Horman 2005-11-23 21:31:41 UTC

Created attachment 121417 [details]
patch to ensure that user data is all received from a socket before reporting socket errors

This is the patch that I'm going to propose upstream after the holidays.  It
passes the attached test case when applied against the upstream kernel.

Comment 5 Jere Leppanen 2005-11-30 11:02:07 UTC

The patch looks fine, and passes my tests.

Comment 12 Jason Baron 2006-03-17 16:39:42 UTC

committed in stream U4 build 34.4. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/

Comment 13 Jere Leppanen 2006-03-21 10:21:12 UTC

Kernel from http://people.redhat.com/~jbaron/rhel4/ (34.5 actually) passes my 
test.

Comment 19 Red Hat Bugzilla 2006-08-10 21:22:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0575.html

Note You need to log in before you can comment on or make changes to this bug.