Bug 76785

Summary: Sockets fails to close
Product: [Retired] Red Hat Linux Reporter: Steffen Persvold <sp>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 15:40:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steffen Persvold 2002-10-26 16:03:07 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020830

Description of problem:
With the 2.4.18-17.7.3 kernel an application which uses many threads to connect
to other hosts (200), fails to recognize a typical "connection closed by peer"
on some of the sockets. No problems with the 2.4.18-10 kernel.

Version-Release number of selected component (if applicable):
2.4.18-17.7.3


How reproducible:
Always

Steps to Reproduce:
1. Start threaded parent application which connects to a large number of remote
daemons (one thread per remote machine, separate socket). This remote daemon
will fork&exec another application.
2. All threads enter the recv() system call to receive data from the remote
executed application.
3. Terminate remote executed application


Actual Results:  After some time, some of the threads are still in the recv()
system call, and the socket is still listed as ESTABLISHED (with netstat -a).
However all remote applications has terminated and no sockets to parent machine
exists on these remote machines. The application therefore hangs, but a ctrl-c
terminates it successfully.

Expected Results:  The recv() system call should have returned 0 to indicate
that the remote application has closed the connection. Then the thread is
terminated and when all threads exits the parent application should have
terminated automatically without needing to press ctrl-c.

Additional info:

Kernel is 2.4.18-17.7.xsmp i686 build. Kernel 2.4.18-10smp i686 build works
fine. Glibc is glibc-2.2.5-40 (i686).

This is a MPI implementation and the described sequence is used to launch the
MPI application on the cluster nodes. Unfortunately no example source is
available (however this could be created) nor is this easy to reproduce since it
requires a relatively high number of cluster nodes (i.e 32 nodes work fine, 64
does not).

Comment 1 Bugzilla owner 2004-09-30 15:40:07 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/