Bug 203122

Summary:	Applications listening on a port stop accepting connections on a XenU kernel
Product:	[Fedora] Fedora	Reporter:	Russell McOrmond <russell>
Component:	xen	Assignee:	Herbert Xu <herbert.xu>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5	CC:	bstein, katzj
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-03-16 14:57:58 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Russell McOrmond 2006-08-18 14:49:28 UTC

Description of problem:
Applications that are listening on a port for connections sometimes goes into a
state where it will refuse new connections.  Restarting the application so that
it binds to the port again fixes the problem.

Version-Release number of selected component (if applicable):

I did not notice the problem on kernel-xenU-2.6.17-1.2145_FC5 , but have on more
recent kernels such as kernel-xenU-2.6.17-1.2174_FC5

How reproducible:

The problem is very intermittant.  I see it mostly on the most busy ports, such
as the SMTP server on my primary mail server, or the HTTP port on the most busy
webservers (Which are on different XenU images).

I don't see it on servers that are more infrequently accessed.

Steps to Reproduce:

Unfortunately not a simple thing that can be reproduced at will.
  
Additional info:

I am aware of the addition of scatter/gather support being added to xennet, and
this may be a problem solved by those patches:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=189112

It may also relate to the TCP checksum problem observed elsewhere
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=186183  . I am not running
NAT on these machines.

These problems may be fixed with the fixes to the other bugs, but I wanted there
to be a bug report that people can attach to that have noticed this problem. 
This way we will have more people testing to ensure that it is gone.

Comment 1 mike gifford 2006-08-19 15:13:13 UTC

I want to note that I have seen this problem too.  I've set Apache to restart
ever 4 hrs so that there isn't too much down time, but I am finding this very
frustrating.

It only seems to happen on one virtual server that I have noticed (in my
install).    

It is odd because Apache is running (just not listening), and it doesn't seem to
affect any other ports (I can still ssh in)..  Port 80 just isn't responding.

Comment 2 Herbert Xu 2006-08-21 10:33:37 UTC

Thanks for the report.  I'm not aware of any existing bugs that can produce a
behaviour like this so this could be something new.  What kernel version are you
guys using in dom0?

When this problem occurs, I would like to see the output of ss -an (or netstat
-ant if you dont have ss).  Please also attach strace to the daemon process, do
a tcpdump on the vifX.0 interface in dom0 as well as on eth0 in domU and then
attempt a connection to it.

Comment 3 Russell McOrmond 2006-08-21 14:12:54 UTC

In my case I am running 2.6.17-1.2174_FC5xen0 (and 2.6.17-1.2174_FC5xenU for the
domU's)

I've also ran previous versions with similar results.  This is an extremely
intermittant problem, but I will do as you suggest when it next happens.

Note: 'ss' seems to be part of iproute, so is already installed on my Xen0 and
XenU's.

Comment 4 Herbert Xu 2006-09-27 10:54:03 UTC

Please let me know if this still happens with 2.6.18 (2189) in FC5 testing.  If
it does please provide the debugging output I requested for previously.  Thanks.

Comment 5 Russell McOrmond 2006-10-05 17:38:56 UTC

Once the new 'xen' package is available and tested, I'm going to roll out the
latest kernel to various machines.  My gut feeling is that this specific problem
only applied to older kernels, but it has been hard to verify due to entirely
different problems with newer kernels.

Comment 6 Herbert Xu 2006-11-06 23:08:38 UTC

The xen package is now available in testing.

Comment 7 Russell McOrmond 2006-11-15 21:52:19 UTC

A quick note.  I am still monitoring this.  While I upgraded another server to
the latest kernel last week, I only upgraded my mail server earlier today.  This
afternoon I saw another one of those odd situations where I needed to restart
the mail server.

I didn't do any of the suggested debugging, but was concentrating on figuring
out why email wasn't flowing.  Only after I restarted and mail was flowing did I
think that this would have been an opportunity for testing.

While doing the 'ss -an' suggested above is easy, I don't see how I'll be able
to diagnose anything with tcpdump.  This is an extremely busy mail server
(mail.flora.ca -- which is the primary mail server for a number of domains),
which is why whatever this "race condition" is showing up at all.  Any attempt
to attach tcpdump will just flood me with data that I won't be able to do much with.

I also don't understand the suggestion of strace, which I believe is a tool that
has to be used to run the command in the first place.  Is there a way to attach
and do a trace on a specific processID once a specific process is identified? 
This bug is to intermittent to just run 'strace' on and expect to get any useful
results.

Comment 8 Herbert Xu 2006-11-17 05:02:10 UTC

You can get tcpdump to write the results to a file for analysis later.  Just
call it with -w <filename>.  As to stracing a running processes, you can use -p
<pid> to attach to them.  Thanks.

Comment 9 Stephen Tweedie 2007-03-16 14:57:58 UTC

Closing due to insufficient data.  Please reopen if you are still able to
reproduce and capture the requesting information.