Description of problem: Applications that are listening on a port for connections sometimes goes into a state where it will refuse new connections. Restarting the application so that it binds to the port again fixes the problem. Version-Release number of selected component (if applicable): I did not notice the problem on kernel-xenU-2.6.17-1.2145_FC5 , but have on more recent kernels such as kernel-xenU-2.6.17-1.2174_FC5 How reproducible: The problem is very intermittant. I see it mostly on the most busy ports, such as the SMTP server on my primary mail server, or the HTTP port on the most busy webservers (Which are on different XenU images). I don't see it on servers that are more infrequently accessed. Steps to Reproduce: Unfortunately not a simple thing that can be reproduced at will. Additional info: I am aware of the addition of scatter/gather support being added to xennet, and this may be a problem solved by those patches: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=189112 It may also relate to the TCP checksum problem observed elsewhere https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=186183 . I am not running NAT on these machines. These problems may be fixed with the fixes to the other bugs, but I wanted there to be a bug report that people can attach to that have noticed this problem. This way we will have more people testing to ensure that it is gone.
I want to note that I have seen this problem too. I've set Apache to restart ever 4 hrs so that there isn't too much down time, but I am finding this very frustrating. It only seems to happen on one virtual server that I have noticed (in my install). It is odd because Apache is running (just not listening), and it doesn't seem to affect any other ports (I can still ssh in).. Port 80 just isn't responding.
Thanks for the report. I'm not aware of any existing bugs that can produce a behaviour like this so this could be something new. What kernel version are you guys using in dom0? When this problem occurs, I would like to see the output of ss -an (or netstat -ant if you dont have ss). Please also attach strace to the daemon process, do a tcpdump on the vifX.0 interface in dom0 as well as on eth0 in domU and then attempt a connection to it.
In my case I am running 2.6.17-1.2174_FC5xen0 (and 2.6.17-1.2174_FC5xenU for the domU's) I've also ran previous versions with similar results. This is an extremely intermittant problem, but I will do as you suggest when it next happens. Note: 'ss' seems to be part of iproute, so is already installed on my Xen0 and XenU's.
Please let me know if this still happens with 2.6.18 (2189) in FC5 testing. If it does please provide the debugging output I requested for previously. Thanks.
Once the new 'xen' package is available and tested, I'm going to roll out the latest kernel to various machines. My gut feeling is that this specific problem only applied to older kernels, but it has been hard to verify due to entirely different problems with newer kernels.
The xen package is now available in testing.
A quick note. I am still monitoring this. While I upgraded another server to the latest kernel last week, I only upgraded my mail server earlier today. This afternoon I saw another one of those odd situations where I needed to restart the mail server. I didn't do any of the suggested debugging, but was concentrating on figuring out why email wasn't flowing. Only after I restarted and mail was flowing did I think that this would have been an opportunity for testing. While doing the 'ss -an' suggested above is easy, I don't see how I'll be able to diagnose anything with tcpdump. This is an extremely busy mail server (mail.flora.ca -- which is the primary mail server for a number of domains), which is why whatever this "race condition" is showing up at all. Any attempt to attach tcpdump will just flood me with data that I won't be able to do much with. I also don't understand the suggestion of strace, which I believe is a tool that has to be used to run the command in the first place. Is there a way to attach and do a trace on a specific processID once a specific process is identified? This bug is to intermittent to just run 'strace' on and expect to get any useful results.
You can get tcpdump to write the results to a file for analysis later. Just call it with -w <filename>. As to stracing a running processes, you can use -p <pid> to attach to them. Thanks.
Closing due to insufficient data. Please reopen if you are still able to reproduce and capture the requesting information.