Description of problem:
We have a cluster of OAS servers running RHEL4 U4 that are built and configured identically. On one of the servers we've been having an issue with a TCP frozen window, but it has not occurred on any of the other servers. When the issue does occur, it only lasts 4-5 seconds but is enough to cause timeouts to the application.
I've searched through the bug lists and the knowledge base and have not been able to find anything applicable to this issue. There is certainly not enough in this ticket to identify a specific new bug, but I would like to know if there is an existing issue/fix that might be known. I can supply additional information if necessary.
Version-Release number of selected component (if applicable):
$ cat /proc/version
Linux version 2.6.9-42.ELsmp (firstname.lastname@example.org) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-2)) #1 SMP Wed Jul 12 23:27:17 EDT 2006
It seems to occur irregularly with not specific way to reproduce the issue. When the server is in production, it will occur but not in a predictable manner or timeframe. Testing has been unable to reproduce the issue outside of that.
The issue has been identified using the Opnet tool to do packet captures on all devices involved and analyzing after the situation occurs. The report does not include details of the session, but I'm including the output below:
TCP Frozen Window
If a tier pair is identified as having a TCP Frozen Window bottleneck, the advertised TCP Receive Window has dropped to a value smaller than the Maximum Segment Size (MSS). This is affecting your application response time.
The advertised TCP Receive Window has dropped to a value smaller than the MSS. When this occurs, the sender cannot send any data until the receive window is one MSS or larger.
To determine if the receive window has become larger, the sending side periodically sends one-byte probe packets. These contents of these probe packets depends on the particular implementation, but they are usually sent with an exponential backoff.
The usual case of a TCP frozen window is that the application on the receiving side is not taking data from the TCP receive buffer quickly enough.
Consider the following solutions:
1. Send less data.
2. Have the receiving application retrieve the data more quickly; if the application cannot process all the data at once, consider storing the data in another buffer.
3. Upgrade the receiving machine.
Unfortunately RHEL-4 contains a TCP receive window clamping issue that was fixed upstream around versions 2.6.14 and 2.6.15. It can be triggered by applications that send a lot of small packets to a slow reader.
This has received some attention lately, so I'll be looking for the right patch in the next few days. Let's hope we can still fix this for RHEL-4.
The TCP frozen window issue could be from kernel/application/hardwares, so it could be hard to locate the root cause. How often does the application timeout occur?
As mentioned in comment #2, rhel4 kernel has a receive window clamp issue, the fix is upstream commit 09e9ec87. I'm not sure if this can fix reporter TCP frozen window issue, but we can handle better with this commit.
*** This bug has been marked as a duplicate of bug 546324 ***