Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Run application
2. Observe tcpdump show the window size fall to 0 bytes
Description of Problem:
Application does a large query on an Oracle database. Soon, application
goes almost totally idle. An strace of the process will show the
getting 1 byte at a time from the socket. /proc/pid/fd/10 is a socket.
Looking further into this, we ran tcpdump on the communications between
the box and the oracle server. The dump starts off normal, with the
solaris tcp receive window constant at 24616. The linux window starts at
about 8k, and quickly rises to ~50-64k
At a somewhat random occurence, the tcp receive window on the linux box
starts to fall quickly, eventually becoming 0. The app then starts reading
the data byte by byte. Once the particular query is completed, a new
socket is opened, and the receive window is reset to normal behaviour. It
then can re-occur during a later data connection.
All other network traffic to and from the box seems normal.
This is using Oracle client 126.96.36.199. The network card is an intel
etherexpress 100. The problem vanishes if we use 2.4.15aa1 custom kernel
Other things we tried in diagnosing the problem:
Replace eepro100 driver with e100, same problem
Add an additional Intel ethernet card to the box on the same vlan as the
oracle server, same problem. This eliminated network card, network cables,
and cat port as problems.
There were no routing problems. The linux box talked directly to the
oracle box without traversing any routers, and the oracle box replied the
same. There are no router hop between the boxes.
The duplex settings are correct, set at 100-FD.
There are no errors on any of the cisco counters for either the linux box
or the solaris box.
Using 2.4.15aa1 solves our problem. The app was run 3 times for test, and
it completed successfully last night.
Fix added to the kernel; kernel 2.4.9-17.6 or later have this fix.
Fixed a year ago, new kernels include the fix. Closing.