Version-Release number of selected component (if applicable): 2.4.9-6enterprise How Reproducible: Somewhat regularly. Steps to Reproduce: 1. Run application 2. Observe tcpdump show the window size fall to 0 bytes Description of Problem: Application does a large query on an Oracle database. Soon, application goes almost totally idle. An strace of the process will show the application doing read(10,"U",2000)=1 read(10,"N",1999)=1 etc getting 1 byte at a time from the socket. /proc/pid/fd/10 is a socket. Looking further into this, we ran tcpdump on the communications between the box and the oracle server. The dump starts off normal, with the solaris tcp receive window constant at 24616. The linux window starts at about 8k, and quickly rises to ~50-64k At a somewhat random occurence, the tcp receive window on the linux box starts to fall quickly, eventually becoming 0. The app then starts reading the data byte by byte. Once the particular query is completed, a new socket is opened, and the receive window is reset to normal behaviour. It then can re-occur during a later data connection. All other network traffic to and from the box seems normal. This is using Oracle client 8.1.7.2. The network card is an intel etherexpress 100. The problem vanishes if we use 2.4.15aa1 custom kernel Other things we tried in diagnosing the problem: Replace eepro100 driver with e100, same problem Add an additional Intel ethernet card to the box on the same vlan as the oracle server, same problem. This eliminated network card, network cables, and cat port as problems. There were no routing problems. The linux box talked directly to the oracle box without traversing any routers, and the oracle box replied the same. There are no router hop between the boxes. The duplex settings are correct, set at 100-FD. There are no errors on any of the cisco counters for either the linux box or the solaris box. Using 2.4.15aa1 solves our problem. The app was run 3 times for test, and it completed successfully last night.
Fix added to the kernel; kernel 2.4.9-17.6 or later have this fix.
Fixed a year ago, new kernels include the fix. Closing.