Description of problem: One of network stress testing is to create async multiple FTP sessions to upload files to a FTP server using vsftpd. By running the test for 5369 seconds, the following messages had been seen, TCP: Treason uncloaked! Peer 10.16.65.2:30929/43133 shrinks window 1454493069:1454493181. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:11412/32825 shrinks window 1455223983:1455223999. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:60762/47808 shrinks window 1540525259:1540525283. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:60762/47808 shrinks window 1540525259:1540525283. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:29102/34032 shrinks window 1537279726:1537279782. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:48134/40988 shrinks window 2116907755:2116907827. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:17795/40071 shrinks window 2507860403:2507860411. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:63525/38264 shrinks window 2530768102:2530768214. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:6970/35744 shrinks window 2527092526:2527092638. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:58573/38937 shrinks window 2725737921:2725737953. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:41026/54177 shrinks window 2721859297:2721859305. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:45678/41090 shrinks window 2678731282:2678731394. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:61704/36478 shrinks window 2769746775:2769746791. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:33882/60420 shrinks window 2786137782:2786137790. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:12864/56566 shrinks window 2780615668:2780615724. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:63337/54108 shrinks window 2768169071:2768169159. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:23917/46247 shrinks window 2827003487:2827003599. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:21245/43466 shrinks window 2777907290:2777907298. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:56791/35263 shrinks window 2838361343:2838361399. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:33882/60420 shrinks window 2786137782:2786137790. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:37380/55687 shrinks window 2677457939:2677457947. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:18852/44810 shrinks window 2770259308:2770259356. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:23917/46247 shrinks window 2827003487:2827003599. Repaired. printk: 10 messages suppressed. TCP: Treason uncloaked! Peer 10.16.65.2:42493/51720 shrinks window 2957366109:2957366125. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:11898/41197 shrinks window 2993448802:2993448874. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:11898/41197 shrinks window 2993448802:2993448874. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:63715/43089 shrinks window 2993490359:2993490407. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:11898/41197 shrinks window 2993448802:2993448874. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:63715/43089 shrinks window 2993490359:2993490407. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:63715/43089 shrinks window 2993490359:2993490407. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:23910/42185 shrinks window 3591659058:3591659130. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:14029/54342 shrinks window 3589837957:3589837981. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:19314/52612 shrinks window 3579204604:3579204652. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:62788/44666 shrinks window 3749790764:3749790836. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:20801/37999 shrinks window 3733524289:3733524401. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:12261/51059 shrinks window 3748299481:3748299553. Repaired. TCP: Treason uncloaked! Peer 10.16.65.2:49105/43639 shrinks window 96390206:96390254. Repaired. Version-Release number of selected component (if applicable): kernel-2.6.18-128.el5.i686 vsftpd-2.0.5-12.el5.i386 curl-7.15.5-2.el5.i386 How reproducible: unknown Steps to Reproduce: 1. setup vsftpd FTP client (hp-dl360g5-02.rhts.bos.redhat.com i386) and server (hp-xw6400-01.rhts.bos.redhat.com i386). 2. create file /var/ftp/regularFile in size 1048576 bytes in client. 3. upload the file to the server async for 5369 seconds (the server disk might run out of space).
We (JBoss) are also hitting what appears to be the same bug during performance testing of JBoss HornetQ - which is the messaging component of JBoss Enterprise Application Platform (a Red Hat product). This causes TCP connections to intermittently hang at very high load. I'd therefore consider this to be a higher priority/severity than is currently set against it, since it means one of Red Hat's main products can hang at high load. Doing some googling this bug appears already to be fixed in the Linux kernel version 2.6.25 (?). Please see this link: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5ea3a7480606cef06321cd85bc5113c72d2c7c68 Thanks!
Right, the bug is fixed upstream and in RHEL6. We should back port this.
I think I have a fix for you. The attached patch has been tested for three days so far on 350 servers running RHEL5.4 (2.6.18-164.15.1) with this patch. The servers no longer suffer pauses in network traffic nor see the erroneous "treason uncloaked" messages. The function in question, tcp_window_allows(), was changed by several commits including renaming it to tcp_mss_split_point(). The function in question and how it is invoked have not changed between RHEL6 Beta2 (2.6.32-44.2) and the current kernel.org source. The commits which touch the function in question that were evaluated in creating this patch include: 0e3a4803 Mon Dec 24 21:33:45 2007 -0800 This is the big one that changes the name of the function and rewrites where it is invoked from. It is completely contained within tcp_output.c. However it is just a dependent change and does not address the problem. I took the commit wholesale and applies cleanly except for the last delta to tcp_push_one() which is rejected. I did not use it since this delta did not have to do with the this change, but was an unused variable cleanup issue left unaddressed from the earlier commit 66f5fe62 slipped into this commit that I didn't wish to pull in. 90840def Mon Dec 31 04:48:41 2007 -0800 Rejected this commit leaving the one line changed in tcp_mss_split_point() unmodified. This was a cleanup only commit and did not change functionality. The other changes in the commit are unnecessary. The commit can be applied if desired, but will need more dependent commits if applied. 056834d9 Mon Dec 31 14:57:14 2007 -0800 Just some whitespace cleanup. I did not apply it in my patch. The commit may be applied if desired, but skipped it due to it pulling in other dependencies. 5ea3a748 Tue Mar 11 17:55:27 2008 -0700 This is the commit that fixes the bug in question and is mentioned in Comment 1. It has to be logically applied by hand due to not having the above two dependent commits. The inline function tcp_write_queue_tail() has to be expanded manually to its skb_peek_tail() call. 17515408 Tue Apr 15 20:36:55 2008 -0700 This commit is not necessary to fix the bug, but adds an efficiency to the code. It may be skipped if desired, but it is an obvious cleanup and brings the tcp_mss_split_point() function logically in sync with the RHEL6 Beta2 (2.6.32-44.2) and the latest kernel.org version. Any feedback on the patch is appreciated.
Created attachment 440169 [details] Proposed fix for problem 494400.
Patch looks good, will test it.
Created attachment 440868 [details] proposed patch
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-219.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Unfortunately, the group in my company that had hit the problem won't want to go through the work to install the -219 kernel above since their systems are already fully working. The 2.6.18-164.15.1 version of the patch I submitted earlier continues to work on those 350 servers flawlessly for four weeks now. And I checked your RHEL5u5 version of tcp_output.c in -219 against my 5u5 version of the file and it matched exactly (even down to the whitespace). That kernel build has been rolled out internally in limited release already, so it looks like we're in parallel testing the same code change.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html