Bug 508914
Summary: | Severe Problems scp from Rhel-5.3 to Rhel-5.3 Boxes | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Issue Tracker <tao> | ||||
Component: | openssh | Assignee: | Jan F. Chadima <jchadima> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | BaseOS QE <qe-baseos-auto> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 5.3 | CC: | aparsons, cward, ebenes, mpoole, mvadkert, pasteur, rmunilla, rodney.mckee, sgrubb, tao, tmraz, vincew | ||||
Target Milestone: | rc | Keywords: | Regression, ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2010-03-30 13:37:27 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 499522, 533862 | ||||||
Attachments: |
|
Description
Issue Tracker
2009-06-30 14:03:53 UTC
Event posted on 2009-06-17 20:07 BST by JoeWilson Description of problem: See atch file named Rhel-5 Network Problem When scp file from Rhel-5.3 box as source to Rhel-5.3 box as target, scp keeps stalling and or terminating. When scp file from Rhel-5.3 box as source to Rhel-5.2 box as target, scp works fine. How reproducible: Just scp file Steps to Reproduce: scp from Rhel-5.3 source box to Rhel-5.3 Target box Actual results: Either stalls, restarts, stalls, or terminates Expected results: scp should transfer smoothly Additional info: This event sent from IssueTracker by mpoole [Support Engineering Group] issue 308455 Event posted on 2009-06-18 19:44 BST by jwilleford "From Rhel 5 3 to Rhel 5 2 it works fine." I'm assuming this is from a 5.2 host in the same datacenter as the 5.3 host. If so, can we get a tcpdump from a 5.2 host in the same datacenter, Salt Lake City UT, as sc1b5. We will need to keep as many variables as we can the same between the 5.2 box and the 5.3 box. The WAN between sdsd and UT is a big variable. Internal Status set to 'Waiting on Customer' Status set to: Waiting on Client This event sent from IssueTracker by mpoole [Support Engineering Group] issue 308455 Event posted on 2009-06-18 20:55 BST by JoeWilson Inside one of our data centers the transfers complete between the boxes. The problem is that from our local data center to the remote we cannot complete a transfer when origination from one of the local rhel5.3 boxes to a rhel5.3. Every other permutation seems to work. We've tested the followibg and they work: All local to local All remote to remote Local 3u7 to remote 3u7 Local 3u7 to remote 5.2 Local 3u7 to remote 5.3 Local 4 to remote 5.3 Local 5.2 to remote 5.2 Local 5.3 to remote 5.2 Remote 3u7 to local 3u7 Remote 5.3 to local 5.3 But what fails is Local 5.3 to remote 5.3 Now, where would you like the tcpdump from based on that? Sent from my Blackberry Please excuse any errors ----- Original Message ----- From: Red Hat Issue Tracker [tao] Sent: 06/18/2009 02:44 PM AST To: Joe Wilson Subject: (UPDATED) Issue #308455 (Severe Problems scp from Rhel-5.3 to Rhel-5.3 Boxes)[US Courts] Update to issue 308455 by jwilleford Action: "From Rhel 5 3 to Rhel 5 2 it works fine." I'm assuming this is from a 5.2 host in the same datacenter as the 5.3 host. If so, can we get a tcpdump from a 5.2 host in the same datacenter, Salt Lake City UT, as sc1b5. We will need to keep as many variables as we can the same between the 5.2 box and the 5.3 box. The WAN between sdsd and UT is a big variable. Status set to: Waiting on Client https://enterprise.redhat.com/issue-tracker/308455 Previous Events: ---------------- Posted 06-18-2009 10:34am by JoeWilson Jason From Rhel 5 3 to Rhel 5 2 it works fine. Sent from my Blackberry Please excuse any errors Status set to: Waiting on Tech Posted 06-18-2009 10:19am by jwilleford rc2b3 is located in Reston VA sc1b5 is located in Salt Lake City UT rep-bl13 is located here at SDSD This to me begs the question...what is the performance of a 5.2 box in the same geo as the 5.3 box, Utah? Status set to: Waiting on Client Internal Status set to 'Waiting on Support' Status set to: Waiting on Tech This event sent from IssueTracker by mpoole [Support Engineering Group] issue 308455 Event posted on 2009-06-19 20:24 BST by vincew Jason, You may also want to take a peek at BZ 461685. In this case there is discussion of packets getting dropped on the transmit side (ie never getting transmitted). It is possible, though I'm not sure how likely, that some packets are getting dropped before they are transmitted. Note that with the 5.3 bnx2 driver, good results were found by increasing the transmit queue length (txqueuelen) to 10000 (the default, and what I see them to be running at, is 1000). So this is something else I'd recommend they try. # ip link set eth0 txqueuelen 10000 It will change immediately, and they can also change things back to default simply by: # ip link set eth0 txqueuelen 1000 --vince This event sent from IssueTracker by mpoole [Support Engineering Group] issue 308455 Event posted on 2009-06-23 18:11 BST by vincew Jason, Greg's details of testing posted at 12:24pm is good information. It's not testing with ftp vs. testing with ssh, however with the summary info included and the steps performed detailed, we have a pretty impressive basis to now begin looking at openssh as a culprit. (unless i missed it there was not a methodical detail like this presented previously). So from this, I feel good about advising the customer to run 5.3 with pre-5.3 openssh packages as an interim workaround until we can isolate the problem with openssh. We should then be able to downgrade the case to Sev2 per our severity practices unless the customer has some issue with that. It is good that we maintained some skepticism about packet loss being the root cause of the problem. Note that their experiencing loss of link intermittently should not be ignored. I would however point out that this is a separate problem and in any case could not be blamed upon openssh or likely even upon the bnx2 driver (since the driver only reads an MII status register in hardware to know if we have carrier). Jason if there is any cause the customer finds to believe this to no longer be specific to openssh (such as seeing the same problem when ftp'ing, which I would still recommend but not quite as urgently now), please let me know. Otherwise I will most likely be reassigning this case to an openssh specialist in short order. Thanks, Vince This event sent from IssueTracker by mpoole [Support Engineering Group] issue 308455 Event posted on 2009-06-23 19:00 BST by vincew Changes (from changelog...) in openssh between 4.3p2-26.el5_2.1 and 4.3p2-29.el5: * Thu Oct 02 2008 Tomas Mraz <tmraz> - 4.3p2-29 - allow options for the sshd subsystem for the sftp logging support (#452619) * Thu Sep 11 2008 Tomas Mraz <tmraz> - 4.3p2-28 - use OpenSSL RNG directly (needed for FIPS-140-2 compliance) - when in FIPS mode do not try to use algorithms which are not allowed (#447936) - improve transfer speed on high latency connections (#227722) - more correct 'service sshd status' reporting (#430877) - add logging support to the sftp server (#452619) - small scp manual page improvement (#433381) Given the symptoms (especially since this apparently doesn't happen on local network xfers to a 5.3 host, only to remote servers) I am going to guess this is related to patches made for "improve transfer speed on high latency connections (#227722)". Checking the BZ for details etc. --vince This event sent from IssueTracker by mpoole [Support Engineering Group] issue 308455 Event posted on 2009-06-23 21:41 BST by vincew Excellent (not reproduced with other protocols) - now we know for sure. Still looking for patches to try to back out. Hopefully it's not BZ 227722 because it looks like 4 different customers' problems were solved by getting that patch in. But let's see what happens. Would the customer be willing/able to test one or more test builds of openssh packages to help nail this down? With some of the parameters involved with reproducing the problem it is probably going to be important they test in their environment as we work to isolate the patch(es) responsible for the problem they are seeing. --vince This event sent from IssueTracker by mpoole [Support Engineering Group] issue 308455 Event posted on 2009-06-25 16:29 BST by vincew Jason, That's great news! In a way... You already know why it's not such great news. :) The patch will have to be re-visited, no doubt, unless there is some kind of buffering tuning etc. that could be done to work around it. There seems to be a definite problem with openssh emptying its receive buffers in this version (note in the original tcpdumps there were even some ACK packets from the receiver indicating Receive Window Full). Now that we have this narrowed down to a specific patch I am going to re-assign this case to an openssh specialist who will either work on the patch themselves to correct it or engage Engineering. As far as I'm concerned this is a regression. --vince SUMMARY OF CASE 1. With 5.3 servers on the receiver end of an scp connection (running 5.3 released openssh), transfers stall over slower WAN links. This is consistently reproducible for the customer. Problems are NOT seen when the 5.3 receiver is on the local network. 2. Problem does not reproduce using other file transfer protocols/applications, those tested include vsftpd/ftp and wget/http. 3. With rebuilt 4.3p2-29.el5 (5.3 stock) openssh packages that do NOT contain Patch 60 ( openssh-4.3p2-latency.patch ), the customer can no longer reproduce the problem. - I am pretty sure this patch is for BZ 227722 ("improve transfer speed on high latency connections") but no patch was ever attached to the BZ that I can tell. Unfortunately 4 IT's were linked to this BZ (including at least 1 FSI customer) so I am pretty sure Engineering will not simply back out this patch. 4. Patch either needs to be re-visited to eliminate this problem, or openssh or kernel TCP stack tuning may be necessary to work around it. This event sent from IssueTracker by mpoole [Support Engineering Group] issue 308455 Would you be so kind, please, and describe the parameters of the problematic WAN line eg. speed, roundtrip, packetloss, mtu? What kind of connection it uses (e.g. satellite connection, DSL, backed line of two or more different lines)? Are there any specific characteristics of the line (blackouts, overfill)? Can you run ping -s <some_big_value_less_than_mtu> for the time which is usually spent by the file transfer and provide the output of it and also the output from parallelly run tcpdump -i <if> -w <output_file> icmp We will review this issue again once you've had a chance to attach this information. Thank you very much in advance Would you be so kind, please, and get the system log especially kernel log from the problematic machine. There may be messages from network layer or network device driver. Modified the latency patch. Changed the target buffer size from 2MByte to 1/2MByte after the experiments with the dependency the speed on the buffer size. Tested only on i686. The speed grows up rapidly with the buffer increase up to 1/2Mbyte after that the speed decreases a bit with increasing the buffer. The new package is built as openssh-4.3p2-38.el5 The average packet loss is 21% in this transfer. I've tested transfers up to 33% or with 3s blackouts divided by 5s transfer windows without the hangs. Created attachment 376704 [details]
the diff between the sequences of the window numbers, as seen on the sides
There are shown which window numbers are transmitted/received on both sides.
The problem may be caused by the network device (maybe firewall or traffic crypter/decrypter) which does not handle large TCP windows or the window scaling. In that case is the solution to make "echo 0 >/proc/sys/net/ipv4/tcp_window_scaling" on both endpoints before the transfer begins. Event posted on 02-03-2010 12:00pm EST by vincew The lwn.net TCP Window Scaling and Broken Routers article Rob posted the link to ( http://lwn.net/Articles/92727/ ) is an interesting read indeed. There is one comment in particular which does suggest a workaround for known-problematic sites/subnets, wherein a fixed window size can be defined for the route. This allows you to leave TCP window scaling enabled on the system (so that TCP connections to hosts that do not have broken routers between them can still benefit from window scaling): --------- Posted Dec 14, 2006 0:15 UTC (Thu) by pcharlan (guest, #29128) [Link] With kernel 2.6.17.13 or higher, you can also do: THEIR_IP=1.2.3.4 MY_GATEWAY=5.6.7.8 ip route add $THEIR_IP/32 via $MY_GATEWAY window 65535 which only limits window scaling for that destination without interfering with your other connections. [It has been a while since the original article, but this still shows up first in Google when searching for "linux tcp window scaling broken router", so perhaps this will help someone.] --------- Note that the example above sets a "host" route - ie, for one particular host. The same syntax can be used to set a "net" route if there is an entire subnet you know there are problems with, by using a smaller netmask. For example, to apply a fixed TCP window size route to the entire 10.1.2.0 class C subnet: ip route add 10.1.2.0/24 via 10.2.2.254 window 65535 if in our example 10.2.2.254 is our gateway to the 10.1.2.0/24 subnet. I would recommend the customer try applying route statements like this to test hosts on each side of the route and re-running tests. If it works (and I feel pretty good about it), then the customer has a way to leave TCP window scaling on system-wide in the kernel but limit the window size used between hosts there are problems with (most likely because of deficiencies in the routers between them). The fact that disabling TCP window scaling temporarily alleviated the problem does indicate very strongly that this what the problem is. The customer should also be aware that if there is a problem like this with their routers, the route options or disabling TCP window scaling entirely is probably about the best we are going to be able to do. I don't think Engineering is going to be willing to remove the latency patch entirely since it does help other customers, and the reason it isn't helping US Courts is very, very likely to be problems with their routers. Thanks, Vince This event sent from IssueTracker by vincew issue 308455 Hello, I have seen a similar issue with rhel5.2 servers. But after updating to the current openssh version on the clients the result was only partial. Basic network connection is: rhel5.2 client | rhel5.4 firewall - patched | internet to remote locations (Melbourne, Sydney,London,Dubai,Hong Kong...) | rhel5.4 firewall - patched | rhel5.2 client The connection (scp) from the firewalls to the REMOTE client has improved 3-4 times after applying the following: openssh-server-4.3p2-36.el5_4.3.x86_64.rpm openssh-clients-4.3p2-36.el5_4.3.x86_64.rpm openssh-4.3p2-36.el5_4.3.x86_64.rpm fipscheck-1.2.0-1.el5.x86_64.rpm fipscheck-lib-1.2.0-1.el5.x86_64.rpm openssl-0.9.8e-12.el5.x86_64.rpm openssl-0.9.8e-12.el5.i686.rpm but connections between clients has not changed. ~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative. Hey Chris, Can I ask what the fix was/is, we tend not to run beta releases on our production equipment and I don't have any remote test systems available for testing this with at this stage. (In reply to comment #79) > I am largely satisfied that the problem *IS* due to their routers since > disabling tcp_window_scaling made the problem go away. Ok, so we are closing this bug. If the client diagrees with our conclusion, that the issue is caused by the hardware/networking problems on his side, they are more than welcome to reopen this bug adding all information about testing with changed TCP settings we have asked them to do. |