Bug 164942

Summary: Bad packet length error in SSH
Product: [Fedora] Fedora Reporter: Phil Oester <bugzilla>
Component: opensshAssignee: Tomas Mraz <tmraz>
Status: CLOSED DUPLICATE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-08-17 04:03:51 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
tcpdump sesion none

Description Phil Oester 2005-08-02 18:18:44 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

Description of problem:
SSH sessions (both interactive and rsync) regularly disconnect with error 'Bad packet length X'

Sample error (using ssh -vvv):

Received disconnect from x.x.x.x: 2: Bad packet length 1951400441.
rsync: connection unexpectedly closed (481545707 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(420)
rsync: connection unexpectedly closed (77800 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(420)


Version-Release number of selected component (if applicable):
openssh-4.1p1-3.1

How reproducible:
Sometimes

Steps to Reproduce:
1. Start an SSH transfer
2. wait around for it to crash...anywhere from 5 - 20 minutes
3.
  

Additional info:
Comment 1 Tomas Mraz 2005-08-03 03:46:25 EDT
This can happen if you have some broken router in between the server and client
machines.

What exact versions of openssh are on the server and client?

What if you try to for example downgrade them to the openssh version from FC3?
Comment 2 Phil Oester 2005-08-03 13:54:05 EDT
Version-Release number of selected component (if applicable):
openssh-4.1p1-3.1

Could you elaborate on how my router could be broken?  The two boxes are on an
internal WAN, separated by a few cisco routers.

Downgrading to FC3 openssh isn't really an option for me -- but what specific
change between FC3 and FC4 openssh do you suspect?
Comment 3 Tomas Mraz 2005-08-03 14:18:45 EDT
So on both server and client machines there is the same openssh version. Both
machines are the same arch (i386) and they are connected by a WAN with a few
Ciscos on the route and nothing else. Then (if ssh localhost on both server and
client works fine which I suppose) the problem must be in the WAN - most
probably in the routers. The packets are somehow mangled and ssh as it is secure
encrypted protocol easily detects it.

This is my theory.

I cannot reproduce your problem so it's only guessing. You will have to do some
experiments if you want that problem resolved.
Comment 4 Phil Oester 2005-08-03 21:22:52 EDT
Yes, on the problematic transfers, both client and server have the same SSH
version and same arch.  However, I have experimented with some different
versions (FC1 and FC4 boxes):

openssh-3.6.1p2-19 -> openssh-4.1p1-3.1 = FAIL
openssh-4.1p1-3.1 -> openssh-3.6.1p2-19 = OK
openssh-4.1p1-3.1 -> openssh-4.1p1-3.1  = FAIL

So the problem appears to be server based.

I've collected a tcpdump which is interesting -- it is on an intermediate
firewall between an east an west coast box.  It shows the disconnect occurring,
seemingly just for a single lost packet.

<attachment in next email>
Comment 5 Phil Oester 2005-08-03 21:24:20 EDT
Created attachment 117427 [details]
tcpdump sesion

annotated tcpdump
Comment 6 Tomas Mraz 2005-08-04 04:24:37 EDT
Are the FC1 a FC4 server boxes on the same LAN so when you're connecting from
the box in another LAN only connection to FC4 server fails?

One dropped packet in the TCP connection cannot make the connection fail as TCP
is a reliable connection protocol with resending lost data and so on. Of course
there is a theoretical possibility of a bug in the TCP implementation in the
kernel on the FC4 box (you could try to upgrade kernel to some newer one from
testing updates) but I don't think it's probable. At least you could try to
upgrade the openssh on the FC1 box although you'd have to rebuild it from SRPM
and see if the failures start to appear there.
Comment 7 Phil Oester 2005-08-05 13:12:51 EDT
Yes, the FC1 and FC4 boxes are on the same LAN on the west coast, and the client
box is on the east coast.  Ran additional tests from the FC4 to the FC1 box just
to verify, and it succeeds everytime.

BTW - the tcpdump session is likely garbage, since it just couldn't capture all
the packets as quickly as they were going through.  It can be ignored.

Conducted some more tests -- downloaded the FC1 openssh srpm, built it on FC4
box, and downgraded to it -- it still fails.  The error is either 'Bad packet
length' or 'Corrupted MAC on input' -- varies.  (also built the FC3 SRPM with
same results)

So perhaps the problem is not in openssh, but in one of the libraries it depends
upon?  Zlib?  crypto?
Comment 8 Phil Oester 2005-08-07 14:47:28 EDT
This seems similar to bug #110101, which was closed as NOTABUG long ago -- but
does smell like a bug somewhere...
Comment 9 Phil Oester 2005-08-08 20:02:08 EDT
Some other things I've tried today...

* downgraded FC4 box kernel to 2.6.10: FAILS
* tried FC4 default kernel (I was using custom compiled kernel.org): FAILS

Attempted to compile SSH rpm with static libraries, but couldn't get it to work
properly -- my hope was to be able to upgrade/downgrade various libraries for
testing, but that may be fruitless.
Comment 10 Tomas Mraz 2005-08-09 04:00:42 EDT
In my opinion the problem must lie somewhere in the network in between the FC4
server box and the client box. This problem is triggered by some change in TCP
implementation in kernel 2.6.x against kernel 2.4.x in FC1. (window scaling and
so on) For example see: http://lwn.net/Articles/92727/ which may or may not help.

I honestly don't think that the problem lies in any other software/libraries of
the FC4 box.
Comment 11 Tomas Mraz 2005-08-09 04:02:20 EDT
Also it could be a hardware bug (memory, motherboard...) of the FC4 box.
Comment 12 Phil Oester 2005-08-16 20:04:05 EDT
I have found that the solution in bug 149887 solves this for me, so it appears 
to be an e1000 driver issue instead of an ssh bug.  I'll leave the duplicate 
marking to you...
Comment 13 Tomas Mraz 2005-08-17 04:03:51 EDT

*** This bug has been marked as a duplicate of 149887 ***