Bug 202917 - lost tcp segment for r8169 network card with kernel-2.6.17-1.2174_FC5
Summary: lost tcp segment for r8169 network card with kernel-2.6.17-1.2174_FC5
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 5
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Neil Horman
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-08-17 05:03 UTC by Alex Kruchkoff
Modified: 2009-11-13 00:47 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-01-19 13:11:27 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
ssh -v output; 9 tcp packets from ethereal (9.36 KB, text/plain)
2006-08-17 05:03 UTC, Alex Kruchkoff
no flags Details
binary output from whiteshark for 129.94.61.53 (144.56 KB, application/octet-stream)
2006-10-30 00:00 UTC, Alex Kruchkoff
no flags Details
ethereal capture for 2.6.16-1.2111_FC4smp (220.88 KB, application/octet-stream)
2006-10-30 00:10 UTC, Alex Kruchkoff
no flags Details
solaris 5.9 packets capture: (9.68 KB, application/octet-stream)
2006-10-31 06:52 UTC, Alex Kruchkoff
no flags Details
linux 2.6.18-1.2200.2.10fc5.netdev.13.1.53 packets capture (10.03 KB, application/octet-stream)
2006-10-31 06:56 UTC, Alex Kruchkoff
no flags Details

Description Alex Kruchkoff 2006-08-17 05:03:19 UTC
Description of problem:

Can not ssh to Sun SSH [Solaris 9] from x86_64 asus a6k laptop with network
card: r8169 Gigabit Ethernet driver with red hat kernel-2.6.17-1.2157_FC5 and 
2.6.17-1.2174_FC5. Able to do this from 2.6.16-1.2133_1.FC5 and previous versions.

using ethereal shows "TCP previous segment lost" [please see the attachment]. 

First reported this bug at 

http://bugzilla.atrpms.net/show_bug.cgi?id=860

for the kernel 2.6.17-1.2157_1.rhfc5.cubbi_suspend2 based on red hat kernel as
above.

Version-Release number of selected component (if applicable):
kernel-2.6.17-1.2157_FC5
kernel-2.6.17-1.2174_FC5

How reproducible:

On the computer with "r8169 Gigabit Ethernet driver 2.2LK-NAPI loaded" and
kernel-2.6.17-1.2157_FC5@x86_64 or 
kernel-2.6.17-1.2174_FC5@x86_64

try to ssh to Solaris 9 sparc box with Sun SSH Version Sun_SSH_1.0.1, protocol
versions 1.5/2.0.

it will stack

Steps to Reproduce:
1.
2.
3.
  
Actual results:
please see the attachment

Expected results:


Additional info:

Comment 1 Alex Kruchkoff 2006-08-17 05:03:24 UTC
Created attachment 134365 [details]
ssh -v output; 9 tcp packets from ethereal

Comment 2 Dave Jones 2006-10-16 17:25:07 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 3 Alex Kruchkoff 2006-10-24 00:44:39 UTC
Dave, I was waiting for that kernel, but still no joy!
But it is easy to reproduce the problem now: we recently mirated all our web
sites from solaris 8 to solaris 9 and I can't see a responce from firefox when I
point it to

http://www.unsw.edu.au

So if a computer with 

00:0b.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit
Ethernet (rev 10)

is available [I've got Asus A6000KT laptop] point it to above mentioned URL and
you'll not see the page. With white shark I grabbed the tcp packets:

No.     Time        Source                Destination           Protocol Info
    207 14.057544   129.94.61.35          149.171.96.95         TCP      34117 >
http [SYN] Seq=0 Len=0 MSS=1460 TSV=117032 TSER=0 WS=7

Frame 207 (74 bytes on wire, 74 bytes captured)
Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst:
All-HSRP-routers_00 (00:00:0c:07:ac:00)
Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95
(149.171.96.95)
Transmission Control Protocol, Src Port: 34117 (34117), Dst Port: http (80),
Seq: 0, Len: 0

No.     Time        Source                Destination           Protocol Info
    208 14.057877   149.171.96.95         129.94.61.35          TCP      http >
34117 [SYN, ACK] Seq=0 Ack=1 Win=49232 Len=0 TSV=58540319 TSER=117032 MSS=1460 WS=0

Frame 208 (78 bytes on wire, 78 bytes captured)
Ethernet II, Src: Cisco_cd:ea:c0 (00:09:11:cd:ea:c0), Dst: AsustekC_01:61:71
(00:17:31:01:61:71)
Internet Protocol, Src: 149.171.96.95 (149.171.96.95), Dst: 129.94.61.35
(129.94.61.35)
Transmission Control Protocol, Src Port: http (80), Dst Port: 34117 (34117),
Seq: 0, Ack: 1, Len: 0

No.     Time        Source                Destination           Protocol Info
    209 14.057925   129.94.61.35          149.171.96.95         TCP      34117 >
http [ACK] Seq=1 Ack=1 Win=5888 Len=0 TSV=117032 TSER=58540319

Frame 209 (66 bytes on wire, 66 bytes captured)
Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst:
All-HSRP-routers_00 (00:00:0c:07:ac:00)
Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95
(149.171.96.95)
Transmission Control Protocol, Src Port: 34117 (34117), Dst Port: http (80),
Seq: 1, Ack: 1, Len: 0

No.     Time        Source                Destination           Protocol Info
    210 14.058004   129.94.61.35          149.171.96.95         HTTP     GET /
HTTP/1.1

Frame 210 (564 bytes on wire, 564 bytes captured)
Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst:
All-HSRP-routers_00 (00:00:0c:07:ac:00)
Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95
(149.171.96.95)
Transmission Control Protocol, Src Port: 34117 (34117), Dst Port: http (80),
Seq: 1, Ack: 1, Len: 498
Hypertext Transfer Protocol

No.     Time        Source                Destination           Protocol Info
    211 14.058457   149.171.96.95         129.94.61.35          TCP      http >
34117 [ACK] Seq=1 Ack=499 Win=48734 Len=0 TSV=58540319 TSER=117032

Frame 211 (66 bytes on wire, 66 bytes captured)
Ethernet II, Src: Cisco_cd:ea:c0 (00:09:11:cd:ea:c0), Dst: AsustekC_01:61:71
(00:17:31:01:61:71)
Internet Protocol, Src: 149.171.96.95 (149.171.96.95), Dst: 129.94.61.35
(129.94.61.35)
Transmission Control Protocol, Src Port: http (80), Dst Port: 34117 (34117),
Seq: 1, Ack: 499, Len: 0

Please let me know if you need any additional information.

Thanks,
Alex

Comment 4 Alex Kruchkoff 2006-10-24 00:45:37 UTC
Dave, I was waiting for that kernel, but still no joy!
But it is easy to reproduce the problem now: we recently mirated all our web
sites from solaris 8 to solaris 9 and I can't see a responce from firefox when I
point it to

http://www.unsw.edu.au

So if a computer with 

00:0b.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit
Ethernet (rev 10)

is available [I've got Asus A6000KT laptop] point it to above mentioned URL and
you'll not see the page. With white shark I grabbed the tcp packets:

No.     Time        Source                Destination           Protocol Info
    207 14.057544   129.94.61.35          149.171.96.95         TCP      34117 >
http [SYN] Seq=0 Len=0 MSS=1460 TSV=117032 TSER=0 WS=7

Frame 207 (74 bytes on wire, 74 bytes captured)
Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst:
All-HSRP-routers_00 (00:00:0c:07:ac:00)
Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95
(149.171.96.95)
Transmission Control Protocol, Src Port: 34117 (34117), Dst Port: http (80),
Seq: 0, Len: 0

No.     Time        Source                Destination           Protocol Info
    208 14.057877   149.171.96.95         129.94.61.35          TCP      http >
34117 [SYN, ACK] Seq=0 Ack=1 Win=49232 Len=0 TSV=58540319 TSER=117032 MSS=1460 WS=0

Frame 208 (78 bytes on wire, 78 bytes captured)
Ethernet II, Src: Cisco_cd:ea:c0 (00:09:11:cd:ea:c0), Dst: AsustekC_01:61:71
(00:17:31:01:61:71)
Internet Protocol, Src: 149.171.96.95 (149.171.96.95), Dst: 129.94.61.35
(129.94.61.35)
Transmission Control Protocol, Src Port: http (80), Dst Port: 34117 (34117),
Seq: 0, Ack: 1, Len: 0

No.     Time        Source                Destination           Protocol Info
    209 14.057925   129.94.61.35          149.171.96.95         TCP      34117 >
http [ACK] Seq=1 Ack=1 Win=5888 Len=0 TSV=117032 TSER=58540319

Frame 209 (66 bytes on wire, 66 bytes captured)
Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst:
All-HSRP-routers_00 (00:00:0c:07:ac:00)
Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95
(149.171.96.95)
Transmission Control Protocol, Src Port: 34117 (34117), Dst Port: http (80),
Seq: 1, Ack: 1, Len: 0

No.     Time        Source                Destination           Protocol Info
    210 14.058004   129.94.61.35          149.171.96.95         HTTP     GET /
HTTP/1.1

Frame 210 (564 bytes on wire, 564 bytes captured)
Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst:
All-HSRP-routers_00 (00:00:0c:07:ac:00)
Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95
(149.171.96.95)
Transmission Control Protocol, Src Port: 34117 (34117), Dst Port: http (80),
Seq: 1, Ack: 1, Len: 498
Hypertext Transfer Protocol

No.     Time        Source                Destination           Protocol Info
    211 14.058457   149.171.96.95         129.94.61.35          TCP      http >
34117 [ACK] Seq=1 Ack=499 Win=48734 Len=0 TSV=58540319 TSER=117032

Frame 211 (66 bytes on wire, 66 bytes captured)
Ethernet II, Src: Cisco_cd:ea:c0 (00:09:11:cd:ea:c0), Dst: AsustekC_01:61:71
(00:17:31:01:61:71)
Internet Protocol, Src: 149.171.96.95 (149.171.96.95), Dst: 129.94.61.35
(129.94.61.35)
Transmission Control Protocol, Src Port: http (80), Dst Port: 34117 (34117),
Seq: 1, Ack: 499, Len: 0

Please let me know if you need any additional information.

Thanks,
Alex

Comment 5 John W. Linville 2006-10-24 12:15:16 UTC
Alex, have you tried the Fedora-netdev kernels?

   http://people.redhat.com/linville/kernels/fedora-netdev/

Please give those a try and post the results here...thanks!

Comment 6 Alex Kruchkoff 2006-10-26 00:12:07 UTC
John, I installed 2.6.18-1.2200.2.10.fc5.netdev.13.1
Still the same. I captured my attempt to grab http://www.unsw.edu.au
No.     Time        Source                Destination           Protocol Info
    490 93.638271   129.94.61.35          149.171.96.95         TCP      36829 >
http [SYN] Seq=0 Len=0 MSS=1460 TSV=161573 TSER=0 WS=7

Frame 490 (74 bytes on wire, 74 bytes captured)
Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst:
All-HSRP-routers_00 (00:00:0c:07:ac:00)
Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95
(149.171.96.95)
Transmission Control Protocol, Src Port: 36829 (36829), Dst Port: http (80),
Seq: 0, Len: 0
10:01:24 65247 $ cat http.txt
No.     Time        Source                Destination           Protocol Info
    243 43.352020   129.94.61.35          149.171.96.95         TCP      51402 >
ssh [SYN] Seq=0 Len=0 MSS=1460 TSV=149002 TSER=0 WS=7

Frame 243 (74 bytes on wire, 74 bytes captured)
Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst:
All-HSRP-routers_00 (00:00:0c:07:ac:00)
Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95
(149.171.96.95)
Transmission Control Protocol, Src Port: 51402 (51402), Dst Port: ssh (22), Seq:
0, Len: 0

No.     Time        Source                Destination           Protocol Info
    244 43.352284   149.171.96.95         129.94.61.35          TCP      ssh >
51402 [RST, ACK] Seq=0 Ack=1 Win=0 Len=0

Frame 244 (60 bytes on wire, 60 bytes captured)
Ethernet II, Src: Cisco_cd:ea:c0 (00:09:11:cd:ea:c0), Dst: AsustekC_01:61:71
(00:17:31:01:61:71)
Internet Protocol, Src: 149.171.96.95 (149.171.96.95), Dst: 129.94.61.35
(129.94.61.35)
Transmission Control Protocol, Src Port: ssh (22), Dst Port: 51402 (51402), Seq:
0, Ack: 1, Len: 0

No.     Time        Source                Destination           Protocol Info
    490 93.638271   129.94.61.35          149.171.96.95         TCP      36829 >
http [SYN] Seq=0 Len=0 MSS=1460 TSV=161573 TSER=0 WS=7

Frame 490 (74 bytes on wire, 74 bytes captured)
Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst:
All-HSRP-routers_00 (00:00:0c:07:ac:00)
Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95
(149.171.96.95)
Transmission Control Protocol, Src Port: 36829 (36829), Dst Port: http (80),
Seq: 0, Len: 0

No.     Time        Source                Destination           Protocol Info
    491 93.638599   149.171.96.95         129.94.61.35          TCP      http >
36829 [SYN, ACK] Seq=0 Ack=1 Win=49232 Len=0 TSV=75646982 TSER=161573 MSS=1460 WS=0

Frame 491 (78 bytes on wire, 78 bytes captured)
Ethernet II, Src: Cisco_cd:ea:c0 (00:09:11:cd:ea:c0), Dst: AsustekC_01:61:71
(00:17:31:01:61:71)
Internet Protocol, Src: 149.171.96.95 (149.171.96.95), Dst: 129.94.61.35
(129.94.61.35)
Transmission Control Protocol, Src Port: http (80), Dst Port: 36829 (36829),
Seq: 0, Ack: 1, Len: 0

No.     Time        Source                Destination           Protocol Info
    492 93.638652   129.94.61.35          149.171.96.95         TCP      36829 >
http [ACK] Seq=1 Ack=1 Win=5888 Len=0 TSV=161573 TSER=75646982

Frame 492 (66 bytes on wire, 66 bytes captured)
Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst:
All-HSRP-routers_00 (00:00:0c:07:ac:00)
Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95
(149.171.96.95)
Transmission Control Protocol, Src Port: 36829 (36829), Dst Port: http (80),
Seq: 1, Ack: 1, Len: 0

No.     Time        Source                Destination           Protocol Info
    493 93.638775   129.94.61.35          149.171.96.95         HTTP     GET /
HTTP/1.1

Frame 493 (564 bytes on wire, 564 bytes captured)
Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst:
All-HSRP-routers_00 (00:00:0c:07:ac:00)
Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95
(149.171.96.95)
Transmission Control Protocol, Src Port: 36829 (36829), Dst Port: http (80),
Seq: 1, Ack: 1, Len: 498
Hypertext Transfer Protocol

No.     Time        Source                Destination           Protocol Info
    494 93.639336   149.171.96.95         129.94.61.35          TCP      http >
36829 [ACK] Seq=1 Ack=499 Win=48734 Len=0 TSV=75646982 TSER=161573

Frame 494 (66 bytes on wire, 66 bytes captured)
Ethernet II, Src: Cisco_cd:ea:c0 (00:09:11:cd:ea:c0), Dst: AsustekC_01:61:71
(00:17:31:01:61:71)
Internet Protocol, Src: 149.171.96.95 (149.171.96.95), Dst: 129.94.61.35
(129.94.61.35)
Transmission Control Protocol, Src Port: http (80), Dst Port: 36829 (36829),
Seq: 1, Ack: 499, Len: 0

Content:
GET / HTTP/1.1
Host: www.unsw.edu.au
User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.0.4)
Gecko/20060614 Fedora/1.5.0.4-1.2.fc5 Firefox/1.5.0.4 pango-text
Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: SITESERVER=ID=51b9af88ee9d405d4aa0aabc2e31af74


And the firefox still tries to load the page while I'm typing...

Thanks,
Alex


Comment 7 John W. Linville 2006-10-26 14:55:22 UTC
Neil, I think you may be better equipped to look at TCP stuff than I am.

Comment 8 Neil Horman 2006-10-26 17:14:42 UTC
Alex, it would be very helpful to me to look at the binary tcpdumps that you've
captured of this problem.  Also, I assume you are capturing on the Fedora Host.
 Would it be possible for you to reproduce the problem and capture parallel
binary dumps at the same time, one on the solaris server you are ssh-ing into
and one on the Fedora machine you are ssh-ing in from?  That would help me
analyze this much more quickly.  Thank you!

Comment 9 Alex Kruchkoff 2006-10-30 00:00:55 UTC
Created attachment 139685 [details]
binary output from whiteshark for 129.94.61.53

Neil this is the eth0 capture of my ASUS 6000KT for
2.6.18-1.2200.2.10.fc5.netdev.13.1 kernel.

I tried two things: http://www.unsw.edu.au [149.171.96.95 running on solaris 9]
- public access

and ssh -v 149.171.96.98 [not accessible by public solaris 9 with SunSSH]

both unsuccessfull. I'll attach the capture from different box.

Comment 10 Alex Kruchkoff 2006-10-30 00:10:40 UTC
Created attachment 139686 [details]
ethereal capture for 2.6.16-1.2111_FC4smp

Neil, to compare this is the ethereal capture from different computer
129.94.61.18 with FC4 I did the same: htttp://www.unsw.edu.au and ssh -v
149.171.96.98
that box does not have a monitor, so I've done it through 129.94.61.123 [you
can ignore packets to/from that box]

Thanks,
Alex

Comment 11 Alex Kruchkoff 2006-10-30 00:24:52 UTC
Neil,

I attached 2 captures from my box 129.94.61.53 and from another FC4 box
129.94.61.18 [I connected to that box from winxp 129.94.61.123 pls ignore
packages from that box.]

Two actions were performed from both boxes: 1) from firefox I tried to load the
page from http://www.unsw.edu.au 2) ssh -v 149.171.96.98

ssh from .53 box:

ssh -v 149.171.96.98
OpenSSH_4.3p2, OpenSSL 0.9.8a 11 Oct 2005
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to 149.171.96.98 [149.171.96.98] port 22.
debug1: Connection established.
debug1: identity file /home/alexk/.ssh/identity type -1
debug1: identity file /home/alexk/.ssh/id_rsa type -1
debug1: identity file /home/alexk/.ssh/id_dsa type 2
debug1: Remote protocol version 2.0, remote software version Sun_SSH_1.0.1
debug1: match: Sun_SSH_1.0.1 pat Sun_SSH_1.0*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_4.3
debug1: SSH2_MSG_KEXINIT sent

killed with Ctrl+C

ssh from .18
ssh -v 149.171.96.98
OpenSSH_4.2p1, OpenSSL 0.9.7f 22 Mar 2005
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to 149.171.96.98 [149.171.96.98] port 22.
debug1: Connection established.
debug1: identity file /home/alexk/.ssh/identity type -1
debug1: identity file /home/alexk/.ssh/id_rsa type 1
debug1: identity file /home/alexk/.ssh/id_dsa type -1
debug1: Remote protocol version 2.0, remote software version Sun_SSH_1.0.1
debug1: match: Sun_SSH_1.0.1 pat Sun_SSH_1.0*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_4.2
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-cbc hmac-md5 none
debug1: kex: client->server aes128-cbc hmac-md5 none
debug1: sending SSH2_MSG_KEXDH_INIT
debug1: expecting SSH2_MSG_KEXDH_REPLY
debug1: Host '149.171.96.98' is known and matches the RSA host key.
debug1: Found key in /home/alexk/.ssh/known_hosts:7
debug1: ssh_rsa_verify: signature correct
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Trying private key: /home/alexk/.ssh/identity
debug1: Offering public key: /home/alexk/.ssh/id_rsa
debug1: Server accepts key: pkalg ssh-rsa blen 149
debug1: read PEM private key done: type RSA
debug1: Authentication succeeded (publickey).
debug1: channel 0: new [client-session]
debug1: Entering interactive session.
Last login: Mon Oct 30 10:49:15 2006 from wkst018.bsdsu.u
Sun Microsystems Inc.   SunOS 5.9       Generic May 2002
10:49:15 12047 $ logout
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: channel 0: free: client-session, nchannels 1
Connection to 149.171.96.98 closed.
debug1: Transferred: stdin 0, stdout 0, stderr 37 bytes in 2.5 seconds
debug1: Bytes per second: stdin 0.0, stdout 0.0, stderr 14.8
debug1: Exit status 0

Please let me know if you need anything else.

Thanks,
Alex

Comment 12 Alex Kruchkoff 2006-10-30 08:19:12 UTC
Today I burned fc6 dvd and booted into linux rescue: tried to ssh and telnet to
port 80 to above mentioned solaris 9 boxes -- still the same problem.


Comment 13 Neil Horman 2006-10-30 18:11:08 UTC
I'd actually asked for two parallel traces of the same connection, not a working
and non-working trace from different machines.  Setting aside for the moment the
.18 capure (since everything seems to work there), I'd like to focus on the .53
capture.

I note that in the http connection, we get the three way tcp handshake in frames
25,26, and 27, after which the .53 client sends an http GET request, to which
the server responds appropriately with an ACK.  At this point we should expect
to see an HTTP response from the server with either an error code or an OK code
(as we do in frame 19 of the .18 capture).  However for some reason we don't see
that.  We need to determine if the server sent that response and the client just
never received it, or if it never sent it in the first place, which is why I was
looking for a parallel dump on both the failing client and the server we were
connecting too.

As a side note, I notice that in the GET request, there is a discrepancy between
the .53 and the .18 machine, the .53 machine is sending a COOKIE to the server
in the GET request, that is not present in the .18 trace.  Could that cookie be
somehow causing a server side error that is preventing any web server response


The same case is applicable to frame 1208 in 53.capture.bin.  Its quite clear
that after a good tcp handshake in the trace, we loose a segment somewhere in
the ssh negotiation, but the question remains, did the server actually send a
malformed tcp segment, with the wrong sequence number in place, or did it send
the missing segment, which simply got lost, either on the network or at the
receiving end.  Given the break in the ip identification sequence, I'm inclined
to believe that we lost it on receive, but i need the parallel dumps from the
client and server to be sure.  It would also be good to know what happens after
that ACK is received, since that ack after the lost segment should have been
followed by a Server key exchange init frame (as it was in the .18 trace) that
never occured.

Comment 14 Alex Kruchkoff 2006-10-31 00:45:16 UTC
Neil, good catch: after clearing cookies from the firefox I've got a web page
loaded to my browser!

I'd like to clarify what do you mean by two parallel traces. Do you mean to
sniff on both the client and the server at the same time when I run ssh to the
server?

Thanks,
Alex

Comment 15 Neil Horman 2006-10-31 01:05:21 UTC
Glad to hear that the cookie was killing your web session.  1 problem down, 1 to go.

As for the traces, you are right on the money, yes.  What I need is for sniff or
tcpdump to be run on the client and server at the same time when you try to ssh
from the client to the server.  With both traces, I can tell conclusively if the
missing frames I mentioned in my comment number 13 are missing because the
server never sent them, or because we never received them, and that will help me
narrow down where to look for this particular problem.

Thanks!

Comment 16 Alex Kruchkoff 2006-10-31 06:27:49 UTC
Neil, sorry: removing cookie did not solve the problem: it was a change to
firewall rules which fixed the program.

I found tcpdump on the solaris box and captured some packages.
Now I will repeat that for John's 2.6.18-1.2200.2.10.fc5.netdev.13.1 kernel and
upload the capture.

Thanks,
Alex

Comment 17 Alex Kruchkoff 2006-10-31 06:52:14 UTC
Created attachment 139811 [details]
solaris 5.9 packets capture:

pls check packets 42-50,77-78

Comment 18 Alex Kruchkoff 2006-10-31 06:56:23 UTC
Created attachment 139813 [details]
linux 2.6.18-1.2200.2.10fc5.netdev.13.1.53 packets capture

pls check packets 35-43,98-99

Comment 19 Neil Horman 2006-10-31 12:27:39 UTC
Thank you for the new traces.  Well, It turns out I was wrong regarding what I
expected to see in these traces.  I had expected to see each frame during the
ssh session appear in both traces right up to the point where we got the last
segment indication.  I had expected that the immediately preceding frame which
should have carried sequence bytes 24 through 383 from teh solaris box to the
fedora box to only appear in the solaris trace, but not the fedora trace.  That
would have indicated that the missing packet was getting lost somewhere on the
network, or in the receive queue on the fedora machine.  What I see however, is
that all frames are in all traces.  The fact that the solaris trace
(capture-5-9-98) records a missing sequence number (see frame 50) in the tcp
stream that it itself is generating indicates that the solaris box never put a
packet on the wire containing sequence bytes 24 through 383, as it should have.
 This is going to have to be debugged on the solaris machine.

I know that you indicated that this only happens with later fedora kernels.  It
may be that some minor ssh discrepancy (for example, the types of key exchange
algorithms supported by various ssh implementations) may be triggering some bug
or overflow in the solaris box, leading to the tcp error.  I've not yet found
any discrepancy, but I'll keep looking

You also mentioned that a firewall rule change led to your ability to get web
sessions going again on the newer fedora boxes.  The same type of problem may be
happening here.  Is the firewall that you were referring to running directly on
the solaris box?  Or is there a firewall on the solaris box?  If so,  a bad rule
would certainly explain how a tcp segment wen't missing on the system that was
supposed to generate it (the firewall filter dropped the frame prior to the
point in the code where the sniff capture observes the tcp stream.

I'd also check your snmp statistics for network frame droppage on the solaris
box, that would indicate problematic bottlenecks that could be responsible for
lost frames

Comment 20 Neil Horman 2007-01-18 14:30:08 UTC
ping?  any update here?

Comment 21 Alex Kruchkoff 2007-01-18 22:27:49 UTC
(In reply to comment #20)
> ping?  any update here?

Neil, I gave up: I installed Scientific Linux 4.4 and I can ssh to SunSSH again..

Thanks,
Alex

Comment 22 Alex Kruchkoff 2009-11-13 00:45:51 UTC
Solution:

Change the value of

/sbin/sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096        87380   4194304

into

sudo /sbin/sysctl -w net.ipv4.tcp_rmem="4096 1048576 207520"

and everything is working again!
is for Scientific Linux 5.0 [RHEL 5.0]

Cheers
Alex

Comment 23 Alex Kruchkoff 2009-11-13 00:47:27 UTC
(In reply to comment #20)
> ping?  any update here?  

Solution:

Change the value of

/sbin/sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096        87380   4194304

into

sudo /sbin/sysctl -w net.ipv4.tcp_rmem="4096 1048576 207520"

and everything is working again!
is for Scientific Linux 5.0 [RHEL 5.0]

Cheers
Alex


Note You need to log in before you can comment on or make changes to this bug.