Description of problem: Can not ssh to Sun SSH [Solaris 9] from x86_64 asus a6k laptop with network card: r8169 Gigabit Ethernet driver with red hat kernel-2.6.17-1.2157_FC5 and 2.6.17-1.2174_FC5. Able to do this from 2.6.16-1.2133_1.FC5 and previous versions. using ethereal shows "TCP previous segment lost" [please see the attachment]. First reported this bug at http://bugzilla.atrpms.net/show_bug.cgi?id=860 for the kernel 2.6.17-1.2157_1.rhfc5.cubbi_suspend2 based on red hat kernel as above. Version-Release number of selected component (if applicable): kernel-2.6.17-1.2157_FC5 kernel-2.6.17-1.2174_FC5 How reproducible: On the computer with "r8169 Gigabit Ethernet driver 2.2LK-NAPI loaded" and kernel-2.6.17-1.2157_FC5@x86_64 or kernel-2.6.17-1.2174_FC5@x86_64 try to ssh to Solaris 9 sparc box with Sun SSH Version Sun_SSH_1.0.1, protocol versions 1.5/2.0. it will stack Steps to Reproduce: 1. 2. 3. Actual results: please see the attachment Expected results: Additional info:
Created attachment 134365 [details] ssh -v output; 9 tcp packets from ethereal
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you.
Dave, I was waiting for that kernel, but still no joy! But it is easy to reproduce the problem now: we recently mirated all our web sites from solaris 8 to solaris 9 and I can't see a responce from firefox when I point it to http://www.unsw.edu.au So if a computer with 00:0b.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) is available [I've got Asus A6000KT laptop] point it to above mentioned URL and you'll not see the page. With white shark I grabbed the tcp packets: No. Time Source Destination Protocol Info 207 14.057544 129.94.61.35 149.171.96.95 TCP 34117 > http [SYN] Seq=0 Len=0 MSS=1460 TSV=117032 TSER=0 WS=7 Frame 207 (74 bytes on wire, 74 bytes captured) Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst: All-HSRP-routers_00 (00:00:0c:07:ac:00) Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95 (149.171.96.95) Transmission Control Protocol, Src Port: 34117 (34117), Dst Port: http (80), Seq: 0, Len: 0 No. Time Source Destination Protocol Info 208 14.057877 149.171.96.95 129.94.61.35 TCP http > 34117 [SYN, ACK] Seq=0 Ack=1 Win=49232 Len=0 TSV=58540319 TSER=117032 MSS=1460 WS=0 Frame 208 (78 bytes on wire, 78 bytes captured) Ethernet II, Src: Cisco_cd:ea:c0 (00:09:11:cd:ea:c0), Dst: AsustekC_01:61:71 (00:17:31:01:61:71) Internet Protocol, Src: 149.171.96.95 (149.171.96.95), Dst: 129.94.61.35 (129.94.61.35) Transmission Control Protocol, Src Port: http (80), Dst Port: 34117 (34117), Seq: 0, Ack: 1, Len: 0 No. Time Source Destination Protocol Info 209 14.057925 129.94.61.35 149.171.96.95 TCP 34117 > http [ACK] Seq=1 Ack=1 Win=5888 Len=0 TSV=117032 TSER=58540319 Frame 209 (66 bytes on wire, 66 bytes captured) Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst: All-HSRP-routers_00 (00:00:0c:07:ac:00) Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95 (149.171.96.95) Transmission Control Protocol, Src Port: 34117 (34117), Dst Port: http (80), Seq: 1, Ack: 1, Len: 0 No. Time Source Destination Protocol Info 210 14.058004 129.94.61.35 149.171.96.95 HTTP GET / HTTP/1.1 Frame 210 (564 bytes on wire, 564 bytes captured) Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst: All-HSRP-routers_00 (00:00:0c:07:ac:00) Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95 (149.171.96.95) Transmission Control Protocol, Src Port: 34117 (34117), Dst Port: http (80), Seq: 1, Ack: 1, Len: 498 Hypertext Transfer Protocol No. Time Source Destination Protocol Info 211 14.058457 149.171.96.95 129.94.61.35 TCP http > 34117 [ACK] Seq=1 Ack=499 Win=48734 Len=0 TSV=58540319 TSER=117032 Frame 211 (66 bytes on wire, 66 bytes captured) Ethernet II, Src: Cisco_cd:ea:c0 (00:09:11:cd:ea:c0), Dst: AsustekC_01:61:71 (00:17:31:01:61:71) Internet Protocol, Src: 149.171.96.95 (149.171.96.95), Dst: 129.94.61.35 (129.94.61.35) Transmission Control Protocol, Src Port: http (80), Dst Port: 34117 (34117), Seq: 1, Ack: 499, Len: 0 Please let me know if you need any additional information. Thanks, Alex
Alex, have you tried the Fedora-netdev kernels? http://people.redhat.com/linville/kernels/fedora-netdev/ Please give those a try and post the results here...thanks!
John, I installed 2.6.18-1.2200.2.10.fc5.netdev.13.1 Still the same. I captured my attempt to grab http://www.unsw.edu.au No. Time Source Destination Protocol Info 490 93.638271 129.94.61.35 149.171.96.95 TCP 36829 > http [SYN] Seq=0 Len=0 MSS=1460 TSV=161573 TSER=0 WS=7 Frame 490 (74 bytes on wire, 74 bytes captured) Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst: All-HSRP-routers_00 (00:00:0c:07:ac:00) Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95 (149.171.96.95) Transmission Control Protocol, Src Port: 36829 (36829), Dst Port: http (80), Seq: 0, Len: 0 10:01:24 65247 $ cat http.txt No. Time Source Destination Protocol Info 243 43.352020 129.94.61.35 149.171.96.95 TCP 51402 > ssh [SYN] Seq=0 Len=0 MSS=1460 TSV=149002 TSER=0 WS=7 Frame 243 (74 bytes on wire, 74 bytes captured) Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst: All-HSRP-routers_00 (00:00:0c:07:ac:00) Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95 (149.171.96.95) Transmission Control Protocol, Src Port: 51402 (51402), Dst Port: ssh (22), Seq: 0, Len: 0 No. Time Source Destination Protocol Info 244 43.352284 149.171.96.95 129.94.61.35 TCP ssh > 51402 [RST, ACK] Seq=0 Ack=1 Win=0 Len=0 Frame 244 (60 bytes on wire, 60 bytes captured) Ethernet II, Src: Cisco_cd:ea:c0 (00:09:11:cd:ea:c0), Dst: AsustekC_01:61:71 (00:17:31:01:61:71) Internet Protocol, Src: 149.171.96.95 (149.171.96.95), Dst: 129.94.61.35 (129.94.61.35) Transmission Control Protocol, Src Port: ssh (22), Dst Port: 51402 (51402), Seq: 0, Ack: 1, Len: 0 No. Time Source Destination Protocol Info 490 93.638271 129.94.61.35 149.171.96.95 TCP 36829 > http [SYN] Seq=0 Len=0 MSS=1460 TSV=161573 TSER=0 WS=7 Frame 490 (74 bytes on wire, 74 bytes captured) Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst: All-HSRP-routers_00 (00:00:0c:07:ac:00) Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95 (149.171.96.95) Transmission Control Protocol, Src Port: 36829 (36829), Dst Port: http (80), Seq: 0, Len: 0 No. Time Source Destination Protocol Info 491 93.638599 149.171.96.95 129.94.61.35 TCP http > 36829 [SYN, ACK] Seq=0 Ack=1 Win=49232 Len=0 TSV=75646982 TSER=161573 MSS=1460 WS=0 Frame 491 (78 bytes on wire, 78 bytes captured) Ethernet II, Src: Cisco_cd:ea:c0 (00:09:11:cd:ea:c0), Dst: AsustekC_01:61:71 (00:17:31:01:61:71) Internet Protocol, Src: 149.171.96.95 (149.171.96.95), Dst: 129.94.61.35 (129.94.61.35) Transmission Control Protocol, Src Port: http (80), Dst Port: 36829 (36829), Seq: 0, Ack: 1, Len: 0 No. Time Source Destination Protocol Info 492 93.638652 129.94.61.35 149.171.96.95 TCP 36829 > http [ACK] Seq=1 Ack=1 Win=5888 Len=0 TSV=161573 TSER=75646982 Frame 492 (66 bytes on wire, 66 bytes captured) Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst: All-HSRP-routers_00 (00:00:0c:07:ac:00) Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95 (149.171.96.95) Transmission Control Protocol, Src Port: 36829 (36829), Dst Port: http (80), Seq: 1, Ack: 1, Len: 0 No. Time Source Destination Protocol Info 493 93.638775 129.94.61.35 149.171.96.95 HTTP GET / HTTP/1.1 Frame 493 (564 bytes on wire, 564 bytes captured) Ethernet II, Src: AsustekC_01:61:71 (00:17:31:01:61:71), Dst: All-HSRP-routers_00 (00:00:0c:07:ac:00) Internet Protocol, Src: 129.94.61.35 (129.94.61.35), Dst: 149.171.96.95 (149.171.96.95) Transmission Control Protocol, Src Port: 36829 (36829), Dst Port: http (80), Seq: 1, Ack: 1, Len: 498 Hypertext Transfer Protocol No. Time Source Destination Protocol Info 494 93.639336 149.171.96.95 129.94.61.35 TCP http > 36829 [ACK] Seq=1 Ack=499 Win=48734 Len=0 TSV=75646982 TSER=161573 Frame 494 (66 bytes on wire, 66 bytes captured) Ethernet II, Src: Cisco_cd:ea:c0 (00:09:11:cd:ea:c0), Dst: AsustekC_01:61:71 (00:17:31:01:61:71) Internet Protocol, Src: 149.171.96.95 (149.171.96.95), Dst: 129.94.61.35 (129.94.61.35) Transmission Control Protocol, Src Port: http (80), Dst Port: 36829 (36829), Seq: 1, Ack: 499, Len: 0 Content: GET / HTTP/1.1 Host: www.unsw.edu.au User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.0.4) Gecko/20060614 Fedora/1.5.0.4-1.2.fc5 Firefox/1.5.0.4 pango-text Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Cookie: SITESERVER=ID=51b9af88ee9d405d4aa0aabc2e31af74 And the firefox still tries to load the page while I'm typing... Thanks, Alex
Neil, I think you may be better equipped to look at TCP stuff than I am.
Alex, it would be very helpful to me to look at the binary tcpdumps that you've captured of this problem. Also, I assume you are capturing on the Fedora Host. Would it be possible for you to reproduce the problem and capture parallel binary dumps at the same time, one on the solaris server you are ssh-ing into and one on the Fedora machine you are ssh-ing in from? That would help me analyze this much more quickly. Thank you!
Created attachment 139685 [details] binary output from whiteshark for 129.94.61.53 Neil this is the eth0 capture of my ASUS 6000KT for 2.6.18-1.2200.2.10.fc5.netdev.13.1 kernel. I tried two things: http://www.unsw.edu.au [149.171.96.95 running on solaris 9] - public access and ssh -v 149.171.96.98 [not accessible by public solaris 9 with SunSSH] both unsuccessfull. I'll attach the capture from different box.
Created attachment 139686 [details] ethereal capture for 2.6.16-1.2111_FC4smp Neil, to compare this is the ethereal capture from different computer 129.94.61.18 with FC4 I did the same: htttp://www.unsw.edu.au and ssh -v 149.171.96.98 that box does not have a monitor, so I've done it through 129.94.61.123 [you can ignore packets to/from that box] Thanks, Alex
Neil, I attached 2 captures from my box 129.94.61.53 and from another FC4 box 129.94.61.18 [I connected to that box from winxp 129.94.61.123 pls ignore packages from that box.] Two actions were performed from both boxes: 1) from firefox I tried to load the page from http://www.unsw.edu.au 2) ssh -v 149.171.96.98 ssh from .53 box: ssh -v 149.171.96.98 OpenSSH_4.3p2, OpenSSL 0.9.8a 11 Oct 2005 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug1: Connecting to 149.171.96.98 [149.171.96.98] port 22. debug1: Connection established. debug1: identity file /home/alexk/.ssh/identity type -1 debug1: identity file /home/alexk/.ssh/id_rsa type -1 debug1: identity file /home/alexk/.ssh/id_dsa type 2 debug1: Remote protocol version 2.0, remote software version Sun_SSH_1.0.1 debug1: match: Sun_SSH_1.0.1 pat Sun_SSH_1.0* debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_4.3 debug1: SSH2_MSG_KEXINIT sent killed with Ctrl+C ssh from .18 ssh -v 149.171.96.98 OpenSSH_4.2p1, OpenSSL 0.9.7f 22 Mar 2005 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug1: Connecting to 149.171.96.98 [149.171.96.98] port 22. debug1: Connection established. debug1: identity file /home/alexk/.ssh/identity type -1 debug1: identity file /home/alexk/.ssh/id_rsa type 1 debug1: identity file /home/alexk/.ssh/id_dsa type -1 debug1: Remote protocol version 2.0, remote software version Sun_SSH_1.0.1 debug1: match: Sun_SSH_1.0.1 pat Sun_SSH_1.0* debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_4.2 debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: server->client aes128-cbc hmac-md5 none debug1: kex: client->server aes128-cbc hmac-md5 none debug1: sending SSH2_MSG_KEXDH_INIT debug1: expecting SSH2_MSG_KEXDH_REPLY debug1: Host '149.171.96.98' is known and matches the RSA host key. debug1: Found key in /home/alexk/.ssh/known_hosts:7 debug1: ssh_rsa_verify: signature correct debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug1: SSH2_MSG_NEWKEYS received debug1: SSH2_MSG_SERVICE_REQUEST sent debug1: SSH2_MSG_SERVICE_ACCEPT received debug1: Authentications that can continue: publickey,password debug1: Next authentication method: publickey debug1: Trying private key: /home/alexk/.ssh/identity debug1: Offering public key: /home/alexk/.ssh/id_rsa debug1: Server accepts key: pkalg ssh-rsa blen 149 debug1: read PEM private key done: type RSA debug1: Authentication succeeded (publickey). debug1: channel 0: new [client-session] debug1: Entering interactive session. Last login: Mon Oct 30 10:49:15 2006 from wkst018.bsdsu.u Sun Microsystems Inc. SunOS 5.9 Generic May 2002 10:49:15 12047 $ logout debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 debug1: channel 0: free: client-session, nchannels 1 Connection to 149.171.96.98 closed. debug1: Transferred: stdin 0, stdout 0, stderr 37 bytes in 2.5 seconds debug1: Bytes per second: stdin 0.0, stdout 0.0, stderr 14.8 debug1: Exit status 0 Please let me know if you need anything else. Thanks, Alex
Today I burned fc6 dvd and booted into linux rescue: tried to ssh and telnet to port 80 to above mentioned solaris 9 boxes -- still the same problem.
I'd actually asked for two parallel traces of the same connection, not a working and non-working trace from different machines. Setting aside for the moment the .18 capure (since everything seems to work there), I'd like to focus on the .53 capture. I note that in the http connection, we get the three way tcp handshake in frames 25,26, and 27, after which the .53 client sends an http GET request, to which the server responds appropriately with an ACK. At this point we should expect to see an HTTP response from the server with either an error code or an OK code (as we do in frame 19 of the .18 capture). However for some reason we don't see that. We need to determine if the server sent that response and the client just never received it, or if it never sent it in the first place, which is why I was looking for a parallel dump on both the failing client and the server we were connecting too. As a side note, I notice that in the GET request, there is a discrepancy between the .53 and the .18 machine, the .53 machine is sending a COOKIE to the server in the GET request, that is not present in the .18 trace. Could that cookie be somehow causing a server side error that is preventing any web server response The same case is applicable to frame 1208 in 53.capture.bin. Its quite clear that after a good tcp handshake in the trace, we loose a segment somewhere in the ssh negotiation, but the question remains, did the server actually send a malformed tcp segment, with the wrong sequence number in place, or did it send the missing segment, which simply got lost, either on the network or at the receiving end. Given the break in the ip identification sequence, I'm inclined to believe that we lost it on receive, but i need the parallel dumps from the client and server to be sure. It would also be good to know what happens after that ACK is received, since that ack after the lost segment should have been followed by a Server key exchange init frame (as it was in the .18 trace) that never occured.
Neil, good catch: after clearing cookies from the firefox I've got a web page loaded to my browser! I'd like to clarify what do you mean by two parallel traces. Do you mean to sniff on both the client and the server at the same time when I run ssh to the server? Thanks, Alex
Glad to hear that the cookie was killing your web session. 1 problem down, 1 to go. As for the traces, you are right on the money, yes. What I need is for sniff or tcpdump to be run on the client and server at the same time when you try to ssh from the client to the server. With both traces, I can tell conclusively if the missing frames I mentioned in my comment number 13 are missing because the server never sent them, or because we never received them, and that will help me narrow down where to look for this particular problem. Thanks!
Neil, sorry: removing cookie did not solve the problem: it was a change to firewall rules which fixed the program. I found tcpdump on the solaris box and captured some packages. Now I will repeat that for John's 2.6.18-1.2200.2.10.fc5.netdev.13.1 kernel and upload the capture. Thanks, Alex
Created attachment 139811 [details] solaris 5.9 packets capture: pls check packets 42-50,77-78
Created attachment 139813 [details] linux 2.6.18-1.2200.2.10fc5.netdev.13.1.53 packets capture pls check packets 35-43,98-99
Thank you for the new traces. Well, It turns out I was wrong regarding what I expected to see in these traces. I had expected to see each frame during the ssh session appear in both traces right up to the point where we got the last segment indication. I had expected that the immediately preceding frame which should have carried sequence bytes 24 through 383 from teh solaris box to the fedora box to only appear in the solaris trace, but not the fedora trace. That would have indicated that the missing packet was getting lost somewhere on the network, or in the receive queue on the fedora machine. What I see however, is that all frames are in all traces. The fact that the solaris trace (capture-5-9-98) records a missing sequence number (see frame 50) in the tcp stream that it itself is generating indicates that the solaris box never put a packet on the wire containing sequence bytes 24 through 383, as it should have. This is going to have to be debugged on the solaris machine. I know that you indicated that this only happens with later fedora kernels. It may be that some minor ssh discrepancy (for example, the types of key exchange algorithms supported by various ssh implementations) may be triggering some bug or overflow in the solaris box, leading to the tcp error. I've not yet found any discrepancy, but I'll keep looking You also mentioned that a firewall rule change led to your ability to get web sessions going again on the newer fedora boxes. The same type of problem may be happening here. Is the firewall that you were referring to running directly on the solaris box? Or is there a firewall on the solaris box? If so, a bad rule would certainly explain how a tcp segment wen't missing on the system that was supposed to generate it (the firewall filter dropped the frame prior to the point in the code where the sniff capture observes the tcp stream. I'd also check your snmp statistics for network frame droppage on the solaris box, that would indicate problematic bottlenecks that could be responsible for lost frames
ping? any update here?
(In reply to comment #20) > ping? any update here? Neil, I gave up: I installed Scientific Linux 4.4 and I can ssh to SunSSH again.. Thanks, Alex
Solution: Change the value of /sbin/sysctl net.ipv4.tcp_rmem net.ipv4.tcp_rmem = 4096 87380 4194304 into sudo /sbin/sysctl -w net.ipv4.tcp_rmem="4096 1048576 207520" and everything is working again! is for Scientific Linux 5.0 [RHEL 5.0] Cheers Alex
(In reply to comment #20) > ping? any update here? Solution: Change the value of /sbin/sysctl net.ipv4.tcp_rmem net.ipv4.tcp_rmem = 4096 87380 4194304 into sudo /sbin/sysctl -w net.ipv4.tcp_rmem="4096 1048576 207520" and everything is working again! is for Scientific Linux 5.0 [RHEL 5.0] Cheers Alex