242870 – openssh becomes non-responsive after a few minutes when logged into a remote host

Bug 242870 - openssh becomes non-responsive after a few minutes when logged into a remote host [NEEDINFO]

Summary: openssh becomes non-responsive after a few minutes when logged into a remote ...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	openssh
Sub Component:
Version:	11
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Jan F. Chadima
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	453752 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-06-06 06:02 UTC by John Pomeroy
Modified:	2010-06-28 10:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-06-28 10:25:20 UTC
Type:	---
Embargoed:
Dependent Products:
Flags:	jchadima: needinfo?

Attachments	(Terms of Use)

Description John Pomeroy 2007-06-06 06:02:12 UTC

Description of problem:
openssh stops functioning after a few minutes logged into a remote host

Version-Release number of selected component (if applicable):
OpenSSH_4.5p1, OpenSSL 0.9.8b 04 May 2006


How reproducible:
ssh into remote host, perform misc tasks, after about 3 - 5 minutes ssh session
becomes non responsive.  

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Tomas Mraz 2007-06-06 10:58:56 UTC

I cannot reproduce this. Does it happen on every hosts you connect to? What is
the remote host sshd version?

Comment 2 Matthew Miller 2007-08-31 02:48:01 UTC

I've been seeing a similar thing connecting from Fedora 7, Fedora Rawhide, or
CentOS 5 machines to a CentOS 4 server. For me, it isn't a matter of minutes
after connecting -- it seems to be when something dumps a few K of text to the
screen at once. Maybe something with terminal control codes, even (although it
is ssh that's hung, not the terminal). John, does that match what you're seeing?

Tried upgrading openssh on the server to RHEL5 version -- didn't help.

Comment 3 Tomas Mraz 2007-08-31 07:17:30 UTC

Matt, I've tried to reproduce and wasn't successfull. What application triggers
that for you?

Comment 4 Matthew Miller 2007-08-31 12:09:14 UTC

It's frustratingly hard to reproduce on purpose. It'll happen in mutt, in ls, in
less, in joe, in vi. But doing the same thing again won't necessarily recreate.
I need to do more diagnosing next time it happens -- at this point, I was
looking for similar bugs that might offer clues.

Comment 5 Tomas Mraz 2007-08-31 12:41:40 UTC

At least it would be interesting to see full stack backtraces on both server and
client.

Comment 6 Matthew Miller 2007-09-06 04:55:56 UTC

I don't have a backtrace with debuginfo, but I do have an strace of the running
process. client is responsive to keystrokes, with the appropriate-looking
selects, reads, and writes. On the server, though, it's just doing this:

  $ sudo strace -p 25726
  Process 25726 attached - interrupt to quit
  select(10, [3 6 9], [], NULL, NULL

and the non-debuginfo (um, with RHEL5 sshd rebuilt on CentOS 4 to see if that
would help; it doesn't):

#0  0xb7fe87a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x0042a64d in ___newselect_nocancel () from /lib/tls/libc.so.6
#2  0x0011f847 in main () from /usr/sbin/sshd

Hmmm. That's interesting....

Comment 7 Tomas Mraz 2007-09-06 06:20:45 UTC

Are there any writes to the network socket on the client? Can you do tcpdump on
server and client to see that there are really packets sent from the client and
received on the server?

Comment 8 Matthew Miller 2007-09-06 12:52:09 UTC

I'll do that next time I see the problem.

Comment 9 Matthew Miller 2007-09-06 19:55:16 UTC

So, I should have done that before. Nothing from tcpdump on server or client
when I type into the frozen window.

However, there's some periodic (keepalive?) stuff going back and forth every
(exactly) two minutes:

15:48:26.533029 IP server.bu.edu.ssh > client.bu.edu.40941: P
6143502:6144510(1008) ack 3046252168 win 3228 <nop,nop,timestamp 1418955725
84682970>
15:48:26.533926 IP client.bu.edu.40941 > server.bu.edu.ssh: . ack 1008 win
501 <nop,nop,timestamp 84888549 1418955725,nop,nop,sack 1 {0:1008}>
15:48:28.301874 IP client.bu.edu.40941 > server.bu.edu.ssh: P 1:49(48) ack
1008 win 501 <nop,nop,timestamp 84890318 1418955725>
15:50:26.506466 IP server.bu.edu.ssh > client.bu.edu.40941: P 0:1008(1008)
ack 1 win 3228 <nop,nop,timestamp 1419075725 84682970>
15:50:26.506516 IP client.bu.edu.40941 > server.bu.edu.ssh: . ack 1008 win
501 <nop,nop,timestamp 85008521 1419075725,nop,nop,sack 1 {0:1008}>
15:50:28.303192 IP client.bu.edu.40941 > server.bu.edu.ssh: P 1:49(48) ack
1008 win 501 <nop,nop,timestamp 85010318 1419075725>


Simultaneously, on the server side:


15:48:26.537106 IP server.bu.edu.ssh > client.bu.edu.40941: P
6143502:6144510(1008) ack 3046252168 win 3228 <nop,nop,timestamp 1418955725
84682970>
15:48:26.537590 IP client.bu.edu.40941 > server.bu.edu.ssh: . ack 1008 win
501 <nop,nop,timestamp 84888549 1418955725,nop,nop,sack sack 1 {0:1008} >
15:48:28.306478 IP client.bu.edu.40941 > server.bu.edu.ssh: P 1:49(48) ack
1008 win 501 <nop,nop,timestamp 84890318 1418955725>
15:50:26.510830 IP server.bu.edu.ssh > client.bu.edu.40941: P 0:1008(1008)
ack 1 win 3228 <nop,nop,timestamp 1419075725 84682970>
15:50:26.511332 IP client.bu.edu.40941 > server.bu.edu.ssh: . ack 1008 win
501 <nop,nop,timestamp 85008521 1419075725,nop,nop,sack sack 1 {0:1008} >
15:50:28.308029 IP client.bu.edu.40941 > server.bu.edu.ssh: P 1:49(48) ack
1008 win 501 <nop,nop,timestamp 85010318 1419075725>

Comment 10 Matthew Miller 2007-09-06 19:59:47 UTC

Just tried typing <enter>~<ctrl-z> to see what would happen. Got the "^Z
[suspend ssh]" output, but didn't seem to actually get a prompt. But then I
accidentally closed the window so I'm not 100% sure. Helpful, I know!

Comment 11 Tomas Mraz 2007-09-07 06:59:49 UTC

I think the problem is rather on the client side - can you try to upgrade to the
openssh-4.7p1-1.fc8 from here:
http://koji.fedoraproject.org/koji/buildinfo?buildID=17988

Comment 12 Matthew Miller 2007-09-07 11:22:09 UTC

Yeah, after seeing the client not send out any packets I gotta agree with your
suspicion. :) I'll try the new packages and see what happens.

Comment 13 Matthew Miller 2007-09-07 12:41:11 UTC

No luck; just happened with openssh-4.7p1-1.fc8 (client).

Comment 14 Tomas Mraz 2007-09-07 12:53:17 UTC

(In reply to comment #13)
> No luck; just happened with openssh-4.7p1-1.fc8 (client).

Damn, so now you can try to bisect versions of the client from various old
Fedoras to see when it broke. Because I think the problem wasn't there in the
3.9p1 version - at least I don't see any such reports on RHEL-4.

But that won't be fun :(.

Note that the old clients should be moreless usable on latest Fedoras. The
servers not so much because of selinux changes.

Comment 15 Matthew Miller 2007-09-07 12:57:54 UTC

Yeah, the problem definitely started for me when I upgraded my home system from
a Fedora Core 4-based distribution, and for Paul Stauffer (who I added to the cc
list) on upgrading to a RHEL5-based one.

I'll see what more I can discover. Good times. :)

Comment 16 Paul Stauffer 2007-09-07 14:07:35 UTC

Actually, the system I'm seeing this problem on is Fedora 7.  I have never seen
this problem on any of several FC6 systems I've used regularly, so I'm guessing
it's something that changed between 6 and 7.

Comment 17 Matthew Miller 2007-09-07 14:24:50 UTC

I definitely see it on my BU Linux 5.0 ( = CentOS 5 / RHEL5) system too, with
openssh-4.3p2-16.el5.centos.bu50.13. (BU changes are to the config file only, so
it's unlikely that we've done anything that causes the problem. Plus my home
system is currently unmodifed rawhide.)

Fedora Core 6 had openssh-4.3p2-10.src.rpm, but then that got updated to
openssh-4.3p2-19.fc6.src.rpm. So it seems a bit odd that that would work and
RHEL5 not.

Maybe the problem is in some supporting library. Wouldn't that be fun.

Paul, can you back your home system back to 4.3p2-19.fc6?

Comment 18 Milan Slanař 2007-10-12 14:27:12 UTC

Same problem as in #2
server CentOS 4 openssh-3.9p1-8.RHEL4.20.x86_64
clients from FC5, FC6 ,F7

no problem from FC2

Comment 19 Robin Norwood 2008-03-26 02:02:20 UTC

I'm seeing what appears to be the same bug on a brand new F9-beta install -
openssh-clients-4.7p1-9.fc9.i386.

Comment 20 Robin Norwood 2008-03-26 02:59:16 UTC

...and downgrading all the way to openssh-4.5p1-6.fc7.i386.rpm and
openssl-0.9.8b-15.fc7.i686.rpm doesn't help.

Comment 21 Matthew Miller 2008-03-26 12:47:41 UTC

Moving to Rawhide as per comment #19. Because I think we all know this isn't
getting fixed in Fedora 7. :)

Comment 22 Bug Zapper 2008-05-14 02:58:21 UTC

Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 23 Tomas Mraz 2008-10-03 07:29:27 UTC

Do you still see the problem with the openssh-5.1p1 which is now in F9 updates?

Comment 24 Matthew Miller 2008-10-03 12:05:41 UTC

I don't have any more CentOS4 / RHEL4 servers to test against. (Some of them I upgraded specifically to avoid this bug.) The problem definitely does not appear when connecting to a CentOS 5 or Fedora 9+ system. But I'm suspicious that the problem is still there.

Comment 25 Matthew Miller 2008-10-03 20:07:02 UTC

Oh, weird. So, at my new job, we have a lot of machines running under VMware ESX 3.5. I still don't have a RHEL4 server, but the CentOS 5 (w/ latest patches) vm I am working on now exhibits the exact same problem when connecting with openssh-5.1p1-2.fc10.x86_64.

Could be a red herring, but the symptoms are exactly the same (including similar network behavior when I watch with tcpdump).

Even if the VMware issue I'm seeing now happens to be unrelated, I think this is serious enough that we want to *know* the problem is solved, not just hope that it happens to go away with a new release.

This bug probably needs to get the attention of someone on the RHEL side of things before RHEL 6, because it's extremely likely that after that release you'll see a lot of this in shops with RHEL 4 and the new 6 release.

Comment 26 Matthew Miller 2008-10-03 20:15:16 UTC

Ohhhhhh.

Hmmm. For the connect-to-centos5-under-vmware problem, I can reproduce frequently *when iptables is enabled*. If I drop the local host firewall, there are no problems. I had tested with the firewall off on the server before but I don't think it had occured to me to test the client. I have an entirely stock iptables configuration:

$ sudo service iptables status
Table: filter
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination         
1    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           state RELATED,ESTABLISHED 
2    ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0           
3    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
4    ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0           state NEW tcp dpt:22 
5    ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0           state NEW tcp dpt:22 
6    REJECT     all  --  0.0.0.0/0            0.0.0.0/0           reject-with icmp-host-prohibited 

Chain FORWARD (policy ACCEPT)
num  target     prot opt source               destination         
1    REJECT     all  --  0.0.0.0/0            0.0.0.0/0           reject-with icmp-host-prohibited 

Chain OUTPUT (policy ACCEPT)
num  target     prot opt source               destination         


Can someone seeing this problem connecting to a stock RHEL4 host test if the issue goes away if you stop iptables on your client machine?

Comment 27 Tomas Mraz 2008-10-03 20:39:14 UTC

I am really curious what triggers it because it seems apparent, that the problem is not just some kind of version incompatibility. To me it rather seems like certain conditions on the network connection or the client or server machines trigger some kind of race condition between the server and client and the client gets into a deadlock condition.

Comment 28 Matthew Miller 2008-10-04 00:55:09 UTC

Previously, it mostly triggered when I was paging through e-mail in mutt. This time, it came up when I was running system-config-*-tui commands, and I was able to pretty reliably reproduce it using that. (I can't guarantee that this latest example is really the same bug, though.) However, it wasn't limited to full-screen apps -- it'd sometimes happen when doing an ls. It seems to mostly happen when there's a bunch of data dumped to the screen all at once — although it never appeared to happen when doing scp.

Comment 29 Jim 2008-10-08 20:06:33 UTC

It seems to me that this could be an iptables connection tracking issue. I caught it while logging all dropped packets while I was having problems with NFS.

Although this seems to happen on its own at times, I can force the same result by restarting iptables on the server with some running ssh connections. Iptables then starts to drop seemingly random packets from what should be established connections. Sometimes it's only dropping a percentage of packets, making the connection seem slow (this was the problem I had with NFS), but often drops so many that the connection is effectively frozen.

My workaround has been to remove the NEW requirement for ssh connections, and accept all tcp on 22.

-A RH-Firewall-1-INPUT -m tcp -p tcp --dport 22 -j ACCEPT

I haven't ever had ssh freeze after this change.

Comment 30 Bug Zapper 2008-11-26 01:54:33 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle.
Changing version to '10'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 31 Fedora Admin XMLRPC Client 2009-03-10 10:15:36 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 32 Jonathan Underwood 2009-03-11 21:52:10 UTC

I used to see this problem a lot when configuring my firewall using shorewall. The problem was a router somewhere was making out of frame packets which by default get marked as INVALID, and dropped. The fix in this case was:

echo 1 > /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal

It's a long shot, but I thought it might be helpful to mention this.

Comment 33 Tomas Mraz 2009-05-04 12:04:39 UTC

*** Bug 453752 has been marked as a duplicate of this bug. ***

Comment 34 rokoge 2009-05-21 11:56:33 UTC

We just experienced a similar problem as Matt Miller (Comment #9 From  Matthew Miller 2007-09-06 15:55:16 EDT).  The only difference is we saw the problem performing fast/large SCP and HTTPS transfers.

To correct the problem, we changed the "client" (the one doing the acks) setting for SACK (aka. tcp selective acknwledgements, aka. selective acks).  SACK is an option in the TCP Header and is set when a connection uses it.

On the "client" side, you could run the following command:
echo 0 >/proc/sys/net/ipv4/tcp_sack

On the network side, you would want to look for dropped packets at the firewall (maybe related to an unset SACK flag if you are using SACK), or if SACK is set in the network equipment you could disable it.

I'm not sure what the developers of OpenSSH can do to fix this particular issue.  The sessions appear to timeout on the "server" (the one NOT doing the acks) side, so maybe if that could be explored, it might provide some insight.

Your mileage may vary. Good luck.

Comment 35 Bug Zapper 2009-06-09 09:15:07 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 36 Bug Zapper 2010-04-27 11:43:31 UTC

This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 37 Jan F. Chadima 2010-06-22 08:53:11 UTC

(In reply to comment #34)
> We just experienced a similar problem as Matt Miller (Comment #9 From  Matthew
> Miller 2007-09-06 15:55:16 EDT).  The only difference is we saw the problem
> performing fast/large SCP and HTTPS transfers.
> 
> To correct the problem, we changed the "client" (the one doing the acks)
> setting for SACK (aka. tcp selective acknwledgements, aka. selective acks). 
> SACK is an option in the TCP Header and is set when a connection uses it.
> 
> On the "client" side, you could run the following command:
> echo 0 >/proc/sys/net/ipv4/tcp_sack
> 
> On the network side, you would want to look for dropped packets at the firewall
> (maybe related to an unset SACK flag if you are using SACK), or if SACK is set
> in the network equipment you could disable it.
> 
> I'm not sure what the developers of OpenSSH can do to fix this particular
> issue.  The sessions appear to timeout on the "server" (the one NOT doing the
> acks) side, so maybe if that could be explored, it might provide some insight.
> 
> Your mileage may vary. Good luck.    

please try turn off the window scaling.
"echo 0 >/proc/sys/net/ipv4/tcp_window_scaling"

Comment 38 Bug Zapper 2010-06-28 10:25:20 UTC

Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.