Bug 1567664

Summary:	Exec session terminates after ~2m while still receiving traffic on 3.10
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Networking	Assignee:	Ram Ranganathan <ramr>
Networking sub component:	router	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED WORKSFORME	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, ccoleman, dmace, jliggitt, jokerman, maszulik, mfojtik, mmccomas
Version:	3.10.0
Target Milestone:	---
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-06-15 17:54:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2018-04-15 22:24:46 UTC

I noticed this while debugging a pod on 3.10.  Using `oc rsh` I was tailing the output of a file and after almost exactly 2 minutes was disconnected.  When I reconnected my other sh was still running so this looked like Docker still thought the exec session was running (which itself is a problem, because a dead stream should result in the session getting terminated and cleaned up on the node).

Scenario:

1. From your laptop, run `time oc rsh RUNNING_POD`
2. Run `top` (means client is receiving continuous writes from server)

Actual:

After 2 minutes the session is disconnected, even though top is sending continuous traffic.

This was on GCP 3.10 from a recent master post-rebase.  GCP master load balancer has a 2 minute timeout but it is timeout to a backend for a request, not for a one way connection idle (and watches aren't being detached after 2m either).

Does not occur against 3.9 AWS clusters like us-east-1 - stayed open forever.

Suspect rebase

Comment 1 Jordan Liggitt 2018-04-15 22:36:44 UTC

docker client didn't change in the rebase. I don't recall apiserver or kubelet handling of exec changing upstream either, but can check

Comment 3 Clayton Coleman 2018-04-30 18:30:39 UTC

I see logs failing still on the one cluster.  Can you verify logs behaves the same way.

Comment 4 Clayton Coleman 2018-05-01 15:20:05 UTC

Were you accessing from within the instance or outside?  When you were outside, what network were you on?

Comment 7 Michal Fojtik 2018-05-23 11:16:31 UTC

I can't reproduce this locally via cluster up, so I assume this have to be provider specific problem where the GCP loadbalancer must somehow break the connection after 2 minutes.

Moving to the networking team for future investigation. I don't think this is a 3.10 blocker.

Comment 8 Dan Mace 2018-06-15 17:54:21 UTC

Closing unless somebody can provide a reproducer. No issues with the `top` scenario in a 3.10 GCP cluster running for 100 minutes, nor with an `oc logs` tail for 30 minutes in the same cluster (accessed from the public internet).