Bug 1567664

Summary: Exec session terminates after ~2m while still receiving traffic on 3.10
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: NetworkingAssignee: Ram Ranganathan <ramr>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED WORKSFORME Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, ccoleman, dmace, jliggitt, jokerman, maszulik, mfojtik, mmccomas
Version: 3.10.0   
Target Milestone: ---   
Target Release: 3.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-15 17:54:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2018-04-15 22:24:46 UTC
I noticed this while debugging a pod on 3.10.  Using `oc rsh` I was tailing the output of a file and after almost exactly 2 minutes was disconnected.  When I reconnected my other sh was still running so this looked like Docker still thought the exec session was running (which itself is a problem, because a dead stream should result in the session getting terminated and cleaned up on the node).

Scenario:

1. From your laptop, run `time oc rsh RUNNING_POD`
2. Run `top` (means client is receiving continuous writes from server)

Actual:

After 2 minutes the session is disconnected, even though top is sending continuous traffic.

This was on GCP 3.10 from a recent master post-rebase.  GCP master load balancer has a 2 minute timeout but it is timeout to a backend for a request, not for a one way connection idle (and watches aren't being detached after 2m either).

Does not occur against 3.9 AWS clusters like us-east-1 - stayed open forever.

Suspect rebase

Comment 1 Jordan Liggitt 2018-04-15 22:36:44 UTC
docker client didn't change in the rebase. I don't recall apiserver or kubelet handling of exec changing upstream either, but can check

Comment 3 Clayton Coleman 2018-04-30 18:30:39 UTC
I see logs failing still on the one cluster.  Can you verify logs behaves the same way.

Comment 4 Clayton Coleman 2018-05-01 15:20:05 UTC
Were you accessing from within the instance or outside?  When you were outside, what network were you on?

Comment 7 Michal Fojtik 2018-05-23 11:16:31 UTC
I can't reproduce this locally via cluster up, so I assume this have to be provider specific problem where the GCP loadbalancer must somehow break the connection after 2 minutes.

Moving to the networking team for future investigation. I don't think this is a 3.10 blocker.

Comment 8 Dan Mace 2018-06-15 17:54:21 UTC
Closing unless somebody can provide a reproducer. No issues with the `top` scenario in a 3.10 GCP cluster running for 100 minutes, nor with an `oc logs` tail for 30 minutes in the same cluster (accessed from the public internet).