Bug 1567664 - Exec session terminates after ~2m while still receiving traffic on 3.10
Summary: Exec session terminates after ~2m while still receiving traffic on 3.10
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 3.11.0
Assignee: Ram Ranganathan
QA Contact: zhaozhanqi
Depends On:
TreeView+ depends on / blocked
Reported: 2018-04-15 22:24 UTC by Clayton Coleman
Modified: 2018-06-15 17:54 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2018-06-15 17:54:21 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Clayton Coleman 2018-04-15 22:24:46 UTC
I noticed this while debugging a pod on 3.10.  Using `oc rsh` I was tailing the output of a file and after almost exactly 2 minutes was disconnected.  When I reconnected my other sh was still running so this looked like Docker still thought the exec session was running (which itself is a problem, because a dead stream should result in the session getting terminated and cleaned up on the node).


1. From your laptop, run `time oc rsh RUNNING_POD`
2. Run `top` (means client is receiving continuous writes from server)


After 2 minutes the session is disconnected, even though top is sending continuous traffic.

This was on GCP 3.10 from a recent master post-rebase.  GCP master load balancer has a 2 minute timeout but it is timeout to a backend for a request, not for a one way connection idle (and watches aren't being detached after 2m either).

Does not occur against 3.9 AWS clusters like us-east-1 - stayed open forever.

Suspect rebase

Comment 1 Jordan Liggitt 2018-04-15 22:36:44 UTC
docker client didn't change in the rebase. I don't recall apiserver or kubelet handling of exec changing upstream either, but can check

Comment 3 Clayton Coleman 2018-04-30 18:30:39 UTC
I see logs failing still on the one cluster.  Can you verify logs behaves the same way.

Comment 4 Clayton Coleman 2018-05-01 15:20:05 UTC
Were you accessing from within the instance or outside?  When you were outside, what network were you on?

Comment 7 Michal Fojtik 2018-05-23 11:16:31 UTC
I can't reproduce this locally via cluster up, so I assume this have to be provider specific problem where the GCP loadbalancer must somehow break the connection after 2 minutes.

Moving to the networking team for future investigation. I don't think this is a 3.10 blocker.

Comment 8 Dan Mace 2018-06-15 17:54:21 UTC
Closing unless somebody can provide a reproducer. No issues with the `top` scenario in a 3.10 GCP cluster running for 100 minutes, nor with an `oc logs` tail for 30 minutes in the same cluster (accessed from the public internet).

Note You need to log in before you can comment on or make changes to this bug.