1567664 – Exec session terminates after ~2m while still receiving traffic on 3.10

Bug 1567664 - Exec session terminates after ~2m while still receiving traffic on 3.10

Summary: Exec session terminates after ~2m while still receiving traffic on 3.10

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.11.0
Assignee:	Ram Ranganathan
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-15 22:24 UTC by Clayton Coleman
Modified:	2022-08-04 22:20 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-15 17:54:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Clayton Coleman 2018-04-15 22:24:46 UTC

I noticed this while debugging a pod on 3.10.  Using `oc rsh` I was tailing the output of a file and after almost exactly 2 minutes was disconnected.  When I reconnected my other sh was still running so this looked like Docker still thought the exec session was running (which itself is a problem, because a dead stream should result in the session getting terminated and cleaned up on the node).

Scenario:

1. From your laptop, run `time oc rsh RUNNING_POD`
2. Run `top` (means client is receiving continuous writes from server)

Actual:

After 2 minutes the session is disconnected, even though top is sending continuous traffic.

This was on GCP 3.10 from a recent master post-rebase.  GCP master load balancer has a 2 minute timeout but it is timeout to a backend for a request, not for a one way connection idle (and watches aren't being detached after 2m either).

Does not occur against 3.9 AWS clusters like us-east-1 - stayed open forever.

Suspect rebase

Comment 1 Jordan Liggitt 2018-04-15 22:36:44 UTC

docker client didn't change in the rebase. I don't recall apiserver or kubelet handling of exec changing upstream either, but can check

Comment 3 Clayton Coleman 2018-04-30 18:30:39 UTC

I see logs failing still on the one cluster.  Can you verify logs behaves the same way.

Comment 4 Clayton Coleman 2018-05-01 15:20:05 UTC

Were you accessing from within the instance or outside?  When you were outside, what network were you on?

Comment 7 Michal Fojtik 2018-05-23 11:16:31 UTC

I can't reproduce this locally via cluster up, so I assume this have to be provider specific problem where the GCP loadbalancer must somehow break the connection after 2 minutes.

Moving to the networking team for future investigation. I don't think this is a 3.10 blocker.

Comment 8 Dan Mace 2018-06-15 17:54:21 UTC

Closing unless somebody can provide a reproducer. No issues with the `top` scenario in a 3.10 GCP cluster running for 100 minutes, nor with an `oc logs` tail for 30 minutes in the same cluster (accessed from the public internet).

Note You need to log in before you can comment on or make changes to this bug.