Bug 1879232

Summary:	API request timed out
Product:	OpenShift Container Platform	Reporter:	Santiago Maudet <smaudet>
Component:	openshift-apiserver	Assignee:	Lukasz Szaszkiewicz <lszaszki>
Status:	CLOSED NOTABUG	QA Contact:	Xingxing Xia <xxia>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.5	CC:	aconstan, akuriyan, alchan, amsingh, aos-bugs, braander, bshirren, cpassare, dyocum, jcoscia, jcrumple, jseunghw, ktadimar, lszaszki, malonso, mfojtik, morgan.peterman, ngirard, oarribas, openshift-bugs-escalate, rdiscala, rgregory, rvanderp, shsaxena, sttts, surya
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.7.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-04 10:33:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Santiago Maudet 2020-09-15 17:59:48 UTC

Description of problem:

Cluster is running fine for a couple of days, and then  oc login fails with:

~~~
F0914 13:59:14.253797    6075 helpers.go:115] Error from server (InternalError): Internal error occurred: unexpected response: 504
~~~

* In the masters API container logs the following Panic is observed: 
 
~~~
killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)

2020-09-14T19:15:53.931589517Z goroutine 18743417 [running]:
2020-09-14T19:15:53.931589517Z k8s.io/apimachinery/pkg/util/runtime.logPanic(0x27ebb80, 0xc000324fe0)
2020-09-14T19:15:53.931589517Z 	@/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
2020-09-14T19:15:53.931589517Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc0038cfbd8, 0x1, 0x1)
2020-09-14T19:15:53.931589517Z 	@/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
2020-09-14T19:15:53.931589517Z panic(0x27ebb80, 0xc000324fe0)
2020-09-14T19:15:53.931589517Z 	runtime/panic.go:679 +0x1b2
2020-09-14T19:15:53.931589517Z k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc0016f2c60, 0xc004689ae0)
2020-09-14T19:15:53.931589517Z 	@/k8s.io/apiserver/pkg/server/filters/timeout.go:257 +0x1cf
2020-09-14T19:15:53.931589517Z k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc0000318a0, 0x34deb00, 0xc000a4b730, 0xc001719e00)
2020-09-14T19:15:53.931589517Z 	@/k8s.io/apiserver/pkg/server/filters/timeout.go:141 +0x310
2020-09-14T19:15:53.931589517Z k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x34deb00, 0xc000a4b730, 0xc001719d00)
2020-09-14T19:15:53.931589517Z 	@/k8s.io/apiserver/pkg/server/filters/waitgroup.go:59 +0x121
2020-09-14T19:15:53.931589517Z net/http.HandlerFunc.ServeHTTP(0xc0033fefc0, 0x34deb00, 0xc000a4b730, 0xc001719d00)
2020-09-14T19:15:53.931589517Z 	net/http/server.go:2007 +0x44
2020-09-14T19:15:53.931589517Z k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x34deb00, 0xc000a4b730, 0xc001719c00)
2020-09-14T19:15:53.931589517Z 	@/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:39 +0x274
2020-09-14T19:15:53.931589517Z net/http.HandlerFunc.ServeHTTP(0xc0033feff0, 0x34deb00, 0xc000a4b730, 0xc001719c00)
2020-09-14T19:15:53.931589517Z 	net/http/server.go:2007 +0x44
2020-09-14T19:15:53.931589517Z k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x34d1b00, 0xc0004898b8, 0xc001719b00)
2020-09-14T19:15:53.931589517Z 	@/k8s.io/apiserver/pkg/server/httplog/httplog.go:89 +0x2ca
2020-09-14T19:15:53.931589517Z net/http.HandlerFunc.ServeHTTP(0xc0000318c0, 0x34d1b00, 0xc0004898b8, 0xc001719b00)
2020-09-14T19:15:53.931589517Z 	net/http/server.go:2007 +0x44
2020-09-14T19:15:53.931589517Z k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x34d1b00, 0xc0004898b8, 0xc001719b00)
2020-09-14T19:15:53.931589517Z 	@/k8s.io/apiserver/pkg/server/filters/wrap.go:51 +0x13e
2020-09-14T19:15:53.931589517Z net/http.HandlerFunc.ServeHTTP(0xc0000318e0, 0x34d1b00, 0xc0004898b8, 0xc001719b00)
2020-09-14T19:15:53.931589517Z 	net/http/server.go:2007 +0x44
2020-09-14T19:15:53.931589517Z github.com/openshift/library-go/pkg/apiserver/apiserverconfig.WithCacheControl.func1(0x34d1b00, 0xc0004898b8, 0xc001719b00)
2020-09-14T19:15:53.931589517Z 	github.com/openshift/library-go.0-20200402123743-4015ba624cae/pkg/apiserver/apiserverconfig/cachecontrol.go:27 +0x268
2020-09-14T19:15:53.931589517Z net/http.HandlerFunc.ServeHTTP(0xc0033ff020, 0x34d1b00, 0xc0004898b8, 0xc001719b00)
2020-09-14T19:15:53.931589517Z 	net/http/server.go:2007 +0x44
2020-09-14T19:15:53.931589517Z k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc0033ff050, 0x34d1b00, 0xc0004898b8, 0xc001719b00)
2020-09-14T19:15:53.931589517Z 	@/k8s.io/apiserver/pkg/server/handler.go:189 +0x51
2020-09-14T19:15:53.931589517Z net/http.serverHandler.ServeHTTP(0xc00091c620, 0x34d1b00, 0xc0004898b8, 0xc001719b00)
2020-09-14T19:15:53.931589517Z 	net/http/server.go:2802 +0xa4
2020-09-14T19:15:53.931589517Z net/http.initNPNRequest.ServeHTTP(0x34ea1c0, 0xc0055bea80, 0xc0020a3880, 0xc00091c620, 0x34d1b00, 0xc0004898b8, 0xc001719b00)
2020-09-14T19:15:53.931589517Z 	net/http/server.go:3365 +0x8d
2020-09-14T19:15:53.931589517Z golang.org/x/net/http2.(*serverConn).runHandler(0xc004399500, 0xc0004898b8, 0xc001719b00, 0xc0016f2b40)
2020-09-14T19:15:53.931589517Z 	golang.org/x/net.0-20200202094626-16171245cfb2/http2/server.go:2149 +0x9f
2020-09-14T19:15:53.931589517Z created by golang.org/x/net/http2.(*serverConn).processHeaders
2020-09-14T19:15:53.931589517Z 	golang.org/x/net.0-20200202094626-16171245cfb2/http2/server.go:1883 +0x4eb
2020-09-14T19:15:53.931618032Z E0914 19:15:53.931588       1 wrap.go:39] apiserver panic'd on GET /openapi/v2
2020-09-14T19:16:12.128554426Z E0914 19:16:12.128479       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, context canceled]
2020-09-14T19:16:22.886801329Z E0914 19:16:22.886741       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, context canceled]
~~~

* The only way to resolve this, is to restart the masters. But after a few days, the problem is back. 

Version-Release number of selected component (if applicable):

OCP 4.5.6

Comment 2 Stefan Schimanski 2020-09-16 10:26:24 UTC

We need the must-gather logs shortly after the incident.

Comment 3 Stefan Schimanski 2020-09-16 10:39:51 UTC

Looking into the customer case, I see crash looping pod and I see failing SubjectAccessReviews that time out or show messages like 

2020-09-14T19:16:04.140061553Z E0914 19:16:04.140000       1 upgradeaware.go:327] Error proxying data from backend to client: write tcp 10.26.17.45:6443->10.26.17.55:33292: write: broken pipe

suggesting that the network connection is cut off.

Without further logs, this pretty much looks like a problem of the service network (SDN) on the masters.

Please provide:

- kubelet logs
- logs of failing pods
- logs of everything around SDN on the masters.

Moving to SDN to investigate network health on the masters.

Comment 48 Shubhag Saxena 2020-10-22 16:44:25 UTC

Hi team, 

Workaround 2 from kcs https://access.redhat.com/solutions/5448851 solved the issue. My case is closed now.

Comment 49 Brendan Shirren 2020-11-12 22:33:18 UTC

Issue still affecting v4.5.16 and crippling master nodes after hours/days

155360 I1112 02:04:12.571661       1 log.go:172] http2: panic serving 10.6.14.84:51786: killing connection/stream because serving request timed out and response had been started

Comment 50 Lukasz Szaszkiewicz 2020-11-16 08:21:59 UTC

(In reply to Brendan Shirren from comment #49)

> Issue still affecting v4.5.16 and crippling master nodes after hours/days
> 
> 155360 I1112 02:04:12.571661       1 log.go:172] http2: panic serving
> 10.6.14.84:51786: killing connection/stream because serving request timed
> out and response had been started

Hi, could you gather must-gather for analysis?

Comment 51 Brandon Anderson 2020-11-16 17:23:12 UTC

(In reply to Lukasz Szaszkiewicz from comment #50)
> (In reply to Brendan Shirren from comment #49)
> 
> > Issue still affecting v4.5.16 and crippling master nodes after hours/days
> > 
> > 155360 I1112 02:04:12.571661       1 log.go:172] http2: panic serving
> > 10.6.14.84:51786: killing connection/stream because serving request timed
> > out and response had been started
> 
> Hi, could you gather must-gather for analysis?

CU just provided the mustgather  in the case, it can be yanked on supportshell now and is at the following hydra link:

https://attachments.access.redhat.com/hydra/rest/cases/02794589/attachments/8612c8b6-22cb-48f5-8029-e2551d196b3d?usePresignedUrl=true