1885644 – Panic output due to timeouts in openshift-apiserver

Bug 1885644 - Panic output due to timeouts in openshift-apiserver

Summary: Panic output due to timeouts in openshift-apiserver

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-apiserver
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Lukasz Szaszkiewicz
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1894916
TreeView+	depends on / blocked

Reported:	2020-10-06 15:20 UTC by Sinny Kumari
Modified:	2024-06-13 23:11 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1894916 (view as bug list)
Environment:	Undiagnosed panic detected in pod
Last Closed:	2021-02-24 15:23:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 456	None	closed	UPSTREAM 95002: no stack on panic timeout	2021-02-01 06:34:42 UTC
Github	openshift oauth-apiserver pull 30	None	closed	bumps kubernetes-apiserver	2021-02-01 06:33:57 UTC
Github	openshift openshift-apiserver pull 160	None	closed	bumps kubernetes-apiserver	2021-02-01 06:33:57 UTC
Red Hat Knowledge Base (Solution)	5880571	None	None	None	2021-03-16 18:53:06 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:24:10 UTC

Description Sinny Kumari 2020-10-06 15:20:06 UTC

test:
Undiagnosed panic detected in pod 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=Undiagnosed+panic+detected+in+pod


Several ovn upgrade from 4.5 -> 4.6 failed.

One of the job link - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.5-stable-to-4.6-ci/1313309253259235328

Error message:
pods/openshift-apiserver_apiserver-7d87777d99-lx9f4_openshift-apiserver.log.gz:E1006 04:09:10.949495       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
pods/openshift-apiserver_apiserver-7d87777d99-lx9f4_openshift-apiserver.log.gz:E1006 04:09:11.910858       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)

Comment 1 Stefan Schimanski 2020-10-07 15:09:02 UTC

This is not a classical panic. We have a fix in kube-apiserver to make it prettier.

Comment 2 Lukasz Szaszkiewicz 2020-10-23 08:08:57 UTC

I have opened a WIP PR https://github.com/kubernetes/kubernetes/pull/95002, haven't got time to finish it

Comment 4 Xingxing Xia 2020-11-26 10:08:09 UTC

Checked above several PRs code. Understood:
The PRs add a new metrics apiserver_request_aborts_total. When the server calls ServeHTTP, if it hits timeout, it will call panic with ErrAbortHandler instead of errConnKilled, because the former will suppress the trace as this bug complained. And the handler's "defer" function adds a condition checking: if the error is ErrAbortHandler, then RecordRequestAbort will be called to record "apiserver_request_aborts_total" metrics.
I'm thinking about how to craft test conditions to make the openshift-apiserver / oauth-apiserver hit timeout to verify above understanding.

Comment 5 Xingxing Xia 2020-11-26 11:48:52 UTC

Hi, Lukasz, could you provide some hints how to make the timeout happen to verify the bug if via e2e perspective tests instead of by the PR's unit tests?

Comment 8 Xingxing Xia 2020-12-07 11:17:31 UTC

Tested in 4.7.0-0.nightly-2020-12-04-013308 env:
In the fresh env, check KAS logs, no any "timeout or abort while handling" messages, check Prometheus page, apiserver_request_aborts_total gets 0 count.
Then run:
while true; do
  oc get projects -o json -A > /dev/null
done &
And run:
oc delete pod --all -n openshift-apiserver --grace-period=1

Check KAS again, "wrap.go:54] timeout or abort while handling" messages appear:
...
2020-12-07T11:07:33.307004669Z 2020/12/07 11:07:33 httputil: ReverseProxy read error during body copy: http2: server sent GOAWAY and closed the connection; LastStreamID=118463, ErrCode=NO_ERROR, debug=""
2020-12-07T11:07:33.307004669Z E1207 11:07:33.306324      19 wrap.go:54] timeout or abort while handling: GET "/apis/image.openshift.io/v1/imagestreams"
...
Check Prometheus page again, there are apiserver_request_aborts_total lines with apiserver="kube-apiserver":
apiserver_request_aborts_total{apiserver="kube-apiserver", endpoint="https", group="apps.openshift.io", instance="10.0.64.62:6443", job="apiserver", namespace="default", resource="deploymentconfigs", scope="cluster", service="kubernetes", verb="WATCH", version="v1"}
...

The result is the kube-apiserver PR's expectation. Discussed with Dev, the similar aggregated apiservers' PRs are hard to craft steps to test, they're not critical, could ignore the tests. So moving to VERIFIED

Comment 11 errata-xmlrpc 2021-02-24 15:23:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.