Bug 1885644 - Panic output due to timeouts in openshift-apiserver
Summary: Panic output due to timeouts in openshift-apiserver
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Lukasz Szaszkiewicz
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks: 1894916
TreeView+ depends on / blocked
 
Reported: 2020-10-06 15:20 UTC by Sinny Kumari
Modified: 2021-03-16 18:53 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1894916 (view as bug list)
Environment:
Undiagnosed panic detected in pod
Last Closed: 2021-02-24 15:23:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 456 0 None closed UPSTREAM 95002: no stack on panic timeout 2021-02-01 06:34:42 UTC
Github openshift oauth-apiserver pull 30 0 None closed bumps kubernetes-apiserver 2021-02-01 06:33:57 UTC
Github openshift openshift-apiserver pull 160 0 None closed bumps kubernetes-apiserver 2021-02-01 06:33:57 UTC
Red Hat Knowledge Base (Solution) 5880571 0 None None None 2021-03-16 18:53:06 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:24:10 UTC

Description Sinny Kumari 2020-10-06 15:20:06 UTC
test:
Undiagnosed panic detected in pod 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=Undiagnosed+panic+detected+in+pod


Several ovn upgrade from 4.5 -> 4.6 failed.

One of the job link - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.5-stable-to-4.6-ci/1313309253259235328

Error message:
pods/openshift-apiserver_apiserver-7d87777d99-lx9f4_openshift-apiserver.log.gz:E1006 04:09:10.949495       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
pods/openshift-apiserver_apiserver-7d87777d99-lx9f4_openshift-apiserver.log.gz:E1006 04:09:11.910858       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)

Comment 1 Stefan Schimanski 2020-10-07 15:09:02 UTC
This is not a classical panic. We have a fix in kube-apiserver to make it prettier.

Comment 2 Lukasz Szaszkiewicz 2020-10-23 08:08:57 UTC
I have opened a WIP PR https://github.com/kubernetes/kubernetes/pull/95002, haven't got time to finish it

Comment 4 Xingxing Xia 2020-11-26 10:08:09 UTC
Checked above several PRs code. Understood:
The PRs add a new metrics apiserver_request_aborts_total. When the server calls ServeHTTP, if it hits timeout, it will call panic with ErrAbortHandler instead of errConnKilled, because the former will suppress the trace as this bug complained. And the handler's "defer" function adds a condition checking: if the error is ErrAbortHandler, then RecordRequestAbort will be called to record "apiserver_request_aborts_total" metrics.
I'm thinking about how to craft test conditions to make the openshift-apiserver / oauth-apiserver hit timeout to verify above understanding.

Comment 5 Xingxing Xia 2020-11-26 11:48:52 UTC
Hi, Lukasz, could you provide some hints how to make the timeout happen to verify the bug if via e2e perspective tests instead of by the PR's unit tests?

Comment 8 Xingxing Xia 2020-12-07 11:17:31 UTC
Tested in 4.7.0-0.nightly-2020-12-04-013308 env:
In the fresh env, check KAS logs, no any "timeout or abort while handling" messages, check Prometheus page, apiserver_request_aborts_total gets 0 count.
Then run:
while true; do
  oc get projects -o json -A > /dev/null
done &
And run:
oc delete pod --all -n openshift-apiserver --grace-period=1

Check KAS again, "wrap.go:54] timeout or abort while handling" messages appear:
...
2020-12-07T11:07:33.307004669Z 2020/12/07 11:07:33 httputil: ReverseProxy read error during body copy: http2: server sent GOAWAY and closed the connection; LastStreamID=118463, ErrCode=NO_ERROR, debug=""
2020-12-07T11:07:33.307004669Z E1207 11:07:33.306324      19 wrap.go:54] timeout or abort while handling: GET "/apis/image.openshift.io/v1/imagestreams"
...
Check Prometheus page again, there are apiserver_request_aborts_total lines with apiserver="kube-apiserver":
apiserver_request_aborts_total{apiserver="kube-apiserver", endpoint="https", group="apps.openshift.io", instance="10.0.64.62:6443", job="apiserver", namespace="default", resource="deploymentconfigs", scope="cluster", service="kubernetes", verb="WATCH", version="v1"}
...

The result is the kube-apiserver PR's expectation. Discussed with Dev, the similar aggregated apiservers' PRs are hard to craft steps to test, they're not critical, could ignore the tests. So moving to VERIFIED

Comment 11 errata-xmlrpc 2021-02-24 15:23:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.