Bug 1885644
Summary: | Panic output due to timeouts in openshift-apiserver | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sinny Kumari <skumari> | |
Component: | openshift-apiserver | Assignee: | Lukasz Szaszkiewicz <lszaszki> | |
Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.5 | CC: | aos-bugs, lszaszki, mfojtik, oarribas, slaznick, sttts | |
Target Milestone: | --- | |||
Target Release: | 4.7.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1894916 (view as bug list) | Environment: |
Undiagnosed panic detected in pod
|
|
Last Closed: | 2021-02-24 15:23:22 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1894916 |
Description
Sinny Kumari
2020-10-06 15:20:06 UTC
This is not a classical panic. We have a fix in kube-apiserver to make it prettier. I have opened a WIP PR https://github.com/kubernetes/kubernetes/pull/95002, haven't got time to finish it Checked above several PRs code. Understood: The PRs add a new metrics apiserver_request_aborts_total. When the server calls ServeHTTP, if it hits timeout, it will call panic with ErrAbortHandler instead of errConnKilled, because the former will suppress the trace as this bug complained. And the handler's "defer" function adds a condition checking: if the error is ErrAbortHandler, then RecordRequestAbort will be called to record "apiserver_request_aborts_total" metrics. I'm thinking about how to craft test conditions to make the openshift-apiserver / oauth-apiserver hit timeout to verify above understanding. Hi, Lukasz, could you provide some hints how to make the timeout happen to verify the bug if via e2e perspective tests instead of by the PR's unit tests? Tested in 4.7.0-0.nightly-2020-12-04-013308 env: In the fresh env, check KAS logs, no any "timeout or abort while handling" messages, check Prometheus page, apiserver_request_aborts_total gets 0 count. Then run: while true; do oc get projects -o json -A > /dev/null done & And run: oc delete pod --all -n openshift-apiserver --grace-period=1 Check KAS again, "wrap.go:54] timeout or abort while handling" messages appear: ... 2020-12-07T11:07:33.307004669Z 2020/12/07 11:07:33 httputil: ReverseProxy read error during body copy: http2: server sent GOAWAY and closed the connection; LastStreamID=118463, ErrCode=NO_ERROR, debug="" 2020-12-07T11:07:33.307004669Z E1207 11:07:33.306324 19 wrap.go:54] timeout or abort while handling: GET "/apis/image.openshift.io/v1/imagestreams" ... Check Prometheus page again, there are apiserver_request_aborts_total lines with apiserver="kube-apiserver": apiserver_request_aborts_total{apiserver="kube-apiserver", endpoint="https", group="apps.openshift.io", instance="10.0.64.62:6443", job="apiserver", namespace="default", resource="deploymentconfigs", scope="cluster", service="kubernetes", verb="WATCH", version="v1"} ... The result is the kube-apiserver PR's expectation. Discussed with Dev, the similar aggregated apiservers' PRs are hard to craft steps to test, they're not critical, could ignore the tests. So moving to VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |