Bug 1885644

Summary: Panic output due to timeouts in openshift-apiserver
Product: OpenShift Container Platform Reporter: Sinny Kumari <skumari>
Component: openshift-apiserverAssignee: Lukasz Szaszkiewicz <lszaszki>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: aos-bugs, lszaszki, mfojtik, oarribas, slaznick, sttts
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1894916 (view as bug list) Environment:
Undiagnosed panic detected in pod
Last Closed: 2021-02-24 15:23:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1894916    

Description Sinny Kumari 2020-10-06 15:20:06 UTC
Undiagnosed panic detected in pod 

is failing frequently in CI, see search results:

Several ovn upgrade from 4.5 -> 4.6 failed.

One of the job link - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.5-stable-to-4.6-ci/1313309253259235328

Error message:
pods/openshift-apiserver_apiserver-7d87777d99-lx9f4_openshift-apiserver.log.gz:E1006 04:09:10.949495       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
pods/openshift-apiserver_apiserver-7d87777d99-lx9f4_openshift-apiserver.log.gz:E1006 04:09:11.910858       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)

Comment 1 Stefan Schimanski 2020-10-07 15:09:02 UTC
This is not a classical panic. We have a fix in kube-apiserver to make it prettier.

Comment 2 Lukasz Szaszkiewicz 2020-10-23 08:08:57 UTC
I have opened a WIP PR https://github.com/kubernetes/kubernetes/pull/95002, haven't got time to finish it

Comment 4 Xingxing Xia 2020-11-26 10:08:09 UTC
Checked above several PRs code. Understood:
The PRs add a new metrics apiserver_request_aborts_total. When the server calls ServeHTTP, if it hits timeout, it will call panic with ErrAbortHandler instead of errConnKilled, because the former will suppress the trace as this bug complained. And the handler's "defer" function adds a condition checking: if the error is ErrAbortHandler, then RecordRequestAbort will be called to record "apiserver_request_aborts_total" metrics.
I'm thinking about how to craft test conditions to make the openshift-apiserver / oauth-apiserver hit timeout to verify above understanding.

Comment 5 Xingxing Xia 2020-11-26 11:48:52 UTC
Hi, Lukasz, could you provide some hints how to make the timeout happen to verify the bug if via e2e perspective tests instead of by the PR's unit tests?

Comment 8 Xingxing Xia 2020-12-07 11:17:31 UTC
Tested in 4.7.0-0.nightly-2020-12-04-013308 env:
In the fresh env, check KAS logs, no any "timeout or abort while handling" messages, check Prometheus page, apiserver_request_aborts_total gets 0 count.
Then run:
while true; do
  oc get projects -o json -A > /dev/null
done &
And run:
oc delete pod --all -n openshift-apiserver --grace-period=1

Check KAS again, "wrap.go:54] timeout or abort while handling" messages appear:
2020-12-07T11:07:33.307004669Z 2020/12/07 11:07:33 httputil: ReverseProxy read error during body copy: http2: server sent GOAWAY and closed the connection; LastStreamID=118463, ErrCode=NO_ERROR, debug=""
2020-12-07T11:07:33.307004669Z E1207 11:07:33.306324      19 wrap.go:54] timeout or abort while handling: GET "/apis/image.openshift.io/v1/imagestreams"
Check Prometheus page again, there are apiserver_request_aborts_total lines with apiserver="kube-apiserver":
apiserver_request_aborts_total{apiserver="kube-apiserver", endpoint="https", group="apps.openshift.io", instance="", job="apiserver", namespace="default", resource="deploymentconfigs", scope="cluster", service="kubernetes", verb="WATCH", version="v1"}

The result is the kube-apiserver PR's expectation. Discussed with Dev, the similar aggregated apiservers' PRs are hard to craft steps to test, they're not critical, could ignore the tests. So moving to VERIFIED

Comment 11 errata-xmlrpc 2021-02-24 15:23:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.