test: Undiagnosed panic detected in pod is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=Undiagnosed+panic+detected+in+pod Several ovn upgrade from 4.5 -> 4.6 failed. One of the job link - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.5-stable-to-4.6-ci/1313309253259235328 Error message: pods/openshift-apiserver_apiserver-7d87777d99-lx9f4_openshift-apiserver.log.gz:E1006 04:09:10.949495 1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started) pods/openshift-apiserver_apiserver-7d87777d99-lx9f4_openshift-apiserver.log.gz:E1006 04:09:11.910858 1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
This is not a classical panic. We have a fix in kube-apiserver to make it prettier.
I have opened a WIP PR https://github.com/kubernetes/kubernetes/pull/95002, haven't got time to finish it
Checked above several PRs code. Understood: The PRs add a new metrics apiserver_request_aborts_total. When the server calls ServeHTTP, if it hits timeout, it will call panic with ErrAbortHandler instead of errConnKilled, because the former will suppress the trace as this bug complained. And the handler's "defer" function adds a condition checking: if the error is ErrAbortHandler, then RecordRequestAbort will be called to record "apiserver_request_aborts_total" metrics. I'm thinking about how to craft test conditions to make the openshift-apiserver / oauth-apiserver hit timeout to verify above understanding.
Hi, Lukasz, could you provide some hints how to make the timeout happen to verify the bug if via e2e perspective tests instead of by the PR's unit tests?
Tested in 4.7.0-0.nightly-2020-12-04-013308 env: In the fresh env, check KAS logs, no any "timeout or abort while handling" messages, check Prometheus page, apiserver_request_aborts_total gets 0 count. Then run: while true; do oc get projects -o json -A > /dev/null done & And run: oc delete pod --all -n openshift-apiserver --grace-period=1 Check KAS again, "wrap.go:54] timeout or abort while handling" messages appear: ... 2020-12-07T11:07:33.307004669Z 2020/12/07 11:07:33 httputil: ReverseProxy read error during body copy: http2: server sent GOAWAY and closed the connection; LastStreamID=118463, ErrCode=NO_ERROR, debug="" 2020-12-07T11:07:33.307004669Z E1207 11:07:33.306324 19 wrap.go:54] timeout or abort while handling: GET "/apis/image.openshift.io/v1/imagestreams" ... Check Prometheus page again, there are apiserver_request_aborts_total lines with apiserver="kube-apiserver": apiserver_request_aborts_total{apiserver="kube-apiserver", endpoint="https", group="apps.openshift.io", instance="10.0.64.62:6443", job="apiserver", namespace="default", resource="deploymentconfigs", scope="cluster", service="kubernetes", verb="WATCH", version="v1"} ... The result is the kube-apiserver PR's expectation. Discussed with Dev, the similar aggregated apiservers' PRs are hard to craft steps to test, they're not critical, could ignore the tests. So moving to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633