pods/openshift-operator-lifecycle-manager_packageserver-8c6b98c9c-6shtx_packageserver.log.gz:E0106 03:34:18.930878 1 runtime.go:76] Observed a panic: runtime error: invalid memory address or nil pointer dereference First noticed here: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/14626/rehearse-14626-pull-ci-openshift-installer-master-e2e-vsphere-upi/1346645775966277632 I0106 03:34:18.930563 1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="587.019743ms" userAgent="kube-probe/1.20" srcIP="10.130.0.1:43944" resp=0 E0106 03:34:18.930878 1 runtime.go:76] Observed a panic: runtime error: invalid memory address or nil pointer dereference goroutine 17500 [running]: k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1(0xc001ee90e0) /build/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:106 +0x113 panic(0x1bb2840, 0x2d38cc0) /usr/lib/golang/src/runtime/panic.go:969 +0x175 k8s.io/apiserver/plugin/pkg/authorizer/webhook.(*WebhookAuthorizer).Authorize(0xc0001f1aa0, 0x20dd0c0, 0xc000d28d50, 0x20f92c0, 0xc0024aee60, 0xc001eaad08, 0xc2954f, 0x20dd740, 0xc003039040, 0xc00234f5eb) /build/vendor/k8s.io/apiserver/plugin/pkg/authorizer/webhook/webhook.go:208 +0x8b9 k8s.io/apiserver/pkg/authorization/union.unionAuthzHandler.Authorize(0xc000788d80, 0x1, 0x1, 0x20dd0c0, 0xc000d28d50, 0x20f92c0, 0xc0024aee60, 0x1, 0x1, 0x1e477df, ...) /build/vendor/k8s.io/apiserver/pkg/authorization/union/union.go:52 +0xfe k8s.io/apiserver/pkg/authorization/union.unionAuthzHandler.Authorize(0xc000364880, 0x2, 0x2, 0x20dd0c0, 0xc000d28d50, 0x20f92c0, 0xc0024aee60, 0x2066140, 0x1aadde0, 0xc000849810, ...) /build/vendor/k8s.io/apiserver/pkg/authorization/union/union.go:52 +0xfe k8s.io/apiserver/pkg/endpoints/filters.WithAuthorization.func1(0x7f7a5c0c29c0, 0xc000b165d8, 0xc000f1c700) /build/vendor/k8s.io/apiserver/pkg/endpoints/filters/authorization.go:59 +0x165 net/http.HandlerFunc.ServeHTTP(0xc0003f4140, 0x7f7a5c0c29c0, 0xc000b165d8, 0xc000f1c700) /usr/lib/golang/src/net/http/server.go:2054 +0x44 k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1(0x7f7a5c0c29c0, 0xc000b165d8, 0xc000f1c700) /build/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:71 +0x186 net/http.HandlerFunc.ServeHTTP(0xc0003f4180, 0x7f7a5c0c29c0, 0xc000b165d8, 0xc000f1c700) /usr/lib/golang/src/net/http/server.go:2054 +0x44 k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1(0x7f7a5c0c29c0, 0xc000b165d8, 0xc000f1c700) /build/vendor/k8s.io/apiserver/pkg/server/filters/maxinflight.go:184 +0x4cf net/http.HandlerFunc.ServeHTTP(0xc00084dce0, 0x7f7a5c0c29c0, 0xc000b165d8, 0xc000f1c700) /usr/lib/golang/src/net/http/server.go:2054 +0x44 https://search.ci.openshift.org/?search=lifecycle-manager.*Observed+a+panic&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job 1% of jobs over the last day, may be a race or initialization failure, or simply an uncaught and rare error. Appears to be in core Kube code (apiserver team) but OLM needs to verify it invokes the auth wrapper appropriately.
assigned it to the api team, going to triage it.
dsover, There is an issue where webhook.go does not handle the error returned from 'webhook.WithExponentialBackoff'. I opened https://github.com/kubernetes/kubernetes/pull/97820. I am hoping this will resolve the panic issue. > r.Status = result.Status (this is where it's panicing) from https://github.com/kubernetes/kubernetes/blob/v1.20.0/staging/src/k8s.io/apiserver/plugin/pkg/authorizer/webhook/webhook.go#L208 As to why 'result' is nil, I see the following possibilities: - A: the context associated with the request has already expired and the SAR create was never called. - B: the retry backoff parameters are not initialized and 'Steps' is zero and the SAR create was never called. B is not likely, otherwise we should see more of this issue in the package server logs. Is it possible for you (or anyone in OLM team) to add the above patch to package server and run it on CI. I expect to see the underlying error instead of the panic. This should give us more insights. In the meantime I will keep digging.
*** Bug 1915300 has been marked as a duplicate of this bug. ***
Doesn't this need a PR to OLM to bump dependencies before it goes to ON_QA? The bug is in the vendored webhook code on the client side, not openshift/kubernetes side?
dsover, OLM needs to bump the dependencies to include the patch https://github.com/kubernetes/kubernetes/pull/97820. I am assigning the BZ back to OLM team so they can follow up. Please feel free to assign it back to the api team if you feel otherwise.
I have a PR open for the dependency bump
Can we get the OLM PR over the finish line? Still happening.
Shows up in 4% of all CI runs.
With this fix in place, instead of the panics we should see errors now. I am interested in seeing what these errors are.
Searched from https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job But, still find OLM package-server pods panic on the master branch that fixed PR merged, for example, https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/15992/rehearse-15992-pull-ci-cri-o-cri-o-master-e2e-gcp/1366635982232752128
The fix is in 4.8, and from https://search.ci.openshift.org/?search=lifecycle-manager.*Observed+a+panic&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job, all the jobs with the panic are 4.7 jobs. Can you take another look?
Hi Ankita, Sure, I guess the master branch aligns with 4.8 now. As you can see from the https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job There are many failures on the master branch, for example, https://prow.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-cri-o-cri-o-master-e2e-gcp A screenshot: https://user-images.githubusercontent.com/15416633/110405607-cbb7eb80-80bb-11eb-8dad-eaacf5fca054.png https://prow.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift-kni_cnf-features-deploy/444/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn/1368931986437050368 I change the Status to ASSIGNED first, please let me know if I missed something, thanks!
jiazha, I briefly looked at the package server logs for https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-okd-installer-e2e-aws-upgrade/1369286393666211840/artifacts/ and I did not find the panics. Can you please recheck your search?
Hi, I took a look at the failing jobs, and those still use 4.7 2021/03/08 14:30:00 Resolved ocp/4.7:base to sha256:a7218e69175bb91140cde07a03cea173c041fe44cc13d5bd317ddee9c9ed7957 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-kni_cnf-features-deploy/444/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn/1368931986437050368/artifacts/e2e-gcp-ovn/gather-extra/artifacts/clusterversion.json I checked most of the failing jobs from https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job , they all have 4.7 clusters from their artifacts. Can you take a look?
Hi Ankita, Thanks for your updates! Most failures are for 4.7: https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job Verifiy it.
still seeing this - from today: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1379561668190670848 and the stack trace was in the webhook. https://search.ci.openshift.org/?search=panic&maxAge=48h&context=1&type=junit&name=4%5C.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Was this accidentally reverted?
Still get the error for 4.8, for exampl, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1380373905826385920 pods/openshift-operator-lifecycle-manager_packageserver-79d5c587c7-8j5cr_packageserver.log.gz:E0409 06:06:05.783918 1 runtime.go:76] Observed a panic: runtime error: invalid memory address or nil pointer dereference Searched from https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=4%5C.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=4\.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Looks like the fix wasn't in at the time, the panic isn't showing up when I rerun the search. Can you please check again?
Just echo-ing Ankita's comment that we haven't seen that panic being produced in the CI search logs since three days ago, but moving this from ON_QA -> MODIFIED so it can be properly picked up.
Hi Ankita, Tim Yes, this panic is not reported since 3 days ago. I can only find the panic reported 5 days ago. Verify it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438