Bug 1913525

Summary:	Panic in OLM packageserver when invoking webhook authorization endpoint
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	OLM	Assignee:	Ankita Thomas <ankithom>
OLM sub component:	OLM	QA Contact:	Jian Zhang <jiazha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	akashem, ankithom, aos-bugs, bluddy, dsover, htariq, jiazha, krizza, maszulik, mfojtik, tflannag, xxia
Version:	4.7
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: k8s.io/apiserver's was not handling context errors for the webhook authorizer Consequence: context errors like timeouts caused the authorizer to panic Fix: bump the apiserver version to include upstream fix for the issue Result: authorizer can gracefully handle context errors.	Story Points:	---
Clone Of:		Environment:	Undiagnosed panic detected in pod
Last Closed:	2021-07-27 22:35:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1933839

Description Clayton Coleman 2021-01-07 01:37:35 UTC

pods/openshift-operator-lifecycle-manager_packageserver-8c6b98c9c-6shtx_packageserver.log.gz:E0106 03:34:18.930878       1 runtime.go:76] Observed a panic: runtime error: invalid memory address or nil pointer dereference

First noticed here: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/14626/rehearse-14626-pull-ci-openshift-installer-master-e2e-vsphere-upi/1346645775966277632

I0106 03:34:18.930563       1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="587.019743ms" userAgent="kube-probe/1.20" srcIP="10.130.0.1:43944" resp=0
E0106 03:34:18.930878       1 runtime.go:76] Observed a panic: runtime error: invalid memory address or nil pointer dereference
goroutine 17500 [running]:
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1(0xc001ee90e0)
	/build/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:106 +0x113
panic(0x1bb2840, 0x2d38cc0)
	/usr/lib/golang/src/runtime/panic.go:969 +0x175
k8s.io/apiserver/plugin/pkg/authorizer/webhook.(*WebhookAuthorizer).Authorize(0xc0001f1aa0, 0x20dd0c0, 0xc000d28d50, 0x20f92c0, 0xc0024aee60, 0xc001eaad08, 0xc2954f, 0x20dd740, 0xc003039040, 0xc00234f5eb)
	/build/vendor/k8s.io/apiserver/plugin/pkg/authorizer/webhook/webhook.go:208 +0x8b9
k8s.io/apiserver/pkg/authorization/union.unionAuthzHandler.Authorize(0xc000788d80, 0x1, 0x1, 0x20dd0c0, 0xc000d28d50, 0x20f92c0, 0xc0024aee60, 0x1, 0x1, 0x1e477df, ...)
	/build/vendor/k8s.io/apiserver/pkg/authorization/union/union.go:52 +0xfe
k8s.io/apiserver/pkg/authorization/union.unionAuthzHandler.Authorize(0xc000364880, 0x2, 0x2, 0x20dd0c0, 0xc000d28d50, 0x20f92c0, 0xc0024aee60, 0x2066140, 0x1aadde0, 0xc000849810, ...)
	/build/vendor/k8s.io/apiserver/pkg/authorization/union/union.go:52 +0xfe
k8s.io/apiserver/pkg/endpoints/filters.WithAuthorization.func1(0x7f7a5c0c29c0, 0xc000b165d8, 0xc000f1c700)
	/build/vendor/k8s.io/apiserver/pkg/endpoints/filters/authorization.go:59 +0x165
net/http.HandlerFunc.ServeHTTP(0xc0003f4140, 0x7f7a5c0c29c0, 0xc000b165d8, 0xc000f1c700)
	/usr/lib/golang/src/net/http/server.go:2054 +0x44
k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1(0x7f7a5c0c29c0, 0xc000b165d8, 0xc000f1c700)
	/build/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:71 +0x186
net/http.HandlerFunc.ServeHTTP(0xc0003f4180, 0x7f7a5c0c29c0, 0xc000b165d8, 0xc000f1c700)
	/usr/lib/golang/src/net/http/server.go:2054 +0x44
k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1(0x7f7a5c0c29c0, 0xc000b165d8, 0xc000f1c700)
	/build/vendor/k8s.io/apiserver/pkg/server/filters/maxinflight.go:184 +0x4cf
net/http.HandlerFunc.ServeHTTP(0xc00084dce0, 0x7f7a5c0c29c0, 0xc000b165d8, 0xc000f1c700)
	/usr/lib/golang/src/net/http/server.go:2054 +0x44

https://search.ci.openshift.org/?search=lifecycle-manager.*Observed+a+panic&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

1% of jobs over the last day, may be a race or initialization failure, or simply an uncaught and rare error.  Appears to be in core Kube code (apiserver team) but OLM needs to verify it invokes the auth wrapper appropriately.

Comment 1 Abu Kashem 2021-01-07 19:01:49 UTC

assigned it to the api team, going to triage it.

Comment 3 Abu Kashem 2021-01-07 21:57:59 UTC

dsover,

There is an issue where webhook.go does not handle the error returned from 'webhook.WithExponentialBackoff'. I opened https://github.com/kubernetes/kubernetes/pull/97820. I am hoping this will resolve the panic issue.


> r.Status = result.Status (this is where it's panicing)
from https://github.com/kubernetes/kubernetes/blob/v1.20.0/staging/src/k8s.io/apiserver/plugin/pkg/authorizer/webhook/webhook.go#L208 

As to why 'result' is nil, I see the following possibilities:
- A: the context associated with the request has already expired and the SAR create was never called.
- B: the retry backoff parameters are not initialized and 'Steps' is zero and the SAR create was never called.

B is not likely, otherwise we should see more of this issue in the package server logs.

Is it possible for you (or anyone in OLM team) to add the above patch to package server and run it on CI. I expect to see the underlying error instead of the panic. This should give us more insights. In the meantime I will keep digging.

Comment 6 Evan Cordell 2021-01-12 14:55:51 UTC

*** Bug 1915300 has been marked as a duplicate of this bug. ***

Comment 7 Clayton Coleman 2021-01-13 18:57:34 UTC

Doesn't this need a PR to OLM to bump dependencies before it goes to ON_QA?  The bug is in the vendored webhook code on the client side, not openshift/kubernetes side?

Comment 8 Abu Kashem 2021-01-13 20:06:12 UTC

dsover,
OLM needs to bump the dependencies to include the patch https://github.com/kubernetes/kubernetes/pull/97820. I am assigning the BZ back to OLM team so they can follow up. Please feel free to assign it back to the api team if you feel otherwise.

Comment 9 Ankita Thomas 2021-01-14 19:54:43 UTC

I have a PR open for the dependency bump

Comment 11 Clayton Coleman 2021-01-28 19:16:03 UTC

Can we get the OLM PR over the finish line?  Still happening.

Comment 12 Clayton Coleman 2021-01-28 19:16:44 UTC

Shows up in 4% of all CI runs.

Comment 13 Abu Kashem 2021-02-10 16:08:34 UTC

With this fix in place, instead of the panics we should see errors now. I am interested in seeing what these errors are.

Comment 15 Jian Zhang 2021-03-02 09:17:54 UTC

Searched from https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job 
But, still find OLM package-server pods panic on the master branch that fixed PR merged, for example, https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/15992/rehearse-15992-pull-ci-cri-o-cri-o-master-e2e-gcp/1366635982232752128

Comment 16 Ankita Thomas 2021-03-08 21:39:20 UTC

The fix is in 4.8, and from https://search.ci.openshift.org/?search=lifecycle-manager.*Observed+a+panic&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job, all the jobs with the panic are 4.7 jobs. 

Can you take another look?

Comment 18 Jian Zhang 2021-03-09 01:45:36 UTC

Hi Ankita,

Sure, I guess the master branch aligns with 4.8 now. As you can see from the https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job 
There are many failures on the master branch, for example,
https://prow.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-cri-o-cri-o-master-e2e-gcp
A screenshot: https://user-images.githubusercontent.com/15416633/110405607-cbb7eb80-80bb-11eb-8dad-eaacf5fca054.png

https://prow.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift-kni_cnf-features-deploy/444/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn/1368931986437050368

I change the Status to ASSIGNED first, please let me know if I missed something, thanks!

Comment 19 Abu Kashem 2021-03-09 16:25:12 UTC

jiazha,

I briefly looked at the package server logs for https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-okd-installer-e2e-aws-upgrade/1369286393666211840/artifacts/ and I did not find the panics. 

Can you please recheck your search?

Comment 20 Ankita Thomas 2021-03-10 01:26:51 UTC

Hi,

I took a look at the failing jobs, and those still use 4.7

2021/03/08 14:30:00 Resolved ocp/4.7:base to sha256:a7218e69175bb91140cde07a03cea173c041fe44cc13d5bd317ddee9c9ed7957
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-kni_cnf-features-deploy/444/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn/1368931986437050368/artifacts/e2e-gcp-ovn/gather-extra/artifacts/clusterversion.json

I checked most of the failing jobs from https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job , they all have 4.7 clusters from their artifacts. Can you take a look?

Comment 22 Jian Zhang 2021-03-11 03:59:16 UTC

Hi Ankita,

Thanks for your updates! Most failures are for 4.7: https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
Verifiy it.

Comment 23 Clayton Coleman 2021-04-07 02:19:30 UTC

still seeing this - from today: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1379561668190670848 and the stack trace was in the webhook.

https://search.ci.openshift.org/?search=panic&maxAge=48h&context=1&type=junit&name=4%5C.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Was this accidentally reverted?

Comment 25 Jian Zhang 2021-04-09 07:36:43 UTC

Still get the error for 4.8, for exampl, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1380373905826385920
pods/openshift-operator-lifecycle-manager_packageserver-79d5c587c7-8j5cr_packageserver.log.gz:E0409 06:06:05.783918       1 runtime.go:76] Observed a panic: runtime error: invalid memory address or nil pointer dereference

Searched from https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=4%5C.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 26 Ankita Thomas 2021-04-12 09:13:01 UTC

https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=4\.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Looks like the fix wasn't in at the time, the panic isn't showing up when I rerun the search. Can you please check again?

Comment 27 tflannag 2021-04-12 13:18:20 UTC

Just echo-ing Ankita's comment that we haven't seen that panic being produced in the CI search logs since three days ago, but moving this from ON_QA -> MODIFIED so it can be properly picked up.

Comment 29 Jian Zhang 2021-04-13 07:07:40 UTC

Hi Ankita, Tim

Yes, this panic is not reported since 3 days ago. I can only find the panic reported 5 days ago. Verify it.

Comment 32 errata-xmlrpc 2021-07-27 22:35:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438