Bug 1913525
Summary: | Panic in OLM packageserver when invoking webhook authorization endpoint | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | OLM | Assignee: | Ankita Thomas <ankithom> |
OLM sub component: | OLM | QA Contact: | Jian Zhang <jiazha> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | medium | CC: | akashem, ankithom, aos-bugs, bluddy, dsover, htariq, jiazha, krizza, maszulik, mfojtik, tflannag, xxia |
Version: | 4.7 | ||
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: k8s.io/apiserver's was not handling context errors for the webhook authorizer
Consequence: context errors like timeouts caused the authorizer to panic
Fix: bump the apiserver version to include upstream fix for the issue
Result: authorizer can gracefully handle context errors.
|
Story Points: | --- |
Clone Of: | Environment: |
Undiagnosed panic detected in pod
|
|
Last Closed: | 2021-07-27 22:35:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1933839 |
Description
Clayton Coleman
2021-01-07 01:37:35 UTC
assigned it to the api team, going to triage it. dsover, There is an issue where webhook.go does not handle the error returned from 'webhook.WithExponentialBackoff'. I opened https://github.com/kubernetes/kubernetes/pull/97820. I am hoping this will resolve the panic issue. > r.Status = result.Status (this is where it's panicing) from https://github.com/kubernetes/kubernetes/blob/v1.20.0/staging/src/k8s.io/apiserver/plugin/pkg/authorizer/webhook/webhook.go#L208 As to why 'result' is nil, I see the following possibilities: - A: the context associated with the request has already expired and the SAR create was never called. - B: the retry backoff parameters are not initialized and 'Steps' is zero and the SAR create was never called. B is not likely, otherwise we should see more of this issue in the package server logs. Is it possible for you (or anyone in OLM team) to add the above patch to package server and run it on CI. I expect to see the underlying error instead of the panic. This should give us more insights. In the meantime I will keep digging. *** Bug 1915300 has been marked as a duplicate of this bug. *** Doesn't this need a PR to OLM to bump dependencies before it goes to ON_QA? The bug is in the vendored webhook code on the client side, not openshift/kubernetes side? dsover, OLM needs to bump the dependencies to include the patch https://github.com/kubernetes/kubernetes/pull/97820. I am assigning the BZ back to OLM team so they can follow up. Please feel free to assign it back to the api team if you feel otherwise. I have a PR open for the dependency bump Can we get the OLM PR over the finish line? Still happening. Shows up in 4% of all CI runs. With this fix in place, instead of the panics we should see errors now. I am interested in seeing what these errors are. Searched from https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job But, still find OLM package-server pods panic on the master branch that fixed PR merged, for example, https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/15992/rehearse-15992-pull-ci-cri-o-cri-o-master-e2e-gcp/1366635982232752128 The fix is in 4.8, and from https://search.ci.openshift.org/?search=lifecycle-manager.*Observed+a+panic&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job, all the jobs with the panic are 4.7 jobs. Can you take another look? Hi Ankita, Sure, I guess the master branch aligns with 4.8 now. As you can see from the https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job There are many failures on the master branch, for example, https://prow.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-cri-o-cri-o-master-e2e-gcp A screenshot: https://user-images.githubusercontent.com/15416633/110405607-cbb7eb80-80bb-11eb-8dad-eaacf5fca054.png https://prow.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift-kni_cnf-features-deploy/444/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn/1368931986437050368 I change the Status to ASSIGNED first, please let me know if I missed something, thanks! jiazha, I briefly looked at the package server logs for https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-okd-installer-e2e-aws-upgrade/1369286393666211840/artifacts/ and I did not find the panics. Can you please recheck your search? Hi, I took a look at the failing jobs, and those still use 4.7 2021/03/08 14:30:00 Resolved ocp/4.7:base to sha256:a7218e69175bb91140cde07a03cea173c041fe44cc13d5bd317ddee9c9ed7957 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-kni_cnf-features-deploy/444/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn/1368931986437050368/artifacts/e2e-gcp-ovn/gather-extra/artifacts/clusterversion.json I checked most of the failing jobs from https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job , they all have 4.7 clusters from their artifacts. Can you take a look? Hi Ankita, Thanks for your updates! Most failures are for 4.7: https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job Verifiy it. still seeing this - from today: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1379561668190670848 and the stack trace was in the webhook. https://search.ci.openshift.org/?search=panic&maxAge=48h&context=1&type=junit&name=4%5C.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Was this accidentally reverted? Still get the error for 4.8, for exampl, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1380373905826385920 pods/openshift-operator-lifecycle-manager_packageserver-79d5c587c7-8j5cr_packageserver.log.gz:E0409 06:06:05.783918 1 runtime.go:76] Observed a panic: runtime error: invalid memory address or nil pointer dereference Searched from https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=4%5C.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=junit&name=4\.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Looks like the fix wasn't in at the time, the panic isn't showing up when I rerun the search. Can you please check again? Just echo-ing Ankita's comment that we haven't seen that panic being produced in the CI search logs since three days ago, but moving this from ON_QA -> MODIFIED so it can be properly picked up. Hi Ankita, Tim Yes, this panic is not reported since 3 days ago. I can only find the panic reported 5 days ago. Verify it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |