Bug 1914994 - Panic observed in k8s-prometheus-adapter since k8s 1.20
Summary: Panic observed in k8s-prometheus-adapter since k8s 1.20
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.8.0
Assignee: Damien Grisonnet
QA Contact: Junqi Zhao
Depends On:
TreeView+ depends on / blocked
Reported: 2021-01-11 17:30 UTC by Damien Grisonnet
Modified: 2021-07-27 22:36 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2021-07-27 22:36:03 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift k8s-prometheus-adapter pull 45 0 None closed Bug 1914994: Bump k8s-prometheus-adapter to v0.8.3 2021-02-21 08:40:54 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:36:27 UTC

Description Damien Grisonnet 2021-01-11 17:30:02 UTC
Description of problem:

Some CI failures caused by the `Undiagnosed panic detected in pod` origin test failing have been reportedly caused by prometheus-adapter. The latest report can be found in [1].
This is already the second time that this is reported since the rebase on Kubernetes 1.20. It was first though that these panics would be fixed by bumping all Kubernetes depedencies in prometheus-adapter to 1.20.0 but it seems that this panics are still occuring.
Although, the panics that were first reported in [2] are quite different and seems to have been fixed by https://github.com/openshift/k8s-prometheus-adapter/pull/41 as they don't seem to occur anymore.

Also, it's worth noting that this is a not just a one time flake in a thousand runs. For the past week, this CI job is responsible for 6% of all the failures and out of all this particular job failures, prometheus-adapter seems to be responsible for 71 of them out of 442 according to [3].

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-controller-manager-operator/491/pull-ci-openshift-cluster-kube-controller-manager-operator-master-e2e-upgrade/1347197449646641152/artifacts/e2e-upgrade/gather-extra/pods/openshift-monitoring_prometheus-adapter-7956dd46cf-h4d5t_prometheus-adapter.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/471/pull-ci-openshift-kubernetes-master-e2e-aws-selfupgrade/1338383564957290496/artifacts/e2e-aws-selfupgrade/gather-extra/pods/openshift-monitoring_prometheus-adapter-66d5b468c5-f6kmf_prometheus-adapter.log
[3] https://search.ci.openshift.org/?search=Undiagnosed+panic+detected+in+pod&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

One of which can be observed 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Comment 1 Damien Grisonnet 2021-01-18 15:40:02 UTC
The panic seems to have been fixed in upstream Kubernetes as part of this PR: https://github.com/kubernetes/kubernetes/pull/97820

There is currently a backport to 1.20 open https://github.com/kubernetes/kubernetes/pull/97862. Once merged, we will need to upgrade client-go in upstream k8s-prometheus-adapter to bring the fix downstream.

Comment 5 Junqi Zhao 2021-02-10 06:32:30 UTC
tested with 4.8.0-0.nightly-2021-02-09-221546, prometheus-adapter version is v0.8.3 now

Comment 8 errata-xmlrpc 2021-07-27 22:36:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.