Bug 1957634 - prometheus-adapter panics on GetNodeMetrics
Summary: prometheus-adapter panics on GetNodeMetrics
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.9.0
Assignee: Damien Grisonnet
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-06 07:50 UTC by Damien Grisonnet
Modified: 2021-10-18 17:31 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:31:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes-sigs prometheus-adapter pull 395 0 None open Prevent metrics-server panics on GetContainerMetrics and GetNodeMetrics 2021-05-06 07:50:47 UTC
Github openshift cluster-monitoring-operator pull 1325 0 None None None 2021-08-18 09:14:36 UTC
Github openshift k8s-prometheus-adapter pull 53 0 None None None 2021-08-18 09:14:36 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:31:28 UTC

Description Damien Grisonnet 2021-05-06 07:50:47 UTC
Description of problem:

In 4,7 CI, we've noticed a panic from prometheus-adapter: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1134/pull-ci-openshift-cluster-monitoring-operator-release-4.7-e2e-agnostic-operator/1389209238441562112

This panic occurs in the code prometheus-adapter is reusing from metrics-server as showed in the logs:

```
I0503 13:58:46.056646       1 adapter.go:98] successfully using in-cluster auth
I0503 13:58:47.390698       1 secure_serving.go:197] Serving securely on [::]:6443
I0503 13:58:47.390944       1 dynamic_cafile_content.go:167] Starting request-header::/etc/tls/private/requestheader-client-ca-file
I0503 13:58:47.390970       1 dynamic_serving_content.go:130] Starting serving-cert::/etc/tls/private/tls.crt::/etc/tls/private/tls.key
I0503 13:58:47.390990       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0503 13:58:47.391441       1 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/tls/private/client-ca-file
E0503 14:34:48.270032       1 provider.go:265] failed querying node metrics: unable to fetch node CPU metrics: unable to execute query: Get "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=sum%281+-+irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29+%2A+on%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7Bnode%3D~%22ci-op-d5jk33fs-0330d-hkmhs-master-0%7Cci-op-d5jk33fs-0330d-hkmhs-master-1%7Cci-op-d5jk33fs-0330d-hkmhs-master-2%7Cci-op-d5jk33fs-0330d-hkmhs-worker-centralus1-dw2tw%7Cci-op-d5jk33fs-0330d-hkmhs-worker-centralus2-vmngh%7Cci-op-d5jk33fs-0330d-hkmhs-worker-centralus3-wtbx8%22%7D%29+by+%28node%29&time=1620052458.269": dial tcp 172.30.163.101:9091: i/o timeout
I0503 14:34:48.270117       1 trace.go:205] Trace[1458323237]: "List" url:/apis/metrics.k8s.io/v1beta1/nodes,user-agent:e2e.test/v0.0.0 (linux/amd64) kubernetes/$Format,client:52.45.10.115 (03-May-2021 14:34:18.269) (total time: 30000ms):
Trace[1458323237]: [30.000655916s] [30.000655916s] END
E0503 14:34:48.270484       1 runtime.go:76] Observed a panic: runtime error: index out of range [0] with length 0
goroutine 6081 [running]:
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1(0xc0004fa720)
	/go/src/github.com/directxman12/k8s-prometheus-adapter/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:106 +0x113
panic(0x1c78fe0, 0xc0011fe7c0)
	/usr/lib/golang/src/runtime/panic.go:969 +0x1b9
sigs.k8s.io/metrics-server/pkg/api.(*nodeMetrics).getNodeMetrics(0xc000453f80, 0xc0006e9cc0, 0x6, 0x8, 0x6, 0x8, 0x0, 0x0, 0x0)
	/go/src/github.com/directxman12/k8s-prometheus-adapter/vendor/sigs.k8s.io/metrics-server/pkg/api/node.go:212 +0x51c
sigs.k8s.io/metrics-server/pkg/api.(*nodeMetrics).List(0xc000453f80, 0x204aa60, 0xc00065af90, 0xc000f003f0, 0x0, 0x0, 0x200ee60, 0xc000f003f0)
	/go/src/github.com/directxman12/k8s-prometheus-adapter/vendor/sigs.k8s.io/metrics-server/pkg/api/node.go:93 +0x38a
...
```

It seems that prometheus-adapter is returning data that metrics-server isn't expecting.

Version-Release number of selected component (if applicable):

prometheus-adapter v0.8.4

How reproducible:

This is pretty hard to reproduce as it requires prometheus-adapter to fail querying node/container metrics from Prometheus.

Comment 1 Damien Grisonnet 2021-05-25 17:29:47 UTC
Upstream PR is waiting for reviews.

Comment 2 Damien Grisonnet 2021-06-01 15:02:34 UTC
PR merged, the fix will be in the next release of prometheus-adapter.

Comment 6 Damien Grisonnet 2021-09-03 07:38:49 UTC
Moving to MODIFIED state since https://github.com/openshift/cluster-monitoring-operator/pull/1325 and https://github.com/openshift/k8s-prometheus-adapter/pull/53 have been merged.

Comment 7 Junqi Zhao 2021-09-09 02:39:47 UTC
is the Target Release 4.9.0 or 4.10.0, I see the fix is in 4.9 and 4.10

Comment 8 Damien Grisonnet 2021-09-09 08:42:44 UTC
Target release is 4.9.0. For some reason, the bug wasn't moved from MODIFIED to ON_QA automatically as I would have expected.

Comment 9 Junqi Zhao 2021-09-09 10:09:04 UTC
tested with 4.9.0-0.nightly-2021-09-08-233235, prometheus-adapter version is 0.9.0, and did not see the panic from CI jobs
https://search.ci.openshift.org/?search=Observed+a+panic%3A+runtime+error%3A+index+out+of+range+%5B0%5D+with+length+0&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 14 errata-xmlrpc 2021-10-18 17:31:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.