1957634 – prometheus-adapter panics on GetNodeMetrics

Bug 1957634 - prometheus-adapter panics on GetNodeMetrics

Summary: prometheus-adapter panics on GetNodeMetrics

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Damien Grisonnet
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-06 07:50 UTC by Damien Grisonnet
Modified:	2021-10-18 17:31 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:31:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes-sigs prometheus-adapter pull 395	None	open	Prevent metrics-server panics on GetContainerMetrics and GetNodeMetrics	2021-05-06 07:50:47 UTC
Github	openshift cluster-monitoring-operator pull 1325	None	None	None	2021-08-18 09:14:36 UTC
Github	openshift k8s-prometheus-adapter pull 53	None	None	None	2021-08-18 09:14:36 UTC
Red Hat Product Errata	RHSA-2021:3759	None	None	None	2021-10-18 17:31:28 UTC

Description Damien Grisonnet 2021-05-06 07:50:47 UTC

Description of problem:

In 4,7 CI, we've noticed a panic from prometheus-adapter: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1134/pull-ci-openshift-cluster-monitoring-operator-release-4.7-e2e-agnostic-operator/1389209238441562112

This panic occurs in the code prometheus-adapter is reusing from metrics-server as showed in the logs:

```
I0503 13:58:46.056646       1 adapter.go:98] successfully using in-cluster auth
I0503 13:58:47.390698       1 secure_serving.go:197] Serving securely on [::]:6443
I0503 13:58:47.390944       1 dynamic_cafile_content.go:167] Starting request-header::/etc/tls/private/requestheader-client-ca-file
I0503 13:58:47.390970       1 dynamic_serving_content.go:130] Starting serving-cert::/etc/tls/private/tls.crt::/etc/tls/private/tls.key
I0503 13:58:47.390990       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0503 13:58:47.391441       1 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/tls/private/client-ca-file
E0503 14:34:48.270032       1 provider.go:265] failed querying node metrics: unable to fetch node CPU metrics: unable to execute query: Get "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=sum%281+-+irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29+%2A+on%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7Bnode%3D~%22ci-op-d5jk33fs-0330d-hkmhs-master-0%7Cci-op-d5jk33fs-0330d-hkmhs-master-1%7Cci-op-d5jk33fs-0330d-hkmhs-master-2%7Cci-op-d5jk33fs-0330d-hkmhs-worker-centralus1-dw2tw%7Cci-op-d5jk33fs-0330d-hkmhs-worker-centralus2-vmngh%7Cci-op-d5jk33fs-0330d-hkmhs-worker-centralus3-wtbx8%22%7D%29+by+%28node%29&time=1620052458.269": dial tcp 172.30.163.101:9091: i/o timeout
I0503 14:34:48.270117       1 trace.go:205] Trace[1458323237]: "List" url:/apis/metrics.k8s.io/v1beta1/nodes,user-agent:e2e.test/v0.0.0 (linux/amd64) kubernetes/$Format,client:52.45.10.115 (03-May-2021 14:34:18.269) (total time: 30000ms):
Trace[1458323237]: [30.000655916s] [30.000655916s] END
E0503 14:34:48.270484       1 runtime.go:76] Observed a panic: runtime error: index out of range [0] with length 0
goroutine 6081 [running]:
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1(0xc0004fa720)
	/go/src/github.com/directxman12/k8s-prometheus-adapter/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:106 +0x113
panic(0x1c78fe0, 0xc0011fe7c0)
	/usr/lib/golang/src/runtime/panic.go:969 +0x1b9
sigs.k8s.io/metrics-server/pkg/api.(*nodeMetrics).getNodeMetrics(0xc000453f80, 0xc0006e9cc0, 0x6, 0x8, 0x6, 0x8, 0x0, 0x0, 0x0)
	/go/src/github.com/directxman12/k8s-prometheus-adapter/vendor/sigs.k8s.io/metrics-server/pkg/api/node.go:212 +0x51c
sigs.k8s.io/metrics-server/pkg/api.(*nodeMetrics).List(0xc000453f80, 0x204aa60, 0xc00065af90, 0xc000f003f0, 0x0, 0x0, 0x200ee60, 0xc000f003f0)
	/go/src/github.com/directxman12/k8s-prometheus-adapter/vendor/sigs.k8s.io/metrics-server/pkg/api/node.go:93 +0x38a
...
```

It seems that prometheus-adapter is returning data that metrics-server isn't expecting.

Version-Release number of selected component (if applicable):

prometheus-adapter v0.8.4

How reproducible:

This is pretty hard to reproduce as it requires prometheus-adapter to fail querying node/container metrics from Prometheus.

Comment 1 Damien Grisonnet 2021-05-25 17:29:47 UTC

Upstream PR is waiting for reviews.

Comment 2 Damien Grisonnet 2021-06-01 15:02:34 UTC

PR merged, the fix will be in the next release of prometheus-adapter.

Comment 6 Damien Grisonnet 2021-09-03 07:38:49 UTC

Moving to MODIFIED state since https://github.com/openshift/cluster-monitoring-operator/pull/1325 and https://github.com/openshift/k8s-prometheus-adapter/pull/53 have been merged.

Comment 7 Junqi Zhao 2021-09-09 02:39:47 UTC

is the Target Release 4.9.0 or 4.10.0, I see the fix is in 4.9 and 4.10

Comment 8 Damien Grisonnet 2021-09-09 08:42:44 UTC

Target release is 4.9.0. For some reason, the bug wasn't moved from MODIFIED to ON_QA automatically as I would have expected.

Comment 9 Junqi Zhao 2021-09-09 10:09:04 UTC

tested with 4.9.0-0.nightly-2021-09-08-233235, prometheus-adapter version is 0.9.0, and did not see the panic from CI jobs
https://search.ci.openshift.org/?search=Observed+a+panic%3A+runtime+error%3A+index+out+of+range+%5B0%5D+with+length+0&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 14 errata-xmlrpc 2021-10-18 17:31:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.