2078835 – Symptom Detection.Undiagnosed panic detected in pod

Bug 2078835 - Symptom Detection.Undiagnosed panic detected in pod

Summary: Symptom Detection.Undiagnosed panic detected in pod

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Jan Fajerski
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Depends On:	2075091
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-26 10:37 UTC by OpenShift BugZilla Robot
Modified:	2022-10-25 18:03 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-10-25 18:03:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kube-state-metrics pull 70	0	None	open	[release-4.10] Bug 2078835: internal/store: fix potential panic in pod store	2022-08-11 08:10:51 UTC
Red Hat Product Errata	RHBA-2022:7035	0	None	None	None	2022-10-25 18:03:34 UTC

Description OpenShift BugZilla Robot 2022-04-26 10:37:17 UTC

+++ This bug was initially created as a clone of Bug #2075091 +++

Symptom Detection.Undiagnosed panic detected in pod

is failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=Symptom%20Detection.Undiagnosed%20panic%20detected%20in%20pod

This problem seemed existing before. But number of cases surged and caused two nightly payloads to be rejected:


https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-150057
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-185124

After that, it mysteriously disappeared. 

Here is a specific case:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1513895844351315968

Message from the test case:

{  pods/openshift-monitoring_kube-state-metrics-67c5b7c7c6-88vxn_kube-state-metrics_previous.log.gz:E0412 15:52:33.358619       1 runtime.go:78] Observed a panic: runtime.boundsError{x:4, y:4, signed:true, code:0x0} (runtime error: index out of range [4] with length 4)}


Panic trace from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1513895844351315968/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_kube-state-metrics-67c5b7c7c6-88vxn_kube-state-metrics_previous.log:

E0412 15:52:33.358619       1 runtime.go:78] Observed a panic: runtime.boundsError{x:4, y:4, signed:true, code:0x0} (runtime error: index out of range [4] with length 4)
goroutine 77 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1741840, 0xc000b635f0})
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x7d
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000ac9740})
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1741840, 0xc000b635f0})
	/usr/lib/golang/src/runtime/panic.go:1038 +0x215
k8s.io/kube-state-metrics/v2/internal/store.createPodContainerInfoFamilyGenerator.func1(0xc003422c00)
	/go/src/k8s.io/kube-state-metrics/internal/store/pod.go:134 +0x375
k8s.io/kube-state-metrics/v2/internal/store.wrapPodFunc.func1({0x1804880, 0xc003422c00})
	/go/src/k8s.io/kube-state-metrics/internal/store/pod.go:1386 +0x5a
k8s.io/kube-state-metrics/v2/pkg/metric_generator.(*FamilyGenerator).Generate(...)
	/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:67
k8s.io/kube-state-metrics/v2/pkg/metric_generator.ComposeMetricGenFuncs.func1({0x1804880, 0xc003422c00})
	/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add(0xc0000c13c0, {0x1804880, 0xc003422c00})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:72 +0xd4
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Update(0xc003422c00, {0x1804880, 0xc003422c00})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:87 +0x25
k8s.io/client-go/tools/cache.(*Reflector).watchHandler(0xc000192fc0, {0x0, 0x0, 0x26cdee0}, {0x1a373f8, 0xc0011c24c0}, 0xc000623d60, 0xc0005ff380, 0xc0002cc480)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:506 +0xa55
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc000192fc0, 0xc0002cc480)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:429 +0x696
k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:221 +0x26
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f02ffada1d0)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00036a2c0, {0x1a1daa0, 0xc000386e60}, 0x1, 0xc0002cc480)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/client-go/tools/cache.(*Reflector).Run(0xc000192fc0, 0xc0002cc480)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:220 +0x1f8
created by k8s.io/kube-state-metrics/v2/internal/store.(*Builder).startReflector
	/go/src/k8s.io/kube-state-metrics/internal/store/builder.go:508 +0x2c8
panic: runtime error: index out of range [4] with length 4 [recovered]
	panic: runtime error: index out of range [4] with length 4


It points to https://github.com/openshift/kube-state-metrics/blob/6efa87f858ee53028fd2de40941b61c09e9ee049/internal/store/pod.go#L134 where the len of p.Status.ContainerStatuses and p.Spec.Containers seems to diverge. 

Unfortunately the condition is ephemeral and the condition that caused the panic does not exist in the must-gather data. 

The ask is to safe guard the code to avoid the panic and log useful debugging info to track down offenders.

--- Additional comment from spasquie on 2022-04-15 14:32:39 UTC ---

Decreasing severity to medium since kube-state-metrics would restart automatically but increasing priority to high.
Setting blocker- as it appears to happen randomly and it doesn't hinder the core monitoring functions (failed scrapes are expected and alerting rules should account for that already).

Comment 2 Jan Fajerski 2022-06-13 14:37:05 UTC

There is one additional fix contained in https://github.com/kubernetes/kube-state-metrics/releases/tag/v2.5.0, namely https://github.com/kubernetes/kube-state-metrics/pull/1734.

I can't prove that this is causing the panic we're seeing, but lets update our payload and test again.

Comment 8 hongyan li 2022-10-10 10:43:41 UTC

verified with PR, searched on prow, didn't see the panic.

Comment 13 errata-xmlrpc 2022-10-25 18:03:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.38 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7035

Note You need to log in before you can comment on or make changes to this bug.