2075091 – Symptom Detection.Undiagnosed panic detected in pod

Bug 2075091 - Symptom Detection.Undiagnosed panic detected in pod

Summary: Symptom Detection.Undiagnosed panic detected in pod

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Jan Fajerski
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2078835
TreeView+	depends on / blocked

Reported:	2022-04-13 15:17 UTC by Ken Zhang
Modified:	2022-08-10 11:07 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 11:07:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift kube-state-metrics pull 69	None	Merged	Bug 2075091: internal/store: fix potential panic in pod store	2022-05-09 11:55:52 UTC
Github	openshift kube-state-metrics pull 71	None	Merged	Bug 2075091: internal/store: fix metrics slice length	2022-06-15 07:51:07 UTC
Github	openshift kube-state-metrics pull 74	None	Merged	Bug 2075091: Bump to KSM 2.5.0	2022-06-15 07:51:08 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 11:07:26 UTC

Description Ken Zhang 2022-04-13 15:17:59 UTC

Symptom Detection.Undiagnosed panic detected in pod

is failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=Symptom%20Detection.Undiagnosed%20panic%20detected%20in%20pod

This problem seemed existing before. But number of cases surged and caused two nightly payloads to be rejected:


https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-150057
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-185124

After that, it mysteriously disappeared. 

Here is a specific case:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1513895844351315968

Message from the test case:

{  pods/openshift-monitoring_kube-state-metrics-67c5b7c7c6-88vxn_kube-state-metrics_previous.log.gz:E0412 15:52:33.358619       1 runtime.go:78] Observed a panic: runtime.boundsError{x:4, y:4, signed:true, code:0x0} (runtime error: index out of range [4] with length 4)}


Panic trace from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1513895844351315968/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_kube-state-metrics-67c5b7c7c6-88vxn_kube-state-metrics_previous.log:

E0412 15:52:33.358619       1 runtime.go:78] Observed a panic: runtime.boundsError{x:4, y:4, signed:true, code:0x0} (runtime error: index out of range [4] with length 4)
goroutine 77 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1741840, 0xc000b635f0})
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x7d
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000ac9740})
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1741840, 0xc000b635f0})
	/usr/lib/golang/src/runtime/panic.go:1038 +0x215
k8s.io/kube-state-metrics/v2/internal/store.createPodContainerInfoFamilyGenerator.func1(0xc003422c00)
	/go/src/k8s.io/kube-state-metrics/internal/store/pod.go:134 +0x375
k8s.io/kube-state-metrics/v2/internal/store.wrapPodFunc.func1({0x1804880, 0xc003422c00})
	/go/src/k8s.io/kube-state-metrics/internal/store/pod.go:1386 +0x5a
k8s.io/kube-state-metrics/v2/pkg/metric_generator.(*FamilyGenerator).Generate(...)
	/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:67
k8s.io/kube-state-metrics/v2/pkg/metric_generator.ComposeMetricGenFuncs.func1({0x1804880, 0xc003422c00})
	/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add(0xc0000c13c0, {0x1804880, 0xc003422c00})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:72 +0xd4
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Update(0xc003422c00, {0x1804880, 0xc003422c00})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:87 +0x25
k8s.io/client-go/tools/cache.(*Reflector).watchHandler(0xc000192fc0, {0x0, 0x0, 0x26cdee0}, {0x1a373f8, 0xc0011c24c0}, 0xc000623d60, 0xc0005ff380, 0xc0002cc480)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:506 +0xa55
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc000192fc0, 0xc0002cc480)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:429 +0x696
k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:221 +0x26
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f02ffada1d0)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00036a2c0, {0x1a1daa0, 0xc000386e60}, 0x1, 0xc0002cc480)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/client-go/tools/cache.(*Reflector).Run(0xc000192fc0, 0xc0002cc480)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:220 +0x1f8
created by k8s.io/kube-state-metrics/v2/internal/store.(*Builder).startReflector
	/go/src/k8s.io/kube-state-metrics/internal/store/builder.go:508 +0x2c8
panic: runtime error: index out of range [4] with length 4 [recovered]
	panic: runtime error: index out of range [4] with length 4


It points to https://github.com/openshift/kube-state-metrics/blob/6efa87f858ee53028fd2de40941b61c09e9ee049/internal/store/pod.go#L134 where the len of p.Status.ContainerStatuses and p.Spec.Containers seems to diverge. 

Unfortunately the condition is ephemeral and the condition that caused the panic does not exist in the must-gather data. 

The ask is to safe guard the code to avoid the panic and log useful debugging info to track down offenders.

Comment 1 Simon Pasquier 2022-04-15 14:32:39 UTC

Decreasing severity to medium since kube-state-metrics would restart automatically but increasing priority to high.
Setting blocker- as it appears to happen randomly and it doesn't hinder the core monitoring functions (failed scrapes are expected and alerting rules should account for that already).

Comment 4 hongyan li 2022-04-29 01:12:39 UTC

Still see the error, see the following run. Reopen the bug.
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27059/pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade/1519795699934302208
https://prow.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.11

Comment 8 hongyan li 2022-05-13 08:30:29 UTC

see the failure in job
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.8-to-4.9-to-4.10-to-4.11-ci/1524890133822705664

Comment 12 hongyan li 2022-06-20 02:52:38 UTC

Didn't see the issue in 4.11 CI run jobs in recent 14 days, so closed the bug.

See the issue in 4.10 jobs, such as, not sure if the bug should be backported
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.10/1538269043335630848

Comment 15 errata-xmlrpc 2022-08-10 11:07:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.