Bug 2075091

Summary:	Symptom Detection.Undiagnosed panic detected in pod
Product:	OpenShift Container Platform	Reporter:	Ken Zhang <kenzhang>
Component:	Monitoring	Assignee:	Jan Fajerski <jfajersk>
Status:	CLOSED ERRATA	QA Contact:	hongyan li <hongyli>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.11	CC:	amuller, anpicker, hongyli, janantha, jfajersk, sippy, spasquie, wking
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 11:07:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2078835

Description Ken Zhang 2022-04-13 15:17:59 UTC

Symptom Detection.Undiagnosed panic detected in pod

is failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=Symptom%20Detection.Undiagnosed%20panic%20detected%20in%20pod

This problem seemed existing before. But number of cases surged and caused two nightly payloads to be rejected:


https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-150057
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-185124

After that, it mysteriously disappeared. 

Here is a specific case:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1513895844351315968

Message from the test case:

{  pods/openshift-monitoring_kube-state-metrics-67c5b7c7c6-88vxn_kube-state-metrics_previous.log.gz:E0412 15:52:33.358619       1 runtime.go:78] Observed a panic: runtime.boundsError{x:4, y:4, signed:true, code:0x0} (runtime error: index out of range [4] with length 4)}


Panic trace from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1513895844351315968/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_kube-state-metrics-67c5b7c7c6-88vxn_kube-state-metrics_previous.log:

E0412 15:52:33.358619       1 runtime.go:78] Observed a panic: runtime.boundsError{x:4, y:4, signed:true, code:0x0} (runtime error: index out of range [4] with length 4)
goroutine 77 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1741840, 0xc000b635f0})
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x7d
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000ac9740})
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1741840, 0xc000b635f0})
	/usr/lib/golang/src/runtime/panic.go:1038 +0x215
k8s.io/kube-state-metrics/v2/internal/store.createPodContainerInfoFamilyGenerator.func1(0xc003422c00)
	/go/src/k8s.io/kube-state-metrics/internal/store/pod.go:134 +0x375
k8s.io/kube-state-metrics/v2/internal/store.wrapPodFunc.func1({0x1804880, 0xc003422c00})
	/go/src/k8s.io/kube-state-metrics/internal/store/pod.go:1386 +0x5a
k8s.io/kube-state-metrics/v2/pkg/metric_generator.(*FamilyGenerator).Generate(...)
	/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:67
k8s.io/kube-state-metrics/v2/pkg/metric_generator.ComposeMetricGenFuncs.func1({0x1804880, 0xc003422c00})
	/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add(0xc0000c13c0, {0x1804880, 0xc003422c00})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:72 +0xd4
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Update(0xc003422c00, {0x1804880, 0xc003422c00})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:87 +0x25
k8s.io/client-go/tools/cache.(*Reflector).watchHandler(0xc000192fc0, {0x0, 0x0, 0x26cdee0}, {0x1a373f8, 0xc0011c24c0}, 0xc000623d60, 0xc0005ff380, 0xc0002cc480)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:506 +0xa55
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc000192fc0, 0xc0002cc480)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:429 +0x696
k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:221 +0x26
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f02ffada1d0)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00036a2c0, {0x1a1daa0, 0xc000386e60}, 0x1, 0xc0002cc480)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/client-go/tools/cache.(*Reflector).Run(0xc000192fc0, 0xc0002cc480)
	/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:220 +0x1f8
created by k8s.io/kube-state-metrics/v2/internal/store.(*Builder).startReflector
	/go/src/k8s.io/kube-state-metrics/internal/store/builder.go:508 +0x2c8
panic: runtime error: index out of range [4] with length 4 [recovered]
	panic: runtime error: index out of range [4] with length 4


It points to https://github.com/openshift/kube-state-metrics/blob/6efa87f858ee53028fd2de40941b61c09e9ee049/internal/store/pod.go#L134 where the len of p.Status.ContainerStatuses and p.Spec.Containers seems to diverge. 

Unfortunately the condition is ephemeral and the condition that caused the panic does not exist in the must-gather data. 

The ask is to safe guard the code to avoid the panic and log useful debugging info to track down offenders.

Comment 1 Simon Pasquier 2022-04-15 14:32:39 UTC

Decreasing severity to medium since kube-state-metrics would restart automatically but increasing priority to high.
Setting blocker- as it appears to happen randomly and it doesn't hinder the core monitoring functions (failed scrapes are expected and alerting rules should account for that already).

Comment 4 hongyan li 2022-04-29 01:12:39 UTC

Still see the error, see the following run. Reopen the bug.
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27059/pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade/1519795699934302208
https://prow.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.11

Comment 8 hongyan li 2022-05-13 08:30:29 UTC

see the failure in job
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.8-to-4.9-to-4.10-to-4.11-ci/1524890133822705664

Comment 12 hongyan li 2022-06-20 02:52:38 UTC

Didn't see the issue in 4.11 CI run jobs in recent 14 days, so closed the bug.

See the issue in 4.10 jobs, such as, not sure if the bug should be backported
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.10/1538269043335630848

Comment 15 errata-xmlrpc 2022-08-10 11:07:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069