Hide Forgot
+++ This bug was initially created as a clone of Bug #2075091 +++ Symptom Detection.Undiagnosed panic detected in pod is failing frequently in CI, see: https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=Symptom%20Detection.Undiagnosed%20panic%20detected%20in%20pod This problem seemed existing before. But number of cases surged and caused two nightly payloads to be rejected: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-150057 https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-185124 After that, it mysteriously disappeared. Here is a specific case: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1513895844351315968 Message from the test case: { pods/openshift-monitoring_kube-state-metrics-67c5b7c7c6-88vxn_kube-state-metrics_previous.log.gz:E0412 15:52:33.358619 1 runtime.go:78] Observed a panic: runtime.boundsError{x:4, y:4, signed:true, code:0x0} (runtime error: index out of range [4] with length 4)} Panic trace from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1513895844351315968/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_kube-state-metrics-67c5b7c7c6-88vxn_kube-state-metrics_previous.log: E0412 15:52:33.358619 1 runtime.go:78] Observed a panic: runtime.boundsError{x:4, y:4, signed:true, code:0x0} (runtime error: index out of range [4] with length 4) goroutine 77 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1741840, 0xc000b635f0}) /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x7d k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000ac9740}) /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75 panic({0x1741840, 0xc000b635f0}) /usr/lib/golang/src/runtime/panic.go:1038 +0x215 k8s.io/kube-state-metrics/v2/internal/store.createPodContainerInfoFamilyGenerator.func1(0xc003422c00) /go/src/k8s.io/kube-state-metrics/internal/store/pod.go:134 +0x375 k8s.io/kube-state-metrics/v2/internal/store.wrapPodFunc.func1({0x1804880, 0xc003422c00}) /go/src/k8s.io/kube-state-metrics/internal/store/pod.go:1386 +0x5a k8s.io/kube-state-metrics/v2/pkg/metric_generator.(*FamilyGenerator).Generate(...) /go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:67 k8s.io/kube-state-metrics/v2/pkg/metric_generator.ComposeMetricGenFuncs.func1({0x1804880, 0xc003422c00}) /go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8 k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add(0xc0000c13c0, {0x1804880, 0xc003422c00}) /go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:72 +0xd4 k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Update(0xc003422c00, {0x1804880, 0xc003422c00}) /go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:87 +0x25 k8s.io/client-go/tools/cache.(*Reflector).watchHandler(0xc000192fc0, {0x0, 0x0, 0x26cdee0}, {0x1a373f8, 0xc0011c24c0}, 0xc000623d60, 0xc0005ff380, 0xc0002cc480) /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:506 +0xa55 k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc000192fc0, 0xc0002cc480) /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:429 +0x696 k8s.io/client-go/tools/cache.(*Reflector).Run.func1() /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:221 +0x26 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f02ffada1d0) /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00036a2c0, {0x1a1daa0, 0xc000386e60}, 0x1, 0xc0002cc480) /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6 k8s.io/client-go/tools/cache.(*Reflector).Run(0xc000192fc0, 0xc0002cc480) /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:220 +0x1f8 created by k8s.io/kube-state-metrics/v2/internal/store.(*Builder).startReflector /go/src/k8s.io/kube-state-metrics/internal/store/builder.go:508 +0x2c8 panic: runtime error: index out of range [4] with length 4 [recovered] panic: runtime error: index out of range [4] with length 4 It points to https://github.com/openshift/kube-state-metrics/blob/6efa87f858ee53028fd2de40941b61c09e9ee049/internal/store/pod.go#L134 where the len of p.Status.ContainerStatuses and p.Spec.Containers seems to diverge. Unfortunately the condition is ephemeral and the condition that caused the panic does not exist in the must-gather data. The ask is to safe guard the code to avoid the panic and log useful debugging info to track down offenders. --- Additional comment from spasquie on 2022-04-15 14:32:39 UTC --- Decreasing severity to medium since kube-state-metrics would restart automatically but increasing priority to high. Setting blocker- as it appears to happen randomly and it doesn't hinder the core monitoring functions (failed scrapes are expected and alerting rules should account for that already).
There is one additional fix contained in https://github.com/kubernetes/kube-state-metrics/releases/tag/v2.5.0, namely https://github.com/kubernetes/kube-state-metrics/pull/1734. I can't prove that this is causing the panic we're seeing, but lets update our payload and test again.
verified with PR, searched on prow, didn't see the panic.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.38 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:7035