Hide Forgot
panic in cluster-storage-operator while updating status - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/1163/pull-ci-openshift-kubernetes-master-k8s-e2e-conformance-aws/1489076275103404032
The panic: panic({0x1ff9140, 0x27bc000}) runtime/panic.go:1047 +0x266 reflect.Value.Index({0x1f77aa0, 0xc00241cb78, 0xc00068d080}, 0x199) reflect/value.go:1297 +0x178 k8s.io/apimachinery/third_party/forked/golang/reflect.Equalities.deepValueEqual(0x22e3b60, {0x1f77aa0, 0xc00292d3d0, 0x16}, {0x1f77aa0, 0xc00241cb78, 0xc001d070b8}, 0x43a191, 0x2) k8s.io/apimachinery.1/third_party/forked/golang/reflect/deep_equal.go:172 +0x1185 k8s.io/apimachinery/third_party/forked/golang/reflect.Equalities.deepValueEqual(0x2160740, {0x22e3b60, 0xc00292d398, 0xc0014c8480}, {0x22e3b60, 0xc00241cb40, 0x2}, 0xc000bf4360, 0x1) k8s.io/apimachinery.1/third_party/forked/golang/reflect/deep_equal.go:186 +0x1325 k8s.io/apimachinery/third_party/forked/golang/reflect.Equalities.deepValueEqual(0x2160740, {0x2160740, 0xc00292d398, 0x578}, {0x2160740, 0xc00241cb40, 0x580}, 0xc000748800, 0x0) k8s.io/apimachinery.1/third_party/forked/golang/reflect/deep_equal.go:183 +0xe11 k8s.io/apimachinery/third_party/forked/golang/reflect.Equalities.DeepEqual(0xc00292d398, {0x2160740, 0xc00292d398}, {0x2160740, 0xc00241cb40}) k8s.io/apimachinery.1/third_party/forked/golang/reflect/deep_equal.go:243 +0x285 github.com/openshift/library-go/pkg/operator/v1helpers.UpdateStatus.func1() github.com/openshift/library-go.0-20220114151217-4362aa519714/pkg/operator/v1helpers/helpers.go:175 +0x169 k8s.io/client-go/util/retry.OnError.func1() k8s.io/client-go.0+incompatible/util/retry/util.go:51 +0x33 k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x4131f4, 0xc001d07798}) k8s.io/apimachinery.1/pkg/util/wait/wait.go:220 +0x1b k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x2819048, 0xc0001ae008}, 0x203000) k8s.io/apimachinery.1/pkg/util/wait/wait.go:233 +0x7c k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection(0x2262d80) k8s.io/apimachinery.1/pkg/util/wait/wait.go:226 +0x39 k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff({0x989680, 0x4014000000000000, 0x3fb999999999999a, 0x4, 0x0}, 0x413547) k8s.io/apimachinery.1/pkg/util/wait/wait.go:421 +0x5f k8s.io/client-go/util/retry.OnError({0x989680, 0x4014000000000000, 0x3fb999999999999a, 0x4, 0x0}, 0x25dcc28, 0xc00241caf0) k8s.io/client-go.0+incompatible/util/retry/util.go:50 +0xf1 k8s.io/client-go/util/retry.RetryOnConflict(...) k8s.io/client-go.0+incompatible/util/retry/util.go:104 github.com/openshift/library-go/pkg/operator/v1helpers.UpdateStatus({0x2819010, 0xc0001cfcc0}, {0x283cd38, 0xc0005776c0}, {0xc0029c6840, 0x3, 0x3}) github.com/openshift/library-go.0-20220114151217-4362aa519714/pkg/operator/v1helpers/helpers.go:162 +0x1c5 github.com/openshift/cluster-storage-operator/pkg/operator/defaultstorageclass.(*Controller).sync(0xc000ce8230, {0x2819010, 0xc0001cfcc0}, {0x30, 0xc000077800}) github.com/openshift/cluster-storage-operator/pkg/operator/defaultstorageclass/controller.go:134 +0x8c5 storage class controller updates its own conditions and that panics somewhere in reflect.Equalities.DeepEqual
The panic is happening because of an out-of-range index access when the status is updated: > E0203 09:27:18.496815 1 runtime.go:76] Observed a panic: reflect: slice index out of range Which seems to be caused here (see https://github.com/openshift/cluster-storage-operator/blob/2eb1fd83439da37e4f54231d627c3120735b20ee/vendor/k8s.io/apimachinery/third_party/forked/golang/reflect/deep_equal.go#L152-L153): > for i := 0; i < v1.Len(); i++ { > if !e.deepValueEqual(v1.Index(i), v2.Index(i), visited, depth+1) { In the code above, the index range is determined via the length of v1 (which is the "old" status in our case). However, the "new" status contains fewer elements (conditions), so an out of range error happens. For completeness, the "new" status contains fewer conditions because we explicitly remove a "Disabled" condition here: https://github.com/openshift/cluster-storage-operator/blob/2eb1fd83439da37e4f54231d627c3120735b20ee/pkg/operator/defaultstorageclass/controller.go#L137
From the stack trace, we're hitting this line: k8s.io/apimachinery/third_party/forked/golang/reflect.Equalities.deepValueEqual(0x22e3b60, {0x1f77aa0, 0xc00292d3d0, 0x16}, {0x1f77aa0, 0xc00241cb78, 0xc001d070b8}, 0x43a191, 0x2) k8s.io/apimachinery.1/third_party/forked/golang/reflect/deep_equal.go:172 +0x1185 https://github.com/openshift/cluster-storage-operator/blob/2eb1fd83439da37e4f54231d627c3120735b20ee/vendor/k8s.io/apimachinery/third_party/forked/golang/reflect/deep_equal.go#L172 So it's comparing slices instead of arrays if the line numbers are correct. And there is a check for slices to see if the lengths are the same before that line, so why doesn't that prevent the panic? if v1.Len() != v2.Len() { return false }
In this call: k8s.io/apimachinery/third_party/forked/golang/reflect.Equalities.deepValueEqual(0x22e3b60, {0x1f77aa0, 0xc00292d3d0, 0x16}, {0x1f77aa0, 0xc00241cb78, 0xc001d070b8}, 0x43a191, 0x2) k8s.io/apimachinery.1/third_party/forked/golang/reflect/deep_equal.go:172 +0x1185 deepValueEqual compares these 2 slices: v1 = {0x1f77aa0, 0xc00292d3d0, 0x16} v2 = {0x1f77aa0, 0xc00241cb78, 0xc001d070b8} And then calls v2.Index(i): reflect.Value.Index({0x1f77aa0, 0xc00241cb78, 0xc00068d080}, 0x199) reflect/value.go:1297 +0x178 Somehow, the third element of v2 appears to be a different value, and i = 0x199. It would help to see the objects being compared when this panic triggers. I think Fabio is right that it's due to a difference in length between the two slices, but I'm puzzled why the length check didn't prevent the panic.
Can we reproduce the issue easily? > And there is a check for slices to see if the lengths are the same before that line, so why doesn't that prevent the panic? I did not check the code, but this smells like a race condition. Can any other goroutine modify any of the objects during the checks? We may miss some DeepCopy() in our operator / library-go.
It appears that the CR controller is mistakenly overriding the status while trying to set the generation. I've filled a PR to fix that.
Monitoring two days and not found 4.11 failure, will Monitor more days.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069