Bug 2050300

Summary: panic in cluster-storage-operator while updating status
Product: OpenShift Container Platform Reporter: Hemant Kumar <hekumar>
Component: StorageAssignee: Fabio Bertinatto <fbertina>
Storage sub component: Operators QA Contact: Wei Duan <wduan>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, fbertina, jdobson, jsafrane
Version: 4.10   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:47:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Jan Safranek 2022-02-03 16:37:46 UTC
The panic:

panic({0x1ff9140, 0x27bc000})
	runtime/panic.go:1047 +0x266
reflect.Value.Index({0x1f77aa0, 0xc00241cb78, 0xc00068d080}, 0x199)
	reflect/value.go:1297 +0x178
k8s.io/apimachinery/third_party/forked/golang/reflect.Equalities.deepValueEqual(0x22e3b60, {0x1f77aa0, 0xc00292d3d0, 0x16}, {0x1f77aa0, 0xc00241cb78, 0xc001d070b8}, 0x43a191, 0x2)
	k8s.io/apimachinery.1/third_party/forked/golang/reflect/deep_equal.go:172 +0x1185
k8s.io/apimachinery/third_party/forked/golang/reflect.Equalities.deepValueEqual(0x2160740, {0x22e3b60, 0xc00292d398, 0xc0014c8480}, {0x22e3b60, 0xc00241cb40, 0x2}, 0xc000bf4360, 0x1)
	k8s.io/apimachinery.1/third_party/forked/golang/reflect/deep_equal.go:186 +0x1325
k8s.io/apimachinery/third_party/forked/golang/reflect.Equalities.deepValueEqual(0x2160740, {0x2160740, 0xc00292d398, 0x578}, {0x2160740, 0xc00241cb40, 0x580}, 0xc000748800, 0x0)
	k8s.io/apimachinery.1/third_party/forked/golang/reflect/deep_equal.go:183 +0xe11
k8s.io/apimachinery/third_party/forked/golang/reflect.Equalities.DeepEqual(0xc00292d398, {0x2160740, 0xc00292d398}, {0x2160740, 0xc00241cb40})
	k8s.io/apimachinery.1/third_party/forked/golang/reflect/deep_equal.go:243 +0x285
github.com/openshift/library-go/pkg/operator/v1helpers.UpdateStatus.func1()
	github.com/openshift/library-go.0-20220114151217-4362aa519714/pkg/operator/v1helpers/helpers.go:175 +0x169
k8s.io/client-go/util/retry.OnError.func1()
	k8s.io/client-go.0+incompatible/util/retry/util.go:51 +0x33
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x4131f4, 0xc001d07798})
	k8s.io/apimachinery.1/pkg/util/wait/wait.go:220 +0x1b
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x2819048, 0xc0001ae008}, 0x203000)
	k8s.io/apimachinery.1/pkg/util/wait/wait.go:233 +0x7c
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection(0x2262d80)
	k8s.io/apimachinery.1/pkg/util/wait/wait.go:226 +0x39
k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff({0x989680, 0x4014000000000000, 0x3fb999999999999a, 0x4, 0x0}, 0x413547)
	k8s.io/apimachinery.1/pkg/util/wait/wait.go:421 +0x5f
k8s.io/client-go/util/retry.OnError({0x989680, 0x4014000000000000, 0x3fb999999999999a, 0x4, 0x0}, 0x25dcc28, 0xc00241caf0)
	k8s.io/client-go.0+incompatible/util/retry/util.go:50 +0xf1
k8s.io/client-go/util/retry.RetryOnConflict(...)
	k8s.io/client-go.0+incompatible/util/retry/util.go:104
github.com/openshift/library-go/pkg/operator/v1helpers.UpdateStatus({0x2819010, 0xc0001cfcc0}, {0x283cd38, 0xc0005776c0}, {0xc0029c6840, 0x3, 0x3})
	github.com/openshift/library-go.0-20220114151217-4362aa519714/pkg/operator/v1helpers/helpers.go:162 +0x1c5
github.com/openshift/cluster-storage-operator/pkg/operator/defaultstorageclass.(*Controller).sync(0xc000ce8230, {0x2819010, 0xc0001cfcc0}, {0x30, 0xc000077800})
	github.com/openshift/cluster-storage-operator/pkg/operator/defaultstorageclass/controller.go:134 +0x8c5


storage class controller updates its own conditions and that panics somewhere in reflect.Equalities.DeepEqual

Comment 2 Fabio Bertinatto 2022-02-03 18:26:52 UTC
The panic is happening because of an out-of-range index access when the status is updated:

> E0203 09:27:18.496815       1 runtime.go:76] Observed a panic: reflect: slice index out of range

Which seems to be caused here (see https://github.com/openshift/cluster-storage-operator/blob/2eb1fd83439da37e4f54231d627c3120735b20ee/vendor/k8s.io/apimachinery/third_party/forked/golang/reflect/deep_equal.go#L152-L153):

> for i := 0; i < v1.Len(); i++ {
> 	if !e.deepValueEqual(v1.Index(i), v2.Index(i), visited, depth+1) {


In the code above, the index range is determined via the length of v1 (which is the "old" status in our case). However, the "new" status contains fewer elements (conditions), so an out of range error happens.

For completeness, the "new" status contains fewer conditions because we explicitly remove a "Disabled" condition here:

https://github.com/openshift/cluster-storage-operator/blob/2eb1fd83439da37e4f54231d627c3120735b20ee/pkg/operator/defaultstorageclass/controller.go#L137

Comment 3 Jonathan Dobson 2022-02-03 18:40:06 UTC
From the stack trace, we're hitting this line:

k8s.io/apimachinery/third_party/forked/golang/reflect.Equalities.deepValueEqual(0x22e3b60, {0x1f77aa0, 0xc00292d3d0, 0x16}, {0x1f77aa0, 0xc00241cb78, 0xc001d070b8}, 0x43a191, 0x2)
	k8s.io/apimachinery.1/third_party/forked/golang/reflect/deep_equal.go:172 +0x1185

https://github.com/openshift/cluster-storage-operator/blob/2eb1fd83439da37e4f54231d627c3120735b20ee/vendor/k8s.io/apimachinery/third_party/forked/golang/reflect/deep_equal.go#L172

So it's comparing slices instead of arrays if the line numbers are correct.
And there is a check for slices to see if the lengths are the same before that line, so why doesn't that prevent the panic?

		if v1.Len() != v2.Len() {
			return false
		}

Comment 4 Jonathan Dobson 2022-02-03 19:06:40 UTC
In this call:

k8s.io/apimachinery/third_party/forked/golang/reflect.Equalities.deepValueEqual(0x22e3b60, {0x1f77aa0, 0xc00292d3d0, 0x16}, {0x1f77aa0, 0xc00241cb78, 0xc001d070b8}, 0x43a191, 0x2)
	k8s.io/apimachinery.1/third_party/forked/golang/reflect/deep_equal.go:172 +0x1185

deepValueEqual compares these 2 slices:

  v1 = {0x1f77aa0, 0xc00292d3d0, 0x16}
  v2 = {0x1f77aa0, 0xc00241cb78, 0xc001d070b8}

And then calls v2.Index(i):

reflect.Value.Index({0x1f77aa0, 0xc00241cb78, 0xc00068d080}, 0x199)
	reflect/value.go:1297 +0x178

Somehow, the third element of v2 appears to be a different value, and i = 0x199.
It would help to see the objects being compared when this panic triggers. I think Fabio is right that it's due to a difference in length between the two slices, but I'm puzzled why the length check didn't prevent the panic.

Comment 5 Jan Safranek 2022-02-04 11:24:09 UTC
Can we reproduce the issue easily?

> And there is a check for slices to see if the lengths are the same before that line, so why doesn't that prevent the panic?

I did not check the code, but this smells like a race condition. Can any other goroutine modify any of the objects during the checks? We may miss some DeepCopy() in our operator / library-go.

Comment 6 Fabio Bertinatto 2022-02-04 14:18:51 UTC
It appears that the CR controller is mistakenly overriding the status while trying to set the generation. I've filled a PR to fix that.

Comment 9 Wei Duan 2022-02-09 04:08:04 UTC
Monitoring two days and not found 4.11 failure, will Monitor more days.

Comment 11 errata-xmlrpc 2022-08-10 10:47:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069