Bug 1903382

Summary:	Panic when task-graph is canceled with a TaskNode with no tasks
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Cluster Version Operator	Assignee:	W. Trevor King <wking>
Status:	CLOSED ERRATA	QA Contact:	liujia <jiajliu>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.6	CC:	aos-bugs, jiajliu, jokerman, yanyang
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:37:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1924194

Description W. Trevor King 2020-12-01 23:16:09 UTC

David Eads pointed out a panic in 4.6 CI [1,2]:

E1201 14:19:10.222361       1 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 258 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x18f5be0, 0xc00084a000)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89
panic(0x18f5be0, 0xc00084a000)
	/usr/lib/golang/src/runtime/panic.go:969 +0x175
github.com/openshift/cluster-version-operator/pkg/payload.RunGraph(0x1c1ea60, 0xc000368240, 0xc00025f350, 0x42, 0xc0017b05d0, 0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/pkg/payload/task_graph.go:549 +0xf6c
...

The referenced line [3] is from [4], which is new in 4.6:

$ git log --oneline origin/master | grep 'Handle node pushing and result collection without a goroutine'
55ef3d30 pkg/payload/task_graph: Handle node pushing and result collection without a goroutine
$ git log --oneline origin/release-4.6 | grep 'Handle node pushing and result collection without a goroutine'
55ef3d30 pkg/payload/task_graph: Handle node pushing and result collection without a goroutine
$ git log --oneline origin/release-4.5 | grep -c 'Handle node pushing and result collection without a goroutine'
0

We should be able to name TaskNodes regardless of the presence of tasks within the node; some more on that in [5].  But for this particular case, we can probably just exclude task-less nodes from the logging.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1333772997986619392
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1333772997986619392/artifacts/e2e-vsphere/gather-extra/pods/openshift-cluster-version_cluster-version-operator-7cd49d5b57-s82jz_cluster-version-operator_previous.log
[3]: https://github.com/openshift/cluster-version-operator/blob/39a42566bfcca5970f3c8805ce4726d19b19417d/pkg/payload/task_graph.go#L549
[4]: https://github.com/openshift/cluster-version-operator/pull/264
[5]: https://github.com/openshift/cluster-version-operator/pull/435

Comment 1 W. Trevor King 2020-12-04 22:45:26 UTC

PR still needs a bit of work.

Comment 3 W. Trevor King 2021-02-02 18:57:25 UTC

Hard to trigger this reliably in a one-off cluster.  But we can let it sit for a day, and then check:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=cluster-version-operator.*Observed+a+panic.*runtime+error:+index+out+of+range' | grep -- '-4\.[78].*failures match' | sort
branch-ci-openshift-cnv-cnv-ci-release-4.7-e2e-upgrade - 11 runs, 55% failed, 17% of failures match
periodic-ci-openshift-release-master-ocp-4.7-e2e-ovirt - 11 runs, 36% failed, 25% of failures match
periodic-ci-openshift-release-master-ocp-4.7-e2e-vsphere-upi-serial - 10 runs, 100% failed, 10% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.7 - 9 runs, 22% failed, 50% of failures match

to confirm it has drained down to zero.

Comment 4 liujia 2021-02-04 08:53:29 UTC

Go through the bug, we can not reproduce and verify the issue in QE's cluster via several times test. So I check the ci logs as above way to verify the bug.

# w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=cluster-version-operator.*Observed+a+panic.*runtime+error:+index+out+of+range' | grep -- '-4\.[78].*failures match' | sort
release-openshift-ocp-installer-e2e-azure-4.7 - 8 runs, 38% failed, 33% of failures match  --[1]
release-openshift-origin-installer-e2e-azure-upgrade-4.6-stable-to-4.7-ci - 4 runs, 25% failed, 100% of failures match  --[2]
release-openshift-origin-installer-e2e-gcp-upgrade-4.7-stable-to-4.8-ci - 4 runs, 100% failed, 25% of failures match   --[3]

[1] was an old build(4.7.0-0.nightly-2021-01-29-162805) before pr merged.
[2] was an upgrade from v4.6.16(need backport in 1924194).
[3] was an upgrade from 4.7.0-fc.5 before pr merged.

All above three jobs can be excluded. So verify the bug.

Comment 7 errata-xmlrpc 2021-02-24 15:37:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633