Bug 1903382 - Panic when task-graph is canceled with a TaskNode with no tasks
Summary: Panic when task-graph is canceled with a TaskNode with no tasks
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.7.0
Assignee: W. Trevor King
QA Contact: liujia
Depends On:
Blocks: 1924194
TreeView+ depends on / blocked
Reported: 2020-12-01 23:16 UTC by W. Trevor King
Modified: 2021-02-26 22:51 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2021-02-24 15:37:21 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 484 0 None open Bug 1903382: pkg/payload/task_graph: Require firstIncompleteNode to have tasks 2021-02-01 03:47:00 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:37:55 UTC

Description W. Trevor King 2020-12-01 23:16:09 UTC
David Eads pointed out a panic in 4.6 CI [1,2]:

E1201 14:19:10.222361       1 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 258 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x18f5be0, 0xc00084a000)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89
panic(0x18f5be0, 0xc00084a000)
	/usr/lib/golang/src/runtime/panic.go:969 +0x175
github.com/openshift/cluster-version-operator/pkg/payload.RunGraph(0x1c1ea60, 0xc000368240, 0xc00025f350, 0x42, 0xc0017b05d0, 0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/pkg/payload/task_graph.go:549 +0xf6c

The referenced line [3] is from [4], which is new in 4.6:

$ git log --oneline origin/master | grep 'Handle node pushing and result collection without a goroutine'
55ef3d30 pkg/payload/task_graph: Handle node pushing and result collection without a goroutine
$ git log --oneline origin/release-4.6 | grep 'Handle node pushing and result collection without a goroutine'
55ef3d30 pkg/payload/task_graph: Handle node pushing and result collection without a goroutine
$ git log --oneline origin/release-4.5 | grep -c 'Handle node pushing and result collection without a goroutine'

We should be able to name TaskNodes regardless of the presence of tasks within the node; some more on that in [5].  But for this particular case, we can probably just exclude task-less nodes from the logging.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1333772997986619392
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1333772997986619392/artifacts/e2e-vsphere/gather-extra/pods/openshift-cluster-version_cluster-version-operator-7cd49d5b57-s82jz_cluster-version-operator_previous.log
[3]: https://github.com/openshift/cluster-version-operator/blob/39a42566bfcca5970f3c8805ce4726d19b19417d/pkg/payload/task_graph.go#L549
[4]: https://github.com/openshift/cluster-version-operator/pull/264
[5]: https://github.com/openshift/cluster-version-operator/pull/435

Comment 1 W. Trevor King 2020-12-04 22:45:26 UTC
PR still needs a bit of work.

Comment 3 W. Trevor King 2021-02-02 18:57:25 UTC
Hard to trigger this reliably in a one-off cluster.  But we can let it sit for a day, and then check:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=cluster-version-operator.*Observed+a+panic.*runtime+error:+index+out+of+range' | grep -- '-4\.[78].*failures match' | sort
branch-ci-openshift-cnv-cnv-ci-release-4.7-e2e-upgrade - 11 runs, 55% failed, 17% of failures match
periodic-ci-openshift-release-master-ocp-4.7-e2e-ovirt - 11 runs, 36% failed, 25% of failures match
periodic-ci-openshift-release-master-ocp-4.7-e2e-vsphere-upi-serial - 10 runs, 100% failed, 10% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.7 - 9 runs, 22% failed, 50% of failures match

to confirm it has drained down to zero.

Comment 4 liujia 2021-02-04 08:53:29 UTC
Go through the bug, we can not reproduce and verify the issue in QE's cluster via several times test. So I check the ci logs as above way to verify the bug.

# w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=cluster-version-operator.*Observed+a+panic.*runtime+error:+index+out+of+range' | grep -- '-4\.[78].*failures match' | sort
release-openshift-ocp-installer-e2e-azure-4.7 - 8 runs, 38% failed, 33% of failures match  --[1]
release-openshift-origin-installer-e2e-azure-upgrade-4.6-stable-to-4.7-ci - 4 runs, 25% failed, 100% of failures match  --[2]
release-openshift-origin-installer-e2e-gcp-upgrade-4.7-stable-to-4.8-ci - 4 runs, 100% failed, 25% of failures match   --[3]

[1] was an old build(4.7.0-0.nightly-2021-01-29-162805) before pr merged.
[2] was an upgrade from v4.6.16(need backport in 1924194).
[3] was an upgrade from 4.7.0-fc.5 before pr merged.

All above three jobs can be excluded. So verify the bug.

Comment 7 errata-xmlrpc 2021-02-24 15:37:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.