Hide Forgot
David Eads pointed out a panic in 4.6 CI [1,2]: E1201 14:19:10.222361 1 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0) goroutine 258 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x18f5be0, 0xc00084a000) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89 panic(0x18f5be0, 0xc00084a000) /usr/lib/golang/src/runtime/panic.go:969 +0x175 github.com/openshift/cluster-version-operator/pkg/payload.RunGraph(0x1c1ea60, 0xc000368240, 0xc00025f350, 0x42, 0xc0017b05d0, 0x0, 0x0, 0x0) /go/src/github.com/openshift/cluster-version-operator/pkg/payload/task_graph.go:549 +0xf6c ... The referenced line [3] is from [4], which is new in 4.6: $ git log --oneline origin/master | grep 'Handle node pushing and result collection without a goroutine' 55ef3d30 pkg/payload/task_graph: Handle node pushing and result collection without a goroutine $ git log --oneline origin/release-4.6 | grep 'Handle node pushing and result collection without a goroutine' 55ef3d30 pkg/payload/task_graph: Handle node pushing and result collection without a goroutine $ git log --oneline origin/release-4.5 | grep -c 'Handle node pushing and result collection without a goroutine' 0 We should be able to name TaskNodes regardless of the presence of tasks within the node; some more on that in [5]. But for this particular case, we can probably just exclude task-less nodes from the logging. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1333772997986619392 [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1333772997986619392/artifacts/e2e-vsphere/gather-extra/pods/openshift-cluster-version_cluster-version-operator-7cd49d5b57-s82jz_cluster-version-operator_previous.log [3]: https://github.com/openshift/cluster-version-operator/blob/39a42566bfcca5970f3c8805ce4726d19b19417d/pkg/payload/task_graph.go#L549 [4]: https://github.com/openshift/cluster-version-operator/pull/264 [5]: https://github.com/openshift/cluster-version-operator/pull/435
PR still needs a bit of work.
Hard to trigger this reliably in a one-off cluster. But we can let it sit for a day, and then check: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=cluster-version-operator.*Observed+a+panic.*runtime+error:+index+out+of+range' | grep -- '-4\.[78].*failures match' | sort branch-ci-openshift-cnv-cnv-ci-release-4.7-e2e-upgrade - 11 runs, 55% failed, 17% of failures match periodic-ci-openshift-release-master-ocp-4.7-e2e-ovirt - 11 runs, 36% failed, 25% of failures match periodic-ci-openshift-release-master-ocp-4.7-e2e-vsphere-upi-serial - 10 runs, 100% failed, 10% of failures match release-openshift-ocp-installer-e2e-gcp-ovn-4.7 - 9 runs, 22% failed, 50% of failures match to confirm it has drained down to zero.
Go through the bug, we can not reproduce and verify the issue in QE's cluster via several times test. So I check the ci logs as above way to verify the bug. # w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=cluster-version-operator.*Observed+a+panic.*runtime+error:+index+out+of+range' | grep -- '-4\.[78].*failures match' | sort release-openshift-ocp-installer-e2e-azure-4.7 - 8 runs, 38% failed, 33% of failures match --[1] release-openshift-origin-installer-e2e-azure-upgrade-4.6-stable-to-4.7-ci - 4 runs, 25% failed, 100% of failures match --[2] release-openshift-origin-installer-e2e-gcp-upgrade-4.7-stable-to-4.8-ci - 4 runs, 100% failed, 25% of failures match --[3] [1] was an old build(4.7.0-0.nightly-2021-01-29-162805) before pr merged. [2] was an upgrade from v4.6.16(need backport in 1924194). [3] was an upgrade from 4.7.0-fc.5 before pr merged. All above three jobs can be excluded. So verify the bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633