Description of problem: Panic in test: STEP: /tmp/test.oc-adm-must-gather.481567317/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-33485bb4fd1f7c442f3d95f7326d8a84ea0f450d10c4268f488b0a44fedd3a72/audit_logs/openshift-apiserver/ci-op-li8k2qfg-0aec4-6gzfn-master-2-audit.log.gz E0929 03:01:08.683956 20595 runtime.go:76] Observed a panic: Your test failed. Ginkgo panics to prevent subsequent assertions from running. Normally Ginkgo rescues this panic so you shouldn't see it. But, if you make an assertion in a goroutine, Ginkgo can't capture the panic. To circumvent this, you should call defer GinkgoRecover() at the top of the goroutine that caused this panic. goroutine 1 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x4c49d40, 0x633b320) Example of failure in vsphere: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1310755797709361152 Search results: https://search.ci.openshift.org/?search=+oc+adm+must-gather+runs+successfully+for+audit+&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
The failures are legit, but the panic is not a problem here. The problem from what I was looking at recent failures is coming from the fact that events are not matching this prefix: <string>: {"kind":"Event", and thus fail. I'll continue to investigate, but it's not a blocker for 4.6.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Hey Maciej I've been getting some fairly consistent sig-cli e2e failure in one of my 4.7 openshift/origin PRs which I believe are unrelated to my changes. In failure I'm seeing most often, in timeout.sh, I'm seeing in the diagnostics: FAILURE after 1.000s: test/cmd/timeout.sh:16: executing 'oc get dc/testdc -w -v=5 --request-timeout=1s 2>&1' expecting success and text 'request canceled': the output content test failed Standard output from the command: NAME REVISION DESIRED CURRENT TRIGGERED BY testdc 0 1 0 config testdc 1 1 0 config testdc 1 1 0 config testdc 1 1 0 config I1020 21:52:31.313507 135 streamwatcher.go:117] Unable to decode an event from the watch stream: context deadline exceeded (Client.Timeout or context cancellation while reading body) There was no error output from the command. [ERROR] hack/lib/cmd.sh:30: `return "${return_code}"` exited with status 1. Any chance that "unable to decode an event" lines up with your analysis in #Comment 1 ? The latest run in my PR with this is https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25595/pull-ci-openshift-origin-master-e2e-cmd/1318655720769458176 thanks
Gabe the test-cmd is a different beast, they don't have any retries and are very dumb. That's why we're currently re-writing all of test-cmd into proper e2e in go. See https://github.com/openshift/origin/blob/master/test/extended/cli/admin.go, for example. That unreliability is also why we're not enforcing test-cmd on origin repo, it runs to get the signal, but it's not a blocking test. It blocking only in oc, where it actually matters. For your failure, you've probably hit a small window between api restarts and the test fails, and that is completely different from events from audit mentioned in comment 1. That event is actually an audit event, not a regular event as you're probably used to working with on a daily basis. Hope that clears your doubts :-)
Thanks for the info Maciej. fwiw we eventualy circumnavigated the existing timing windows and got a clean e2e-cmd run in the PR I referenced.
Great to hear that!
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
This bug is still a thing https://search.ci.openshift.org/?search=oc+adm+must-gather+runs+successfully+for+audit+logs&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job Failures seem to be concentrated in 4.6 jobs. Is there a fix that maybe didn't get backported?
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.
This is happening although it's only scoped to 4.6 and ovn specifically. All of those are failing to match "{"kind":"Event"," which was fixed in both oc and must-gather by adding extra sync after collection, see: - https://github.com/openshift/must-gather/pull/176 - https://github.com/openshift/oc/pull/574 I'm not seeing this error happening in any of the masters so I'm lowering its priority to low. I'm not sure what else can be done the above changes are present in 4.6 as well.
I'm moving this for verification, one caveat this is fixed only for 4.7 and above.
> ...one caveat this is fixed only for 4.7 and above. This bug report is for 4.6, and may be keeping Sippy from complaining about a test-case failure that is not associated with a bug. If it is a common enough failure mode in 4.6 CI, you might need to either backport the 4.7 fix or just leave the bug open on low priority until 4.6 is end-of-lifed (which will be quite a while) to keep build-monitors from continually opening new 4.6 bugs for this test-case.
Yeah, that's fair. Opening a clone for 4.6.
Maciej Szulik I still could see the error: to match "{"kind":"Event"," https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1358289736799621120 Could you please help take a look ? thanks.
(In reply to zhou ying from comment #16) > Maciej Szulik > > I still could see the error: to match "{"kind":"Event"," > https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift- > origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1358289736799621120 Yeah, it's possible, I've opened a clone for this to backport these fixes to 4.6 so we silence the bugs entirely.
No reproduce for 4.7 , will move to verified status.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633