Bug 1926180

Summary:	[sig-cli] oc adm must-gather runs successfully for audit logs [Suite:openshift/conformance/parallel]
Product:	OpenShift Container Platform	Reporter:	Maciej Szulik <maszulik>
Component:	oc	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED DEFERRED	QA Contact:	zhou ying <yinzhou>
Severity:	medium	Docs Contact:
Priority:	low
Version:	4.6	CC:	aos-bugs, jcallen, maszulik, mfojtik, sjenning, wking, yinzhou
Target Milestone:	---	Flags:	mfojtik: needinfo?
Target Release:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	tag-ci LifecycleReset
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1883523	Environment:
Last Closed:	2022-01-17 16:25:22 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1883523
Bug Blocks:

Description Maciej Szulik 2021-02-08 12:16:48 UTC

+++ This bug was initially created as a clone of Bug #1883523 +++

Description of problem:

Panic in test:

STEP: /tmp/test.oc-adm-must-gather.481567317/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-33485bb4fd1f7c442f3d95f7326d8a84ea0f450d10c4268f488b0a44fedd3a72/audit_logs/openshift-apiserver/ci-op-li8k2qfg-0aec4-6gzfn-master-2-audit.log.gz
E0929 03:01:08.683956   20595 runtime.go:76] Observed a panic: 
Your test failed.
Ginkgo panics to prevent subsequent assertions from running.
Normally Ginkgo rescues this panic so you shouldn't see it.

But, if you make an assertion in a goroutine, Ginkgo can't capture the panic.
To circumvent this, you should call

	defer GinkgoRecover()

at the top of the goroutine that caused this panic.

goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x4c49d40, 0x633b320)


Example of failure in vsphere:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1310755797709361152


Search results:

https://search.ci.openshift.org/?search=+oc+adm+must-gather+runs+successfully+for+audit+&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Maciej Szulik on 2020-09-30 12:25:14 CEST ---

The failures are legit, but the panic is not a problem here. The problem from what I was looking at recent failures is coming from
the fact that events are not matching this prefix:

<string>: {"kind":"Event",

and thus fail.

I'll continue to investigate, but it's not a blocker for 4.6.

--- Additional comment from Maciej Szulik on 2020-10-01 16:50:30 CEST ---

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

--- Additional comment from Gabe Montero on 2020-10-21 15:06:56 CEST ---

Hey Maciej

I've been getting some fairly consistent sig-cli e2e failure in one of my 4.7 openshift/origin PRs which I believe are unrelated to my changes.

In failure I'm seeing most often, in timeout.sh, I'm seeing in the diagnostics:

FAILURE after 1.000s: test/cmd/timeout.sh:16: executing 'oc get dc/testdc -w -v=5 --request-timeout=1s 2>&1' expecting success and text 'request canceled': the output content test failed
Standard output from the command:
NAME     REVISION   DESIRED   CURRENT   TRIGGERED BY
testdc   0          1         0         config
testdc   1          1         0         config
testdc   1          1         0         config
testdc   1          1         0         config
I1020 21:52:31.313507     135 streamwatcher.go:117] Unable to decode an event from the watch stream: context deadline exceeded (Client.Timeout or context cancellation while reading body)

There was no error output from the command.
[ERROR] hack/lib/cmd.sh:30: `return "${return_code}"` exited with status 1.

Any chance that "unable to decode an event" lines up with your analysis in #Comment 1 ?

The latest run in my PR with this is https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25595/pull-ci-openshift-origin-master-e2e-cmd/1318655720769458176

thanks

--- Additional comment from Maciej Szulik on 2020-10-22 13:56:52 CEST ---

Gabe the test-cmd is a different beast, they don't have any retries and are very dumb. That's why we're currently re-writing 
all of test-cmd into proper e2e in go. See https://github.com/openshift/origin/blob/master/test/extended/cli/admin.go, for example.
That unreliability is also why we're not enforcing test-cmd on origin repo, it runs to get the signal, but it's not a blocking test.
It blocking only in oc, where it actually matters. 

For your failure, you've probably hit a small window between api restarts and the test fails, and that is completely different 
from events from audit mentioned in comment 1. That event is actually an audit event, not a regular event as you're probably 
used to working with on a daily basis.

Hope that clears your doubts :-)

--- Additional comment from Gabe Montero on 2020-10-22 15:08:36 CEST ---

Thanks for the info Maciej.

fwiw we eventualy circumnavigated the existing timing windows and got a clean e2e-cmd run in the PR I referenced.

--- Additional comment from Maciej Szulik on 2020-10-22 15:19:04 CEST ---

Great to hear that!

--- Additional comment from Maciej Szulik on 2020-11-13 12:07:29 CET ---

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

--- Additional comment from Michal Fojtik on 2020-11-21 15:12:06 CET ---

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

--- Additional comment from Seth Jennings on 2021-01-19 22:23:06 CET ---

This bug is still a thing
https://search.ci.openshift.org/?search=oc+adm+must-gather+runs+successfully+for+audit+logs&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Failures seem to be concentrated in 4.6 jobs.  Is there a fix that maybe didn't get backported?

--- Additional comment from Michal Fojtik on 2021-01-19 23:20:26 CET ---

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

--- Additional comment from Maciej Szulik on 2021-01-20 12:24:21 CET ---

This is happening although it's only scoped to 4.6 and ovn specifically. All of those are failing
to match "{"kind":"Event"," which was fixed in both oc and must-gather by adding extra sync after
collection, see:
- https://github.com/openshift/must-gather/pull/176
- https://github.com/openshift/oc/pull/574

I'm not seeing this error happening in any of the masters so I'm lowering its priority to low. 
I'm not sure what else can be done the above changes are present in 4.6 as well.

--- Additional comment from Maciej Szulik on 2021-02-05 15:44:37 CET ---

I'm moving this for verification, one caveat this is fixed only for 4.7 and above.

--- Additional comment from W. Trevor King on 2021-02-05 18:56:38 CET ---

> ...one caveat this is fixed only for 4.7 and above.

This bug report is for 4.6, and may be keeping Sippy from complaining about a test-case failure that is not associated with a bug.  If it is a common enough failure mode in 4.6 CI, you might need to either backport the 4.7 fix or just leave the bug open on low priority until 4.6 is end-of-lifed (which will be quite a while) to keep build-monitors from continually opening new 4.6 bugs for this test-case.

--- Additional comment from Maciej Szulik on 2021-02-08 13:16:23 CET ---

Yeah, that's fair. Opening a clone for 4.6.

Comment 1 Maciej Szulik 2021-02-08 12:17:30 UTC

Backport these two to 4.6
- https://github.com/openshift/must-gather/pull/176
- https://github.com/openshift/oc/pull/574

Comment 2 Michal Fojtik 2021-03-10 12:26:58 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 4 Michal Fojtik 2022-01-15 07:11:49 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 5 Maciej Szulik 2022-01-17 16:25:22 UTC

Fix in 4.7+