Bug 1883523 - [sig-cli] oc adm must-gather runs successfully for audit logs [Suite:openshift/conformance/parallel]
Summary: [sig-cli] oc adm must-gather runs successfully for audit logs [Suite:openshif...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.7.0
Assignee: Maciej Szulik
QA Contact: zhou ying
URL:
Whiteboard: LifecycleReset
Depends On:
Blocks: 1926180
TreeView+ depends on / blocked
 
Reported: 2020-09-29 13:36 UTC by Joseph Callen
Modified: 2021-02-24 15:21 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1926180 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:21:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:21:38 UTC

Description Joseph Callen 2020-09-29 13:36:29 UTC
Description of problem:

Panic in test:

STEP: /tmp/test.oc-adm-must-gather.481567317/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-33485bb4fd1f7c442f3d95f7326d8a84ea0f450d10c4268f488b0a44fedd3a72/audit_logs/openshift-apiserver/ci-op-li8k2qfg-0aec4-6gzfn-master-2-audit.log.gz
E0929 03:01:08.683956   20595 runtime.go:76] Observed a panic: 
Your test failed.
Ginkgo panics to prevent subsequent assertions from running.
Normally Ginkgo rescues this panic so you shouldn't see it.

But, if you make an assertion in a goroutine, Ginkgo can't capture the panic.
To circumvent this, you should call

	defer GinkgoRecover()

at the top of the goroutine that caused this panic.

goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x4c49d40, 0x633b320)


Example of failure in vsphere:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1310755797709361152


Search results:

https://search.ci.openshift.org/?search=+oc+adm+must-gather+runs+successfully+for+audit+&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Maciej Szulik 2020-09-30 10:25:14 UTC
The failures are legit, but the panic is not a problem here. The problem from what I was looking at recent failures is coming from
the fact that events are not matching this prefix:

<string>: {"kind":"Event",

and thus fail.

I'll continue to investigate, but it's not a blocker for 4.6.

Comment 2 Maciej Szulik 2020-10-01 14:50:30 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 3 Gabe Montero 2020-10-21 13:06:56 UTC
Hey Maciej

I've been getting some fairly consistent sig-cli e2e failure in one of my 4.7 openshift/origin PRs which I believe are unrelated to my changes.

In failure I'm seeing most often, in timeout.sh, I'm seeing in the diagnostics:

FAILURE after 1.000s: test/cmd/timeout.sh:16: executing 'oc get dc/testdc -w -v=5 --request-timeout=1s 2>&1' expecting success and text 'request canceled': the output content test failed
Standard output from the command:
NAME     REVISION   DESIRED   CURRENT   TRIGGERED BY
testdc   0          1         0         config
testdc   1          1         0         config
testdc   1          1         0         config
testdc   1          1         0         config
I1020 21:52:31.313507     135 streamwatcher.go:117] Unable to decode an event from the watch stream: context deadline exceeded (Client.Timeout or context cancellation while reading body)

There was no error output from the command.
[ERROR] hack/lib/cmd.sh:30: `return "${return_code}"` exited with status 1.

Any chance that "unable to decode an event" lines up with your analysis in #Comment 1 ?

The latest run in my PR with this is https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25595/pull-ci-openshift-origin-master-e2e-cmd/1318655720769458176

thanks

Comment 4 Maciej Szulik 2020-10-22 11:56:52 UTC
Gabe the test-cmd is a different beast, they don't have any retries and are very dumb. That's why we're currently re-writing 
all of test-cmd into proper e2e in go. See https://github.com/openshift/origin/blob/master/test/extended/cli/admin.go, for example.
That unreliability is also why we're not enforcing test-cmd on origin repo, it runs to get the signal, but it's not a blocking test.
It blocking only in oc, where it actually matters. 

For your failure, you've probably hit a small window between api restarts and the test fails, and that is completely different 
from events from audit mentioned in comment 1. That event is actually an audit event, not a regular event as you're probably 
used to working with on a daily basis.

Hope that clears your doubts :-)

Comment 5 Gabe Montero 2020-10-22 13:08:36 UTC
Thanks for the info Maciej.

fwiw we eventualy circumnavigated the existing timing windows and got a clean e2e-cmd run in the PR I referenced.

Comment 6 Maciej Szulik 2020-10-22 13:19:04 UTC
Great to hear that!

Comment 7 Maciej Szulik 2020-11-13 11:07:29 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 8 Michal Fojtik 2020-11-21 14:12:06 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 9 Seth Jennings 2021-01-19 21:23:06 UTC
This bug is still a thing
https://search.ci.openshift.org/?search=oc+adm+must-gather+runs+successfully+for+audit+logs&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Failures seem to be concentrated in 4.6 jobs.  Is there a fix that maybe didn't get backported?

Comment 10 Michal Fojtik 2021-01-19 22:20:26 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 11 Maciej Szulik 2021-01-20 11:24:21 UTC
This is happening although it's only scoped to 4.6 and ovn specifically. All of those are failing
to match "{"kind":"Event"," which was fixed in both oc and must-gather by adding extra sync after
collection, see:
- https://github.com/openshift/must-gather/pull/176
- https://github.com/openshift/oc/pull/574

I'm not seeing this error happening in any of the masters so I'm lowering its priority to low. 
I'm not sure what else can be done the above changes are present in 4.6 as well.

Comment 12 Maciej Szulik 2021-02-05 14:44:37 UTC
I'm moving this for verification, one caveat this is fixed only for 4.7 and above.

Comment 13 W. Trevor King 2021-02-05 17:56:38 UTC
> ...one caveat this is fixed only for 4.7 and above.

This bug report is for 4.6, and may be keeping Sippy from complaining about a test-case failure that is not associated with a bug.  If it is a common enough failure mode in 4.6 CI, you might need to either backport the 4.7 fix or just leave the bug open on low priority until 4.6 is end-of-lifed (which will be quite a while) to keep build-monitors from continually opening new 4.6 bugs for this test-case.

Comment 14 Maciej Szulik 2021-02-08 12:16:23 UTC
Yeah, that's fair. Opening a clone for 4.6.

Comment 16 zhou ying 2021-02-09 03:37:39 UTC
Maciej Szulik 

I still could see the error: to match "{"kind":"Event"," 
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1358289736799621120

Could you please help take a look ? thanks.

Comment 17 Maciej Szulik 2021-02-09 09:42:32 UTC
(In reply to zhou ying from comment #16)
> Maciej Szulik 
> 
> I still could see the error: to match "{"kind":"Event"," 
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-
> origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1358289736799621120

Yeah, it's possible, I've opened a clone for this to backport these fixes to 4.6 so 
we silence the bugs entirely.

Comment 18 zhou ying 2021-02-09 11:56:42 UTC
No reproduce for 4.7 , will move to verified status.

Comment 21 errata-xmlrpc 2021-02-24 15:21:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.