Bug 1861201

Summary: [sig-cli] oc adm must-gather runs successfully for audit logs
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: ocAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.6CC: amcdermo, aos-bugs, jokerman, mfojtik, tnozicka
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:17:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-07-28 04:06:42 UTC
test:
[sig-cli] oc adm must-gather runs successfully for audit logs

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&search=%5C%5Bsig-cli%5C%5D+oc+adm+must-gather+runs+successfully+for+audit+logs

$ w3m -dump -cols 200 'https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&search=%5C%5Bsig-cli%5C%5D+oc+adm+must-gather+runs+successfully+for+audit+logs' | grep 'failures match' | sort
promote-release-openshift-machine-os-content-e2e-aws-4.6 - 546 runs, 56% failed, 1% of failures match
pull-ci-cri-o-cri-o-master-e2e-aws - 241 runs, 62% failed, 17% of failures match
pull-ci-cri-o-cri-o-release-1.19-e2e-aws - 6 runs, 50% failed, 67% of failures match
pull-ci-openshift-cloud-credential-operator-master-e2e-aws - 18 runs, 39% failed, 43% of failures match
pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws - 6 runs, 50% failed, 33% of failures match
...
pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-gcp - 94 runs, 57% failed, 24% of failures match
pull-ci-operator-framework-operator-registry-master-e2e-aws - 33 runs, 58% failed, 26% of failures match
rehearse-10454-pull-ci-cri-o-cri-o-master-e2e-aws - 3 runs, 33% failed, 100% of failures match
rehearse-10454-pull-ci-openshift-cloud-credential-operator-master-e2e-azure - 3 runs, 67% failed, 50% of failures match
rehearse-10454-pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi - 3 runs, 33% failed, 100% of failures match
rehearse-10454-pull-ci-openshift-cluster-network-operator-master-e2e-azure - 3 runs, 33% failed, 100% of failures match
rehearse-10454-pull-ci-openshift-installer-master-e2e-gcp-shared-vpc - 3 runs, 33% failed, 100% of failures match
release-openshift-ocp-e2e-aws-scaleup-rhel7-4.6 - 28 runs, 61% failed, 24% of failures match
release-openshift-ocp-installer-e2e-azure-4.6 - 71 runs, 66% failed, 28% of failures match

Picking [1] as release-informing example, the test-case flaked, failing once and passing on retry.  The failure included:

STEP: Found 0 events.
Jul 27 21:22:40.551: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
Jul 27 21:22:40.551: INFO: 
Jul 27 21:22:40.675: INFO: skipping dumping cluster info - cluster too large
Jul 27 21:22:40.724: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-oc-adm-must-gather-gb28p-user}, err: <nil>
Jul 27 21:22:40.774: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-oc-adm-must-gather-gb28p}, err: <nil>
Jul 27 21:22:40.825: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  QWtOCZRnTwiQ1Op-xQtGeAAAAAAAAAAA}, err: <nil>
[AfterEach] [sig-cli] oc adm must-gather
  github.com/openshift/origin@/test/extended/util/client.go:134
Jul 27 21:22:40.825: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-test-oc-adm-must-gather-gb28p" for this suite.
Jul 27 21:22:40.914: INFO: Running AfterSuite actions on all nodes
Jul 27 21:22:40.915: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin@/test/extended/cli/mustgather.go:248]: Expected
    <int>: 0
to be >
    <int>: 1000

An example PR presubmit where the test-case failed both times and was the only failure is [2].  Possibly related to the PR which landed for bug 1859916.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1287846471231606784
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/406/pull-ci-openshift-cluster-version-operator-master-e2e/1287918162578247680

Comment 3 zhou ying 2020-07-29 06:46:17 UTC
Since the bug need to check the failure ration , will check days later.

Comment 4 zhou ying 2020-08-06 01:28:23 UTC
w3m -dump -cols 200  'https://search.ci.openshift.org/?search=oc+adm+must-gather+runs+successfully+for+audit+logs&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job' | grep 'failures match' | sort

pull-ci-openshift-cloud-credential-operator-master-e2e-aws - 14 runs, 71% failed, 10% of failures match
pull-ci-openshift-cluster-etcd-operator-master-e2e-azure - 50 runs, 96% failed, 2% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-azure - 94 runs, 94% failed, 6% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn - 121 runs, 90% failed, 1% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-openstack - 138 runs, 99% failed, 1% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-ovn-step-registry - 103 runs, 93% failed, 1% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-vsphere - 119 runs, 98% failed, 1% of failures match
pull-ci-openshift-machine-api-operator-master-e2e-azure - 49 runs, 67% failed, 3% of failures match
pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry - 171 runs, 95% failed, 1% of failures match
release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 59 runs, 83% failed, 2% of failures match
release-openshift-ocp-installer-e2e-openstack-4.6 - 74 runs, 88% failed, 2% of failures match
release-openshift-origin-installer-e2e-azure-4.6 - 221 runs, 64% failed, 3% of failures match


The failure ratio has down , and checked partly failed logs, can't reproduce the issue now , will verify .

Comment 6 errata-xmlrpc 2020-10-27 16:17:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 7 errata-xmlrpc 2020-10-27 16:20:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196