1893691 – OCS4.6 must_gather failes to complete in 600sec

Bug 1893691 - OCS4.6 must_gather failes to complete in 600sec

Summary: OCS4.6 must_gather failes to complete in 600sec

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	must-gather
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.6.0
Assignee:	Pulkit Kundra
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-02 11:28 UTC by Oded
Modified:	2021-06-01 08:47 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.6.0-154.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-17 06:25:20 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 884	0	None	closed	must-gather: add ceph crash collection func	2020-12-09 02:49:22 UTC
Red Hat Product Errata	RHSA-2020:5605	0	None	None	None	2020-12-17 06:25:41 UTC

Description Oded 2020-11-02 11:28:18 UTC

Description of problem (please be detailed as possible and provide log
snippests):
OCS4.6 must_gather failes to complete in 600sec 

Version of all relevant components (if applicable):
OCS4.6

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
no

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Collect must_gather:
oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.6

2.Failes to complete in 600sec 
image=quay.io/rhceph-dev/ocs-must-gather:latest-4.6 --dest-dir=/home/jenkins/current-cluster-dir/logs/deployment_1603639721/ocs_must_gather
17:07:50 - MainThread - ocs_ci.ocs.utils - ERROR - Timeout 600s for must-gather reached, command exited with error: Command '['oc', '--kubeconfig', '/home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig', 'adm', 'must-gather', '--image=quay.io/rhceph-dev/ocs-must-gather:latest-4.6', '--dest-dir=/home/jenkins/current-cluster-dir/logs/deployment_1603639721/ocs_must_gather']' timed out after 600 seconds


Actual results:
Collection time on OCS4.6 longer than OCS4.5

Expected results:
Collection time on OCS4.6 and OCS4.5 same

Additional info:

Comment 2 Elad 2020-11-02 12:02:05 UTC

Proposing as a blocker because of the significant time difference of collecting the logs, between 4.5 and 4.6

Comment 3 Mudit Agarwal 2020-11-02 12:09:52 UTC

AFAIK, must-gather has many changes in 4.6 which add more time to the collection. I don't think it is mandatory for must-gather to finish collection in 600 seconds.
Do we publish that somewhere? If not I don't think that this is a blocker.

Pulkit, please correct me if I am wrong

Comment 4 Pulkit Kundra 2020-11-02 12:28:51 UTC

(In reply to Mudit Agarwal from comment #3)
> AFAIK, must-gather has many changes in 4.6 which add more time to the
> collection. I don't think it is mandatory for must-gather to finish
> collection in 600 seconds.
> Do we publish that somewhere? If not I don't think that this is a blocker.
> 
> Pulkit, please correct me if I am wrong

yes it is not a blocker. It is not at all mandatory for must-gather to finish before 600 seconds.
collection time can be different  for each setup up. If must-gather fails with msg `timed out waiting for condition` then --timeout flag should be used to increase the time for collection.

It is no where mentioned that must-gather should finish before 10 minutes.

Comment 6 Mudit Agarwal 2020-11-02 13:00:12 UTC

Thanks Pulkit, this is not even a bug then. Will close it if QE doesn't have something to add.

Comment 7 Elad 2020-11-02 14:07:04 UTC

Would like to add also that this affects our automation runs - we collect OCS and OCP must gather upon each test failure. Therefore, recently, ever since must gather takes more time to complete, we fail to collect those logs.
In case we adjust our automation with the new needed timeout, the time it will take to run our automation will be significantly higher.

Comment 8 Mudit Agarwal 2020-11-03 07:28:06 UTC

Providing dev_ack to fix extra sleep time which has been added as part of crash info collection.

Please note that this will still not gurantee 10 minute completion of must-gather, which as stated earlier is not a valid requirement also.

Comment 12 Jose A. Rivera 2020-11-04 22:34:50 UTC

Backport PR: https://github.com/openshift/ocs-operator/pull/890

Comment 13 Oded 2020-11-11 21:40:18 UTC

Must gather collection takes 3 minutes and 35 seconds

SetUp:
Provider:Vmware
OCP Version:4.6.0-0.nightly-2020-11-07-035509
OCS Version:ocs-operator.v4.6.0-156.ci

Test Process:
1.Run Bash Script

#!/bin/bash
SECONDS=0
oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.6
duration=$SECONDS
echo "$(($duration / 60)) minutes and $(($duration % 60)) seconds elapsed."

OutPut:
3 minutes and 35 seconds elapsed.

Comment 16 errata-xmlrpc 2020-12-17 06:25:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605

Note You need to log in before you can comment on or make changes to this bug.