2255240 – Must gather in 4.15 is not collected in 35 mins

Bug 2255240 - Must gather in 4.15 is not collected in 35 mins

Summary: Must gather in 4.15 is not collected in 35 mins

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	must-gather
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Niraj Yadav
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2246375
TreeView+	depends on / blocked

Reported:	2023-12-19 13:22 UTC by Petr Balogh
Modified:	2024-03-19 15:26 UTC (History)
CC List:	7 users (show)
Fixed In Version:	4.15.0-112
Doc Type:	If docs needed, set a value
Doc Text:	.Must gather logs not collected after upgrade Previously, the `must-gather` tool failed to collect logs after the upgrade as `Collection started <time> was seen twice. With this fix, the `must-gather` tool was updated to run the pre-install script only once. As a result, the tool is able to collect the logs successfully after upgrade.
Clone Of:
Environment:
Last Closed:	2024-03-19 15:25:55 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage odf-must-gather pull 100	None	Merged	Support parallel collection in default op mode	2023-12-19 13:34:37 UTC
Github	red-hat-storage odf-must-gather pull 101	None	open	Bug 2255240:[release-4.15] Support parallel collection in default op mode	2023-12-21 12:15:18 UTC
Red Hat Product Errata	RHSA-2024:1383	None	None	None	2024-03-19 15:26:02 UTC

Description Petr Balogh 2023-12-19 13:22:26 UTC

Description of problem (please be detailed as possible and provide log
snippests):
I am looking to some of the upgrade executions from 4.14 to 4.15 and see all failed.  But in all of those the must gather is not collected looks like.

Error: Command '['oc', '--kubeconfig', '/home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig', 'adm', 'must-gather', '--image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15', '--dest-dir=/home/jenkins/current-cluster-dir/logs/failed_testcase_ocs_logs_1702896210/test_upgrade_ocs_logs/j-113vi1cs33-uba/ocs_must_gather']' timed out after 2100 secondsMust-Gather Output:This is command I see we run from console output of the job

so looks like it didn't collect the logs in 35 mins and our timeout killed the command

pod "must-gather-jmwfl-helper" deleted
error: resource name may not be empty
error: resource name may not be empty
error: resource name may not be empty
error: resource name may not be empty
error: resource name may not be empty
error: resource name may not be empty
Error from server (NotFound): error when deleting "pod_helper.yaml": pods "must-gather-jmwfl-helper" not found
error: resource name may not be empty
error: resource name may not be empty
error: resource name may not be empty
error: resource name may not be empty
error: resource name may not be empty
error: resource name may not be empty

So seems, the helper pod was deleted before the completion of the collection and before the gather pod could copy the collection.

Discussed here:
https://chat.google.com/room/AAAAREGEba8/sSNlEKi4Gmk

Yati said it's a bug on MG side and should be fixed here:
https://github.com/red-hat-storage/odf-must-gather/pull/100

Version of all relevant components (if applicable):
4.15.0-89

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Run must gather on 4.15 cluster - in this execution it was upgraded cluster from 4.14 which I don't think is relevant.
2.
3.


Actual results:


Expected results:
Have must gather collected

Comment 9 Oded 2024-01-02 13:41:11 UTC

I tested mg on ODF4.15  [odf-operator.v4.15.0-98.stable]

$ time oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15
....
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-s5j5q deleted


Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: fce1583c-d2b1-44a7-9345-ee787616b50b
ClusterVersion: Stable at "4.15.0-0.nightly-2023-12-25-100326"
ClusterOperators:
	All healthy and stable



real	7m3.801s
user	0m1.896s
sys	0m0.672s

Niraj, Can I test it on "latest-4.15" image?

Comment 12 Vijay Avuthu 2024-01-03 06:48:29 UTC

ocs-operator.v4.15.0-100.stable

must gather is still failing to collect:

2024-01-03 08:27:57  02:57:56 - ThreadPoolExecutor-5_0 - ocs_ci.ocs.utils - INFO  - Must gather image: quay.io/rhceph-dev/ocs-must-gather:latest-4.15 will be used.
2024-01-03 08:27:57  02:57:56 - ThreadPoolExecutor-5_0 - ocs_ci.ocs.utils - INFO  - OCS logs will be placed in location /home/jenkins/current-cluster-dir/logs/testcases_1704249707/vavuthuupq1/ocs_must_gather
2024-01-03 08:27:57  02:57:56 - ThreadPoolExecutor-5_0 - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15 --dest-dir=/home/jenkins/current-cluster-dir/logs/testcases_1704249707/vavuthuupq1/ocs_must_gather
2024-01-03 09:03:04  03:32:56 - ThreadPoolExecutor-5_0 - ocs_ci.ocs.utils - ERROR  - Failed during must gather logs! Error: Command '['oc', '--kubeconfig', '/home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig', 'adm', 'must-gather', '--image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15', '--dest-dir=/home/jenkins/current-cluster-dir/logs/testcases_1704249707/vavuthuupq1/ocs_must_gather']' timed out after 2100 secondsMust-Gather Output: 

job ( upgrade from 4.14 to 4.15 ): https://url.corp.redhat.com/024bded

must gather ( not collected ): https://url.corp.redhat.com/54ddfd5

Comment 15 Mudit Agarwal 2024-01-03 08:55:49 UTC

We need not to wait for a cluster or for a job. Don't we have any cluster with the latest 4.15 image installed. 
This is just must-gather collection which we can simply verify, why to make it so complicated?

Comment 20 Oded 2024-01-14 22:32:32 UTC

can you check my test procedure? Do I need to test it after tier1? 
Test process:
1.upgrade osd 4.14->4.15
$ oc get csv -A
NAMESPACE                              NAME                                         DISPLAY                       VERSION             REPLACES                                PHASE
openshift-operator-lifecycle-manager   packageserver                                Package Server                0.0.1-snapshot                                              Succeeded
openshift-storage                      mcg-operator.v4.15.0-113.stable              NooBaa Operator               4.15.0-113.stable   mcg-operator.v4.14.4-rhodf              Succeeded
openshift-storage                      ocs-operator.v4.15.0-113.stable              OpenShift Container Storage   4.15.0-113.stable   ocs-operator.v4.14.4-rhodf              Succeeded
openshift-storage                      odf-csi-addons-operator.v4.15.0-113.stable   CSI Addons                    4.15.0-113.stable   odf-csi-addons-operator.v4.14.4-rhodf   Succeeded
openshift-storage                      odf-operator.v4.15.0-113.stable              OpenShift Data Foundation     4.15.0-113.stable   odf-operator.v4.14.4-rhodf              Succeeded

2.Collect MG
$ time oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15

real	3m29.078s
user	0m0.819s
sys	0m0.373s

Comment 22 Oded 2024-01-23 16:40:44 UTC

tier1 is a suite with many tests.
So after tier1 suite there is a lot of data on the cluster and that's we can simulate a "customer cluster"

Comment 24 Oded 2024-01-28 13:04:47 UTC

I will move this bz to the verified state because this test https://bugzilla.redhat.com/show_bug.cgi?id=2255240#c20

I will try to find a method to create a cluster like a customer. [with a lot of data and operators]

Comment 31 errata-xmlrpc 2024-03-19 15:25:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.