1985074 – must-gather is skipping the ceph collection when there are two must-gather-helper pods

Bug 1985074 - must-gather is skipping the ceph collection when there are two must-gather-helper pods

Summary: must-gather is skipping the ceph collection when there are two must-gather-he...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	must-gather
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	ODF 4.9.0
Assignee:	Rewant
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-22 18:48 UTC by Oded
Modified:	2023-08-09 16:35 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-13 17:44:55 UTC
Embargoed:
Flags:	resoni: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 1280	0	None	None	None	2021-08-16 03:54:47 UTC
Red Hat Product Errata	RHSA-2021:5086	0	None	None	None	2021-12-13 17:45:18 UTC

Description Oded 2021-07-22 18:48:30 UTC

Description of problem (please be detailed as possible and provide log
snippests):
Skipping the ceph collection when there are two must-gather-helper pods on openshift-storage project [two pods on Running state] 

Version of all relevant components (if applicable):
OCP Version:4.8
OCS Version:4.8

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Run must-gather command first time:
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.8

2.Run must-gather command second time:
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.8

3.Check pod status: [there are two must-gather-helper pods on openshift-storage project]
must-gather-6ppsn-helper                                          1/1     Running     0          83s     10.131.2.157   ip-10-0-183-4.us-east-2.compute.internal     <none>           <none>
must-gather-hsh8d-helper                                          1/1     Running     0          157m    10.131.2.100   ip-10-0-183-4.us-east-2.compute.internal     <none>           <none>

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j051ai3c33-ua/j051ai3c33-ua_20210720T222651/logs/failed_testcase_ocs_logs_1626847605/test_check_mds_cache_memory_limit_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-180ca4c2ca1f8bfd59251ef37dc6f0b0c6f6b651383dad7a34ef67c0374617f5/namespaces/openshift-storage/oc_output/pods_-owide

4.Check content of must-gather directory:
On first must-gather, we collect the ceph files
On second must-gather, we does not collect the ceph files

```
ceph core dump collection completed
skipping the ceph collection
```
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j051ai3c33-ua/j051ai3c33-ua_20210720T222651/logs/failed_testcase_ocs_logs_1626847605/test_check_mds_cache_memory_limit_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-180ca4c2ca1f8bfd59251ef37dc6f0b0c6f6b651383dad7a34ef67c0374617f5/gather-debug.log


Actual results:
On second must-gather, we does not collect the ceph files

Expected results:
On second must-gather, we collect the ceph files

Additional info:

Comment 5 Mudit Agarwal 2021-07-23 09:09:16 UTC

Agree, the cleanup part needs improvement.
Not a 4.8 blocker, its a day one issue.

Comment 8 Oded 2021-08-24 14:05:37 UTC

Bug reconstructed on OCS4.9

SetUp:
OCP Version: 4.8.0-0.nightly-2021-08-12-174317
OCS Version: ocs-operator.v4.9.0-105.ci
LSO Version: 4.8

Test Prcess:
1.Run must-gather command first time:
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9

2.Run must-gather command second time:
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9

3.Check pod status: [there are two must-gather-helper pods on openshift-storage project]
$ oc get pods | grep must
must-gather-9j56r-helper                                          1/1     Running     0          2m7s

4.Check content of must-gather directory:
a.On first must-gather, we did not collect the ceph files.
Exception: Files don't exist:
['ceph_auth_list', 'ceph_balancer_status', 'ceph_config-key_ls', 'ceph_config_dump', 'ceph_crash_stat', 'ceph_device_ls', 'ceph_fs_dump', 'ceph_fs_ls', 'ceph_fs_status', 'ceph_fs_subvolumegroup_ls_ocs-storagecluster-cephfilesystem', 'ceph_health_detail', 'ceph_mds_stat', 'ceph_mgr_dump', 'ceph_mgr_module_ls', 'ceph_mgr_services', 'ceph_mon_dump', 'ceph_mon_stat', 'ceph_osd_blocked-by', 'ceph_osd_crush_class_ls', 'ceph_osd_crush_dump', 'ceph_osd_crush_rule_dump', 'ceph_osd_crush_rule_ls', 'ceph_osd_crush_show-tunables', 'ceph_osd_crush_weight-set_dump', 'ceph_osd_df', 'ceph_osd_df_tree', 'ceph_osd_dump', 'ceph_osd_getmaxosd', 'ceph_osd_lspools', 'ceph_osd_numa-status', 'ceph_osd_perf', 'ceph_osd_pool_ls_detail', 'ceph_osd_stat', 'ceph_osd_tree', 'ceph_osd_utilization', 'ceph_pg_dump', 'ceph_pg_stat', 'ceph_quorum_status', 'ceph_report', 'ceph_service_dump', 'ceph_status', 'ceph_time-sync-status', 'ceph_versions', 'ceph_df_detail']

b.On second must-gather, we collect the ceph files

Comment 9 Gobinda Das 2021-08-25 11:32:26 UTC

This is what we experienced when already mg helper pod is running(Some how it's not cleaned up or it's run by using "keep" flag) then running again MG(so total two mg helper pods), the ceph commands execution are having error. So we need to implement as I mentioned in Comment#4 . I don't think at a single point of time we need two MG instances to run.

Comment 10 Gobinda Das 2021-08-25 11:51:34 UTC

So some how the older helper pod is not deleting as per fix , PR: https://github.com/openshift/ocs-operator/pull/1280
@Rewant can you please take a look?

Comment 11 Gobinda Das 2021-08-25 12:04:11 UTC

As per our discussion over gchat with Oded, the 2nd run happend within a minute after 1st run. So the 1st helper pod terminated before reaching to ceph command execution. That's why 1st run is missing with ceph out put. This is the intention of the fix.So if 2nd run is able to collect ceph commands out put then it's perfect.

Comment 12 Oded 2021-08-25 12:10:18 UTC

Bug moved to verified state based on https://bugzilla.redhat.com/show_bug.cgi?id=1985074#c11

Comment 18 errata-xmlrpc 2021-12-13 17:44:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086

Note You need to log in before you can comment on or make changes to this bug.