Bug 1985074

Summary: must-gather is skipping the ceph collection when there are two must-gather-helper pods
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Oded <oviner>
Component: must-gatherAssignee: Rewant <resoni>
Status: CLOSED ERRATA QA Contact: Oded <oviner>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.8CC: ebenahar, godas, kramdoss, muagarwa, nberry, ocs-bugs, odf-bz-bot, resoni, sabose
Target Milestone: ---Keywords: Automation
Target Release: ODF 4.9.0Flags: resoni: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-13 17:44:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Oded 2021-07-22 18:48:30 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Skipping the ceph collection when there are two must-gather-helper pods on openshift-storage project [two pods on Running state] 

Version of all relevant components (if applicable):
OCP Version:4.8
OCS Version:4.8

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Run must-gather command first time:
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.8

2.Run must-gather command second time:
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.8

3.Check pod status: [there are two must-gather-helper pods on openshift-storage project]
must-gather-6ppsn-helper                                          1/1     Running     0          83s     10.131.2.157   ip-10-0-183-4.us-east-2.compute.internal     <none>           <none>
must-gather-hsh8d-helper                                          1/1     Running     0          157m    10.131.2.100   ip-10-0-183-4.us-east-2.compute.internal     <none>           <none>

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j051ai3c33-ua/j051ai3c33-ua_20210720T222651/logs/failed_testcase_ocs_logs_1626847605/test_check_mds_cache_memory_limit_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-180ca4c2ca1f8bfd59251ef37dc6f0b0c6f6b651383dad7a34ef67c0374617f5/namespaces/openshift-storage/oc_output/pods_-owide

4.Check content of must-gather directory:
On first must-gather, we collect the ceph files
On second must-gather, we does not collect the ceph files

```
ceph core dump collection completed
skipping the ceph collection
```
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j051ai3c33-ua/j051ai3c33-ua_20210720T222651/logs/failed_testcase_ocs_logs_1626847605/test_check_mds_cache_memory_limit_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-180ca4c2ca1f8bfd59251ef37dc6f0b0c6f6b651383dad7a34ef67c0374617f5/gather-debug.log


Actual results:
On second must-gather, we does not collect the ceph files

Expected results:
On second must-gather, we collect the ceph files

Additional info:

Comment 5 Mudit Agarwal 2021-07-23 09:09:16 UTC
Agree, the cleanup part needs improvement.
Not a 4.8 blocker, its a day one issue.

Comment 8 Oded 2021-08-24 14:05:37 UTC
Bug reconstructed on OCS4.9

SetUp:
OCP Version: 4.8.0-0.nightly-2021-08-12-174317
OCS Version: ocs-operator.v4.9.0-105.ci
LSO Version: 4.8

Test Prcess:
1.Run must-gather command first time:
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9

2.Run must-gather command second time:
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9

3.Check pod status: [there are two must-gather-helper pods on openshift-storage project]
$ oc get pods | grep must
must-gather-9j56r-helper                                          1/1     Running     0          2m7s

4.Check content of must-gather directory:
a.On first must-gather, we did not collect the ceph files.
Exception: Files don't exist:
['ceph_auth_list', 'ceph_balancer_status', 'ceph_config-key_ls', 'ceph_config_dump', 'ceph_crash_stat', 'ceph_device_ls', 'ceph_fs_dump', 'ceph_fs_ls', 'ceph_fs_status', 'ceph_fs_subvolumegroup_ls_ocs-storagecluster-cephfilesystem', 'ceph_health_detail', 'ceph_mds_stat', 'ceph_mgr_dump', 'ceph_mgr_module_ls', 'ceph_mgr_services', 'ceph_mon_dump', 'ceph_mon_stat', 'ceph_osd_blocked-by', 'ceph_osd_crush_class_ls', 'ceph_osd_crush_dump', 'ceph_osd_crush_rule_dump', 'ceph_osd_crush_rule_ls', 'ceph_osd_crush_show-tunables', 'ceph_osd_crush_weight-set_dump', 'ceph_osd_df', 'ceph_osd_df_tree', 'ceph_osd_dump', 'ceph_osd_getmaxosd', 'ceph_osd_lspools', 'ceph_osd_numa-status', 'ceph_osd_perf', 'ceph_osd_pool_ls_detail', 'ceph_osd_stat', 'ceph_osd_tree', 'ceph_osd_utilization', 'ceph_pg_dump', 'ceph_pg_stat', 'ceph_quorum_status', 'ceph_report', 'ceph_service_dump', 'ceph_status', 'ceph_time-sync-status', 'ceph_versions', 'ceph_df_detail']

b.On second must-gather, we collect the ceph files

Comment 9 Gobinda Das 2021-08-25 11:32:26 UTC
This is what we experienced when already mg helper pod is running(Some how it's not cleaned up or it's run by using "keep" flag) then running again MG(so total two mg helper pods), the ceph commands execution are having error. So we need to implement as I mentioned in Comment#4 . I don't think at a single point of time we need two MG instances to run.

Comment 10 Gobinda Das 2021-08-25 11:51:34 UTC
So some how the older helper pod is not deleting as per fix , PR: https://github.com/openshift/ocs-operator/pull/1280
@Rewant can you please take a look?

Comment 11 Gobinda Das 2021-08-25 12:04:11 UTC
As per our discussion over gchat with Oded, the 2nd run happend within a minute after 1st run. So the 1st helper pod terminated before reaching to ceph command execution. That's why 1st run is missing with ceph out put. This is the intention of the fix.So if 2nd run is able to collect ceph commands out put then it's perfect.

Comment 12 Oded 2021-08-25 12:10:18 UTC
Bug moved to verified state based on https://bugzilla.redhat.com/show_bug.cgi?id=1985074#c11

Comment 18 errata-xmlrpc 2021-12-13 17:44:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086