Description of problem (please be detailed as possible and provide log snippests): Skipping the ceph collection when there are two must-gather-helper pods on openshift-storage project [two pods on Running state] Version of all relevant components (if applicable): OCP Version:4.8 OCS Version:4.8 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Run must-gather command first time: $ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.8 2.Run must-gather command second time: $ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.8 3.Check pod status: [there are two must-gather-helper pods on openshift-storage project] must-gather-6ppsn-helper 1/1 Running 0 83s 10.131.2.157 ip-10-0-183-4.us-east-2.compute.internal <none> <none> must-gather-hsh8d-helper 1/1 Running 0 157m 10.131.2.100 ip-10-0-183-4.us-east-2.compute.internal <none> <none> http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j051ai3c33-ua/j051ai3c33-ua_20210720T222651/logs/failed_testcase_ocs_logs_1626847605/test_check_mds_cache_memory_limit_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-180ca4c2ca1f8bfd59251ef37dc6f0b0c6f6b651383dad7a34ef67c0374617f5/namespaces/openshift-storage/oc_output/pods_-owide 4.Check content of must-gather directory: On first must-gather, we collect the ceph files On second must-gather, we does not collect the ceph files ``` ceph core dump collection completed skipping the ceph collection ``` http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j051ai3c33-ua/j051ai3c33-ua_20210720T222651/logs/failed_testcase_ocs_logs_1626847605/test_check_mds_cache_memory_limit_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-180ca4c2ca1f8bfd59251ef37dc6f0b0c6f6b651383dad7a34ef67c0374617f5/gather-debug.log Actual results: On second must-gather, we does not collect the ceph files Expected results: On second must-gather, we collect the ceph files Additional info:
Agree, the cleanup part needs improvement. Not a 4.8 blocker, its a day one issue.
Bug reconstructed on OCS4.9 SetUp: OCP Version: 4.8.0-0.nightly-2021-08-12-174317 OCS Version: ocs-operator.v4.9.0-105.ci LSO Version: 4.8 Test Prcess: 1.Run must-gather command first time: $ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9 2.Run must-gather command second time: $ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9 3.Check pod status: [there are two must-gather-helper pods on openshift-storage project] $ oc get pods | grep must must-gather-9j56r-helper 1/1 Running 0 2m7s 4.Check content of must-gather directory: a.On first must-gather, we did not collect the ceph files. Exception: Files don't exist: ['ceph_auth_list', 'ceph_balancer_status', 'ceph_config-key_ls', 'ceph_config_dump', 'ceph_crash_stat', 'ceph_device_ls', 'ceph_fs_dump', 'ceph_fs_ls', 'ceph_fs_status', 'ceph_fs_subvolumegroup_ls_ocs-storagecluster-cephfilesystem', 'ceph_health_detail', 'ceph_mds_stat', 'ceph_mgr_dump', 'ceph_mgr_module_ls', 'ceph_mgr_services', 'ceph_mon_dump', 'ceph_mon_stat', 'ceph_osd_blocked-by', 'ceph_osd_crush_class_ls', 'ceph_osd_crush_dump', 'ceph_osd_crush_rule_dump', 'ceph_osd_crush_rule_ls', 'ceph_osd_crush_show-tunables', 'ceph_osd_crush_weight-set_dump', 'ceph_osd_df', 'ceph_osd_df_tree', 'ceph_osd_dump', 'ceph_osd_getmaxosd', 'ceph_osd_lspools', 'ceph_osd_numa-status', 'ceph_osd_perf', 'ceph_osd_pool_ls_detail', 'ceph_osd_stat', 'ceph_osd_tree', 'ceph_osd_utilization', 'ceph_pg_dump', 'ceph_pg_stat', 'ceph_quorum_status', 'ceph_report', 'ceph_service_dump', 'ceph_status', 'ceph_time-sync-status', 'ceph_versions', 'ceph_df_detail'] b.On second must-gather, we collect the ceph files
This is what we experienced when already mg helper pod is running(Some how it's not cleaned up or it's run by using "keep" flag) then running again MG(so total two mg helper pods), the ceph commands execution are having error. So we need to implement as I mentioned in Comment#4 . I don't think at a single point of time we need two MG instances to run.
So some how the older helper pod is not deleting as per fix , PR: https://github.com/openshift/ocs-operator/pull/1280 @Rewant can you please take a look?
As per our discussion over gchat with Oded, the 2nd run happend within a minute after 1st run. So the 1st helper pod terminated before reaching to ceph command execution. That's why 1st run is missing with ceph out put. This is the intention of the fix.So if 2nd run is able to collect ceph commands out put then it's perfect.
Bug moved to verified state based on https://bugzilla.redhat.com/show_bug.cgi?id=1985074#c11
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086