Bug 1893613 - must-gather tries to collect ceph commands in external mode when storagecluster already deleted
Summary: must-gather tries to collect ceph commands in external mode when storageclust...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: must-gather
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.7.0
Assignee: RAJAT SINGH
QA Contact: Sidhant Agrawal
URL:
Whiteboard:
Depends On: 1921864
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-02 07:18 UTC by Neha Berry
Modified: 2021-06-01 08:47 UTC (History)
6 users (show)

Fixed In Version: 4.7.0-696.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-19 09:16:13 UTC
Embargoed:


Attachments (Terms of Use)
terminal-log (117.93 KB, text/plain)
2020-11-02 07:18 UTC, Neha Berry
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 968 0 None closed must-gather: gather ceph logs only when storageclusters is present 2021-02-16 12:17:15 UTC
Github openshift ocs-operator pull 982 0 None closed bug 1893613: [release-4.7] must-gather: gather ceph logs only when storageclusters is present 2021-02-16 12:17:14 UTC
Red Hat Product Errata RHSA-2021:2041 0 None None None 2021-05-19 09:16:51 UTC

Description Neha Berry 2020-11-02 07:18:48 UTC
Created attachment 1725682 [details]
terminal-log

Description of problem (please be detailed as possible and provide log
snippests):
---------------------------------------------------------------------------
Even When storage cluster is already deleted, must-gather tries to collect ceph outputs, both in internal and external mode(not supposed to attempt at all after fix of Bug 1845976)

Related BZ for helper pod re-try & attempt @ collecting ceph outputs - Bug 1893611

Special mention
======================
As per Bug 1845976 fix, in external mode, must-gather skips attempt to collect ceph outputs via toolbox and doesn't bring up a must-gather-helper pod to accomplish that.

But, in corner cases,when the storage cluster(read -also cephcluster) is already deleted(e.g. when uninstall is initiated and is half-way through), and one tries to collect must-gather, helper pod tries to come up(but fails) to collect ceph outputs (not possible for external mode).



>> Also, when the storagecluster is already deleted (even for internal mode), following errors are seen in the must-gather collection outputs - https://bugzilla.redhat.com/show_bug.cgi?id=1893611#c3


Version of all relevant components (if applicable):
---------------------------------------
tested in OCS 4.6 ; ocs-operator.v4.6.0-144.ci and ocs-operator.v4.6.0-147.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
--------------------------------------------------
No. But it throws some errors while collecting must-gather


Is there any workaround available to the best of your knowledge?
-------------------------------------------------------
Not sure


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
-----------------------------------------------
3

Can this issue reproducible?
-------------------------------
yes.

Can this issue reproduce from the UI?
---------------------------------------
NA

If this is a regression, please provide more details to justify this:
-------------------------------------------------
No

Steps to Reproduce:
-----------------------
1. Delete storagecluster, esp in an external mode cluster
2. Start must-gather
oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.6 |tee terminal-must-gather

3. Check in the absence of storagecluster, it tries to bring up a helper pod and collect ceph outputs(not possible)

4. Same behavior is seen in internal mode cluster too (logs https://bugzilla.redhat.com/show_bug.cgi?id=1893611#c2)



Actual results:
-------------------
With Storage cluster deleted, must-gather has no way to know if it was an external cluster, hence tries to collect ceph outputs via helper toolbox pod (which obviously fails to come up)

Expected results:
-----------------------
Internal mode: If storage cluster is deleted, no attempt should be made to collect ceph outputs as cephcluster is already deleted

External Mode: the behavior should be the same as above + external mode should skip creating helper pod (as it does when storagecluster exists and the "external" status of cluster is known)

Additional info:

-------------------------

Before deletion of storagecluster
Wed Oct 28 13:21:08 UTC 2020
--------------
========CSV ======
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.6.0-144.ci   OpenShift Container Storage   4.6.0-144.ci              Succeeded
--------------
=======PODS ======
NAME                                            READY   STATUS        RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
csi-cephfsplugin-nbqmv                          3/3     Running       0          21m   10.1.160.161   compute-1   <none>           <none>
csi-cephfsplugin-provisioner-56455449bd-4ck6g   6/6     Running       0          21m   10.129.2.66    compute-2   <none>           <none>
csi-cephfsplugin-provisioner-56455449bd-6kbq4   6/6     Running       0          21m   10.131.0.185   compute-1   <none>           <none>
csi-cephfsplugin-s4v9w                          3/3     Running       0          21m   10.1.160.165   compute-0   <none>           <none>
csi-cephfsplugin-zmm4l                          3/3     Running       0          21m   10.1.160.180   compute-2   <none>           <none>
csi-rbdplugin-8dxmp                             3/3     Running       0          21m   10.1.160.165   compute-0   <none>           <none>
csi-rbdplugin-8jw6k                             3/3     Running       0          21m   10.1.160.161   compute-1   <none>           <none>
csi-rbdplugin-mtdwc                             3/3     Running       0          21m   10.1.160.180   compute-2   <none>           <none>
csi-rbdplugin-provisioner-586fc6cfc-6bzxb       6/6     Running       0          21m   10.131.0.184   compute-1   <none>           <none>
csi-rbdplugin-provisioner-586fc6cfc-8b2xl       6/6     Running       0          21m   10.128.2.46    compute-0   <none>           <none>
noobaa-core-0                                   1/1     Terminating   0          21m   10.128.2.47    compute-0   <none>           <none>
noobaa-endpoint-6799cdd795-stzwq                1/1     Terminating   0          20m   10.128.2.48    compute-0   <none>           <none>
noobaa-operator-f7789cf94-wp74l                 1/1     Running       0          23h   10.131.1.213   compute-1   <none>           <none>
ocs-metrics-exporter-576f474c87-9r7bv           1/1     Running       0          23h   10.129.3.104   compute-2   <none>           <none>
ocs-operator-686fd84dd7-6l45s                   1/1     Running       0          23h   10.129.3.102   compute-2   <none>           <none>
rook-ceph-operator-7558fcf89c-wmjr4             1/1     Running       0          23h   10.129.3.103   compute-2   <none>           <none>
--------------
======= PVC ==========
--------------
======= storagecluster ==========
NAME                          AGE   PHASE      EXTERNAL   CREATED AT             VERSION
ocs-external-storagecluster   21m   Deleting   true       2020-10-28T12:59:46Z   4.6.0
--------------
======= cephcluster ==========
NAME                                      DATADIRHOSTPATH   MONCOUNT   AGE   PHASE      MESSAGE               HEALTH
ocs-external-storagecluster-cephcluster                                21m   Deleting   Cluster is deleting   HEALTH_OK

Comment 3 Neha Berry 2020-11-02 07:22:03 UTC
@pulkit could you also check why we get errors listed here - https://bugzilla.redhat.com/show_bug.cgi?id=1893611#c3

esp. about cephobjectstoreUsers when the storagecluster does not exist (not dependent on helper pod)



[must-gather-gn4xm] POD collecting dump cephobjectstoreusers
[must-gather-gn4xm] POD error: error executing jsonpath "{range .items[*]}{@.metadata.name}{'\\n'}{end}": Error executing template: not in range, nothing to end. Printing more information for debugging the template:
[must-gather-gn4xm] POD         template was:
[must-gather-gn4xm] POD                 {range .items[*]}{@.metadata.name}{'\n'}{end}
[must-gather-gn4xm] POD         object given to jsonpath engine was:
[must-gather-gn4xm] POD                 map[string]interface {}{"apiVersion":"v1", "items":[]interface {}{}, "kind":"List", "metadata":map[string]interface {}{"resourceVersion":"", "selfLink":""}}
[must-gather-gn4xm] POD
[must-gather-gn4xm] POD

Comment 4 Neha Berry 2020-11-02 07:54:46 UTC
(In reply to Neha Berry from comment #3)
> @pulkit could you also check why we get errors listed here -
> https://bugzilla.redhat.com/show_bug.cgi?id=1893611#c3
> 
> esp. about cephobjectstoreUsers when the storagecluster does not exist (not
> dependent on helper pod)
> 
> 
> 
> [must-gather-gn4xm] POD collecting dump cephobjectstoreusers
> [must-gather-gn4xm] POD error: error executing jsonpath "{range
> .items[*]}{@.metadata.name}{'\\n'}{end}": Error executing template: not in
> range, nothing to end. Printing more information for debugging the template:
> [must-gather-gn4xm] POD         template was:
> [must-gather-gn4xm] POD                 {range
> .items[*]}{@.metadata.name}{'\n'}{end}
> [must-gather-gn4xm] POD         object given to jsonpath engine was:
> [must-gather-gn4xm] POD                 map[string]interface
> {}{"apiVersion":"v1", "items":[]interface {}{}, "kind":"List",
> "metadata":map[string]interface {}{"resourceVersion":"", "selfLink":""}}
> [must-gather-gn4xm] POD
> [must-gather-gn4xm] POD

Bug raised - https://bugzilla.redhat.com/show_bug.cgi?id=1893619

Comment 6 RAJAT SINGH 2021-01-04 11:13:31 UTC
PR Link : https://github.com/openshift/ocs-operator/pull/968

Comment 7 RAJAT SINGH 2021-01-04 11:15:03 UTC
@nberry I have added better logging commands and now, MG would not try to collect logs when an external cluster is not present. It would be great if you can test this Pr from your end as well.
Thanks

Comment 8 Neha Berry 2021-01-04 13:58:50 UTC
Hi Rajat,

Is it OK if we test this fix once the bug is ON_QA.

If the PR looks good, feel free to move bug to modified and ON_QA once it is part of a build.

Adding qa_ack.

Comment 9 RAJAT SINGH 2021-01-06 11:54:37 UTC
Sure!

Comment 16 errata-xmlrpc 2021-05-19 09:16:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041


Note You need to log in before you can comment on or make changes to this bug.