1934625 – [must-gather]improve logging and mention all instances in MG terminal log

Bug 1934625 - [must-gather]improve logging and mention all instances in MG terminal log

Summary: [must-gather]improve logging and mention all instances in MG terminal log

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	must-gather
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	ODF 4.9.0
Assignee:	Rewant
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-03 15:13 UTC by Neha Berry
Modified:	2023-08-09 16:35 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-13 17:44:30 UTC
Embargoed:

Attachments	(Terms of Use)
Terminal log (83.77 KB, text/plain) 2021-03-03 15:13 UTC, Neha Berry	no flags	Details
terminal output (9.12 KB, application/zip) 2021-06-11 08:10 UTC, Neha Berry	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 1109	None	closed	Must-gather: refactor the crash and volume collection	2021-06-01 10:59:46 UTC
Github	openshift ocs-operator pull 1132	None	closed	must-gather: skip noobaa collection when storagecluster is not present	2021-06-01 10:59:46 UTC
Red Hat Product Errata	RHSA-2021:5086	None	None	None	2021-12-13 17:44:43 UTC

Description Neha Berry 2021-03-03 15:13:03 UTC

Created attachment 1760405 [details]
Terminal log

Description of problem (please be detailed as possible and provide log
snippests):
=====================================================================
This bug is raised to report the issues discussed in chat thread [1]. @pulkit please add any more enhancement you seem deemed fit for this bug(as discussed in the chats)

Issues
---------

1.  If the MG is collected when Storagecluster is not yet created or is already deleted(node labels also removed), debug pods and helper pod creation is skipped. Hence these processes do no run in background and no PID is generated. But we still see following incomplete log message where ofcourse PIDs are missing:

[must-gather-zr45d] POD not creating helper pod since storagecluster is not present
>>[must-gather-zr45d] POD waiting for  to terminate 

Since no instance name/PID exists, hence we should rather skip printing this altogether

In a normal OCS cluster, it looks like this

[must-gather-vbvzz] POD pod/must-gather-vbvzz-helper labeled
[must-gather-vbvzz] POD waiting for 103 104 106 107 to terminate

2. if no storagecluster created, do we really need  attempts at collecting following noobaa related stuffs (If yes, ignore this comment)

collecting dump of noobaa
Wrote inspect data to must-gather/noobaa.
collecting dump of backingstore
Wrote inspect data to must-gather/noobaa.
collecting dump of bucketclass
Wrote inspect data to must-gather/noobaa.


[1] - https://chat.google.com/room/AAAAREGEba8/B5FNcAjENMY

Version of all relevant components (if applicable):
======================================================
OCS 4.7 all versions


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
================================================================
No

Is there any workaround available to the best of your knowledge?
=============================================
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
===================================================
1

Can this issue reproducible?
===============================
Always

Can this issue reproduce from the UI?
======================================
NA

If this is a regression, please provide more details to justify this:
==============================================================
Not sure. pulkit can confirm

Steps to Reproduce:
======================

1. Install OCS operator. Do not install Storagecluster
2. Initiate a must-gather collection
oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.7 |tee terminal-must-gather

Similar observation seen in another repoducer:

1. Install OCS operator and create storagecluster
2. Delete storagecluster and follow uninstall steps(remove OCS completely, along with OCS node label)



Expected results:
======================
If PIDs are not created, we do not need the message to be printed

Additional info:
=======================
--------------
========CSV ======
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.7.0-280.ci   OpenShift Container Storage   4.7.0-280.ci              Succeeded
--------------
=======PODS ======
NAME                                  READY   STATUS    RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
noobaa-operator-5f6c776566-2tdfs      1/1     Running   0          47s   10.131.0.54    compute-2   <none>           <none>
ocs-metrics-exporter-79db8f64-vr97x   1/1     Running   0          47s   10.131.2.244   compute-1   <none>           <none>
ocs-operator-6dbf6f8c97-75x6k         1/1     Running   0          47s   10.128.4.214   compute-4   <none>           <none>
rook-ceph-operator-79dfd4d7d6-vlznh   1/1     Running   0          47s   10.130.2.146   compute-5   <none>           <none>

Comment 3 RAJAT SINGH 2021-03-12 10:58:16 UTC

(In reply to Neha Berry from comment #0)
> Created attachment 1760405 [details]
> Terminal log
> 
> Description of problem (please be detailed as possible and provide log
> snippests):
> =====================================================================
> This bug is raised to report the issues discussed in chat thread [1].
> @pulkit please add any more enhancement you seem deemed fit for this bug(as
> discussed in the chats)
> 
> Issues
> ---------
> 
> 1.  If the MG is collected when Storagecluster is not yet created or is
> already deleted(node labels also removed), debug pods and helper pod
> creation is skipped. Hence these processes do no run in background and no
> PID is generated. But we still see following incomplete log message where
> ofcourse PIDs are missing:
> 
> [must-gather-zr45d] POD not creating helper pod since storagecluster is not
> present
> >>[must-gather-zr45d] POD waiting for  to terminate 
> 
> Since no instance name/PID exists, hence we should rather skip printing this
> altogether
Ok, so the message "waiting to terminate" is coming from debug pods that get created irrespective of if storagecluster is present or not. We can skip creating debug pods if storagecluster is not present.
> 
> In a normal OCS cluster, it looks like this
> 
> [must-gather-vbvzz] POD pod/must-gather-vbvzz-helper labeled
> [must-gather-vbvzz] POD waiting for 103 104 106 107 to terminate
> 
> 2. if no storagecluster created, do we really need  attempts at collecting
> following noobaa related stuffs (If yes, ignore this comment)

I mean, I understand it correctly, we still want to collect namespace resources and noobaa resources irrespective of ceph collection.
Correct me if I am wrong here @
> 
> collecting dump of noobaa
> Wrote inspect data to must-gather/noobaa.
> collecting dump of backingstore
> Wrote inspect data to must-gather/noobaa.
> collecting dump of bucketclass
> Wrote inspect data to must-gather/noobaa.
> 
> 
> [1] - https://chat.google.com/room/AAAAREGEba8/B5FNcAjENMY
> 
> Version of all relevant components (if applicable):
> ======================================================
> OCS 4.7 all versions
> 
> 
> Does this issue impact your ability to continue to work with the product
> (please explain in detail what is the user impact)?
> ================================================================
> No
> 
> Is there any workaround available to the best of your knowledge?
> =============================================
> No
> 
> Rate from 1 - 5 the complexity of the scenario you performed that caused this
> bug (1 - very simple, 5 - very complex)?
> ===================================================
> 1
> 
> Can this issue reproducible?
> ===============================
> Always
> 
> Can this issue reproduce from the UI?
> ======================================
> NA
> 
> If this is a regression, please provide more details to justify this:
> ==============================================================
> Not sure. pulkit can confirm
> 
> Steps to Reproduce:
> ======================
> 
> 1. Install OCS operator. Do not install Storagecluster
> 2. Initiate a must-gather collection
> oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.7
> |tee terminal-must-gather
> 
> Similar observation seen in another repoducer:
> 
> 1. Install OCS operator and create storagecluster
> 2. Delete storagecluster and follow uninstall steps(remove OCS completely,
> along with OCS node label)
> 
> 
> 
> Expected results:
> ======================
> If PIDs are not created, we do not need the message to be printed
> 
> Additional info:
> =======================
> --------------
> ========CSV ======
> NAME                         DISPLAY                       VERSION       
> REPLACES   PHASE
> ocs-operator.v4.7.0-280.ci   OpenShift Container Storage   4.7.0-280.ci     
> Succeeded
> --------------
> =======PODS ======
> NAME                                  READY   STATUS    RESTARTS   AGE   IP 
> NODE        NOMINATED NODE   READINESS GATES
> noobaa-operator-5f6c776566-2tdfs      1/1     Running   0          47s  
> 10.131.0.54    compute-2   <none>           <none>
> ocs-metrics-exporter-79db8f64-vr97x   1/1     Running   0          47s  
> 10.131.2.244   compute-1   <none>           <none>
> ocs-operator-6dbf6f8c97-75x6k         1/1     Running   0          47s  
> 10.128.4.214   compute-4   <none>           <none>
> rook-ceph-operator-79dfd4d7d6-vlznh   1/1     Running   0          47s  
> 10.130.2.146   compute-5   <none>           <none>

pkundra

Comment 5 RAJAT SINGH 2021-03-25 08:46:55 UTC

this is the PR
https://github.com/openshift/ocs-operator/pull/1109

Comment 6 RAJAT SINGH 2021-03-25 08:55:30 UTC

PR for skipping the noobaa when storagecluster is not present 
https://github.com/openshift/ocs-operator/pull/1132

Comment 10 Neha Berry 2021-06-11 08:10:57 UTC

Created attachment 1790126 [details]
terminal output

@rajasing sorry I missed my Needinfo

Tested with ocs-4.8.0-416.ci and it seems that both the issues reported in Comment#0 are still not fixed

1.  If the MG is collected when Storagecluster is not yet created or is already deleted(node labels also removed), debug pods and helper pod creation is skipped. Hence these processes do no run in background and no PID is generated. But we still see following incomplete log message where ofcourse PIDs are missing:

[must-gather-zr45d] POD not creating helper pod since storagecluster is not present
>>[must-gather-zr45d] POD waiting for  to terminate 

Since no instance name/PID exists, hence we should rather skip printing this altogether


2. if no storagecluster created, do we really need  attempts at collecting following noobaa related stuffs . 


collecting dump of noobaa
Wrote inspect data to must-gather/noobaa.
collecting dump of backingstore
Wrote inspect data to must-gather/noobaa.
collecting dump of bucketclass
Wrote inspect data to must-gather/noobaa.

Noobaa operator pod is the only available pod until we create storagecluster, so only this makes sense:

[must-gather-5vbjl] POD 2021-06-11T07:46:23.820021735Z collecting dump of noobaa-operator-866c7c65d4-st76g pod from openshift-storage

If noobaa is not yet installed, not sure if collecting backingstore, bucketclass, dump of noobaa, status of noobaa and obc list makes sense..
Even RGW is not yet up, so we wont even have RGW based obcs and bucketclasses.





Fri Jun 11 08:09:34 AM UTC 2021
--------------
========CSV ======
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.8.0-416.ci   OpenShift Container Storage   4.8.0-416.ci              Succeeded
--------------
=======PODS ======
NAME                                    READY   STATUS    RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
noobaa-operator-866c7c65d4-st76g        1/1     Running   0          31m   10.131.1.139   compute-0   <none>           <none>
ocs-metrics-exporter-6dffc4d6bb-pwzd9   1/1     Running   0          31m   10.129.2.16    compute-2   <none>           <none>
ocs-operator-768678d7ff-86w2h           1/1     Running   0          31m   10.128.2.14    compute-1   <none>           <none>
rook-ceph-operator-7c655dfbdb-6tthb     1/1     Running   0          31m   10.129.2.15    compute-2   <none>           <none>
--------------
======= PVC ==========
No resources found in openshift-storage namespace.
--------------
======= storagecluster ==========
No resources found in openshift-storage namespace.
--------------
======= cephcluster ==========
No resources found in openshift-storage namespace.
======= backingstore ==========
No resources found in openshift-storage namespace.
======= PV ====
No resources found
======= bucketclass ==========
No resources found in openshift-storage namespace.
======= obc ==========
No resources found

Comment 12 Mudit Agarwal 2021-06-15 15:35:42 UTC

Discussed offline, not a blocker for 4.8

Comment 19 Mudit Agarwal 2021-09-16 09:50:56 UTC

Rewant, please sync up with Neha once and check if something is pending here.

Comment 27 Rewant 2021-11-08 05:28:50 UTC

As discussed a new bug is already created for #2 https://bugzilla.redhat.com/show_bug.cgi?id=2015408

Comment 29 errata-xmlrpc 2021-12-13 17:44:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086

Note You need to log in before you can comment on or make changes to this bug.