1845976 – OCS 4.5 Independent mode: must-gather commands fails to collect ceph command outputs from external cluster

Bug 1845976 - OCS 4.5 Independent mode: must-gather commands fails to collect ceph command outputs from external cluster

Summary: OCS 4.5 Independent mode: must-gather commands fails to collect ceph command ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	must-gather
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.6.0
Assignee:	Pulkit Kundra
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-10 14:13 UTC by Neha Berry
Modified:	2021-08-24 09:52 UTC (History)
CC List:	13 users (show)
Fixed In Version:	4.6.0-137.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-17 06:22:30 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
terminal output from must-gather (64.00 KB, text/plain) 2020-10-21 16:39 UTC, Neha Berry	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 711	None	closed	must-gather: add ROOK_CEPH_SECRET creds used by operator	2021-02-18 12:23:59 UTC
Github	openshift ocs-operator pull 816	None	closed	must-gather: skip ceph collection in external mode OCS	2021-02-18 12:24:00 UTC
Github	openshift ocs-operator pull 827	None	closed	bug 1845976: [release-4.6] must-gather: skip ceph collection in external mode OCS	2021-02-18 12:24:00 UTC
Red Hat Product Errata	RHSA-2020:5605	None	None	None	2020-12-17 06:22:47 UTC

Description Neha Berry 2020-06-10 14:13:21 UTC

>> Description of problem (please be detailed as possible and provide log
snippests):
----------------------------------------------------------------------

Logs attached here - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1845898/ocs-must-gather/must-gather.local.991649001401754259/ceph/namespaces/openshift-storage/must_gather_commands/

Logs in zipped format - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1845898.zip

Must-gather for Independent mode cluster needs some changes to collect the ceph command outputs from an external cluster

As seen in "gather-debug.log", the commands are run but they fail to connect to the cluster.

------------------------------------
collecting command output for: ceph status
unable to parse addrs in 'dell-r730-044=10.1.8.54:6789,dell-r730-031=10.1.8.41:6789,dell-r730-037=10.1.8.47:6789'
[errno 22] error connecting to the cluster
command terminated with exit code 1
unable to parse addrs in 'dell-r730-044=10.1.8.54:6789,dell-r730-031=10.1.8.41:6789,dell-r730-037=10.1.8.47:6789'

------------------------------------


Note: the MON IPs are correct and we are able to create PVCs etc, so connection with the external cluster is intact.


>> Version of all relevant components (if applicable):
----------------------------------------------------------------------
4.5.0-0.nightly-2020-06-03-215545
ocs-operator.v4.5.0-446.ci
ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
----------------------------------------------------------------------

Is there any workaround available to the best of your knowledge?
----------------------------------------------------------------------

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
----------------------------------------------------------------------

Can this issue reproducible?
----------------------------------------------------------------------

Can this issue reproduce from the UI?
----------------------------------------------------------------------

If this is a regression, please provide more details to justify this:
----------------------------------------------------------------------

Steps to Reproduce:
----------------------------------------------------------------------
1. Create an OCP 4.5 cluster and Install OCS in Independent mode ( an RHCS cluster with version 4.1 was pre-configured and added to the OCS Storagecluster)
2. Run the command to collect must-gather 
    oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.5
3. Check for error messages in must-gather collection  or check the same in gather-debug.log inside the must-gather folder.
4. Check the content of the ceph commands under "ceph/namespaces/openshift-storage/must_gather_commands" folder of the collected must-gather. The files are present but all commands failed and there are no outputs.


Actual results:
----------------------------------------------------------------------
As seen during must-gather run, the ceph command collection fails

Expected results:
----------------------------------------------------------------------
Must-gather should be able to collect ceph outputs from external Ceph cluster (dashboards gather some of this information anyways)

Additional info:
----------------------------------------------------------------------
The openshift-storage.config file has all the 3 MON details from the external cluster, still it is unable to get the ceph command output. Hence, we need to confirm what additional changes are required in the commands to collect this information.


>> $ oc rsh -n openshift-storage rook-ceph-operator-8659bd856-tw65h

sh-4.4$ ceph  --conf=/var/lib/rook/openshift-storage/openshift-storage.config osd versions
[errno 5] error connecting to the cluster

sh-4.4$ #ceph  --conf=/var/lib/rook/openshift-storage/openshift-storage.config osd versions


sh-4.4$ cat /var/lib/rook/openshift-storage/openshift-storage.config
[global]
fsid                = fe01cf06-8c2b-4e5b-9fea-8a6a8e402b88
mon initial members = dell-r730-044 dell-r730-031 dell-r730-037
mon host            = [v2:10.1.8.54:3300,v1:10.1.8.54:6789],[v2:10.1.8.41:3300,v1:10.1.8.41:6789],[v2:10.1.8.47:3300,v1:10.1.8.47:6789]

[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring

[client.healthchecker]
keyring = /var/lib/rook/openshift-storage/client.healthchecker.keyring

sh-4.4$

Comment 2 Yaniv Kaul 2020-06-10 17:13:01 UTC

I'm not sure we'd collect logs from an external RHCS cluster via ocs-must-gather. We have tools to collect logs on RHCS side.

Comment 3 Sahina Bose 2020-06-11 10:52:30 UTC

(In reply to Yaniv Kaul from comment #2)
> I'm not sure we'd collect logs from an external RHCS cluster via
> ocs-must-gather. We have tools to collect logs on RHCS side.

Not the logs, only output of ceph commands.

Pulkit, can you check if this is possible - i.e running ceph commands on external cluster from toolbox

Comment 6 Yaniv Kaul 2020-06-24 08:18:52 UTC

What's the decision - is it going to be worked on for 4.5 or not?

Comment 8 Jose A. Rivera 2020-06-26 14:00:53 UTC

You forgot to also remove the ocs-4.5.0? flag. :) Doing so now.

Comment 10 Neha Berry 2020-08-20 11:25:22 UTC

If not for OCS 4.5, can we plan to consider the fix for z-stream of OCS 4.5 ?

Comment 11 Sahina Bose 2020-08-20 13:59:44 UTC

(In reply to Neha Berry from comment #10)
> If not for OCS 4.5, can we plan to consider the fix for z-stream of OCS 4.5 ?

Yes, should be possible as there's a patch available already

Comment 14 Sébastien Han 2020-08-24 08:34:56 UTC

Sahina, no, the change is too big, it's too risky.

Comment 15 Michael Adam 2020-08-24 12:00:46 UTC

(In reply to leseb from comment #14)
> Sahina, no, the change is too big, it's too risky.

I think this means we should move it from 4.5.z to 4.6.0 ?

Comment 16 Sahina Bose 2020-08-25 07:28:07 UTC

(In reply to Michael Adam from comment #15)
> (In reply to leseb from comment #14)
> > Sahina, no, the change is too big, it's too risky.
> 
> I think this means we should move it from 4.5.z to 4.6.0 ?

Done

Comment 20 Mudit Agarwal 2020-09-24 05:03:55 UTC

Not putting "fixed in version" because it is there in 4.6 for a long time now.

Comment 23 Travis Nielsen 2020-09-30 16:00:36 UTC

@Neha Can you share the /etc/ceph/ceph.conf from the toolbox pod? Also does the /etc/ceph/keyring match what you expect? Something must be wrong in the toolbox config that is preventing the ceph connection.

Comment 25 Mudit Agarwal 2020-10-07 13:46:26 UTC

Neha, can you please give it a try now?

Comment 26 Neha Berry 2020-10-07 18:07:06 UTC

(In reply to Mudit Agarwal from comment #25)
> Neha, can you please give it a try now?

Hi Mudit,

 What Sidhant tested was just to confirm that we can run ceph commands in the manually created toolbox, in case the toolbox has proper ceph admin key. It is not the solution to OCS must-gather. There is no fix yet to try again.

Me, Sidhant got into the Operators meeting and discussed the scenario.

The toolbox created during must-gather or the one created via [1] lacks this admin key, hence it is unable to connect to the RHCS cluster to run ceph commands.

The error message we get is:

$ oc rsh rook-ceph-tools-9858c9845-6z5q8
sh-4.4$ ceph -s
[errno 5] RADOS I/O error (error connecting to the cluster)
sh-4.4$ 

The key in secret "rook-ceph-mon", which is part of toolbox pod created during must-gather, does not have admin rights.

@Travis confirmed that he will look into the issue as to how to get proper key to toolbox.


[1] - oc patch ocsinitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'

Comment 27 Travis Nielsen 2020-10-07 20:47:11 UTC

Ok, we see now that the expected must-gather commands all require the admin keyring and do not work with the lower-privileged keyring that was provided to the cluster. With the fix to the toolbox that was included in 4.6, it was only to properly use whatever keyring was provided for the external cluster. It didn't mean that the toolbox was expected to have privileges to run every Ceph command.

By design, the external cluster provides a lower-privileged key to connect with and the must-gather commands will fail as long as no admin key is provided.

As Yaniv originally indicated in the bug, we don't expect to gather the Ceph status of the external cluster. We will need to rely on the RHCS admin to provide information about the external cluster. OCS isn't the admin of the external cluster, so we can't expect to gather admin-privileged info.

@Pulkit Either must-gather shouldn't call the admin ceph commands on the external cluster, or we need to ignore the errors.

Comment 28 Neha Berry 2020-10-08 08:09:01 UTC

(In reply to Travis Nielsen from comment #27)
> Ok, we see now that the expected must-gather commands all require the admin
> keyring and do not work with the lower-privileged keyring that was provided
> to the cluster. With the fix to the toolbox that was included in 4.6, it was
> only to properly use whatever keyring was provided for the external cluster.
> It didn't mean that the toolbox was expected to have privileges to run every
> Ceph command.
> 
> By design, the external cluster provides a lower-privileged key to connect
> with and the must-gather commands will fail as long as no admin key is
> provided.
> 
> As Yaniv originally indicated in the bug, we don't expect to gather the Ceph
> status of the external cluster. We will need to rely on the RHCS admin to
> provide information about the external cluster. OCS isn't the admin of the
> external cluster, so we can't expect to gather admin-privileged info.
> 
> @Pulkit Either must-gather shouldn't call the admin ceph commands on the
> external cluster, or we need to ignore the errors.

Hi Travis,

After trying it manually, it does seem difficult to gain access to RHCS admin key as the uploaded JSON doesnt have the key (as you said)

But, in case some of our PVCs are pending or we are facing OCS related issues, we would still want some information from the RHCS side. 

@bipin should we have a KCS article in place on how to collect cpeh commands after adding the ceph admin key to toolbox ? (provided RHCS admin provides the key to support team? ). Just thinking out loud. Let me know if this doesnt make sense at all.

Comment 31 Bipin Kunal 2020-10-20 05:48:56 UTC

Neha,

   Lets gather must-gather and sos-report from RHCS node for External Mode. 

-Bipin Kunal

Comment 32 Neha Berry 2020-10-21 16:39:16 UTC

Created attachment 1723253 [details]
terminal output from must-gather

Ack. I will raise a new troubleshooting doc BZ to collect sosreport from RHCS side


Verified the fix in OCS version 4.6.0-137.ci and OCP 4.6.0-0.nightly-2020-10-17-040148

must-gather is skipping the collection of ceph commands and creation of a toolbox(must-gather-helper) pod in the openshift-storage namespace

Snip from terminal
=========================
must-gather-hck57] POD collecting dump of noobaa-db-0 pod from openshift-storage
[must-gather-hck57] POD collecting dump of noobaa-operator-6499b55c9b-x6hrg pod from openshift-storage
>> [must-gather-hck57] POD Skipping the ceph collection as External Storage is enabled
[must-gather-hck57] OUT waiting for gather to complete
[must-gather-hck57] OUT downloading gather output

Based on the fix, moving the BZ to verified state.

Comment 34 errata-xmlrpc 2020-12-17 06:22:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605

Note You need to log in before you can comment on or make changes to this bug.