>> Description of problem (please be detailed as possible and provide log snippests): ---------------------------------------------------------------------- Logs attached here - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1845898/ocs-must-gather/must-gather.local.991649001401754259/ceph/namespaces/openshift-storage/must_gather_commands/ Logs in zipped format - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1845898.zip Must-gather for Independent mode cluster needs some changes to collect the ceph command outputs from an external cluster As seen in "gather-debug.log", the commands are run but they fail to connect to the cluster. ------------------------------------ collecting command output for: ceph status unable to parse addrs in 'dell-r730-044=10.1.8.54:6789,dell-r730-031=10.1.8.41:6789,dell-r730-037=10.1.8.47:6789' [errno 22] error connecting to the cluster command terminated with exit code 1 unable to parse addrs in 'dell-r730-044=10.1.8.54:6789,dell-r730-031=10.1.8.41:6789,dell-r730-037=10.1.8.47:6789' ------------------------------------ Note: the MON IPs are correct and we are able to create PVCs etc, so connection with the external cluster is intact. >> Version of all relevant components (if applicable): ---------------------------------------------------------------------- 4.5.0-0.nightly-2020-06-03-215545 ocs-operator.v4.5.0-446.ci ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ---------------------------------------------------------------------- Is there any workaround available to the best of your knowledge? ---------------------------------------------------------------------- Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? ---------------------------------------------------------------------- Can this issue reproducible? ---------------------------------------------------------------------- Can this issue reproduce from the UI? ---------------------------------------------------------------------- If this is a regression, please provide more details to justify this: ---------------------------------------------------------------------- Steps to Reproduce: ---------------------------------------------------------------------- 1. Create an OCP 4.5 cluster and Install OCS in Independent mode ( an RHCS cluster with version 4.1 was pre-configured and added to the OCS Storagecluster) 2. Run the command to collect must-gather oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.5 3. Check for error messages in must-gather collection or check the same in gather-debug.log inside the must-gather folder. 4. Check the content of the ceph commands under "ceph/namespaces/openshift-storage/must_gather_commands" folder of the collected must-gather. The files are present but all commands failed and there are no outputs. Actual results: ---------------------------------------------------------------------- As seen during must-gather run, the ceph command collection fails Expected results: ---------------------------------------------------------------------- Must-gather should be able to collect ceph outputs from external Ceph cluster (dashboards gather some of this information anyways) Additional info: ---------------------------------------------------------------------- The openshift-storage.config file has all the 3 MON details from the external cluster, still it is unable to get the ceph command output. Hence, we need to confirm what additional changes are required in the commands to collect this information. >> $ oc rsh -n openshift-storage rook-ceph-operator-8659bd856-tw65h sh-4.4$ ceph --conf=/var/lib/rook/openshift-storage/openshift-storage.config osd versions [errno 5] error connecting to the cluster sh-4.4$ #ceph --conf=/var/lib/rook/openshift-storage/openshift-storage.config osd versions sh-4.4$ cat /var/lib/rook/openshift-storage/openshift-storage.config [global] fsid = fe01cf06-8c2b-4e5b-9fea-8a6a8e402b88 mon initial members = dell-r730-044 dell-r730-031 dell-r730-037 mon host = [v2:10.1.8.54:3300,v1:10.1.8.54:6789],[v2:10.1.8.41:3300,v1:10.1.8.41:6789],[v2:10.1.8.47:3300,v1:10.1.8.47:6789] [client.admin] keyring = /var/lib/rook/openshift-storage/client.admin.keyring [client.healthchecker] keyring = /var/lib/rook/openshift-storage/client.healthchecker.keyring sh-4.4$
I'm not sure we'd collect logs from an external RHCS cluster via ocs-must-gather. We have tools to collect logs on RHCS side.
(In reply to Yaniv Kaul from comment #2) > I'm not sure we'd collect logs from an external RHCS cluster via > ocs-must-gather. We have tools to collect logs on RHCS side. Not the logs, only output of ceph commands. Pulkit, can you check if this is possible - i.e running ceph commands on external cluster from toolbox
What's the decision - is it going to be worked on for 4.5 or not?
You forgot to also remove the ocs-4.5.0? flag. :) Doing so now.
If not for OCS 4.5, can we plan to consider the fix for z-stream of OCS 4.5 ?
(In reply to Neha Berry from comment #10) > If not for OCS 4.5, can we plan to consider the fix for z-stream of OCS 4.5 ? Yes, should be possible as there's a patch available already
Sahina, no, the change is too big, it's too risky.
(In reply to leseb from comment #14) > Sahina, no, the change is too big, it's too risky. I think this means we should move it from 4.5.z to 4.6.0 ?
(In reply to Michael Adam from comment #15) > (In reply to leseb from comment #14) > > Sahina, no, the change is too big, it's too risky. > > I think this means we should move it from 4.5.z to 4.6.0 ? Done
Not putting "fixed in version" because it is there in 4.6 for a long time now.
@Neha Can you share the /etc/ceph/ceph.conf from the toolbox pod? Also does the /etc/ceph/keyring match what you expect? Something must be wrong in the toolbox config that is preventing the ceph connection.
Neha, can you please give it a try now?
(In reply to Mudit Agarwal from comment #25) > Neha, can you please give it a try now? Hi Mudit, What Sidhant tested was just to confirm that we can run ceph commands in the manually created toolbox, in case the toolbox has proper ceph admin key. It is not the solution to OCS must-gather. There is no fix yet to try again. Me, Sidhant got into the Operators meeting and discussed the scenario. The toolbox created during must-gather or the one created via [1] lacks this admin key, hence it is unable to connect to the RHCS cluster to run ceph commands. The error message we get is: $ oc rsh rook-ceph-tools-9858c9845-6z5q8 sh-4.4$ ceph -s [errno 5] RADOS I/O error (error connecting to the cluster) sh-4.4$ The key in secret "rook-ceph-mon", which is part of toolbox pod created during must-gather, does not have admin rights. @Travis confirmed that he will look into the issue as to how to get proper key to toolbox. [1] - oc patch ocsinitialization ocsinit -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'
Ok, we see now that the expected must-gather commands all require the admin keyring and do not work with the lower-privileged keyring that was provided to the cluster. With the fix to the toolbox that was included in 4.6, it was only to properly use whatever keyring was provided for the external cluster. It didn't mean that the toolbox was expected to have privileges to run every Ceph command. By design, the external cluster provides a lower-privileged key to connect with and the must-gather commands will fail as long as no admin key is provided. As Yaniv originally indicated in the bug, we don't expect to gather the Ceph status of the external cluster. We will need to rely on the RHCS admin to provide information about the external cluster. OCS isn't the admin of the external cluster, so we can't expect to gather admin-privileged info. @Pulkit Either must-gather shouldn't call the admin ceph commands on the external cluster, or we need to ignore the errors.
(In reply to Travis Nielsen from comment #27) > Ok, we see now that the expected must-gather commands all require the admin > keyring and do not work with the lower-privileged keyring that was provided > to the cluster. With the fix to the toolbox that was included in 4.6, it was > only to properly use whatever keyring was provided for the external cluster. > It didn't mean that the toolbox was expected to have privileges to run every > Ceph command. > > By design, the external cluster provides a lower-privileged key to connect > with and the must-gather commands will fail as long as no admin key is > provided. > > As Yaniv originally indicated in the bug, we don't expect to gather the Ceph > status of the external cluster. We will need to rely on the RHCS admin to > provide information about the external cluster. OCS isn't the admin of the > external cluster, so we can't expect to gather admin-privileged info. > > @Pulkit Either must-gather shouldn't call the admin ceph commands on the > external cluster, or we need to ignore the errors. Hi Travis, After trying it manually, it does seem difficult to gain access to RHCS admin key as the uploaded JSON doesnt have the key (as you said) But, in case some of our PVCs are pending or we are facing OCS related issues, we would still want some information from the RHCS side. @bipin should we have a KCS article in place on how to collect cpeh commands after adding the ceph admin key to toolbox ? (provided RHCS admin provides the key to support team? ). Just thinking out loud. Let me know if this doesnt make sense at all.
Neha, Lets gather must-gather and sos-report from RHCS node for External Mode. -Bipin Kunal
Created attachment 1723253 [details] terminal output from must-gather Ack. I will raise a new troubleshooting doc BZ to collect sosreport from RHCS side Verified the fix in OCS version 4.6.0-137.ci and OCP 4.6.0-0.nightly-2020-10-17-040148 must-gather is skipping the collection of ceph commands and creation of a toolbox(must-gather-helper) pod in the openshift-storage namespace Snip from terminal ========================= must-gather-hck57] POD collecting dump of noobaa-db-0 pod from openshift-storage [must-gather-hck57] POD collecting dump of noobaa-operator-6499b55c9b-x6hrg pod from openshift-storage >> [must-gather-hck57] POD Skipping the ceph collection as External Storage is enabled [must-gather-hck57] OUT waiting for gather to complete [must-gather-hck57] OUT downloading gather output Based on the fix, moving the BZ to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605