Bug 1980202 - OCS/ODF Operator on OpenShift could not connect to External Ceph with "[errno 13] error connecting to the cluster" in rook container
Summary: OCS/ODF Operator on OpenShift could not connect to External Ceph with "[errno...
Keywords:
Status: CLOSED DUPLICATE of bug 1974476
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ceph
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Scott Ostapovicz
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-08 05:13 UTC by Mohammed Salih
Modified: 2021-07-08 06:14 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-08 06:00:43 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1931811 1 None None None 2022-10-20 05:53:21 UTC

Description Mohammed Salih 2021-07-08 05:13:54 UTC
Description of problem (please be detailed as possible and provide log
snippests):
OCS/ODF Operator on OpenShift could not connect to External Ceph with "[errno 13] error connecting to the cluster" in rook container. I have checked the access from OCP cluster to the Ceph cluster and all of the servers and their ports are accessible from the OpenShift. I also validated if the keyring and user has proper access to check health from the bastion host which is part of the OpenShift cluster and it works fine. 

Here are the versions 

OpenShift Disconnected UPI deployment :

Client Version: 4.6.36
Server Version: 4.6.35
Kubernetes Version: v1.19.0+b00ba52]

OCS/ODF version: 4.6.5 

Ceph Version: 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes it impacts the PoC I am working on. It is a blocker for moving forward.

Is there any workaround available to the best of your knowledge? 

Yes. After trying to triage the issue, I found that in the log of Ceph monitors it was complaining about "cephx server client.healthchecker:  attempt to reclaim global_id 64287 without presenting ticket". 

So based on a quick search I found that it could be caused by client version mismatch. So I set  "auth_allow_insecure_global_id_reclaim"  to true on the ceph side using the command `ceph config set mon auth_allow_insecure_global_id_reclaim true` and `ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false`. Once these are set, the rook container in OCS/ODF started connecting and the whole deployment went ahead. 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2 - It is not a complex deployment. Openshift and External Ceph was newly installed with nothing on them, OCS/ODF was installed using operator and then the storage was connected as per the doc. 

Can this issue reproducible?
Yes. 
Can this issue reproduce from the UI?
Yes 

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Install OpenShift 4.6.35 using disconnected UPI
2.Install latest Ceph version 14.2.11-181.el8cp
3. Install OCS/ODF on OpenShift 
4. Follow our official document to connect OCS/ODF to External cluster with the JSON generated from the External cluster
5. Observe the rook container log. After few minutes you should be getting "[errno 13] error connecting to the cluster"
6. As a work around , run the following commands on the Ceph cluster
ceph config set mon auth_allow_insecure_global_id_reclaim true
ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false


Actual results:
OCS/ODF could not connect to External Ceph cluster

Expected results:
OCS/ODF connected to External Ceph cluster and the Storage service available in OpenShift

Additional info:
A possible bug similar to this is attached to this report.

Comment 2 Mudit Agarwal 2021-07-08 06:00:43 UTC
Please install 4.6.6, which has the fix for this issue.

*** This bug has been marked as a duplicate of bug 1974476 ***

Comment 3 Mohammed Salih 2021-07-08 06:13:59 UTC
4.6.5 is the version that I got when I pulled the Operator to the local registry on 5th July. Is it a new release that came out after that.

Comment 4 Mudit Agarwal 2021-07-08 06:14:57 UTC
Yes, it was  released yesterday only.


Note You need to log in before you can comment on or make changes to this bug.