Bug 2186225

Summary: [RDR] when running any ceph cmd we see error 2023-03-31T08:25:31.844+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: umanga <uchapaga>
Component: ocs-operatorAssignee: umanga <uchapaga>
Status: CLOSED ERRATA QA Contact: Sidhant Agrawal <sagrawal>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.13CC: kramdoss, muagarwa, nberry, ocs-bugs, odf-bz-bot, prsurve, sapillai, sgaddam
Target Milestone: ---Keywords: AutomationBlocker
Target Release: ODF 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 2183457 Environment:
Last Closed: 2023-06-21 15:25:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2183457    
Bug Blocks:    

Description umanga 2023-04-12 13:16:16 UTC
+++ This bug was initially created as a clone of Bug #2183457 +++

Description of problem (please be detailed as possible and provide log
snippets):

[RDR] when running ceph status cmd we see 2023-03-31T08:25:31.844+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]


Version of all relevant components (if applicable):

OCP version:- 4.13.0-0.nightly-2023-03-29-235439
ODF version:- 4.13.0-121
CEPH version:- ceph version 17.2.5-1342.el9cp (ed07851f2c5b8d3dccadf079402f86a67cb7d3e5) quincy (stable)
ACM version:- v2.7.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy RDR cluster with globalnet
2.add spec.network.multiClusterService.Enabled: true to storagecluster post ODF deployment
3.check ceph status via toolbox


Actual results:
$cephstatus
2023-03-31T08:25:31.844+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-03-31T08:25:31.844+0000 7f8deb7fe640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-03-31T08:25:37.844+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-03-31T08:25:40.843+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
[errno 13] RADOS permission denied (error connecting to the cluster)
command terminated with exit code 13


Expected results:


Additional info:
we have seen this in one of the managed cluster in rdr setup but not on second managed cluster

--- Additional comment from RHEL Program Management on 2023-03-31 14:02:03 IST ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2023-03-31 14:02:03 IST ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from Pratik Surve on 2023-03-31 14:19:26 IST ---

Logs:- http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/2183457/mar31/31-03-2023_14-02-09

--- Additional comment from Santosh Pillai on 2023-03-31 21:44:45 IST ---

mon logs:

ebug 2023-03-31T09:40:18.696+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:18.755+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:18.853+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:18.897+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:18.956+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:19.055+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:19.298+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:19.357+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:19.449+0000 7f33ca877640 -1 mon.b@0(probing) e8 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
debug 2023-03-31T09:40:19.456+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:20.100+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:20.159+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:20.258+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id

--- Additional comment from Santosh Pillai on 2023-04-04 13:36:19 IST ---

Reinstalling the ODF cluster is the workaround for that. While I investigate what's happening to the mon quorum, this workaround can be used.

--- Additional comment from Santosh Pillai on 2023-04-07 13:22:58 IST ---

Still investigating.

--- Additional comment from Santosh Pillai on 2023-04-07 15:48:02 IST ---

osd pod is missing the socket file

```
  Normal   Started                3h                   kubelet            Started container osd
  Normal   Pulled                 3h                   kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63" already present on machine
  Normal   Created                3h                   kubelet            Created container log-collector
  Normal   Started                3h                   kubelet            Started container log-collector
  Warning  Unhealthy              31s (x1076 over 3h)  kubelet            Startup probe failed: ceph daemon health check failed with the following output:
> admin_socket: exception getting command descriptions: [Errno 2] No such file or directory

```

--- Additional comment from krishnaram Karthick on 2023-04-11 10:18:36 IST ---

Removing testblocker and adding automation blocker for the following reasons. 

1) With the workaround of reinstalling ODF on the affected cluster, QE should be able to proceed with the deployment. 
2) However, this could be a challenge for automated deployments + testing.

Comment 10 errata-xmlrpc 2023-06-21 15:25:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742