Bug 2183457 - [RDR] when running any ceph cmd we see error 2023-03-31T08:25:31.844+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Summary: [RDR] when running any ceph cmd we see error 2023-03-31T08:25:31.844+0000 7f8...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.13.0
Assignee: Santosh Pillai
QA Contact: Sidhant Agrawal
URL:
Whiteboard:
Depends On:
Blocks: 2186225
TreeView+ depends on / blocked
 
Reported: 2023-03-31 08:31 UTC by Pratik Surve
Modified: 2023-08-09 17:03 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2186225 (view as bug list)
Environment:
Last Closed: 2023-06-21 15:25:01 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 470 0 None open Bug 2183457: Use clusterID for nslookup of exported IPs 2023-04-12 13:13:52 UTC
Github rook rook pull 12064/files 0 None None None 2023-04-11 11:50:26 UTC
Red Hat Product Errata RHBA-2023:3742 0 None None None 2023-06-21 15:25:29 UTC

Description Pratik Surve 2023-03-31 08:31:54 UTC
Description of problem (please be detailed as possible and provide log
snippets):

[RDR] when running ceph status cmd we see 2023-03-31T08:25:31.844+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]


Version of all relevant components (if applicable):

OCP version:- 4.13.0-0.nightly-2023-03-29-235439
ODF version:- 4.13.0-121
CEPH version:- ceph version 17.2.5-1342.el9cp (ed07851f2c5b8d3dccadf079402f86a67cb7d3e5) quincy (stable)
ACM version:- v2.7.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy RDR cluster with globalnet
2.add spec.network.multiClusterService.Enabled: true to storagecluster post ODF deployment
3.check ceph status via toolbox


Actual results:
$cephstatus
2023-03-31T08:25:31.844+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-03-31T08:25:31.844+0000 7f8deb7fe640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-03-31T08:25:37.844+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-03-31T08:25:40.843+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
[errno 13] RADOS permission denied (error connecting to the cluster)
command terminated with exit code 13


Expected results:


Additional info:
we have seen this in one of the managed cluster in rdr setup but not on second managed cluster

Comment 4 Santosh Pillai 2023-03-31 16:14:45 UTC
mon logs:

ebug 2023-03-31T09:40:18.696+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:18.755+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:18.853+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:18.897+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:18.956+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:19.055+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:19.298+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:19.357+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:19.449+0000 7f33ca877640 -1 mon.b@0(probing) e8 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
debug 2023-03-31T09:40:19.456+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:20.100+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:20.159+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
debug 2023-03-31T09:40:20.258+0000 7f33c486b640  1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id

Comment 5 Santosh Pillai 2023-04-04 08:06:19 UTC
Reinstalling the ODF cluster is the workaround for that. While I investigate what's happening to the mon quorum, this workaround can be used.

Comment 6 Santosh Pillai 2023-04-07 07:52:58 UTC
Still investigating.

Comment 7 Santosh Pillai 2023-04-07 10:18:02 UTC
osd pod is missing the socket file

```
  Normal   Started                3h                   kubelet            Started container osd
  Normal   Pulled                 3h                   kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63" already present on machine
  Normal   Created                3h                   kubelet            Created container log-collector
  Normal   Started                3h                   kubelet            Started container log-collector
  Warning  Unhealthy              31s (x1076 over 3h)  kubelet            Startup probe failed: ceph daemon health check failed with the following output:
> admin_socket: exception getting command descriptions: [Errno 2] No such file or directory

```

Comment 8 krishnaram Karthick 2023-04-11 04:48:36 UTC
Removing testblocker and adding automation blocker for the following reasons. 

1) With the workaround of reinstalling ODF on the affected cluster, QE should be able to proceed with the deployment. 
2) However, this could be a challenge for automated deployments + testing.

Comment 15 errata-xmlrpc 2023-06-21 15:25:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742


Note You need to log in before you can comment on or make changes to this bug.