Description of problem (please be detailed as possible and provide log snippets): [RDR] when running ceph status cmd we see 2023-03-31T08:25:31.844+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1] Version of all relevant components (if applicable): OCP version:- 4.13.0-0.nightly-2023-03-29-235439 ODF version:- 4.13.0-121 CEPH version:- ceph version 17.2.5-1342.el9cp (ed07851f2c5b8d3dccadf079402f86a67cb7d3e5) quincy (stable) ACM version:- v2.7.2 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy RDR cluster with globalnet 2.add spec.network.multiClusterService.Enabled: true to storagecluster post ODF deployment 3.check ceph status via toolbox Actual results: $cephstatus 2023-03-31T08:25:31.844+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1] 2023-03-31T08:25:31.844+0000 7f8deb7fe640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1] 2023-03-31T08:25:37.844+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1] 2023-03-31T08:25:40.843+0000 7f8deaffd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1] [errno 13] RADOS permission denied (error connecting to the cluster) command terminated with exit code 13 Expected results: Additional info: we have seen this in one of the managed cluster in rdr setup but not on second managed cluster
mon logs: ebug 2023-03-31T09:40:18.696+0000 7f33c486b640 1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id debug 2023-03-31T09:40:18.755+0000 7f33c486b640 1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id debug 2023-03-31T09:40:18.853+0000 7f33c486b640 1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id debug 2023-03-31T09:40:18.897+0000 7f33c486b640 1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id debug 2023-03-31T09:40:18.956+0000 7f33c486b640 1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id debug 2023-03-31T09:40:19.055+0000 7f33c486b640 1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id debug 2023-03-31T09:40:19.298+0000 7f33c486b640 1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id debug 2023-03-31T09:40:19.357+0000 7f33c486b640 1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id debug 2023-03-31T09:40:19.449+0000 7f33ca877640 -1 mon.b@0(probing) e8 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied debug 2023-03-31T09:40:19.456+0000 7f33c486b640 1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id debug 2023-03-31T09:40:20.100+0000 7f33c486b640 1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id debug 2023-03-31T09:40:20.159+0000 7f33c486b640 1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id debug 2023-03-31T09:40:20.258+0000 7f33c486b640 1 mon.b@0(probing) e8 handle_auth_request failed to assign global_id
Reinstalling the ODF cluster is the workaround for that. While I investigate what's happening to the mon quorum, this workaround can be used.
Still investigating.
osd pod is missing the socket file ``` Normal Started 3h kubelet Started container osd Normal Pulled 3h kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63" already present on machine Normal Created 3h kubelet Created container log-collector Normal Started 3h kubelet Started container log-collector Warning Unhealthy 31s (x1076 over 3h) kubelet Startup probe failed: ceph daemon health check failed with the following output: > admin_socket: exception getting command descriptions: [Errno 2] No such file or directory ```
Removing testblocker and adding automation blocker for the following reasons. 1) With the workaround of reinstalling ODF on the affected cluster, QE should be able to proceed with the deployment. 2) However, this could be a challenge for automated deployments + testing.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742