Bug 2230254 - [cee/sd][RADOS] cephx server client.crash.XXX: handle_request failed to decode CephXAuthenticate: End of buffer [NEEDINFO]
Summary: [cee/sd][RADOS] cephx server client.crash.XXX: handle_request failed to decod...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 5.3
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 7.1
Assignee: Radoslaw Zarzynski
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-09 07:18 UTC by Tridibesh Chakraborty
Modified: 2023-08-16 04:33 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
trchakra: needinfo? (rzarzyns)
trchakra: needinfo? (rzarzyns)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7180 0 None None None 2023-08-09 07:19:03 UTC

Description Tridibesh Chakraborty 2023-08-09 07:18:22 UTC
Description of problem:

Customer recently upgraded to RHCS 5.3z3 and after that he started noticing error messages related to crash daemon where it is unable to connect to the cluster and logging `monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]` messages. 

Version-Release number of selected component (if applicable):

RHCS 5.3z3 : 16.2.10-172.el8cp

How reproducible:

I am unable to reproduce it on my lab cluster. This issue is visible on customer environment only. 

Steps to Reproduce:

Nil

Actual results:

Getting below messages in syslog on monitor node:

~~~
Aug 02 06:41:01 lxrgwp05arch.ux.dvint.de ceph-c0b13019-e62f-4fcb-80c4-5e9f69b94248-crash-lxrgwp05arch[15879]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-08-01T22:10:45.831791Z_0aeed091-9f6f-4c81-a616-bf7df2c080aa as client.crash.lxrgwp05arch failed: 2023-08-02T04:41:01.039+0000 7f432affd700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Aug 02 06:41:00 lxrgwp05arch.ux.dvint.de ceph-c0b13019-e62f-4fcb-80c4-5e9f69b94248-crash-lxrgwp05arch[15879]:
Aug 02 06:41:00 lxrgwp05arch.ux.dvint.de ceph-c0b13019-e62f-4fcb-80c4-5e9f69b94248-crash-lxrgwp05arch[15879]: [errno 13] RADOS permission denied (error connecting to the cluster)
Aug 02 06:41:00 lxrgwp05arch.ux.dvint.de ceph-c0b13019-e62f-4fcb-80c4-5e9f69b94248-crash-lxrgwp05arch[15879]: 2023-08-02T04:41:00.903+0000 7f63a8b41700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Aug 02 06:41:00 lxrgwp05arch.ux.dvint.de ceph-c0b13019-e62f-4fcb-80c4-5e9f69b94248-crash-lxrgwp05arch[15879]: 2023-08-02T04:41:00.902+0000 7f639b7fe700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
~~~

Expected results:

The crash daemons should be able to connect to the cluster. 

Additional info:

By seeing the error message, initially I thought that the version of the crash daemon might be older or it is missing the keyring. But when I verified, I can see it is of same version as cluster and also have the valid keyring under /var/lib/ceph/<fsid>/crash.<hostname> directory (filename: keyring). Also the permission is set to 600 and owner is Ceph.


Note You need to log in before you can comment on or make changes to this bug.