Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2401088

Summary:	[GSS] ceph-crash not authenticating with cluster correctly
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	afanase
Component:	Cephadm	Assignee:	Adam King <adking>
Status:	CLOSED UPSTREAM	QA Contact:	Vinayak Papnoi <vpapnoi>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.1	CC:	adking, akane, allee, cephqe-warriors, falim, kjosy, rsachere
Target Milestone:	---
Target Release:	9.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2432067 2432068 2432069 (view as bug list)		Environment:
Last Closed:	2026-03-04 09:55:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2432067, 2432068, 2432069

Description afanase 2025-10-02 16:54:45 UTC

Description of problem:

We have a SFDC case (04266150) where customer seeing issues with ceph-crash unable to authenticate with the cluster correctly in a number of their clusters. These clusters have been upgraded from Ceph 5.3 to 7.1z7.

Sep 29 11:55:43 WARNING:ceph-crash:post /var/lib/ceph/crash/2025-09-25T08:03:50.707573Z_0a688f01-b55d-4e3d-baf8-36082860a7dc as client.crash.xxx failed: 2025-09-29T01:55:43.681+0000 7f969effd640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Sep 29 11:55:432025-09-29T01:55:43.681+0000 7f969e7fc640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Sep 29 11:55:43  2025-09-29T01:55:43.681+0000 7f969f7fe640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Sep 29 11:55:43 [errno 13] RADOS permission denied (error connecting to the cluster)

Version-Release number of selected component (if applicable):

ceph version 18.2.1-340.el9cp

Red Hat Enterprise Linux release 9.6 (Plow)

How reproducible:

Only in customer's environment

Comment 2 afanase 2025-10-02 17:37:25 UTC

*** Bug 2401087 has been marked as a duplicate of this bug. ***

Comment 19 Raimund Sacherer 2025-11-12 13:03:45 UTC

I *think* the issue might be here in cephadm/services/cephadmservice.py:


```
  0 def get_auth_entity(daemon_type: str, daemon_id: str, host: str = "") -> AuthEntity:
  1     """ 
  2     Map the daemon id to a cephx keyring entity name
  3     """ 
  4     # despite this mapping entity names to daemons, self.TYPE within
  5     # the CephService class refers to service types, not daemon types
  6     if daemon_type in ['rgw', 'rbd-mirror', 'cephfs-mirror', "iscsi", 'nvmeof', 'ingress', 'ceph-exporter']:
  7         return AuthEntity(f'client.{daemon_type}.{daemon_id}')
  8     elif daemon_type in ['crash', 'agent', 'node-proxy']:
  9         if host == "":                        
 10             raise OrchestratorError(
 11                 f'Host not provided to generate <{daemon_type}> auth entity name')
 12         return AuthEntity(f'client.{daemon_type}.{host}')
 13     elif daemon_type == 'mon':
 14         return AuthEntity('mon.')
 15     elif daemon_type in ['mgr', 'osd', 'mds']:
 16         return AuthEntity(f'{daemon_type}.{daemon_id}')
 17     else:
 18         raise OrchestratorError(f"unknown daemon type {daemon_type}")
 19         
 20             
```



Most likely the `crash` should be in the first if clause (besides 'rgw') and not in the 2nd (besides 'agent').

Comment 20 Raimund Sacherer 2025-11-12 18:38:06 UTC

I patched the ceph-crash daemon so it add's authentication, no matter if -n was set or not:
```
    log.info("pinging cluster to exercise our key")
    for auth_name in auth_names:
        pr = subprocess.Popen(args=['timeout', '30', 'ceph', '-s', '-n', auth_name])
        pr.wait()
```

I patched the ceph mgr cephadm cephadmservice.py and added 'crash' alongside to 'rgw' which creates keys with the crash ID instead of the hostname and removed it from alongside 'agent':
```
def get_auth_entity(daemon_type: str, daemon_id: str, host: str = "") -> AuthEntity:
    """
    Map the daemon id to a cephx keyring entity name
    """
    # despite this mapping entity names to daemons, self.TYPE within
    # the CephService class refers to service types, not daemon types
    if daemon_type in ['rgw', 'rbd-mirror', 'cephfs-mirror', 'nfs', "iscsi", 'nvmeof', 'ingress', 'ceph-exporter', 'crash']:
        return AuthEntity(f'client.{daemon_type}.{daemon_id}')
    elif daemon_type in ['agent', 'node-proxy']:
        if host == "":
```

Current key has long hostname:
```
[root@osds-0 quickcluster]# cat /var/lib/ceph/71a3d9b0-a05a-11f0-9235-fa163e1de1c0/crash.osds-0/keyring
[client.crash.osds-0.ceph7test2.lab.psi.pnq2.redhat.com]
key = AQCgz99oV52EFxAAc7hm+7PfJ+wvgLfce5g6sw==
```


Kill -11 osd to simulate a crash

Journal log, we can't process the crash:
```
Nov 12 13:13:36 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-mon[488845]: cephx server client.crash.osds-0: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
Nov 12 13:13:36 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2664821]: WARNING:ceph-crash:post /var/lib/ceph/crash/2025-11-12T18:13:28.007964Z_d1f14a2c-1432-4c9d-b876-f4af8b2ebb2d as client.crash.osds-0 failed: 2025-11-12T18:13:36.848+0000 7f0f52575640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Nov 12 13:13:36 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2664821]: 2025-11-12T18:13:36.849+0000 7f0f52d76640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Nov 12 13:13:36 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2664821]: 2025-11-12T18:13:36.850+0000 7f0f51d74640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Nov 12 13:13:36 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2664821]: [errno 13] RADOS permission denied (error connecting to the cluster)
```


First we need to create a new key with shortname, if we do not, then the rotate-key request will fail because get-or-create-pending for the key will fail, so let's create one:
```
[quickcluster@mgmt-0 ~]$ sudo ceph auth get-or-create client.crash.osds-0
[client.crash.osds-0]
	key = AQBYzxRpjJR1MxAA60UyZhVgCC0mtHJYyrz8dg==
```

The actual created key is not important, now we ask the orchestrator to rotate the key:
```
[quickcluster@mgmt-0 ~]$ sudo ceph orch daemon rotate-key crash.osds-0
Scheduled to rotate-key crash.osds-0 on host 'osds-0.ceph7test2.lab.psi.pnq2.redhat.com'
```

And shortly after that, a new key is created on osds-0 with now with the key id:
```
[root@osds-0 ~]# cat /var/lib/ceph/71a3d9b0-a05a-11f0-9235-fa163e1de1c0/crash.osds-0/keyring
[client.crash.osds-0]
key = AQCQzxRpDBzdGBAAFADIHLYdp+Fd2yOHaQmBRg==
```

The journal shows now that the restart after configuring the new key shows the ceph status correctly (with the patch to add -n <name> to the ceph status in ceph-crash):
```
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com systemd[1]: Started Ceph crash.osds-0 for 71a3d9b0-a05a-11f0-9235-fa163e1de1c0.
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: INFO:ceph-crash:pinging cluster to exercise our key
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com systemd[1]: Started dbus-:1.1-org.fedoraproject.SetroubleshootPrivileged.
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:   cluster:
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:     id:     71a3d9b0-a05a-11f0-9235-fa163e1de1c0
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:     health: HEALTH_ERR
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:             9 osds(s) are not reachable
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:             1 daemons have recently crashed
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:  
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:   services:
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:     mon: 5 daemons, quorum mgmt-0,mons-1,osds-2,osds-1,osds-0 (age 2w)
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:     mgr: mgmt-0.jpwdwi(active, since 5h), standbys: mons-2.shcund
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:     osd: 9 osds: 9 up (since 5m), 9 in (since 5w)
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:  
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:   data:
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:     pools:   1 pools, 1 pgs
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:     objects: 2 objects, 449 KiB
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:     usage:   991 MiB used, 89 GiB / 90 GiB avail
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:     pgs:     1 active+clean
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:  
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 15s
```

And now we also see the crash being posted:
```
Nov 12 13:19:00 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-mon[488845]: from='client.156484 -' entity='client.crash.osds-0' cmd=[{"prefix": "crash post", "target": ["mon-mgr", ""]}]: dispatch
```

```
[quickcluster@mgmt-0 ~]$ sudo ceph crash ls
ID                                                                ENTITY  NEW  
2025-11-12T13:15:11.593715Z_7ada8146-357e-4d9f-9421-d5f5d1f2e314  osd.2    *   
2025-11-12T18:13:28.007964Z_d1f14a2c-1432-4c9d-b876-f4af8b2ebb2d  osd.6    *   
```

With osd.6 being the OSD I killed to test crash posting. 


So I think with those two patches and creating the correct shortname key the crash daemon issue should be fixed. I also tested this with just a `ceph orch daemon redeploy` for the crash daemon with the MGR cephadm fix and it worked nicely as well, so redeploying on upgrade should fix the keys, so I guess that will happen all automatically. 


Adam, if that looks fine, maybe I should create an upstream MR for that? 

BR
Raimund

Comment 22 Raimund Sacherer 2025-11-13 07:23:40 UTC

Hi afanase, 

that's OK, This:
```
[ceph: root@mgmt-0 /]# ceph-crash --name client.crash.osds-0
INFO:ceph-crash:pinging cluster to exercise our key
2025-11-13T04:16:17.156+0000 7f9d93408640 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
2025-11-13T04:16:17.156+0000 7f9d93408640 -1 AuthRegistry(0x7f9d84006420) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
2025-11-13T04:16:17.157+0000 7f9d93408640 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
2025-11-13T04:16:17.157+0000 7f9d93408640 -1 AuthRegistry(0x7f9d93406fe0) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
2025-11-13T04:16:17.158+0000 7f9d91c05640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]
2025-11-13T04:16:17.158+0000 7f9d92406640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]
2025-11-13T04:16:17.159+0000 7f9d92c07640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1]
2025-11-13T04:16:17.159+0000 7f9d93408640 -1 monclient: authenticate NOTE: no keyring found; disabled cephx authentication
[errno 13] RADOS permission denied (error connecting to the cluster)
INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s
```

Is an issue of the `test` when `pinging cluster to exercise our key`. That happens only once, with the proper short key in place, the crash daemon should work even if you experience the above (we are having two seperate issues here, and the status output when starting is not serious for the daemon to work. 

The best thing to test is:
- deployed key with long hostname, do a `kill -11 <pid of osd>` (make sure you get the right pid, the one for the `ceph-osd` process, not the container init one). You should see a segfault crash in the journal, but `ceph crash ls` should not show anything, and in the journal you should see again an error that the crash daemon can connect to the cluster
- create a short key and configure that in the keyring, restart the crash daemon, you still see the status output with an error, but now, `kill -11 <pid of osd>` again, and you should see the crash in the journal, and you also see the crash daemon `posting` the crash now (can be seen also in the mon logs) and `ceph crash ls` should show the crash now.

Keep in mind that by default the crash daemon only looks at crashes every 10 minutes (the delay of 600s). What I do to make that faster is edit the `/var/lib/ceph/<fsid>/crash.<id>/unit.run` and add a `--delay 0.25` at the end, then restart the crash daemon with systemctl restart. That makes it check for crashes every 15 seconds which makes this part easier to catch.

Modificatons to the unit.run files (or any modifications to files in that directory actually) will be overwritten by orch reconfig or orch redeploy, so keep that in mind. 

BR
Raimund

Comment 23 Raimund Sacherer 2025-11-13 20:38:43 UTC

Hi Adam, 

I took the liberty to create a merge request:
https://github.com/ceph/ceph/pull/66239


And fixed an infinity loop as well when it happens that we have a crashpath and a `meta` file in it, but no `done` file. It happened to me while testing, but it took me quite a while to figure that out, not sure *how* I managed to get such a crash dump directory, but with my fix we log a message that we do not proccess this directory because of the missing none file and move forward processing other potential directories.

Fixed keyring generation.
Fixed checking the keyring with ceph status at daemon start.

Could you look that MR over and see if all is good?

Thanks
Raimund

Comment 32 Red Hat Bugzilla 2026-03-04 09:55:43 UTC

This product has been discontinued or is no longer tracked in Red Hat Bugzilla.