Bug 2401088
| Summary: | [GSS] ceph-crash not authenticating with cluster correctly | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | afanase | |
| Component: | Cephadm | Assignee: | Adam King <adking> | |
| Status: | CLOSED UPSTREAM | QA Contact: | Vinayak Papnoi <vpapnoi> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 7.1 | CC: | adking, akane, allee, cephqe-warriors, falim, kjosy, rsachere | |
| Target Milestone: | --- | |||
| Target Release: | 9.1 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2432067 2432068 2432069 (view as bug list) | Environment: | ||
| Last Closed: | 2026-03-04 09:55:43 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2432067, 2432068, 2432069 | |||
|
Description
afanase
2025-10-02 16:54:45 UTC
*** Bug 2401087 has been marked as a duplicate of this bug. *** I *think* the issue might be here in cephadm/services/cephadmservice.py:
```
0 def get_auth_entity(daemon_type: str, daemon_id: str, host: str = "") -> AuthEntity:
1 """
2 Map the daemon id to a cephx keyring entity name
3 """
4 # despite this mapping entity names to daemons, self.TYPE within
5 # the CephService class refers to service types, not daemon types
6 if daemon_type in ['rgw', 'rbd-mirror', 'cephfs-mirror', "iscsi", 'nvmeof', 'ingress', 'ceph-exporter']:
7 return AuthEntity(f'client.{daemon_type}.{daemon_id}')
8 elif daemon_type in ['crash', 'agent', 'node-proxy']:
9 if host == "":
10 raise OrchestratorError(
11 f'Host not provided to generate <{daemon_type}> auth entity name')
12 return AuthEntity(f'client.{daemon_type}.{host}')
13 elif daemon_type == 'mon':
14 return AuthEntity('mon.')
15 elif daemon_type in ['mgr', 'osd', 'mds']:
16 return AuthEntity(f'{daemon_type}.{daemon_id}')
17 else:
18 raise OrchestratorError(f"unknown daemon type {daemon_type}")
19
20
```
Most likely the `crash` should be in the first if clause (besides 'rgw') and not in the 2nd (besides 'agent').
I patched the ceph-crash daemon so it add's authentication, no matter if -n was set or not:
```
log.info("pinging cluster to exercise our key")
for auth_name in auth_names:
pr = subprocess.Popen(args=['timeout', '30', 'ceph', '-s', '-n', auth_name])
pr.wait()
```
I patched the ceph mgr cephadm cephadmservice.py and added 'crash' alongside to 'rgw' which creates keys with the crash ID instead of the hostname and removed it from alongside 'agent':
```
def get_auth_entity(daemon_type: str, daemon_id: str, host: str = "") -> AuthEntity:
"""
Map the daemon id to a cephx keyring entity name
"""
# despite this mapping entity names to daemons, self.TYPE within
# the CephService class refers to service types, not daemon types
if daemon_type in ['rgw', 'rbd-mirror', 'cephfs-mirror', 'nfs', "iscsi", 'nvmeof', 'ingress', 'ceph-exporter', 'crash']:
return AuthEntity(f'client.{daemon_type}.{daemon_id}')
elif daemon_type in ['agent', 'node-proxy']:
if host == "":
```
Current key has long hostname:
```
[root@osds-0 quickcluster]# cat /var/lib/ceph/71a3d9b0-a05a-11f0-9235-fa163e1de1c0/crash.osds-0/keyring
[client.crash.osds-0.ceph7test2.lab.psi.pnq2.redhat.com]
key = AQCgz99oV52EFxAAc7hm+7PfJ+wvgLfce5g6sw==
```
Kill -11 osd to simulate a crash
Journal log, we can't process the crash:
```
Nov 12 13:13:36 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-mon[488845]: cephx server client.crash.osds-0: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
Nov 12 13:13:36 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2664821]: WARNING:ceph-crash:post /var/lib/ceph/crash/2025-11-12T18:13:28.007964Z_d1f14a2c-1432-4c9d-b876-f4af8b2ebb2d as client.crash.osds-0 failed: 2025-11-12T18:13:36.848+0000 7f0f52575640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Nov 12 13:13:36 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2664821]: 2025-11-12T18:13:36.849+0000 7f0f52d76640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Nov 12 13:13:36 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2664821]: 2025-11-12T18:13:36.850+0000 7f0f51d74640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Nov 12 13:13:36 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2664821]: [errno 13] RADOS permission denied (error connecting to the cluster)
```
First we need to create a new key with shortname, if we do not, then the rotate-key request will fail because get-or-create-pending for the key will fail, so let's create one:
```
[quickcluster@mgmt-0 ~]$ sudo ceph auth get-or-create client.crash.osds-0
[client.crash.osds-0]
key = AQBYzxRpjJR1MxAA60UyZhVgCC0mtHJYyrz8dg==
```
The actual created key is not important, now we ask the orchestrator to rotate the key:
```
[quickcluster@mgmt-0 ~]$ sudo ceph orch daemon rotate-key crash.osds-0
Scheduled to rotate-key crash.osds-0 on host 'osds-0.ceph7test2.lab.psi.pnq2.redhat.com'
```
And shortly after that, a new key is created on osds-0 with now with the key id:
```
[root@osds-0 ~]# cat /var/lib/ceph/71a3d9b0-a05a-11f0-9235-fa163e1de1c0/crash.osds-0/keyring
[client.crash.osds-0]
key = AQCQzxRpDBzdGBAAFADIHLYdp+Fd2yOHaQmBRg==
```
The journal shows now that the restart after configuring the new key shows the ceph status correctly (with the patch to add -n <name> to the ceph status in ceph-crash):
```
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com systemd[1]: Started Ceph crash.osds-0 for 71a3d9b0-a05a-11f0-9235-fa163e1de1c0.
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: INFO:ceph-crash:pinging cluster to exercise our key
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com systemd[1]: Started dbus-:1.1-org.fedoraproject.SetroubleshootPrivileged.
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: cluster:
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: id: 71a3d9b0-a05a-11f0-9235-fa163e1de1c0
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: health: HEALTH_ERR
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: 9 osds(s) are not reachable
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: 1 daemons have recently crashed
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: services:
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: mon: 5 daemons, quorum mgmt-0,mons-1,osds-2,osds-1,osds-0 (age 2w)
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: mgr: mgmt-0.jpwdwi(active, since 5h), standbys: mons-2.shcund
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: osd: 9 osds: 9 up (since 5m), 9 in (since 5w)
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: data:
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: pools: 1 pools, 1 pgs
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: objects: 2 objects, 449 KiB
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: usage: 991 MiB used, 89 GiB / 90 GiB avail
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: pgs: 1 active+clean
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]:
Nov 12 13:18:58 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-71a3d9b0-a05a-11f0-9235-fa163e1de1c0-crash-osds-0[2667069]: INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 15s
```
And now we also see the crash being posted:
```
Nov 12 13:19:00 osds-0.ceph7test2.lab.psi.pnq2.redhat.com ceph-mon[488845]: from='client.156484 -' entity='client.crash.osds-0' cmd=[{"prefix": "crash post", "target": ["mon-mgr", ""]}]: dispatch
```
```
[quickcluster@mgmt-0 ~]$ sudo ceph crash ls
ID ENTITY NEW
2025-11-12T13:15:11.593715Z_7ada8146-357e-4d9f-9421-d5f5d1f2e314 osd.2 *
2025-11-12T18:13:28.007964Z_d1f14a2c-1432-4c9d-b876-f4af8b2ebb2d osd.6 *
```
With osd.6 being the OSD I killed to test crash posting.
So I think with those two patches and creating the correct shortname key the crash daemon issue should be fixed. I also tested this with just a `ceph orch daemon redeploy` for the crash daemon with the MGR cephadm fix and it worked nicely as well, so redeploying on upgrade should fix the keys, so I guess that will happen all automatically.
Adam, if that looks fine, maybe I should create an upstream MR for that?
BR
Raimund
Hi afanase, that's OK, This: ``` [ceph: root@mgmt-0 /]# ceph-crash --name client.crash.osds-0 INFO:ceph-crash:pinging cluster to exercise our key 2025-11-13T04:16:17.156+0000 7f9d93408640 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory 2025-11-13T04:16:17.156+0000 7f9d93408640 -1 AuthRegistry(0x7f9d84006420) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx 2025-11-13T04:16:17.157+0000 7f9d93408640 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory 2025-11-13T04:16:17.157+0000 7f9d93408640 -1 AuthRegistry(0x7f9d93406fe0) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx 2025-11-13T04:16:17.158+0000 7f9d91c05640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] 2025-11-13T04:16:17.158+0000 7f9d92406640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] 2025-11-13T04:16:17.159+0000 7f9d92c07640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] 2025-11-13T04:16:17.159+0000 7f9d93408640 -1 monclient: authenticate NOTE: no keyring found; disabled cephx authentication [errno 13] RADOS permission denied (error connecting to the cluster) INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s ``` Is an issue of the `test` when `pinging cluster to exercise our key`. That happens only once, with the proper short key in place, the crash daemon should work even if you experience the above (we are having two seperate issues here, and the status output when starting is not serious for the daemon to work. The best thing to test is: - deployed key with long hostname, do a `kill -11 <pid of osd>` (make sure you get the right pid, the one for the `ceph-osd` process, not the container init one). You should see a segfault crash in the journal, but `ceph crash ls` should not show anything, and in the journal you should see again an error that the crash daemon can connect to the cluster - create a short key and configure that in the keyring, restart the crash daemon, you still see the status output with an error, but now, `kill -11 <pid of osd>` again, and you should see the crash in the journal, and you also see the crash daemon `posting` the crash now (can be seen also in the mon logs) and `ceph crash ls` should show the crash now. Keep in mind that by default the crash daemon only looks at crashes every 10 minutes (the delay of 600s). What I do to make that faster is edit the `/var/lib/ceph/<fsid>/crash.<id>/unit.run` and add a `--delay 0.25` at the end, then restart the crash daemon with systemctl restart. That makes it check for crashes every 15 seconds which makes this part easier to catch. Modificatons to the unit.run files (or any modifications to files in that directory actually) will be overwritten by orch reconfig or orch redeploy, so keep that in mind. BR Raimund Hi Adam, I took the liberty to create a merge request: https://github.com/ceph/ceph/pull/66239 And fixed an infinity loop as well when it happens that we have a crashpath and a `meta` file in it, but no `done` file. It happened to me while testing, but it took me quite a while to figure that out, not sure *how* I managed to get such a crash dump directory, but with my fix we log a message that we do not proccess this directory because of the missing none file and move forward processing other potential directories. Fixed keyring generation. Fixed checking the keyring with ceph status at daemon start. Could you look that MR over and see if all is good? Thanks Raimund This product has been discontinued or is no longer tracked in Red Hat Bugzilla. |