Description of problem: [cephadm] 5.0 - [cephadm] 5.0 - New OSD device to the cluster is not getting the OSD IDs - Unable to allocate new IDs to the new OSD device Version-Release number of selected component (if applicable): [root@ceph-adm7 ~]# sudo cephadm version Using recent ceph image registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest ceph version 16.0.0-7953.el8cp (aac7c5c7d5f82d2973c366730f65255afd66e515) pacific (dev) How reproducible: Steps to Reproduce: 1. Install 5.0 cluster with dashboard enabled 2. Enter to cephadm shell 3. check ceph status and make sure all OSDs are up and IN state 4. Followed the below steps a)ceph osd tree --> check all OSds b) Ceph orch osd rm 11 --> remove 11 c) ceph osd tree --> 11 should been removed d) ceph orch device zap ceph-adm7 /dev/sdc --force --> clear the data e) ceph orch device ls ---> Device should be available to reuse for adding new disk f) ceph orch daemon add osd ceph-adm7:/dev/sdc--adding new disk to get the new OSD ID g)ceph osd tree --> observe the behaviour Actual results: New OSD ID is not been allocated to the new OSD device Expected results: We should get new OSD ID for the newly added OSD device Additional info: 10.74.253.36 root/redhat output: ********************************************************************************************** ceph: root@ceph-adm7 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.37105 root default -7 0.10738 host ceph-adm7 4 hdd 0.02930 osd.4 up 1.00000 1.00000 11 hdd 0.02930 osd.11 up 1.00000 1.00000 12 hdd 0.02930 osd.12 up 1.00000 1.00000 13 hdd 0.01949 osd.13 up 1.00000 1.00000 -3 0.14648 host ceph-adm8 1 hdd 0.05859 osd.1 up 1.00000 1.00000 2 hdd 0.02930 osd.2 up 1.00000 1.00000 6 hdd 0.02930 osd.6 up 1.00000 1.00000 8 hdd 0.02930 osd.8 up 1.00000 1.00000 -5 0.11719 host ceph-adm9 3 hdd 0.02930 osd.3 destroyed 0 1.00000 5 hdd 0.02930 osd.5 destroyed 1.00000 1.00000 7 hdd 0.02930 osd.7 up 1.00000 1.00000 9 hdd 0.02930 osd.9 destroyed 0 1.00000 0 0 osd.0 down 0 1.00000 [ceph: root@ceph-adm7 /]# ceph orch osd rm 11 Scheduled OSD(s) for removal [ceph: root@ceph-adm7 /]# ceph device ls DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY QEMU_QEMU_HARDDISK_073ed4af-4752-4956-9e09-6504da882a79 ceph-adm9:sdf osd.0 QEMU_QEMU_HARDDISK_3bc5e11c-28f2-419e-b076-8b6032e49de5 ceph-adm8:sdc osd.2 QEMU_QEMU_HARDDISK_46e30862-254f-4327-bb00-99ea29f8e237 ceph-adm8:sdf osd.1 QEMU_QEMU_HARDDISK_572813c0-bce4-46f0-a388-bb9ba92a4c9c ceph-adm8:sde osd.8 QEMU_QEMU_HARDDISK_5df4866b-18c5-4de5-8ce1-f44084b67e74 ceph-adm7:sdf mon.ceph-adm7 QEMU_QEMU_HARDDISK_5eed3652-9334-408b-b0e7-3a6d125a7acc ceph-adm7:sdb osd.4 QEMU_QEMU_HARDDISK_6a660612-aa36-4e56-a80f-01839475e55d ceph-adm7:sde osd.13 QEMU_QEMU_HARDDISK_7c92121d-7ee6-4545-9820-14449e78892c ceph-adm9:sdb osd.3 QEMU_QEMU_HARDDISK_7e0b094b-662c-4320-82af-353c993e46bb ceph-adm9:sda mon.ceph-adm9 QEMU_QEMU_HARDDISK_bb888a81-55a6-4418-a9e5-c79043d1bbf7 ceph-adm7:sdd osd.12 QEMU_QEMU_HARDDISK_d81b73c5-ab55-4a41-9f77-81533496ac16 ceph-adm9:sdd osd.7 QEMU_QEMU_HARDDISK_ee309705-a09e-4e31-83e7-3b380398f255 ceph-adm8:sda mon.ceph-adm8 QEMU_QEMU_HARDDISK_f65c4443-18fb-4d02-917d-6a6761541dab ceph-adm8:sdd osd.6 QEMU_QEMU_HARDDISK_ff170b5d-c13f-4514-9685-532e3b3c798e ceph-adm9:sdc osd.5 QEMU_QEMU_HARDDISK_ff2f239d-9870-4ee9-b7a1-20d01ad318cc ceph-adm9:sde osd.9 [ceph: root@ceph-adm7 /]# ceph orch device zap ceph-adm8 /dev/sdf --force ^CInterrupted [ceph: root@ceph-adm7 /]# ^C [ceph: root@ceph-adm7 /]# ^C [ceph: root@ceph-adm7 /]# ceph orch device zap ceph-adm7 /dev/sdc --force /bin/podman:stderr WARNING: The same type, major and minor should not be used for multiple devices. /bin/podman:stderr --> Zapping: /dev/sdc /bin/podman:stderr --> Zapping lvm member /dev/sdc. lv_path is /dev/ceph-9178f657-943e-409e-be2c-20c3207b4016/osd-block-61d6b56d-b982-492f-8dd2-c76e5a5384cb /bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero of=/dev/ceph-9178f657-943e-409e-be2c-20c3207b4016/osd-block-61d6b56d-b982-492f-8dd2-c76e5a5384cb bs=1M count=10 conv=fsync /bin/podman:stderr stderr: 10+0 records in /bin/podman:stderr 10+0 records out /bin/podman:stderr 10485760 bytes (10 MB, 10 MiB) copied, 0.041614 s, 252 MB/s /bin/podman:stderr --> Only 1 LV left in VG, will proceed to destroy volume group ceph-9178f657-943e-409e-be2c-20c3207b4016 /bin/podman:stderr Running command: /usr/sbin/vgremove -v -f ceph-9178f657-943e-409e-be2c-20c3207b4016 /bin/podman:stderr stderr: Removing ceph--9178f657--943e--409e--be2c--20c3207b4016-osd--block--61d6b56d--b982--492f--8dd2--c76e5a5384cb (253:8) /bin/podman:stderr stderr: Archiving volume group "ceph-9178f657-943e-409e-be2c-20c3207b4016" metadata (seqno 5). /bin/podman:stderr stderr: Releasing logical volume "osd-block-61d6b56d-b982-492f-8dd2-c76e5a5384cb" /bin/podman:stderr stderr: Creating volume group backup "/etc/lvm/backup/ceph-9178f657-943e-409e-be2c-20c3207b4016" (seqno 6). /bin/podman:stderr stdout: Logical volume "osd-block-61d6b56d-b982-492f-8dd2-c76e5a5384cb" successfully removed /bin/podman:stderr stderr: Removing physical volume "/dev/sdc" from volume group "ceph-9178f657-943e-409e-be2c-20c3207b4016" /bin/podman:stderr stdout: Volume group "ceph-9178f657-943e-409e-be2c-20c3207b4016" successfully removed /bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero of=/dev/sdc bs=1M count=10 conv=fsync /bin/podman:stderr stderr: 10+0 records in /bin/podman:stderr 10+0 records out /bin/podman:stderr 10485760 bytes (10 MB, 10 MiB) copied, 0.0194739 s, 538 MB/s /bin/podman:stderr stderr: /bin/podman:stderr --> Zapping successful for: <Raw Device: /dev/sdc> [ [ceph: root@ceph-adm7 /]# ceph orch device ls Hostname Path Type Serial Size Health Ident Fault Available ceph-adm7 /dev/sdc hdd 6d0fcb0e-e6cc-47e9-b5a5-710a38470199 32.2G Unknown N/A N/A Yes ceph-adm7 /dev/sdb hdd 5eed3652-9334-408b-b0e7-3a6d125a7acc 32.2G Unknown N/A N/A No ceph-adm7 /dev/sdd hdd bb888a81-55a6-4418-a9e5-c79043d1bbf7 32.2G Unknown N/A N/A No ceph-adm7 /dev/sde hdd 6a660612-aa36-4e56-a80f-01839475e55d 21.4G Unknown N/A N/A No ceph-adm8 /dev/sdb hdd 8829185e-42e9-4df9-9a59-ffab4697b7aa 32.2G Unknown N/A N/A No ceph-adm8 /dev/sdc hdd 3bc5e11c-28f2-419e-b076-8b6032e49de5 32.2G Unknown N/A N/A No ceph-adm8 /dev/sdd hdd f65c4443-18fb-4d02-917d-6a6761541dab 32.2G Unknown N/A N/A No ceph-adm8 /dev/sde hdd 572813c0-bce4-46f0-a388-bb9ba92a4c9c 32.2G Unknown N/A N/A No ceph-adm8 /dev/sdf hdd 46e30862-254f-4327-bb00-99ea29f8e237 64.4G Unknown N/A N/A No ceph-adm9 /dev/sdb hdd 7c92121d-7ee6-4545-9820-14449e78892c 32.2G Unknown N/A N/A No ceph-adm9 /dev/sdc hdd ff170b5d-c13f-4514-9685-532e3b3c798e 32.2G Unknown N/A N/A No ceph-adm9 /dev/sdd hdd d81b73c5-ab55-4a41-9f77-81533496ac16 32.2G Unknown N/A N/A No ceph-adm9 /dev/sde hdd ff2f239d-9870-4ee9-b7a1-20d01ad318cc 32.2G Unknown N/A N/A No ceph-adm9 /dev/sdf hdd 073ed4af-4752-4956-9e09-6504da882a79 85.8G Unknown N/A N/A No [ceph: root@ceph-adm7 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.34175 root default -7 0.07808 host ceph-adm7 4 hdd 0.02930 osd.4 up 1.00000 1.00000 12 hdd 0.02930 osd.12 up 1.00000 1.00000 13 hdd 0.01949 osd.13 up 1.00000 1.00000 -3 0.14648 host ceph-adm8 1 hdd 0.05859 osd.1 up 1.00000 1.00000 2 hdd 0.02930 osd.2 up 1.00000 1.00000 6 hdd 0.02930 osd.6 up 1.00000 1.00000 8 hdd 0.02930 osd.8 up 1.00000 1.00000 -5 0.11719 host ceph-adm9 3 hdd 0.02930 osd.3 destroyed 0 1.00000 5 hdd 0.02930 osd.5 destroyed 1.00000 1.00000 7 hdd 0.02930 osd.7 up 1.00000 1.00000 9 hdd 0.02930 osd.9 destroyed 0 1.00000 0 0 osd.0 up 0 1.00000 [ceph: root@ceph-adm7 /]# ceph orch daemon add osd ceph-adm7:/dev/sdc Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1196, in _handle_command return self.handle_command(inbuf, cmd) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 141, in handle_command return dispatch[cmd['prefix']].call(self, cmd, inbuf) File "/usr/share/ceph/mgr/mgr_module.py", line 332, in call return self.func(mgr, **kwargs) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 103, in <lambda> wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 92, in wrapper return func(*args, **kwargs) File "/usr/share/ceph/mgr/orchestrator/module.py", line 753, in _daemon_add_osd raise_if_exception(completion) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 643, in raise_if_exception raise e RuntimeError: cephadm exited with an error code: 1, stderr:/bin/podman:stderr WARNING: The same type, major and minor should not be used for multiple devices. /bin/podman:stderr --> passed data devices: 1 physical, 0 LVM /bin/podman:stderr --> relative data size: 1.0 /bin/podman:stderr Running command: /usr/bin/ceph-authtool --gen-print-key /bin/podman:stderr Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 553c853b-18d9-4490-9f55-6473fc2cc5b1 /bin/podman:stderr stderr: Error EEXIST: entity osd.10 exists but key does not match /bin/podman:stderr --> RuntimeError: Unable to create a new OSD id Traceback (most recent call last): File "<stdin>", line 6129, in <module> File "<stdin>", line 1300, in _infer_fsid File "<stdin>", line 1383, in _infer_image File "<stdin>", line 3613, in command_ceph_volume File "<stdin>", line 1062, in call_throws RuntimeError: Failed command: /bin/podman run --rm --ipc=host --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk -e CONTAINER_IMAGE=registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest -e NODE_NAME=ceph-adm7 -e CEPH_VOLUME_OSDSPEC_AFFINITY=None -v /var/run/ceph/58149bf2-66ac-11eb-84bf-001a4a000262:/var/run/ceph:z -v /var/log/ceph/58149bf2-66ac-11eb-84bf-001a4a000262:/var/log/ceph:z -v /var/lib/ceph/58149bf2-66ac-11eb-84bf-001a4a000262/crash:/var/lib/ceph/crash:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /tmp/ceph-tmp5bedoklo:/etc/ceph/ceph.conf:z -v /tmp/ceph-tmp84xn3s03:/var/lib/ceph/bootstrap-osd/ceph.keyring:z registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest lvm batch --no-auto /dev/sdc --yes --no-systemd
It seems that in some way , osd.10 was not deleted properly ... do you remember if you get any error when you deleted osd.10? What was the procedure used to remove osd.10? The error produced is: /bin/podman:stderr stderr: Error EEXIST: entity osd.10 exists but key does not match /bin/podman:stderr --> RuntimeError: Unable to create a new OSD id Returned directly form ceph-volume So it seems that the id "10" has been selected to be used for the new osd, but surprisingly it seems that osd.10 is also present in the auth list. Please verify if this is the situation. In order to be able to create new osds .. you will need to do the following: # ceph auth list if you see: osd.10 key: ................... caps: [mgr] allow profile osd caps: [mon] allow profile osd caps: [osd] allow * Then execute: # ceph auth del osd.10 You will need to do exactly the same for all the osd entries in the auth list without a real OSD created in the cluster. The interesting thing in this bug is how do you reach the point where osd 10 is deleted but still present in auths... do you remember how did you remove osd.10?
@Adam, I used ceph orch osd rm command to remove the OSD.10. I cannot repro the issue in other cluster. Below command output where OSD add was successful. [ceph: root@magna021 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 33.75156 root default -3 4.94633 host plena001 0 hdd 0.87279 osd.0 up 1.00000 1.00000 1 hdd 0.87279 osd.1 up 1.00000 1.00000 2 hdd 0.87279 osd.2 up 1.00000 1.00000 3 hdd 0.87279 osd.3 up 1.00000 1.00000 4 ssd 1.45518 osd.4 up 1.00000 1.00000 -7 4.94633 host plena002 5 hdd 0.87279 osd.5 up 1.00000 1.00000 6 hdd 0.87279 osd.6 up 1.00000 1.00000 7 hdd 0.87279 osd.7 up 1.00000 1.00000 8 hdd 0.87279 osd.8 up 1.00000 1.00000 9 ssd 1.45518 osd.9 up 1.00000 1.00000 -10 4.94633 host plena003 10 hdd 0.87279 osd.10 up 1.00000 1.00000 11 hdd 0.87279 osd.11 up 1.00000 1.00000 12 hdd 0.87279 osd.12 up 1.00000 1.00000 13 hdd 0.87279 osd.13 up 1.00000 1.00000 14 ssd 1.45518 osd.14 up 1.00000 1.00000 -13 4.94633 host plena004 15 hdd 0.87279 osd.15 up 1.00000 1.00000 16 hdd 0.87279 osd.16 up 1.00000 1.00000 17 hdd 0.87279 osd.17 up 1.00000 1.00000 18 hdd 0.87279 osd.18 up 1.00000 1.00000 19 ssd 1.45518 osd.19 up 1.00000 1.00000 -16 4.94633 host plena005 20 hdd 0.87279 osd.20 up 1.00000 1.00000 21 hdd 0.87279 osd.21 up 1.00000 1.00000 22 hdd 0.87279 osd.22 up 1.00000 1.00000 23 hdd 0.87279 osd.23 up 1.00000 1.00000 24 ssd 1.45518 osd.24 up 1.00000 1.00000 -19 4.07355 host plena006 25 hdd 0.87279 osd.25 up 1.00000 1.00000 27 hdd 0.87279 osd.27 up 1.00000 1.00000 28 hdd 0.87279 osd.28 up 1.00000 1.00000 29 ssd 1.45518 osd.29 up 1.00000 1.00000 -22 4.94633 host plena007 30 hdd 0.87279 osd.30 up 1.00000 1.00000 31 hdd 0.87279 osd.31 up 1.00000 1.00000 32 hdd 0.87279 osd.32 up 1.00000 1.00000 33 hdd 0.87279 osd.33 up 1.00000 1.00000 34 ssd 1.45518 osd.34 up 1.00000 1.00000 [ceph: root@magna021 /]# ceph orch daemon add osd plena006.ceph.redhat.com:/dev/sdc Created osd(s) 26 on host 'plena006.ceph.redhat.com' [ceph: root@magna021 /]#
@Juan, I do not see the issue in another cluster with the latest alpha image.
mmanjuna: It seems that your cluster is using a previous alpha image, and you are experimenting this issues: https://bugzilla.redhat.com/show_bug.cgi?id=1923719 Please double check that you are uisng the latest alpha image in your cluster.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3294