1932490 – [cephadm] 5.0 - New OSD device to the cluster is not getting the OSD IDs - Unable to allocate new IDs to the new OSD device

Bug 1932490 - [cephadm] 5.0 - New OSD device to the cluster is not getting the OSD IDs - Unable to allocate new IDs to the new OSD device

Summary: [cephadm] 5.0 - New OSD device to the cluster is not getting the OSD IDs - U...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Cephadm
Sub Component:
Version:	5.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	5.0
Assignee:	Juan Miguel Olmo
QA Contact:	Vasishta
Docs Contact:	Karen Norteman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-24 17:52 UTC by Preethi
Modified:	2021-08-30 08:29 UTC (History)
CC List:	5 users (show)
Fixed In Version:	ceph-16.1.0-486.el8cp
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-30 08:28:49 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-1067	0	None	None	None	2021-08-27 05:19:17 UTC
Red Hat Product Errata	RHBA-2021:3294	0	None	None	None	2021-08-30 08:29:01 UTC

Description Preethi 2021-02-24 17:52:50 UTC

Description of problem:
[cephadm] 5.0 -  [cephadm] 5.0 -  New OSD device to the cluster is not getting the OSD IDs - Unable to allocate new IDs to the new OSD device

Version-Release number of selected component (if applicable):
[root@ceph-adm7 ~]# sudo cephadm version
Using recent ceph image registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest
ceph version 16.0.0-7953.el8cp (aac7c5c7d5f82d2973c366730f65255afd66e515) pacific (dev)


How reproducible:


Steps to Reproduce:
1. Install 5.0 cluster with dashboard enabled
2. Enter to cephadm shell
3. check ceph status and make sure all OSDs are up and IN state 
4. Followed the below steps

a)ceph osd tree --> check all OSds 
b) Ceph orch osd rm 11 --> remove 11 
c) ceph osd tree --> 11 should been removed
d) ceph orch device zap ceph-adm7 /dev/sdc --force --> clear the data
e) ceph orch device ls ---> Device should be available to reuse for adding new disk

f)  ceph orch daemon add osd ceph-adm7:/dev/sdc--adding new disk to get the new OSD ID
g)ceph osd tree --> observe the behaviour



Actual results: New OSD ID is not been allocated to the new OSD device


Expected results: We should get new OSD ID for the newly added OSD device


Additional info:
10.74.253.36 root/redhat

output:


**********************************************************************************************
ceph: root@ceph-adm7 /]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME           STATUS     REWEIGHT  PRI-AFF
-1         0.37105  root default                                    
-7         0.10738      host ceph-adm7                              
 4    hdd  0.02930          osd.4              up   1.00000  1.00000
11    hdd  0.02930          osd.11             up   1.00000  1.00000
12    hdd  0.02930          osd.12             up   1.00000  1.00000
13    hdd  0.01949          osd.13             up   1.00000  1.00000
-3         0.14648      host ceph-adm8                              
 1    hdd  0.05859          osd.1              up   1.00000  1.00000
 2    hdd  0.02930          osd.2              up   1.00000  1.00000
 6    hdd  0.02930          osd.6              up   1.00000  1.00000
 8    hdd  0.02930          osd.8              up   1.00000  1.00000
-5         0.11719      host ceph-adm9                              
 3    hdd  0.02930          osd.3       destroyed         0  1.00000
 5    hdd  0.02930          osd.5       destroyed   1.00000  1.00000
 7    hdd  0.02930          osd.7              up   1.00000  1.00000
 9    hdd  0.02930          osd.9       destroyed         0  1.00000
 0               0  osd.0                    down         0  1.00000
[ceph: root@ceph-adm7 /]# ceph orch osd rm 11
Scheduled OSD(s) for removal

[ceph: root@ceph-adm7 /]# ceph device ls
DEVICE                                                   HOST:DEV       DAEMONS        LIFE EXPECTANCY
QEMU_QEMU_HARDDISK_073ed4af-4752-4956-9e09-6504da882a79  ceph-adm9:sdf  osd.0                         
QEMU_QEMU_HARDDISK_3bc5e11c-28f2-419e-b076-8b6032e49de5  ceph-adm8:sdc  osd.2                         
QEMU_QEMU_HARDDISK_46e30862-254f-4327-bb00-99ea29f8e237  ceph-adm8:sdf  osd.1                         
QEMU_QEMU_HARDDISK_572813c0-bce4-46f0-a388-bb9ba92a4c9c  ceph-adm8:sde  osd.8                         
QEMU_QEMU_HARDDISK_5df4866b-18c5-4de5-8ce1-f44084b67e74  ceph-adm7:sdf  mon.ceph-adm7                 
QEMU_QEMU_HARDDISK_5eed3652-9334-408b-b0e7-3a6d125a7acc  ceph-adm7:sdb  osd.4                         
QEMU_QEMU_HARDDISK_6a660612-aa36-4e56-a80f-01839475e55d  ceph-adm7:sde  osd.13                        
QEMU_QEMU_HARDDISK_7c92121d-7ee6-4545-9820-14449e78892c  ceph-adm9:sdb  osd.3                         
QEMU_QEMU_HARDDISK_7e0b094b-662c-4320-82af-353c993e46bb  ceph-adm9:sda  mon.ceph-adm9                 
QEMU_QEMU_HARDDISK_bb888a81-55a6-4418-a9e5-c79043d1bbf7  ceph-adm7:sdd  osd.12                        
QEMU_QEMU_HARDDISK_d81b73c5-ab55-4a41-9f77-81533496ac16  ceph-adm9:sdd  osd.7                         
QEMU_QEMU_HARDDISK_ee309705-a09e-4e31-83e7-3b380398f255  ceph-adm8:sda  mon.ceph-adm8                 
QEMU_QEMU_HARDDISK_f65c4443-18fb-4d02-917d-6a6761541dab  ceph-adm8:sdd  osd.6                         
QEMU_QEMU_HARDDISK_ff170b5d-c13f-4514-9685-532e3b3c798e  ceph-adm9:sdc  osd.5                         
QEMU_QEMU_HARDDISK_ff2f239d-9870-4ee9-b7a1-20d01ad318cc  ceph-adm9:sde  osd.9                         
[ceph: root@ceph-adm7 /]# ceph orch device zap ceph-adm8 /dev/sdf --force
^CInterrupted
[ceph: root@ceph-adm7 /]# ^C
[ceph: root@ceph-adm7 /]# ^C
[ceph: root@ceph-adm7 /]# ceph orch device zap ceph-adm7 /dev/sdc --force
/bin/podman:stderr WARNING: The same type, major and minor should not be used for multiple devices.
/bin/podman:stderr --> Zapping: /dev/sdc
/bin/podman:stderr --> Zapping lvm member /dev/sdc. lv_path is /dev/ceph-9178f657-943e-409e-be2c-20c3207b4016/osd-block-61d6b56d-b982-492f-8dd2-c76e5a5384cb
/bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero of=/dev/ceph-9178f657-943e-409e-be2c-20c3207b4016/osd-block-61d6b56d-b982-492f-8dd2-c76e5a5384cb bs=1M count=10 conv=fsync
/bin/podman:stderr  stderr: 10+0 records in
/bin/podman:stderr 10+0 records out
/bin/podman:stderr 10485760 bytes (10 MB, 10 MiB) copied, 0.041614 s, 252 MB/s
/bin/podman:stderr --> Only 1 LV left in VG, will proceed to destroy volume group ceph-9178f657-943e-409e-be2c-20c3207b4016
/bin/podman:stderr Running command: /usr/sbin/vgremove -v -f ceph-9178f657-943e-409e-be2c-20c3207b4016
/bin/podman:stderr  stderr: Removing ceph--9178f657--943e--409e--be2c--20c3207b4016-osd--block--61d6b56d--b982--492f--8dd2--c76e5a5384cb (253:8)
/bin/podman:stderr  stderr: Archiving volume group "ceph-9178f657-943e-409e-be2c-20c3207b4016" metadata (seqno 5).
/bin/podman:stderr  stderr: Releasing logical volume "osd-block-61d6b56d-b982-492f-8dd2-c76e5a5384cb"
/bin/podman:stderr  stderr: Creating volume group backup "/etc/lvm/backup/ceph-9178f657-943e-409e-be2c-20c3207b4016" (seqno 6).
/bin/podman:stderr  stdout: Logical volume "osd-block-61d6b56d-b982-492f-8dd2-c76e5a5384cb" successfully removed
/bin/podman:stderr  stderr: Removing physical volume "/dev/sdc" from volume group "ceph-9178f657-943e-409e-be2c-20c3207b4016"
/bin/podman:stderr  stdout: Volume group "ceph-9178f657-943e-409e-be2c-20c3207b4016" successfully removed
/bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero of=/dev/sdc bs=1M count=10 conv=fsync
/bin/podman:stderr  stderr: 10+0 records in
/bin/podman:stderr 10+0 records out
/bin/podman:stderr 10485760 bytes (10 MB, 10 MiB) copied, 0.0194739 s, 538 MB/s
/bin/podman:stderr  stderr: 
/bin/podman:stderr --> Zapping successful for: <Raw Device: /dev/sdc>
[
[ceph: root@ceph-adm7 /]# ceph orch device ls
Hostname   Path      Type  Serial                                Size   Health   Ident  Fault  Available  
ceph-adm7  /dev/sdc  hdd   6d0fcb0e-e6cc-47e9-b5a5-710a38470199  32.2G  Unknown  N/A    N/A    Yes        
ceph-adm7  /dev/sdb  hdd   5eed3652-9334-408b-b0e7-3a6d125a7acc  32.2G  Unknown  N/A    N/A    No         
ceph-adm7  /dev/sdd  hdd   bb888a81-55a6-4418-a9e5-c79043d1bbf7  32.2G  Unknown  N/A    N/A    No         
ceph-adm7  /dev/sde  hdd   6a660612-aa36-4e56-a80f-01839475e55d  21.4G  Unknown  N/A    N/A    No         
ceph-adm8  /dev/sdb  hdd   8829185e-42e9-4df9-9a59-ffab4697b7aa  32.2G  Unknown  N/A    N/A    No         
ceph-adm8  /dev/sdc  hdd   3bc5e11c-28f2-419e-b076-8b6032e49de5  32.2G  Unknown  N/A    N/A    No         
ceph-adm8  /dev/sdd  hdd   f65c4443-18fb-4d02-917d-6a6761541dab  32.2G  Unknown  N/A    N/A    No         
ceph-adm8  /dev/sde  hdd   572813c0-bce4-46f0-a388-bb9ba92a4c9c  32.2G  Unknown  N/A    N/A    No         
ceph-adm8  /dev/sdf  hdd   46e30862-254f-4327-bb00-99ea29f8e237  64.4G  Unknown  N/A    N/A    No         
ceph-adm9  /dev/sdb  hdd   7c92121d-7ee6-4545-9820-14449e78892c  32.2G  Unknown  N/A    N/A    No         
ceph-adm9  /dev/sdc  hdd   ff170b5d-c13f-4514-9685-532e3b3c798e  32.2G  Unknown  N/A    N/A    No         
ceph-adm9  /dev/sdd  hdd   d81b73c5-ab55-4a41-9f77-81533496ac16  32.2G  Unknown  N/A    N/A    No         
ceph-adm9  /dev/sde  hdd   ff2f239d-9870-4ee9-b7a1-20d01ad318cc  32.2G  Unknown  N/A    N/A    No         
ceph-adm9  /dev/sdf  hdd   073ed4af-4752-4956-9e09-6504da882a79  85.8G  Unknown  N/A    N/A    No         
[ceph: root@ceph-adm7 /]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME           STATUS     REWEIGHT  PRI-AFF
-1         0.34175  root default                                    
-7         0.07808      host ceph-adm7                              
 4    hdd  0.02930          osd.4              up   1.00000  1.00000
12    hdd  0.02930          osd.12             up   1.00000  1.00000
13    hdd  0.01949          osd.13             up   1.00000  1.00000
-3         0.14648      host ceph-adm8                              
 1    hdd  0.05859          osd.1              up   1.00000  1.00000
 2    hdd  0.02930          osd.2              up   1.00000  1.00000
 6    hdd  0.02930          osd.6              up   1.00000  1.00000
 8    hdd  0.02930          osd.8              up   1.00000  1.00000
-5         0.11719      host ceph-adm9                              
 3    hdd  0.02930          osd.3       destroyed         0  1.00000
 5    hdd  0.02930          osd.5       destroyed   1.00000  1.00000
 7    hdd  0.02930          osd.7              up   1.00000  1.00000
 9    hdd  0.02930          osd.9       destroyed         0  1.00000
 0               0  osd.0                      up         0  1.00000

[ceph: root@ceph-adm7 /]# ceph orch daemon add osd ceph-adm7:/dev/sdc
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1196, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 141, in handle_command
    return dispatch[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 332, in call
    return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 103, in <lambda>
    wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 92, in wrapper
    return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/module.py", line 753, in _daemon_add_osd
    raise_if_exception(completion)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 643, in raise_if_exception
    raise e
RuntimeError: cephadm exited with an error code: 1, stderr:/bin/podman:stderr WARNING: The same type, major and minor should not be used for multiple devices.
/bin/podman:stderr --> passed data devices: 1 physical, 0 LVM
/bin/podman:stderr --> relative data size: 1.0
/bin/podman:stderr Running command: /usr/bin/ceph-authtool --gen-print-key
/bin/podman:stderr Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 553c853b-18d9-4490-9f55-6473fc2cc5b1
/bin/podman:stderr  stderr: Error EEXIST: entity osd.10 exists but key does not match
/bin/podman:stderr -->  RuntimeError: Unable to create a new OSD id
Traceback (most recent call last):
  File "<stdin>", line 6129, in <module>
  File "<stdin>", line 1300, in _infer_fsid
  File "<stdin>", line 1383, in _infer_image
  File "<stdin>", line 3613, in command_ceph_volume
  File "<stdin>", line 1062, in call_throws
RuntimeError: Failed command: /bin/podman run --rm --ipc=host --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk -e CONTAINER_IMAGE=registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest -e NODE_NAME=ceph-adm7 -e CEPH_VOLUME_OSDSPEC_AFFINITY=None -v /var/run/ceph/58149bf2-66ac-11eb-84bf-001a4a000262:/var/run/ceph:z -v /var/log/ceph/58149bf2-66ac-11eb-84bf-001a4a000262:/var/log/ceph:z -v /var/lib/ceph/58149bf2-66ac-11eb-84bf-001a4a000262/crash:/var/lib/ceph/crash:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /tmp/ceph-tmp5bedoklo:/etc/ceph/ceph.conf:z -v /tmp/ceph-tmp84xn3s03:/var/lib/ceph/bootstrap-osd/ceph.keyring:z registry.redhat.io/rhceph-alpha/rhceph-5-rhel8:latest lvm batch --no-auto /dev/sdc --yes --no-systemd

Comment 1 Juan Miguel Olmo 2021-03-03 11:58:26 UTC

It seems that in some way , osd.10 was not deleted properly ... do you remember if you get any error when you deleted osd.10? What was the procedure used to remove osd.10?

The error produced is:
/bin/podman:stderr  stderr: Error EEXIST: entity osd.10 exists but key does not match
/bin/podman:stderr -->  RuntimeError: Unable to create a new OSD id

Returned directly form ceph-volume 

So it seems that the id "10" has been selected to be used for the new osd, but surprisingly it seems that osd.10 is also present in the auth list. Please verify if this is the situation.

In order to be able to create new osds .. you will need to do the following:

# ceph auth list

if you see:

osd.10
	key: ...................
	caps: [mgr] allow profile osd
	caps: [mon] allow profile osd
	caps: [osd] allow *

Then execute:

# ceph auth del osd.10

You will need to do exactly the same for all the osd entries in the auth list without a real OSD created in the cluster.


The interesting thing in this bug is how do you reach the point where osd 10 is deleted but still present in auths... do you remember how did you remove osd.10?

Comment 2 Preethi 2021-03-05 18:29:15 UTC

@Adam, I used ceph orch osd rm command to remove the OSD.10. I cannot repro the issue in other cluster. Below command output where OSD add was successful. 

[ceph: root@magna021 /]# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME          STATUS  REWEIGHT  PRI-AFF
 -1         33.75156  root default                                
 -3          4.94633      host plena001                           
  0    hdd   0.87279          osd.0          up   1.00000  1.00000
  1    hdd   0.87279          osd.1          up   1.00000  1.00000
  2    hdd   0.87279          osd.2          up   1.00000  1.00000
  3    hdd   0.87279          osd.3          up   1.00000  1.00000
  4    ssd   1.45518          osd.4          up   1.00000  1.00000
 -7          4.94633      host plena002                           
  5    hdd   0.87279          osd.5          up   1.00000  1.00000
  6    hdd   0.87279          osd.6          up   1.00000  1.00000
  7    hdd   0.87279          osd.7          up   1.00000  1.00000
  8    hdd   0.87279          osd.8          up   1.00000  1.00000
  9    ssd   1.45518          osd.9          up   1.00000  1.00000
-10          4.94633      host plena003                           
 10    hdd   0.87279          osd.10         up   1.00000  1.00000
 11    hdd   0.87279          osd.11         up   1.00000  1.00000
 12    hdd   0.87279          osd.12         up   1.00000  1.00000
 13    hdd   0.87279          osd.13         up   1.00000  1.00000
 14    ssd   1.45518          osd.14         up   1.00000  1.00000
-13          4.94633      host plena004                           
 15    hdd   0.87279          osd.15         up   1.00000  1.00000
 16    hdd   0.87279          osd.16         up   1.00000  1.00000
 17    hdd   0.87279          osd.17         up   1.00000  1.00000
 18    hdd   0.87279          osd.18         up   1.00000  1.00000
 19    ssd   1.45518          osd.19         up   1.00000  1.00000
-16          4.94633      host plena005                           
 20    hdd   0.87279          osd.20         up   1.00000  1.00000
 21    hdd   0.87279          osd.21         up   1.00000  1.00000
 22    hdd   0.87279          osd.22         up   1.00000  1.00000
 23    hdd   0.87279          osd.23         up   1.00000  1.00000
 24    ssd   1.45518          osd.24         up   1.00000  1.00000
-19          4.07355      host plena006                           
 25    hdd   0.87279          osd.25         up   1.00000  1.00000
 27    hdd   0.87279          osd.27         up   1.00000  1.00000
 28    hdd   0.87279          osd.28         up   1.00000  1.00000
 29    ssd   1.45518          osd.29         up   1.00000  1.00000
-22          4.94633      host plena007                           
 30    hdd   0.87279          osd.30         up   1.00000  1.00000
 31    hdd   0.87279          osd.31         up   1.00000  1.00000
 32    hdd   0.87279          osd.32         up   1.00000  1.00000
 33    hdd   0.87279          osd.33         up   1.00000  1.00000
 34    ssd   1.45518          osd.34         up   1.00000  1.00000
[ceph: root@magna021 /]# ceph orch daemon add osd plena006.ceph.redhat.com:/dev/sdc
Created osd(s) 26 on host 'plena006.ceph.redhat.com'
[ceph: root@magna021 /]#

Comment 3 Preethi 2021-03-08 13:19:31 UTC

@Juan, I do not see the issue in another cluster with the latest alpha image.

Comment 5 Juan Miguel Olmo 2021-03-15 09:14:20 UTC

mmanjuna:

It seems that your cluster is using a previous alpha image, and you are experimenting this issues:
https://bugzilla.redhat.com/show_bug.cgi?id=1923719

Please double check that you are uisng the latest alpha image in your cluster.

Comment 8 errata-xmlrpc 2021-08-30 08:28:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3294

Note You need to log in before you can comment on or make changes to this bug.