Bug 1954503

Summary: Replacing the failed OSDs are failing on NVME's
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: skanta
Component: CephadmAssignee: Daniel Pivonka <dpivonka>
Status: CLOSED ERRATA QA Contact: skanta
Severity: medium Docs Contact: Mary Frances Hull <mhull>
Priority: urgent    
Version: 5.0CC: agunn, akupczyk, bhubbard, ceph-eng-bugs, dpivonka, knortema, nojha, pasik, pdhiran, pnataraj, rzarzyns, sangadi, sewagner, sseshasa, tserlin, vereddy, vumrao
Target Milestone: ---Keywords: Reopened
Target Release: 5.0z1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-16.2.0-127.el8cp Doc Type: Bug Fix
Doc Text:
.Searching Ceph OSD id claim matches a host's fully-qualified domain name to a host name Previously, when replacing a failed Ceph OSD, the name in the CRUSH map appeared only as a host name, and searching for the Ceph OSD id claim was using the fully-qualified domain name (FQDN) instead. As a result, the Ceph OSD id claim was not found. With this release, the Ceph OSD id claim search functionality correctly matches a FQDN to a host name, and replacing the Ceph OSD works as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-02 16:38:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1936099, 1959686    
Attachments:
Description Flags
cephadm,ceph-volume and ceph-osd logs
none
Error snap shot
none
HDD Working scenario
none
Error snap shot none

Description skanta 2021-04-28 09:41:46 UTC
Description of problem:
   Trying to replace the failed OSD's with zap command and noticed that the OSD's are in out and destroyed state.


Version-Release number of selected component (if applicable):
[ceph: root@magna045 ceph]# ceph -v
ceph version 16.2.0-8.el8cp (f869f8bf2b6e9c3886e94267d378de5d9d57bb61) pacific (stable)
[ceph: root@magna045 ceph]# 



How reproducible:
Steps to Reproduce:
1. Create cluster by using cephadm
2. To simulate the OSD down scenario, login to to the OSD node.
3. To get the OSD services execute "systemctl list-units --type=service | grep ceph"
4.To make the OSD down executed "systemctl stop ceph-1bafd812-a7e2-11eb-9865-002590fbc342.service". In the dashboard the OSD status should be "in down".

5. Login to the Dashboard  https://<nideIP>:8443
6.Navigate to Clusters->OSDs
7.Select the OSD which is down.
8.From the Edit drop-down menu, select Flags and select No Up and click Update.The status should be "in down" and the Flags should be "noup"

9.From the Edit drop-down menu, select Delete.
10.In the Delete OSD dialog box, select the Preserve OSD ID(s) for replacement and Yes, I am sure check boxes.
11.Click Delete OSD
12.Wait till the status of the OSD changes to out and `destroyed.
13.ceph orch device zap depressa008.ceph.redhat.com /dev/nvme0n1 --force

Actual results:
                The OSD is  in "out  destroyed" state

Expected results:
                The OSD should move from  "out destroyed" status to "out down" status.


Additional info:  The same scenario is working on HDD's

Comment 1 skanta 2021-04-28 10:28:03 UTC
Created attachment 1776495 [details]
cephadm,ceph-volume and ceph-osd logs

Comment 2 skanta 2021-04-28 10:31:18 UTC
Created attachment 1776497 [details]
Error snap shot

Comment 3 skanta 2021-04-28 10:32:45 UTC
Created attachment 1776499 [details]
HDD Working scenario

Comment 4 Vikhyat Umrao 2021-04-28 10:54:23 UTC
I do not think this a RADOS bug.

Comment 6 skanta 2021-04-30 12:56:55 UTC
Reproduced the issue with CLI commands-

1.[ceph: root@depressa009 /]# ceph -s
  cluster:
    id:     478513fc-a996-11eb-b00c-ac1f6b5635ee
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum depressa009,depressa011,depressa010 (age 29m)
    mgr: depressa009.qrithr(active, since 97m), standbys: magna048.dzxcnc
    osd: 23 osds: 23 up (since 22m), 23 in (since 22m)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   128 MiB used, 66 TiB / 66 TiB avail
    pgs:     1 active+clean
 
[ceph: root@depressa009 /]# 

2.systemctl stop ceph-478513fc-a996-11eb-b00c-ac1f6b5635ee.service  ("out down" status in Dashboard)

3. ceph osd add-noup osd.2 - ("out down + noup"  status in Dashboard)

4. ceph osd destroy 2 --yes-i-really-mean-it ("out destroyed + noup" status in Dashboard)
   output: destroyed osd.2

5. ceph orch device zap depressa010.ceph.redhat.com /dev/nvme0n1 --force

   Output :
      [ceph: root@depressa009 /]# ceph orch device zap depressa011.ceph.redhat.com /dev/nvme0n1 --force
		/bin/podman: WARNING: The same type, major and minor should not be used for multiple devices.
		/bin/podman: WARNING: The same type, major and minor should not be used for multiple devices.
		/bin/podman: WARNING: The same type, major and minor should not be used for multiple devices.
		/bin/podman: WARNING: The same type, major and minor should not be used for multiple devices.
		/bin/podman: WARNING: The same type, major and minor should not be used for multiple devices.
		/bin/podman: --> Zapping: /dev/nvme0n1
		/bin/podman: --> Zapping lvm member /dev/nvme0n1. lv_path is /dev/ceph-48982378-126e-4e95-847c-bfd6d6d1692d/osd-block-8e6d7857-42df-4eb4-8426-fb552c1eb97f
		/bin/podman: Running command: /usr/bin/dd if=/dev/zero of=/dev/ceph-48982378-126e-4e95-847c-bfd6d6d1692d/osd-block-8e6d7857-42df-4eb4-8426-fb552c1eb97f bs=1M count=10 conv=fsync
		/bin/podman:  stderr: 10+0 records in
		/bin/podman: 10+0 records out
		/bin/podman:  stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.00877897 s, 1.2 GB/s
		/bin/podman: --> Only 1 LV left in VG, will proceed to destroy volume group ceph-48982378-126e-4e95-847c-bfd6d6d1692d
		/bin/podman: Running command: /usr/sbin/vgremove -v -f ceph-48982378-126e-4e95-847c-bfd6d6d1692d
		/bin/podman:  stderr: Removing ceph--48982378--126e--4e95--847c--bfd6d6d1692d-osd--block--8e6d7857--42df--4eb4--8426--fb552c1eb97f (253:0)
		/bin/podman:  stderr: Archiving volume group "ceph-48982378-126e-4e95-847c-bfd6d6d1692d" metadata (seqno 5).
		/bin/podman:  stderr: Releasing logical volume "osd-block-8e6d7857-42df-4eb4-8426-fb552c1eb97f"
		/bin/podman:  stderr: Creating volume group backup "/etc/lvm/backup/ceph-48982378-126e-4e95-847c-bfd6d6d1692d" (seqno 6).
		/bin/podman:  stdout: Logical volume "osd-block-8e6d7857-42df-4eb4-8426-fb552c1eb97f" successfully removed
		/bin/podman:  stderr: Removing physical volume "/dev/nvme0n1" from volume group "ceph-48982378-126e-4e95-847c-bfd6d6d1692d"
		/bin/podman:  stdout: Volume group "ceph-48982378-126e-4e95-847c-bfd6d6d1692d" successfully removed
		/bin/podman: Running command: /usr/bin/dd if=/dev/zero of=/dev/nvme0n1 bs=1M count=10 conv=fsync
		/bin/podman:  stderr: 10+0 records in
		/bin/podman: 10+0 records out
		/bin/podman: 10485760 bytes (10 MB, 10 MiB) copied, 0.0081311 s, 1.3 GB/s
		/bin/podman: --> Zapping successful for: <Raw Device: /dev/nvme0n1>

Comment 8 Juan Miguel Olmo 2021-05-06 09:29:17 UTC
@skanta :

In comment 6 you are not doing nothing to "replace" the OSD id... you are just deleting an OSD using an imaginative way. 
And what we see in you comment is that all the commands you have used are working properly as expected.

If you want to replace an OSD using CLI commands:, you must follow:

https://docs.ceph.com/en/latest/cephadm/osd/#replacing-an-osd

And not is needed to stop any service previously. The only think you need is to have "devices" available in the same host where you are trying to replace the osd.


Note 1: 
Take into account that if you have used a "managed OSD" service to create your OSDs, like "all-available-devices", execution the zap command as you have pointed in comment 6:
"ceph orch device zap depressa010.ceph.redhat.com /dev/nvme0n1 --force"
causes the "creation" of one available device in depressa010 that will be used inmmediatelly to create a new osd inmediatelly.
see: https://docs.ceph.com/en/latest/cephadm/osd/#erasing-devices-zapping-devices

Note 2: @pnataraj has executed this procedure several times .. please comment with her your doubts.

Comment 9 skanta 2021-05-10 00:53:19 UTC
Thanks for our suggestions, I performed the tests as mentioned at https://bugzilla.redhat.com/show_bug.cgi?id=1932489#c1, and here are the outputs-

=========================================================================================================================================================================
                              HDD                                          |                         NVME                                    |      Status
=========================================================================================================================================================================
1.ceph orch osd rm 0 --replace                                             |      ceph orch osd rm 1 --replace                               | HDD -->OSD 0 removed(replace)  
  [ceph: root@magna045 /]# ceph orch osd rm 0 --replace                    |         [ceph: root@magna045 /]# ceph orch osd rm 1 --replace   | NVME-->OSD 1 removed(replace)
    Scheduled OSD(s) for removal                                           |                Scheduled OSD(s) for removal                     |    
  [ceph: root@magna045 /]#                                                 |         [ceph: root@magna045 /]#                                |                              
                                                                           |

2.ceph orch osd rm status  
   [ceph: root@magna045 /]# ceph orch osd rm status                        |     [ceph: root@magna045 /]# ceph orch osd rm status            |
	No OSD remove/replace operations reported                          |         No OSD remove/replace operations reported               | 
   [ceph: root@magna045 /]#                                                |     [ceph: root@magna045 /]#                                    |


3.
  [ceph: root@magna045 /]# ceph orch daemon add osd magna046:/dev/sdd      | [ceph: root@magna045 /]# ceph orch daemon add osd depressa008.  |
	Created osd(s) 0 on host 'magna046'                                |                               ceph.redhat.com:/dev/nvme1n1      |  HDD----> OSD 0 added (replaced) 
  [ceph: root@magna045 /]#                                                 |         Created osd(s) 5 on host 'depressa008.ceph.redhat.com'  |  NVME---> OSD 5 added(New and not replaced)
                                                                           | [ceph: root@magna045 /]#                                        |  

4. Output:-

  [ceph: root@magna045 /]# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME             STATUS  REWEIGHT  PRI-AFF      | [ceph: root@magna045 /]# ceph osd tree
 -1         10.05649  root default                                         | ID   CLASS  WEIGHT    TYPE NAME             STATUS     REWEIGHT  PRI-AFF
 -7          7.32739      host depressa008                                 |  -1         10.39758  root default  
  1    ssd   0.34109          osd.1             up   1.00000  1.00000      |  -7          7.66849      host depressa008       
  2    ssd   6.98630          osd.2             up   1.00000  1.00000      |   1    ssd   0.34109          osd.1         destroyed         0  1.00000
-10          0.90970      host magna045                                    |   2    ssd   6.98630          osd.2                up   1.00000  1.00000
  3    hdd   0.90970          osd.3             up   1.00000  1.00000      |   5    ssd   0.34109          osd.5                up   1.00000  1.00000      ---->OSD 1 in destroyed status
 -3          0.90970      host magna046                                    | -10          0.90970      host magna045                                           and OSD.5 is newly added not replaced.
  0    hdd   0.90970          osd.0             up   1.00000  1.00000      |   3    hdd   0.90970          osd.3                up   1.00000  1.00000
-13          0.90970      host magna047                                    |  -3          0.90970      host magna046                                 
  4    hdd   0.90970          osd.4             up   1.00000  1.00000      |   0    hdd   0.90970          osd.0                up   1.00000  1.00000
[ceph: root@magna045 /]#                                                   | -13          0.90970      host magna047 
                                                                           |   4    hdd   0.90970          osd.4                up   1.00000  1.00000 
									   | [ceph: root@magna045 /]#	  


Initially created a cluster with 5 osd's(0,1,2,3 and 4).For HDD's OSD are replaced successfully so we can see only 5 OSD's after replace(step-4). In NVME's replace is not working and a new OSD(5) is being added and we can notice that the osd count is 6(0,1,2,3,4 and 5-->newly added).

Comment 10 skanta 2021-05-10 02:55:13 UTC
Created attachment 1781487 [details]
Error snap shot

Comment 11 Juan Miguel Olmo 2021-05-11 10:13:50 UTC
The problem is related to the use of FQDN names in the cluster.

The OSD ids to reuse when creating new osds come from the output of the command "ceph osd tree destroyed", and in the output of this command the hostname is represented as bare hostname.
Ex:
[ceph: root@magna045 /]# ceph osd tree destroyed
ID  CLASS  WEIGHT    TYPE NAME             STATUS     REWEIGHT  PRI-AFF
-1         17.04279  root default                                      
-5          7.32739      host depressa008                              
 5    ssd   0.34109          osd.5         destroyed         0  1.00000


However, the new osd where we want to reuse the "old id" ( in this case 5 ) is created using the FDQN hostname:

ceph orch daemon add osd depressa008.ceph.redhat.com:/dev/nvme0n1


And cephadm cannot "link" the bare hostname with the FDQN hostname. So it is not possible to know if we have "osd ids" to reuse in the host.


The composition of the cluster is weird, because it mix hosts with bare and FQDN names:

[ceph: root@magna045 /]# ceph orch host ls
HOST                         ADDR                         LABELS  STATUS  
depressa008.ceph.redhat.com  depressa008.ceph.redhat.com                  
depressa009.ceph.redhat.com  depressa009.ceph.redhat.com                  
magna045                     magna045                                     
magna046                     magna046                                     
magna047                     magna047    


I recommend the exclusive use of BARE NAMES for ALL the hosts in the cluster:

See https://docs.ceph.com/en/latest/cephadm/host-management/#fully-qualified-domain-names-vs-bare-host-names

To have a mix of naming schemes for the hosts only can cause more problems in the future.


Please use the bare name for "depressa008" and "depressa009" when you are adding these hosts to the cluster. Once these hosts will use the bare name the problem won't appear again.

The more easy and faster solution for the issue pointed in this bug is to rebuild the cluster using BARE hostnames.

Comment 12 Juan Miguel Olmo 2021-05-13 17:03:15 UTC
Modification on going. Please use "bare names" in the cluster hosts as workaround for this issue until we have the fix downstream.

Comment 13 skanta 2021-05-13 17:24:56 UTC
As mentioned in the https://bugzilla.redhat.com/show_bug.cgi?id=1954503#c11, performed the steps with the bare names in the cluster and successfully replaced the osd's.

[ceph: root@magna045 /]# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME             STATUS  REWEIGHT  PRI-AFF
 -1         10.05649  root default                                   
 -7          7.32739      host depressa008                           
  1    ssd   0.34109          osd.1             up   1.00000  1.00000
  2    ssd   6.98630          osd.2             up   1.00000  1.00000
-10          0.90970      host magna045                              
  3    hdd   0.90970          osd.3             up   1.00000  1.00000
 -3          0.90970      host magna046                              
  0    hdd   0.90970          osd.0             up   1.00000  1.00000
-13          0.90970      host magna047                              
  4    hdd   0.90970          osd.4             up   1.00000  1.00000
[ceph: root@magna045 /]#

Comment 14 Sebastian Wagner 2021-05-26 10:33:19 UTC
For customers with FQDN host names, the workaround is to manually replace OSDs. 

* https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#removing-osds-manual 
* set destroyed flag
* Add the OSD via ceph-volume
* ceph cephadm osd activate <host>


Updating Severity accordingly.

Comment 15 Sebastian Wagner 2021-05-26 10:38:47 UTC
5.1 as discussed

Comment 16 Sebastian Wagner 2021-06-18 11:52:19 UTC
would be great to have this in z1

Comment 20 Daniel Pivonka 2021-09-16 18:30:05 UTC
fix pr:https://github.com/ceph/ceph/pull/41328
pacific backport: https://github.com/ceph/ceph/pull/41463
cherry picked to: ceph-5.0-rhel-patches

Comment 31 Daniel Pivonka 2021-10-21 16:52:48 UTC
I think there is a misunderstanding of how replacing an osd works. When an osd is removed and marked to be replaced with 'ceph orch osd rm <osd_id> --replace' the osd that will replace the removed osd must be on the same host as the osd that was removed. 

In your testing in comments #c25 and #c26 you are removing an osd on one host and then adding another osd on a different host. 

for example in #c25 you removed and marked osd.0 to be replaced. That osd was on host depressa008. You then created a new osd on host depressa011 and that became osd.15. That is the expected behavior. To replace  osd.0 you would have had to create another osd on depressa008.

When I tested osd replacement it is working as expected when you remove and mark an osd to be replaced on one host and add another osd on the same host.

Moving this back to on_qa to be tested correctly.

Comment 40 errata-xmlrpc 2021-11-02 16:38:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4105