Bug 1954503
Summary: | Replacing the failed OSDs are failing on NVME's | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | skanta | ||||||||||
Component: | Cephadm | Assignee: | Daniel Pivonka <dpivonka> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | skanta | ||||||||||
Severity: | medium | Docs Contact: | Mary Frances Hull <mhull> | ||||||||||
Priority: | urgent | ||||||||||||
Version: | 5.0 | CC: | agunn, akupczyk, bhubbard, ceph-eng-bugs, dpivonka, knortema, nojha, pasik, pdhiran, pnataraj, rzarzyns, sangadi, sewagner, sseshasa, tserlin, vereddy, vumrao | ||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||
Target Release: | 5.0z1 | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | ceph-16.2.0-127.el8cp | Doc Type: | Bug Fix | ||||||||||
Doc Text: |
.Searching Ceph OSD id claim matches a host's fully-qualified domain name to a host name
Previously, when replacing a failed Ceph OSD, the name in the CRUSH map appeared only as a host name, and searching for the Ceph OSD id claim was using the fully-qualified domain name (FQDN) instead. As a result, the Ceph OSD id claim was not found.
With this release, the Ceph OSD id claim search functionality correctly matches a FQDN to a host name, and replacing the Ceph OSD works as expected.
|
Story Points: | --- | ||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2021-11-02 16:38:26 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1936099, 1959686 | ||||||||||||
Attachments: |
|
Description
skanta
2021-04-28 09:41:46 UTC
Created attachment 1776495 [details]
cephadm,ceph-volume and ceph-osd logs
Created attachment 1776497 [details]
Error snap shot
Created attachment 1776499 [details]
HDD Working scenario
I do not think this a RADOS bug. Followed the steps as mentioned at -https://cee-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/CCS/job/ccs-mr-preview/35102/artifact/preview/index.html#replacing-the-failed-osds-on-the-ceph-dashboard_dash Reproduced the issue with CLI commands- 1.[ceph: root@depressa009 /]# ceph -s cluster: id: 478513fc-a996-11eb-b00c-ac1f6b5635ee health: HEALTH_OK services: mon: 3 daemons, quorum depressa009,depressa011,depressa010 (age 29m) mgr: depressa009.qrithr(active, since 97m), standbys: magna048.dzxcnc osd: 23 osds: 23 up (since 22m), 23 in (since 22m) data: pools: 1 pools, 1 pgs objects: 0 objects, 0 B usage: 128 MiB used, 66 TiB / 66 TiB avail pgs: 1 active+clean [ceph: root@depressa009 /]# 2.systemctl stop ceph-478513fc-a996-11eb-b00c-ac1f6b5635ee.service ("out down" status in Dashboard) 3. ceph osd add-noup osd.2 - ("out down + noup" status in Dashboard) 4. ceph osd destroy 2 --yes-i-really-mean-it ("out destroyed + noup" status in Dashboard) output: destroyed osd.2 5. ceph orch device zap depressa010.ceph.redhat.com /dev/nvme0n1 --force Output : [ceph: root@depressa009 /]# ceph orch device zap depressa011.ceph.redhat.com /dev/nvme0n1 --force /bin/podman: WARNING: The same type, major and minor should not be used for multiple devices. /bin/podman: WARNING: The same type, major and minor should not be used for multiple devices. /bin/podman: WARNING: The same type, major and minor should not be used for multiple devices. /bin/podman: WARNING: The same type, major and minor should not be used for multiple devices. /bin/podman: WARNING: The same type, major and minor should not be used for multiple devices. /bin/podman: --> Zapping: /dev/nvme0n1 /bin/podman: --> Zapping lvm member /dev/nvme0n1. lv_path is /dev/ceph-48982378-126e-4e95-847c-bfd6d6d1692d/osd-block-8e6d7857-42df-4eb4-8426-fb552c1eb97f /bin/podman: Running command: /usr/bin/dd if=/dev/zero of=/dev/ceph-48982378-126e-4e95-847c-bfd6d6d1692d/osd-block-8e6d7857-42df-4eb4-8426-fb552c1eb97f bs=1M count=10 conv=fsync /bin/podman: stderr: 10+0 records in /bin/podman: 10+0 records out /bin/podman: stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.00877897 s, 1.2 GB/s /bin/podman: --> Only 1 LV left in VG, will proceed to destroy volume group ceph-48982378-126e-4e95-847c-bfd6d6d1692d /bin/podman: Running command: /usr/sbin/vgremove -v -f ceph-48982378-126e-4e95-847c-bfd6d6d1692d /bin/podman: stderr: Removing ceph--48982378--126e--4e95--847c--bfd6d6d1692d-osd--block--8e6d7857--42df--4eb4--8426--fb552c1eb97f (253:0) /bin/podman: stderr: Archiving volume group "ceph-48982378-126e-4e95-847c-bfd6d6d1692d" metadata (seqno 5). /bin/podman: stderr: Releasing logical volume "osd-block-8e6d7857-42df-4eb4-8426-fb552c1eb97f" /bin/podman: stderr: Creating volume group backup "/etc/lvm/backup/ceph-48982378-126e-4e95-847c-bfd6d6d1692d" (seqno 6). /bin/podman: stdout: Logical volume "osd-block-8e6d7857-42df-4eb4-8426-fb552c1eb97f" successfully removed /bin/podman: stderr: Removing physical volume "/dev/nvme0n1" from volume group "ceph-48982378-126e-4e95-847c-bfd6d6d1692d" /bin/podman: stdout: Volume group "ceph-48982378-126e-4e95-847c-bfd6d6d1692d" successfully removed /bin/podman: Running command: /usr/bin/dd if=/dev/zero of=/dev/nvme0n1 bs=1M count=10 conv=fsync /bin/podman: stderr: 10+0 records in /bin/podman: 10+0 records out /bin/podman: 10485760 bytes (10 MB, 10 MiB) copied, 0.0081311 s, 1.3 GB/s /bin/podman: --> Zapping successful for: <Raw Device: /dev/nvme0n1> @skanta : In comment 6 you are not doing nothing to "replace" the OSD id... you are just deleting an OSD using an imaginative way. And what we see in you comment is that all the commands you have used are working properly as expected. If you want to replace an OSD using CLI commands:, you must follow: https://docs.ceph.com/en/latest/cephadm/osd/#replacing-an-osd And not is needed to stop any service previously. The only think you need is to have "devices" available in the same host where you are trying to replace the osd. Note 1: Take into account that if you have used a "managed OSD" service to create your OSDs, like "all-available-devices", execution the zap command as you have pointed in comment 6: "ceph orch device zap depressa010.ceph.redhat.com /dev/nvme0n1 --force" causes the "creation" of one available device in depressa010 that will be used inmmediatelly to create a new osd inmediatelly. see: https://docs.ceph.com/en/latest/cephadm/osd/#erasing-devices-zapping-devices Note 2: @pnataraj has executed this procedure several times .. please comment with her your doubts. Thanks for our suggestions, I performed the tests as mentioned at https://bugzilla.redhat.com/show_bug.cgi?id=1932489#c1, and here are the outputs- ========================================================================================================================================================================= HDD | NVME | Status ========================================================================================================================================================================= 1.ceph orch osd rm 0 --replace | ceph orch osd rm 1 --replace | HDD -->OSD 0 removed(replace) [ceph: root@magna045 /]# ceph orch osd rm 0 --replace | [ceph: root@magna045 /]# ceph orch osd rm 1 --replace | NVME-->OSD 1 removed(replace) Scheduled OSD(s) for removal | Scheduled OSD(s) for removal | [ceph: root@magna045 /]# | [ceph: root@magna045 /]# | | 2.ceph orch osd rm status [ceph: root@magna045 /]# ceph orch osd rm status | [ceph: root@magna045 /]# ceph orch osd rm status | No OSD remove/replace operations reported | No OSD remove/replace operations reported | [ceph: root@magna045 /]# | [ceph: root@magna045 /]# | 3. [ceph: root@magna045 /]# ceph orch daemon add osd magna046:/dev/sdd | [ceph: root@magna045 /]# ceph orch daemon add osd depressa008. | Created osd(s) 0 on host 'magna046' | ceph.redhat.com:/dev/nvme1n1 | HDD----> OSD 0 added (replaced) [ceph: root@magna045 /]# | Created osd(s) 5 on host 'depressa008.ceph.redhat.com' | NVME---> OSD 5 added(New and not replaced) | [ceph: root@magna045 /]# | 4. Output:- [ceph: root@magna045 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF | [ceph: root@magna045 /]# ceph osd tree -1 10.05649 root default | ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -7 7.32739 host depressa008 | -1 10.39758 root default 1 ssd 0.34109 osd.1 up 1.00000 1.00000 | -7 7.66849 host depressa008 2 ssd 6.98630 osd.2 up 1.00000 1.00000 | 1 ssd 0.34109 osd.1 destroyed 0 1.00000 -10 0.90970 host magna045 | 2 ssd 6.98630 osd.2 up 1.00000 1.00000 3 hdd 0.90970 osd.3 up 1.00000 1.00000 | 5 ssd 0.34109 osd.5 up 1.00000 1.00000 ---->OSD 1 in destroyed status -3 0.90970 host magna046 | -10 0.90970 host magna045 and OSD.5 is newly added not replaced. 0 hdd 0.90970 osd.0 up 1.00000 1.00000 | 3 hdd 0.90970 osd.3 up 1.00000 1.00000 -13 0.90970 host magna047 | -3 0.90970 host magna046 4 hdd 0.90970 osd.4 up 1.00000 1.00000 | 0 hdd 0.90970 osd.0 up 1.00000 1.00000 [ceph: root@magna045 /]# | -13 0.90970 host magna047 | 4 hdd 0.90970 osd.4 up 1.00000 1.00000 | [ceph: root@magna045 /]# Initially created a cluster with 5 osd's(0,1,2,3 and 4).For HDD's OSD are replaced successfully so we can see only 5 OSD's after replace(step-4). In NVME's replace is not working and a new OSD(5) is being added and we can notice that the osd count is 6(0,1,2,3,4 and 5-->newly added). Created attachment 1781487 [details]
Error snap shot
The problem is related to the use of FQDN names in the cluster. The OSD ids to reuse when creating new osds come from the output of the command "ceph osd tree destroyed", and in the output of this command the hostname is represented as bare hostname. Ex: [ceph: root@magna045 /]# ceph osd tree destroyed ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 17.04279 root default -5 7.32739 host depressa008 5 ssd 0.34109 osd.5 destroyed 0 1.00000 However, the new osd where we want to reuse the "old id" ( in this case 5 ) is created using the FDQN hostname: ceph orch daemon add osd depressa008.ceph.redhat.com:/dev/nvme0n1 And cephadm cannot "link" the bare hostname with the FDQN hostname. So it is not possible to know if we have "osd ids" to reuse in the host. The composition of the cluster is weird, because it mix hosts with bare and FQDN names: [ceph: root@magna045 /]# ceph orch host ls HOST ADDR LABELS STATUS depressa008.ceph.redhat.com depressa008.ceph.redhat.com depressa009.ceph.redhat.com depressa009.ceph.redhat.com magna045 magna045 magna046 magna046 magna047 magna047 I recommend the exclusive use of BARE NAMES for ALL the hosts in the cluster: See https://docs.ceph.com/en/latest/cephadm/host-management/#fully-qualified-domain-names-vs-bare-host-names To have a mix of naming schemes for the hosts only can cause more problems in the future. Please use the bare name for "depressa008" and "depressa009" when you are adding these hosts to the cluster. Once these hosts will use the bare name the problem won't appear again. The more easy and faster solution for the issue pointed in this bug is to rebuild the cluster using BARE hostnames. Modification on going. Please use "bare names" in the cluster hosts as workaround for this issue until we have the fix downstream. As mentioned in the https://bugzilla.redhat.com/show_bug.cgi?id=1954503#c11, performed the steps with the bare names in the cluster and successfully replaced the osd's. [ceph: root@magna045 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 10.05649 root default -7 7.32739 host depressa008 1 ssd 0.34109 osd.1 up 1.00000 1.00000 2 ssd 6.98630 osd.2 up 1.00000 1.00000 -10 0.90970 host magna045 3 hdd 0.90970 osd.3 up 1.00000 1.00000 -3 0.90970 host magna046 0 hdd 0.90970 osd.0 up 1.00000 1.00000 -13 0.90970 host magna047 4 hdd 0.90970 osd.4 up 1.00000 1.00000 [ceph: root@magna045 /]# For customers with FQDN host names, the workaround is to manually replace OSDs. * https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#removing-osds-manual * set destroyed flag * Add the OSD via ceph-volume * ceph cephadm osd activate <host> Updating Severity accordingly. 5.1 as discussed would be great to have this in z1 fix pr:https://github.com/ceph/ceph/pull/41328 pacific backport: https://github.com/ceph/ceph/pull/41463 cherry picked to: ceph-5.0-rhel-patches I think there is a misunderstanding of how replacing an osd works. When an osd is removed and marked to be replaced with 'ceph orch osd rm <osd_id> --replace' the osd that will replace the removed osd must be on the same host as the osd that was removed. In your testing in comments #c25 and #c26 you are removing an osd on one host and then adding another osd on a different host. for example in #c25 you removed and marked osd.0 to be replaced. That osd was on host depressa008. You then created a new osd on host depressa011 and that became osd.15. That is the expected behavior. To replace osd.0 you would have had to create another osd on depressa008. When I tested osd replacement it is working as expected when you remove and mark an osd to be replaced on one host and add another osd on the same host. Moving this back to on_qa to be tested correctly. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4105 |