Bug 1724428
| Summary: | The "host" signature in "ceph osd status" remains unchanged on moving an OSD disk from failed node to a new node (workaround: mgr restart) | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Neha Berry <nberry> |
| Component: | RADOS | Assignee: | Neha Ojha <nojha> |
| Status: | CLOSED ERRATA | QA Contact: | Manohar Murthy <mmurthy> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.0 | CC: | bniver, ceph-eng-bugs, dzafman, ebenahar, etamir, hyelloji, jdurgin, kchai, madam, mkasturi, nojha, ocs-bugs, owasserm, prsurve, ratamir, sostapov, suprasad, tserlin |
| Target Milestone: | rc | Keywords: | AutomationBackLog |
| Target Release: | 4.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | ceph-14.2.8-7.el8, ceph-14.2.8-6.el7 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-19 17:30:41 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Neha Berry
2019-06-27 05:19:36 UTC
As I understand, the OSD does not automatically change where it is in the CRUSH map since this can cause data movement. Rook isn't controlling the setting of the osd crush location other than making sure the host is set correctly in the osd context. @Josh what is your expectation for OSDs moving to a different node? How should the tree be updated? Neha, I was able to reproduce this and found that the inconsistency is resolved after I restart the ceph manager pod. This means it is likely an issue in which the ceph-mgr cache is not being invalidated. It looks very similar to http://tracker.ceph.com/issues/40011 / https://bugzilla.redhat.com/show_bug.cgi?id=1705464. before ceph-mgr restart: [nwatkins@smash rook]$ kubectl -n rook-ceph exec -it rook-ceph-tools-7cf4cc7568-kz4q6 ceph osd status +----+---------+-------+-------+--------+---------+--------+---------+-----------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+---------+-------+-------+--------+---------+--------+---------+-----------+ | 0 | worker1 | 1027M | 8188M | 0 | 0 | 0 | 0 | exists,up | | 1 | worker0 | 1027M | 8188M | 0 | 0 | 0 | 0 | exists,up | +----+---------+-------+-------+--------+---------+--------+---------+-----------+ [nwatkins@smash rook]$ kubectl -n rook-ceph exec -it rook-ceph-tools-7cf4cc7568-kz4q6 ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.01758 root default -5 0.01758 host worker0 0 hdd 0.00879 osd.0 up 1.00000 1.00000 1 hdd 0.00879 osd.1 up 1.00000 1.00000 after ceph-mgr restart: [nwatkins@smash rook]$ kubectl -n rook-ceph exec -it rook-ceph-tools-7cf4cc7568-kz4q6 ceph osd status +----+---------+-------+-------+--------+---------+--------+---------+-----------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+---------+-------+-------+--------+---------+--------+---------+-----------+ | 0 | worker0 | 1027M | 8188M | 0 | 0 | 0 | 0 | exists,up | | 1 | worker0 | 1027M | 8188M | 0 | 0 | 0 | 0 | exists,up | +----+---------+-------+-------+--------+---------+--------+---------+-----------+ here is an upstream tracker issue for this: http://tracker.ceph.com/issues/40871 Component changed to ceph per @Noah's analysis. upstream backport fix posted at https://github.com/ceph/ceph/pull/30624. @Neha > # ceph version > ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable) i just realized we were testing a community release. but unlike downstream releases, we cannot cherry-pick into our own branch at will. the upstream has its own release schedule. how can you test a not-yet-released release? could you shed some light on it? Docs bug for adding the restart of MGR to the procedure of moving the OSD disk between nodes - bug 1789436 *** Bug 1776750 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:2231 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |