Description of problem: --------------------------- Rook csi ceph setup = latest rook + csi setup (after PR -https://github.com/rook/rook/pull/3324 was merged) When a node with OSD is removed permanently from OCP cluster, and the corresponding OSD disk is added to a new OCP node( through AWS console), the OSD is indeed recovered but the "host" name under "ceph osd status" still lists the hostname of the failed node instead of the new node. Details of steps performed added in the next comment. Snip of output which shows inconsistency between host name from oc get pods and ceph osd status sh-4.2# ceph osd status +----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+ | 0 | ip-10-0-149-134.us-east-2.compute.internal | 1025M | 97.9G | 0 | 0 | 1 | 16 | exists,up | | 1 | ip-10-0-129-70.us-east-2.compute.internal | 1025M | 97.9G | 0 | 0 | 0 | 0 | exists,up | | 2 | ip-10-0-161-198.us-east-2.compute.internal | 1025M | 97.9G | 0 | 0 | 1 | 90 | exists,up | +----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+ $oc get pods -n openshift-storage -o wide|grep osd- rook-ceph-osd-0-cd8c5475d-pmzw5 1/1 Running 0 19h 10.129.2.17 ip-10-0-149-134.us-east-2.compute.internal <none> <none> rook-ceph-osd-1-dd6754887-gssh2 1/1 Running 0 14h 10.130.2.9 ip-10-0-132-79.us-east-2.compute.internal <none> <none> rook-ceph-osd-2-64c99cc4c6-hdmph 1/1 Running 0 19h 10.128.2.13 ip-10-0-161-198.us-east-2.compute.internal <none> <none> How reproducible: ------------------------- 100% Steps to Reproduce: -------------------------- 1. Create a ceph based rook-csi cluster using the latest rook repo files (https://github.com/rook/rook) - master branch 2. Remove one of the worker machine which hosts an OSD $ date; time oc delete machine/nberry-jun26-1-mwzjg-worker-us-east-2a-4ppjb -n openshift-machine-api; date 3. After some time, confirm that a new OCP node is spun up and a new machine is added. 4. Once ceph marks the failed OSD as OUT, add the "Available" disk from failed node to the new node in AWS console 5. Check the following #oc get pods -o wide -n <openshift-storage> ----> for osd-x again coming up without any issue Within toolbox: #ceph status #ceph osd status Actual results: =================== When a failed OSD is recovered by re-adding the disk to a new node, its "host" signature still shows the old hostname instead of the new. This can be misleading. Expected results: ==================== Ceph OSD status should get updated with new host name for the re-added OSD. Version-Release number of selected component (if applicable): ================================================= Ceph and rook version ----------------------- sh-4.2# ceph version ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable) sh-4.2# rook version rook: v1.0.0-154.g004f795 CSI version ------------- $oc logs csi-rbdplugin-provisioner-0 -n openshift-storage -c csi-rbdplugin I0626 09:40:26.830842 1 cephcsi.go:108] Starting driver type: rbd with name: rbd.csi.ceph.com I0626 09:40:26.830922 1 rbd.go:104] Driver: rbd.csi.ceph.com version: 1.0.0 $ oc describe pod csi-rbdplugin-provisioner-0 -n openshift-storage|grep -i image Image: quay.io/k8scsi/csi-provisioner:v1.2.0 Image ID: quay.io/k8scsi/csi-provisioner@sha256:0dffe9a8d39c4fdd49c5dd98ca5611a3f9726c012b082946f630e36988ba9f37 Image: quay.io/k8scsi/csi-attacher:v1.1.1 Image ID: quay.io/k8scsi/csi-attacher@sha256:e4db94969e1d463807162a1115192ed70d632a61fbeb3bdc97b40fe9ce78c831 Image: quay.io/k8scsi/csi-snapshotter:v1.1.0 Image ID: quay.io/k8scsi/csi-snapshotter@sha256:a49e0da1af6f2bf717e41ba1eee8b5e6a1cbd66a709dd92cc43fe475fe2589eb Image: quay.io/cephcsi/cephcsi:canary Image ID: quay.io/cephcsi/cephcsi@sha256:e832be9790bf12ab74c87217cce1dbd0a2416500e1d9a39af53b3fece414feac Rookk image ---------------------- $ oc describe pod rook-ceph-operator-5c6fd4b7db-rldcf -n openshift-storage|grep -i image Image: rook/ceph:master Image ID: docker.io/rook/ceph@sha256:4d0057e90c28a7bd8d3c3e9b13df40d0df1847567ef50a2a1e41dcea7ddb1d18 ROOK_CSI_CEPH_IMAGE: quay.io/cephcsi/cephcsi:canary ROOK_CSI_REGISTRAR_IMAGE: quay.io/k8scsi/csi-node-driver-registrar:v1.1.0 ROOK_CSI_PROVISIONER_IMAGE: quay.io/k8scsi/csi-provisioner:v1.2.0 ROOK_CSI_SNAPSHOTTER_IMAGE: quay.io/k8scsi/csi-snapshotter:v1.1.0 ROOK_CSI_ATTACHER_IMAGE: quay.io/k8scsi/csi-attacher:v1.1.1
As I understand, the OSD does not automatically change where it is in the CRUSH map since this can cause data movement. Rook isn't controlling the setting of the osd crush location other than making sure the host is set correctly in the osd context. @Josh what is your expectation for OSDs moving to a different node? How should the tree be updated?
Neha, I was able to reproduce this and found that the inconsistency is resolved after I restart the ceph manager pod. This means it is likely an issue in which the ceph-mgr cache is not being invalidated. It looks very similar to http://tracker.ceph.com/issues/40011 / https://bugzilla.redhat.com/show_bug.cgi?id=1705464. before ceph-mgr restart: [nwatkins@smash rook]$ kubectl -n rook-ceph exec -it rook-ceph-tools-7cf4cc7568-kz4q6 ceph osd status +----+---------+-------+-------+--------+---------+--------+---------+-----------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+---------+-------+-------+--------+---------+--------+---------+-----------+ | 0 | worker1 | 1027M | 8188M | 0 | 0 | 0 | 0 | exists,up | | 1 | worker0 | 1027M | 8188M | 0 | 0 | 0 | 0 | exists,up | +----+---------+-------+-------+--------+---------+--------+---------+-----------+ [nwatkins@smash rook]$ kubectl -n rook-ceph exec -it rook-ceph-tools-7cf4cc7568-kz4q6 ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.01758 root default -5 0.01758 host worker0 0 hdd 0.00879 osd.0 up 1.00000 1.00000 1 hdd 0.00879 osd.1 up 1.00000 1.00000 after ceph-mgr restart: [nwatkins@smash rook]$ kubectl -n rook-ceph exec -it rook-ceph-tools-7cf4cc7568-kz4q6 ceph osd status +----+---------+-------+-------+--------+---------+--------+---------+-----------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+---------+-------+-------+--------+---------+--------+---------+-----------+ | 0 | worker0 | 1027M | 8188M | 0 | 0 | 0 | 0 | exists,up | | 1 | worker0 | 1027M | 8188M | 0 | 0 | 0 | 0 | exists,up | +----+---------+-------+-------+--------+---------+--------+---------+-----------+ here is an upstream tracker issue for this: http://tracker.ceph.com/issues/40871
Component changed to ceph per @Noah's analysis.
upstream backport fix posted at https://github.com/ceph/ceph/pull/30624. @Neha > # ceph version > ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable) i just realized we were testing a community release. but unlike downstream releases, we cannot cherry-pick into our own branch at will. the upstream has its own release schedule. how can you test a not-yet-released release? could you shed some light on it?
Docs bug for adding the restart of MGR to the procedure of moving the OSD disk between nodes - bug 1789436
*** Bug 1776750 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:2231
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days