Description of problem: After replacing the disk, ceph osd tree still showing the old disk entry. Reference Document: https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/blob/v1.3/replace-osds.adoc Version-Release number of selected component (if applicable): 1.3.1 ceph-0.94.3-2.el7cp.x86_64 How reproducible: NA Steps to Reproduce: 1. Followed the document to replace the Failed drive with the new drive. 2. Output of ceph osd tree, after completing steps 1-6. -1 29.42995 root default -2 7.62997 host cephqe5 0 1.09000 osd.0 up 1.00000 1.00000 1 1.09000 osd.1 up 1.00000 1.00000 2 1.09000 osd.2 up 1.00000 1.00000 3 1.09000 osd.3 up 1.00000 1.00000 4 1.09000 osd.4 up 1.00000 1.00000 5 1.09000 osd.5 up 1.00000 1.00000 6 1.09000 osd.6 up 1.00000 1.00000 -3 8.71999 host cephqe6 8 1.09000 osd.8 up 1.00000 1.00000 9 1.09000 osd.9 up 1.00000 1.00000 10 1.09000 osd.10 up 1.00000 1.00000 11 1.09000 osd.11 up 1.00000 1.00000 12 1.09000 osd.12 up 1.00000 1.00000 13 1.09000 osd.13 up 1.00000 1.00000 14 1.09000 osd.14 up 1.00000 1.00000 15 1.09000 osd.15 up 1.00000 1.00000 -5 6.53999 host cephqe8 24 1.09000 osd.24 up 1.00000 1.00000 25 1.09000 osd.25 up 1.00000 1.00000 26 1.09000 osd.26 up 1.00000 1.00000 27 1.09000 osd.27 up 1.00000 1.00000 28 1.09000 osd.28 up 1.00000 1.00000 29 1.09000 osd.29 up 1.00000 1.00000 -4 6.53999 host NEW 16 1.09000 osd.16 up 1.00000 1.00000 17 1.09000 osd.17 up 1.00000 1.00000 18 1.09000 osd.18 up 1.00000 1.00000 19 1.09000 osd.19 up 1.00000 1.00000 20 1.09000 osd.20 up 1.00000 1.00000 21 1.09000 osd.21 up 1.00000 1.00000 3. Then when i execute step 8, i.e. Recreate the OSD ceph osd create 7 ( This gets proper osd number, as osd.7 was removed as a part of Failed Disk) 4. Then i added the new osd following the document, the osd gets added properly, but the newly added osd gets a new osd number ( i.e osd.22 ) and the older osd ( i.e. osd.7 shows as down state) Output of ceph osd tree: ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 30.51994 root default -2 8.71997 host cephqe5 0 1.09000 osd.0 up 1.00000 1.00000 1 1.09000 osd.1 up 1.00000 1.00000 2 1.09000 osd.2 up 1.00000 1.00000 3 1.09000 osd.3 up 1.00000 1.00000 4 1.09000 osd.4 up 1.00000 1.00000 5 1.09000 osd.5 up 1.00000 1.00000 6 1.09000 osd.6 up 1.00000 1.00000 22 1.09000 osd.22 up 1.00000 1.00000 -3 8.71999 host cephqe6 8 1.09000 osd.8 up 1.00000 1.00000 9 1.09000 osd.9 up 1.00000 1.00000 10 1.09000 osd.10 up 1.00000 1.00000 11 1.09000 osd.11 up 1.00000 1.00000 12 1.09000 osd.12 up 1.00000 1.00000 13 1.09000 osd.13 up 1.00000 1.00000 14 1.09000 osd.14 up 1.00000 1.00000 15 1.09000 osd.15 up 1.00000 1.00000 -5 6.53999 host cephqe8 24 1.09000 osd.24 up 1.00000 1.00000 25 1.09000 osd.25 up 1.00000 1.00000 26 1.09000 osd.26 up 1.00000 1.00000 27 1.09000 osd.27 up 1.00000 1.00000 28 1.09000 osd.28 up 1.00000 1.00000 29 1.09000 osd.29 up 1.00000 1.00000 -4 6.53999 host NEW 16 1.09000 osd.16 up 1.00000 1.00000 17 1.09000 osd.17 up 1.00000 1.00000 18 1.09000 osd.18 up 1.00000 1.00000 19 1.09000 osd.19 up 1.00000 1.00000 20 1.09000 osd.20 up 1.00000 1.00000 21 1.09000 osd.21 up 1.00000 1.00000 7 0 osd.7 down 0 1.00000 Actual results: osd.7 is showing as down and the newly added osd is getting assigned as a new osd (osd.22) instead of osd.7 Expected results: The newly added osd should have been osd.7 Additional info:
Sam, mind looking into this one (or re-assigning as appropriate?) Is this a bug the docs (replace-osds.adoc), or something else?
Few more information: ceph -s output shows 29 OSD's but 28 up and in snippet: osdmap e695: 29 osds: 28 up, 28 in Calamari GUI also shows wrong information: OSD 28/29 In & Up 1 down PFA, the screenshot ( Dashboard and OSD Workbench)
Created attachment 1087142 [details] Calamari_GUI
Created attachment 1087143 [details] Calamari_Dashboard
After i restart the newly added osd, i am seeing I/O error, as it still pointing to ceph-7 directory. But the old.22 is getting started, PFA the log of osd.22 [root@cephqe5 ~]# /etc/init.d/ceph stop osd.22 find: ‘/var/lib/ceph/osd/ceph-7’: Input/output error === osd.22 === Stopping Ceph osd.22 on cephqe5...kill 66039...kill 66039...done [root@cephqe5 ~]# /etc/init.d/ceph start osd.22 find: ‘/var/lib/ceph/osd/ceph-7’: Input/output error === osd.22 === create-or-move updated item name 'osd.22' weight 1.09 at location {host=cephqe5,root=default} to crush map Starting Ceph osd.22 on cephqe5... Running as unit run-71865.service.
Created attachment 1087157 [details] Log
I don't now why osd.7 wasn't found as the next open slot. I have not been able to reproduce this on v0.94.3. I even did rm/create multiple times in a tight loop to check for a race condition. If the customer created a new osd before removing the old one, that would explain this. Or if the customer created 2 new osds (osd.7/osd.22) and removed osd.7 a second time.
Previous comment not important.
Based on bug description, instructions are wrong because ceph-deploy must also be doing a "ceph osd create" New instructions are pending and I've noted a concern that the new instructions might lead to the same problem and how to fix it. I've added comment to https://bugzilla.redhat.com/show_bug.cgi?id=1210539 and marking duplicate. *** This bug has been marked as a duplicate of bug 1210539 ***