Bug 1210539
Summary: | Replacing failed disks on CEPH nodes | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vasu Kulkarni <vakulkar> |
Component: | Documentation | Assignee: | John Wilkins <jowilkin> |
Status: | CLOSED WONTFIX | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 1.3.0 | CC: | anharris, dzafman, ealcaniz, flucifre, hnallurv, jowilkin, kdreyer, khartsoe, shmohan, vashastr, vikumar, vumrao |
Target Milestone: | rc | Keywords: | Reopened |
Target Release: | 1.3.4 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-02-20 20:59:06 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Vasu Kulkarni
2015-04-10 02:27:17 UTC
I'm targeting this to 1.3.0. John please feel free to re-target if that's not appropriate. Will address after 1.3. I will need physical hardware for this procedure. Also, we should ask whether we want this in generally available documentation or as a kb article. Hi John, IMO, this should go into GA documents. You may want to have a section "replacing Ceph hardware components" to describe the replacement procedures for all ceph components. Please note that QE is not going to test this defect for 1.3.0 RHEL release. Regards, Harish (In reply to John Wilkins from comment #3) > Will address after 1.3. I will need physical hardware for this procedure. > Also, we should ask whether we want this in generally available > documentation or as a kb article. Reading replace-osds.adoc, I don't see anything that specifically requires a physical host. Can this be tested with a VM? Re-opening this BUG as i see a minor documentation issue. Point number 9. 9. From your admin node, find the OSD drive and zap it. ceph-deploy drive list <node-name> ceph-deploy drive zap <node-name>:</path/to/drive> It should be disk, not drive. Correct command: ceph-deploy disk list cephqe5.lab.eng.blr.redhat.com ceph-deploy disk zap cephqe5.lab.eng.blr.redhat.com:/dev/sdj Marking it Verified. Why no "ceph-deploy --overwrite-conf osd activate <node-name>:</path/to/drive>" as in previous documentation? Using the older documetnation we've seen that the "ceph osd create" is redundant and causes the osd number to be wrong. I think step 10 should be removed unless we know that ceph-deploy as given here will NOT do the create again. *** Bug 1275631 has been marked as a duplicate of this bug. *** David, What kind of Documentation changes will be required ? Can you let John Wilkins know, so that he can update the document. This bug is assigned to John so I assumed my comment above indicates what needs to be changed and re-tested. Tanay, Can you test the steps above skipping the "ceph osd create" (step 10)? Referring the below Document, ceph osd create is in Step number 8 8. Recreate the OSD. ceph osd create https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/blob/v1.3/replace-osds.adoc I hope you meant this. I've removed the ceph osd create step, and placed the disk zap command before activate. Please retest. See https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/commit/6e5f3704f1842bb38d0ccb56fc1d9d422581efe7 I am still unable to make it Successfully Run. The problem i feel is after i delete the osd entries from crush i.e. (Running steps 1-6), still i see the osd is being mounted: /dev/sda1 on /var/lib/ceph/osd/ceph-6 type xfs (rw,noatime,seclabel,attr2,inode64,noquota) Now once i replace the drive with a new drive, activating is failing because the new drive which i put in the server is not getting /dev/sda, rather its getting some new drive letter ( Is this the real Problem ) If Yes: The question here is how we how can we control the newly added drive getting the same drive letter. Another correction in the document: Step 5 should be replaced with Step 6 and vice-verse. Hi John, Did the mentioned steps worked for you ? Thanks, Tanay (In reply to John Wilkins from comment #16) > I've removed the ceph osd create step, and placed the disk zap command > before activate. Please retest. See > > https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration- > guide/commit/6e5f3704f1842bb38d0ccb56fc1d9d422581efe7 Why did you remove the prepare step? I don't think that is right. I'll have Tanay test it out. Tanay, I don't have a running cluster today, so I'm not able to test it. I've added a umount step, because the failed drive is mounted. That was a problem in https://bugzilla.redhat.com/show_bug.cgi?id=1278558 too. So we do need to umount. https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/commit/24945382a649c830f43dda644c92f9cd75a302f2 (In reply to David Zafman from comment #18) > (In reply to John Wilkins from comment #16) > > I've removed the ceph osd create step, and placed the disk zap command > > before activate. Please retest. See > > > > https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration- > > guide/commit/6e5f3704f1842bb38d0ccb56fc1d9d422581efe7 > > Why did you remove the prepare step? I don't think that is right. I'll > have Tanay test it out. I don't have a running cluster to test it right now. I'll restore prepare. At one point, prepare was activating an OSD erroneously, and that's why it got omitted. Prepare restored. https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/commit/ab31d98ff947ef92be325ee6b59add990b9f4e4d (In reply to Tanay Ganguly from comment #17) > I am still unable to make it Successfully Run. > > The problem i feel is after i delete the osd entries from crush i.e. > (Running steps 1-6), still i see the osd is being mounted: > > /dev/sda1 on /var/lib/ceph/osd/ceph-6 type xfs > (rw,noatime,seclabel,attr2,inode64,noquota) I would have thought that if you replaced a drive into the physical location in a machine it would get the same device name. Linux may be moving away from using device names by mounting using UUID which is the better approach going forward. However, Ceph may not use fstab on some distributions. > > Now once i replace the drive with a new drive, activating is failing because > the new drive which i put in the server is not getting /dev/sda, rather its > getting some new drive letter ( Is this the real Problem ) > > If Yes: > The question here is how we how can we control the newly added drive getting > the same drive letter. If you aren't physically changing drives in the system and rebooting, then manually unmount the old partition. Also, remove an fstab entry if one is present on your system. We might need instructions about that. What if I have an unused drive on a node and want to replace the bad drive (but leave it physically installed). That would require a drive letter change. So there might be extra steps for that case. If so, we need to figure out what those are. For now can you work around leaving the old drive in place as I've described? Then substitute the new drive letter in the commands steps. > > Another correction in the document: > Step 5 should be replaced with Step 6 and vice-verse. > Yes, John should change order to "ceph auth del" followed by "ceph osd rm" Tanay, If the instructions said to "ceph-deploy osd create" the OSD, it should handle the prepare and activate in one step. According to ceph-deploy documentation an osd create is the same as prepare then activate. Can you try an extra test using ceph-deploy osd create instead of the ceph-deploy osd activate from Step 9: "Recreate the OSD."? ceph-deploy --overwrite-conf osd create <node-name>:</path/to/drive> Please re-test using https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/blob/ab31d98ff947ef92be325ee6b59add990b9f4e4d/replace-osds.adoc and try again with my suggestion of replacing the prepare/activate with create in current step #10. David Thanks John I see that the order of steps 5 and 6 are correct now. Its not working.. I followed the new steps mentioned in the document: https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/blob/v1.3/replace-osds.adoc I used osd.0 to be removed as the part of this Process. I followed step 1 -7 in sequence. After that i followed Step 9 ceph-deploy disk list <node-name> ceph-deploy disk zap <node-name>:</path/to/drive> Took a new Drive rather replacing the old drive ( As now it was not feasible ) Then didn't performed Step 10 as mentioned but as David mentioned used.. ceph-deploy --overwrite-conf osd create <node-name>:</path/to/drive> But Ceph osd tree shows: ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 13.43994 root default -2 0.35999 host cephqe11 1 0.35999 osd.1 up 1.00000 1.00000 -3 4.35999 host cephqe8 2 1.09000 osd.2 up 1.00000 1.00000 3 1.09000 osd.3 up 1.00000 1.00000 4 1.09000 osd.4 up 1.00000 1.00000 5 1.09000 osd.5 up 1.00000 1.00000 -4 4.35999 host cephqe9 6 1.09000 osd.6 up 1.00000 1.00000 7 1.09000 osd.7 up 1.00000 1.00000 8 1.09000 osd.8 up 1.00000 1.00000 9 1.09000 osd.9 up 1.00000 1.00000 -5 4.35999 host cephqe10 10 1.09000 osd.10 up 1.00000 1.00000 11 1.09000 osd.11 up 1.00000 1.00000 12 1.09000 osd.12 up 1.00000 1.00000 13 1.09000 osd.13 up 1.00000 1.00000 0 0 osd.0 down 0 1.00000 Some other Information: After i added a new drive, i still see osd.0 as down. And strange again i see /dev/sdb1 on /var/lib/ceph/osd/ceph-0 got mounted. But now i used /dev/sdd as a replacement, so i guess /dev/sdd should have got mounted. Tanay, I wanted you to try the "ceph-deploy ... osd create ..." as an extra test. According to Alfredo there is a RHEL bug with "create." So please retest using the procedure as described. Thanks David Tanay, I've ask 2 other developers to look at the procedure and they think it looks good. Please send me a log of the entire shell session in which you are following these steps. As per my previous comment use the exact instructions, not my suggestion. David Removed the section from the TOC. It will remain in the repo and we can continue to develop and test; then, republish when it's ready. https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/commit/03c6f66246cfb60aad8b61e3b775baed4bd3eb0e targeted for 1.3.2 Will re-test this in 1.3.3 @docs team, Vasu has a request to add a section to replace even the "system" disks. Please check comments: https://bugzilla.redhat.com/show_bug.cgi?id=1210543#c20 https://bugzilla.redhat.com/show_bug.cgi?id=1210543#c22 The comment https://bugzilla.redhat.com/show_bug.cgi?id=1210543#c26 has rough steps to replace a faulty 'system' disk. As this bug is already tracking the replacement of the failed drives, I feel Vasu's additional request of replacing 'system' disks can be accommodated here. Changing the state of this defect to Assigned to address above request. Thanks, Harish The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |