Bug 1947482
| Summary: | The device replacement process when deleting the volume metadata need to be fixed or modified | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Itzhak <ikave> |
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> |
| Status: | CLOSED ERRATA | QA Contact: | Itzhak <ikave> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.7 | CC: | bniver, ebenahar, kramdoss, madam, muagarwa, nberry, ocs-bugs, odf-bz-bot, rcyriac, tnielsen |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.10.0-175 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-24 13:48:17 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Itzhak
2021-04-08 14:54:47 UTC
Additional info: Here are some of the outputs after I finished with the device replacement process: $ oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE local-pv-16b477a8 8Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-metadata-0djkkm localblock-metadata 21h local-pv-27e19a3a 200Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-data-0n4g49 localblock 21h local-pv-71e9b59c 200Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-data-04lgrj localblock 120m local-pv-8b3c930 200Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-data-0ctw6g localblock 8d local-pv-ad86817c 8Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-metadata-0hkrsl localblock-metadata 8d local-pv-c5440a6f 8Gi RWO Delete Available localblock-metadata 111m local-pv-d1771d53 8Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-metadata-0n4wbl localblock-metadata 8d pvc-ba427b93-4d0b-40b3-90ec-44692abfb7cb 50Gi RWO Delete Bound openshift-storage/db-noobaa-db-pg-0 ocs-storagecluster-ceph-rbd 8d $ oc get pods | grep osd rook-ceph-osd-1-65876f5857-7kqc6 2/2 Running 0 8d rook-ceph-osd-2-98cdf8584-5d9bj 2/2 Running 0 17h rook-ceph-osd-prepare-ocs-deviceset-0-data-04lgrj-2f7bq 0/1 Init:0/3 0 32m rook-ceph-osd-prepare-ocs-deviceset-1-data-0ctw6g-gxm4x 0/1 Completed 0 8d rook-ceph-osd-prepare-ocs-deviceset-2-data-0n4g49-lfw5z 0/1 Completed 0 17h $ oc get jobs NAME COMPLETIONS DURATION AGE rook-ceph-osd-prepare-ocs-deviceset-0-data-04lgrj 0/1 34m 34m rook-ceph-osd-prepare-ocs-deviceset-1-data-0ctw6g 1/1 13s 8d rook-ceph-osd-prepare-ocs-deviceset-2-data-0n4g49 1/1 13s 17h As you can see, Ceph didn't recognize the newly available PV metadata after the process has finished Doesn't look like a blocker for 4.8, moving it out. Hi Mudit, Currently, I don't have a solution for this and we require someone from Ceph team. @Travis Need help on the device replacement process in case of multiple device deployment (OSD + metadata + wal device). Does rook already support it? . Thanks Kesavan, will request Ceph team to take a look as well. (In reply to Kesavan from comment #6) > Hi Mudit, > Currently, I don't have a solution for this and we require someone from Ceph > team. > > @Travis Need help on the device replacement process in case of multiple > device deployment (OSD + metadata + wal device). Does rook already support > it? . When using a metadata/wal device, you would have to replace all OSDs using the same metadata/wal device, which usually means: - Replace all OSDs on a node if the metadata device dies - If a single OSD dies, leave it dead until all OSDs on the node can be wiped and replaced Travis are you suggesting that the proper process was not followed? Taking another look, I see that the osd purge job does not delete the metadata PVC associated with the OSD that is being purged. Rook is only deleting the data PVC as observed in the bug. In my previous response I must have missed that the osd purge job was being used. Moving back to 4.10 since we have a fix before dev freeze. Should we add doc text? (In reply to Mudit Agarwal from comment #15) > Should we add doc text? Do we need doc text for 4.10 issues? This is such an uncommon case to remove an OSD on a cluster with metadata devices, not sure it's worth adding doc text. Can we move it to 4.11? I think this BZ is not so important at the moment. Can it be verified with 4.10.z? While the scenario isn't critical, the risk of regression could also be there until the BZ is verified. Travis already answered. Okay. I am moving it to Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6156 |