Bug 1973603
| Summary: | OCS doesn't delete the pvc when a node is deleted from the UI, and the PV is stuck in "Terminating" | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Udi Kalifon <ukalifon> |
| Component: | unclassified | Assignee: | Mudit Agarwal <muagarwa> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Raz Tamir <ratamir> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | unspecified | CC: | aos-bugs, bniver, gmeno, jrivera, jsafrane, muagarwa, ocs-bugs, odf-bz-bot, prsurve, sostapov, tmuthami |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-12 05:26:55 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Udi Kalifon
2021-06-18 09:53:58 UTC
What PV is Terminating and why? Is it because there is still PVC bound to it? It should be deleted by OCS. In addition, deleting PVs and PVCs is quite dangerous, what if the deleted node comes back (as it was on maintenance)? In assisted installer it's probably OK to delete them, as user deleted the node, but it should not be done in generic scenario when a node disappears from the cluster. More information is required before anything meaningful can be assessed. Please collect full must-gather output for both OCP and OCS. To Jan's point: Yes, PVs typically block if they are bound to a PVC that's currently in use by a Pod. Given what I understand of the situation, you decommissioned a node that had a Ceph OSD Pod running on it, using an LSO volume. If for some reason you did not follow the full OSD removal process, then Kubernetes would still think the Pod might be around, thus not deleting the PVC, thus not freeing the PVC, thus blocking the deletion of the PV. Another possibility is that if the node was not removed gracefully from the cluster, the CSI drive may think it can't unmount the PV and thus reporting it as Terminating until it eventually succeeds (which it won't!). This raises a few immediate questions: * Does your step 10 you described mean that you were able to add a new OSD using the local volume on the new node, and Ceph is reporting HEALTH_OK? * Does "ceph osd tree" still show the old OSD? If so, is it up or down? * Is the old OSD Pod still present when you do "oc get pods"? If not, do the CSI provisioner logs have anything useful to say? Finally, the following steps are vague and seem to omit a lot of detail: 8. Go back to the cloud and find the day2 cluster for your cluster 9. Add a new worker to the cluster from the Add hosts tab. The new worker should also have a suitable disk for OCS. 10. After the new worker is added, wait for LSO to create a PV for its disk, and see that OCS is claiming this PV Please describe the full process you used, referencing any specific documentation you followed if needed. Not a 4.8 blocker, please re-target if required. Output of "ceph osd tree": sh-4.4# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.58589 root default -5 0.19530 host worker-0-0 1 hdd 0.19530 osd.1 down 0 1.00000 -3 0.19530 host worker-0-1 0 hdd 0.19530 osd.0 up 1.00000 1.00000 -7 0.19530 host worker-0-2 2 hdd 0.19530 osd.2 up 1.00000 1.00000 In "ceph status" I see HEALTH_WARN. This is because this time, OCS didn't take the new LSO block as it did in the previous time when I reported the bug. I will try to find out why and will update the bug. After editing the StorageCluster CR and changing the count to 4 and then to 3, I see this: sh-4.4# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.78119 root default -5 0.19530 host worker-0-0 1 hdd 0.19530 osd.1 down 0 1.00000 -3 0.19530 host worker-0-1 0 hdd 0.19530 osd.0 up 1.00000 1.00000 -7 0.19530 host worker-0-2 2 hdd 0.19530 osd.2 up 1.00000 1.00000 -9 0.19530 host worker-0-3 3 hdd 0.19530 osd.3 up 1.00000 1.00000 Ceph status still shows HEALTH_WARN (waited just a few minutes) and the PV of the old node's disk is still stuck in "Terminating". How do I properly release it permanently? No update for a long time, please reopen if the problem still exists. |