@Pratik Can you share kubeconfig?
@Pratik, May I know the use case of deleting the host? New osds can be added in the same host right? Once osds will be added to that host ceph health may regain to HEALTH_OK. Also, the commands which added in the job template will take care of purging the osd and removing the osd from the crush map. But they don't deal with managing the host.
Pratik, just to be clear, the issue is only on the Ceph CRUSH tree right? I'm not sure to understand why this is a blocker, functionality-wise everything works normally, we "just" have a leftover host on the crush map. In the BZ description you said YES to "Does this issue impact your ability to continue to work with the product", could you please elaborate on that? Thanks.
In my opinion the hanging host in ceph osd tree is not the cause why ceph is in HEALTH_WARN. I tried to delete the host entry manually in the cluster but ceph still seems to be in HEALTH_WARN state. Ideally it should rebalance to HEALTH_OK.
I think we should ahead with a new job to remove the host from the CRUSH map once the node has been replaced: Servesha, please go ahead and send a PR on ocs-op with this content: ceph osd crush rm $HOST_TO_REMOVE Just set -xe when executing the script. That's all we need, Ceph will fail if the host has OSDs with ENOTEMPTY, so it's safe. Ceph being in WARN state is a different issue.
@leseb ack. I will send a PR
This is a corner case for 4.4. Re-adding the node to a different rack doesn't seem like something we should support, or at least should not be a blocker for 4.4. Let's focus on improving the solution in a future release with this upstream design instead of continuing to hack on the script. The new design has a proposal for fixing this scenario. https://github.com/rook/rook/issues/5258 I do not agree with this being a blocker for 4.4.
This was devel-ack-ed for 4.5 and then moved back to 4.4.z (without explanation), keeping the ack. This was actually NOT a devel_ack for 4.4.1. The patch is yet to be written. So it can not be in 4.4.1. We need to move this back to OCS 4.5.0 and clone it to 4.4.z if we intend to backport (but that won't be 4.4.1). @Elad, moving to 4.5. Please clone to 4.4.z if you want to propose for further backport.
(In reply to Michael Adam from comment #23) > > @Elad, moving to 4.5. Please clone to 4.4.z if you want to propose for > further backport. Done - bug 1848455
Has any work been done on this? Can we get it done this week?
For 4.5 this was already being tracked with https://bugzilla.redhat.com/show_bug.cgi?id=1841461, which is in POST. Since a clone has already been created for 4.4.z, let's close this.
@Jose Yes. We can get it in this week. It's almost done. The PR is tracked here: https://github.com/openshift/ocs-operator/pull/589
Scenarios are different for this one and https://bugzilla.redhat.com/show_bug.cgi?id=1841461 but fix wold be same so either we can close this or dup it to that BZ.
(In reply to Mudit Agarwal from comment #29) > Scenarios are different for this one and > https://bugzilla.redhat.com/show_bug.cgi?id=1841461 but fix wold be same so > either we can close this or dup it to that BZ. Even if the root cause or the fix is the same, we should not close BZs if the scenario for producing it / the phenomena observed are different: QE should verify with all the ways in which they created the BZ.
@Servesha, we only move a BZ to MODIFIED, when the patch is merged in the downstream code branch. Master branch merge is only POST. (NEEDINFO on you just to raise your attention.)
https://github.com/openshift/ocs-operator/pull/608 backport PR
@Michael ack!
Hello Daniel, I was reading through the case comments[1], I realized the customer tried other ceph commands manually but may have missed removing the host from the crush map. That could be causing the problem primarily. Here's the command to do that: ceph osd crush rm $HOST_TO_REMOVE [1] https://access.redhat.com/support/cases/#/case/02679050?commentId=a0a2K00000VLQvSQAX
What is the latest status on this one?
https://github.com/openshift/ocs-operator/pull/671 is merged
Thanks, moving it to POST. Needs a downstream PR too.
Backport PR https://github.com/openshift/ocs-operator/pull/672
Backport PR merged, moving it to MODIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754