Description of problem (please be detailed as possible and provide log snippets): If a user mistakenly provides an ID of a running osd to the parameter FAILED_OSD_ID, the running osd can be marked as `out`. Only `down` osds should be taken out of the cluster. `ceph osd out osd.ID` command marks out the osds regardless of they are up or down. We need a way to fix this so that we can prevent healthy osds from being out of the cluster. Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? A workaround is to mark osd `in` manually (ceph osd in osd.ID). But toolbox would be needed to run ceph command. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install OCS 2. Note down the osd ID of a running osd 3. Provide the osd ID when running the job template Actual results: healthy osd gets marked as `out` if its ID is passed to the job template Expected results: healthy osd should not be marked as `out` Additional info:
This seems to be high severity issue as chances of human errors are quite high. It would have been less error-prone if there was a way for getting the failed osd ids automatically >> oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=1 | oc create -f - Shouldn't this be considered as a blocker for this release to get the fix in OCS 4.4 itself ?
Agree. Implications are very severe from this simple human error. Marking as a blocker
(In reply to Elad from comment #3) > Agree. Implications are very severe from this simple human error. Marking as > a blocker agreed . acked
To be clear, the consequences of running the template on a healthy OSD is only that it's marked out. It won't be purged. If this happened, the admin could mark the osd back "in" to fix things. However, this will not be obvious to the user, they will have no idea what an "out" osd even is, and we would also need to document how to mark an osd back in if they picked the wrong one. If this were the day before release I wouldn't consider it a blocker, but sounds good as a blocker if it doesn't push the release date out.
Since there may have been a lesser probability of someone entering osd id of running osd and no possible data loss, I didn't see it as a blocker. But I agree! The blocker certainly sounds about right.
Servesha, what about using "ceph osd info osd.$ID" in your script instead of the tree?
I wouldn't consider it a blocker, as it's reversible, according to https://bugzilla.redhat.com/show_bug.cgi?id=1833317#c5
Hello @leseb, ~~~ NAMESPACE=openshift-storage FAILED_OSD_ID=2 echo "finding operator pod" ROOK_OPERATOR_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}') #Find the status(up/down) of failed osd osd_status=$(oc exec -it ${ROOK_OPERATOR_POD} -n ${NAMESPACE} -- ceph osd tree --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring |grep osd.${FAILED_OSD_ID}|awk '{print $5}') if [ "$osd_status" == "up" ] then echo "osd is up and running. Please check if you entered correct ID of failed osd!" else oc exec -it ${ROOK_OPERATOR_POD} -n ${NAMESPACE} -- ceph osd out osd.${FAILED_OSD_ID} --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring oc exec -it ${ROOK_OPERATOR_POD} -n ${NAMESPACE} -- ceph osd purge osd.${FAILED_OSD_ID} --force --yes-i-really-mean-it --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring fi ~~~ I tried to run `ceph osd info osd.ID` command in rook operator pod and then in toolbox pod but in both cases it says the command is not valid. The above script is working fine though. Should I move forward with this script?
That's fine, just go with the tree command then. Thanks.
@leseb ack!
@Servesha, I experimented with your script and ended up with a few modifications. This script was running inside my toolbox, which is in the same environment as will be used in the job template. I would expect all you need for your PR is to change the template command string to a multi-line command with this content: https://gist.github.com/travisn/782cf2ef9f4956d7799d658fca12942c A few points: - There are no "oc" commands. We're already in the ceph context so there is no need for oc. - With grep it is important to see that multiple OSDs could be selected. For example, if you're purging osd.1, we want to make sure we don't purge osd.11 or osd.12. The best thing I found to prevent that was to add quotes with a space in the grep. - The extra arguments for keyring and ceph config aren't necessary since we're already in the ceph context.
according to the decision in the PgM meeting, moving this one out of OCS 4.4. It is not deemed critical enough to spin another RC for it.
Opened a PR- https://github.com/openshift/ocs-operator/pull/510
Hi @Michael, As discussed in the leads meeting, we agreed to have a fix in 4.4. Moving back to 4.4 unless there are some limitations I'm not aware of
Hi @Servesha and @Travis, Could you add more information about the impact of having 1 OSD in out? Some questions I have: - How it will impact the cluster when the OSD will be marked as in when the customer will understand the mistake? - What are the consequences of keep working with 1 OSD marked as out? - how much time it will take to introduce a fix for this issue? Thanks!
https://github.com/openshift/ocs-operator/pull/510 - master PR
Hi @Raz, - In both cases- `osd out` and `osd in`, data movement will be there. - If osd is marked `out`, the data inside osd will be copied to other present osds. - If the same osd is marked `in`, the data will be backfilled to that osd. - IMO, even if we keep working with `osd out` situation, a consequence will not be severe, since the data is safe. As soon as we bring the `out` osd `in`, data will start rebalancing and will be copied to the osd, pgs will become active+clean. - I updated PR[1] with the last changes some time ago. [1] https://github.com/openshift/ocs-operator/pull/510
Servesha already answered it, thanks
Update: I was testing template by installing the private OCS build, but for some reason, CSV wasn't getting created. I tested the job manually, it's working fine. I will update the status asap regarding ocs created template testing. Thanks
Instead of blocking the removal of OSD when healthy OSD is provided, I would prefer to have extra check where we give a warning message that a healthy node is being purged and then ask the user to really mean it one more time. There can be use case where healthy OSD needs to be removed. So please do block that completely.
(In reply to Bipin Kunal from comment #27) > Instead of blocking the removal of OSD when healthy OSD is provided, I would > prefer to have extra check where we give a warning message that a healthy > node is being purged and then ask the user to really mean it one more time. > > There can be use case where healthy OSD needs to be removed. So please do > block that completely. @Bipin The template doesn't have a way to interact. All it can do is kick off the job to clean up the OSD. If they really want to taken down a running OSD, they will need to first stop the OSD before running the template, for example by scaling down the OSD deployment to 0.
Hi Travis, I'm not following. The issue is still there if I choose the wrong healthy OSD. The main purpose of the fix is to cover the user in case they choose a wrong OSD, so we can't rely on that (2nd item in your comment). I don't see anything changed since our last discussion on this except for the removing healthy OSD (with additional command in docs)
@Servesha, there was discussion in the OCS huddle today that the private install of OCS is working again now with the latest master and the new deployment instructions from Umanga. Can you rebase and try the test again? @Neha It would still be ideal to get this in, really it's just a question of timing now if we can validate the fix without continued delay of the release.
@Travis, Okay I can try and test again!
(In reply to Servesha from comment #35) > @Travis, Okay I can try and test again! Servesha, it would be great if you could test this first thing tomorrow morning. If it works, we can merge and backport the patch. We are waiting for this to merge in order to build the next and hopefully last RC for OCS 4.4. Thanks!
Hello Michael, Tried to test the code with new steps. Now ocs-operator pod is getting created but it's going in error state as soon as it gets created. Checking on it.
https://github.com/openshift/ocs-operator/pull/531 packport PR to 4.4 merged.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2393