Bug 1833317 - If healthy osd ID is provided to a job template, it marks running osd out
Summary: If healthy osd ID is provided to a job template, it marks running osd out
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.4.0
Assignee: Servesha
QA Contact: Rachael
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-08 12:22 UTC by Servesha
Modified: 2020-09-23 09:05 UTC (History)
15 users (show)

Fixed In Version: 4.4.0-rc7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-04 12:54:40 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 510 0 None closed Avoid running osd being marked out 2020-06-09 14:36:00 UTC
Github openshift ocs-operator pull 531 0 None closed Bug 1833317: Added script to avoid healthy osd from marking out 2020-06-09 14:36:00 UTC
Red Hat Product Errata RHBA-2020:2393 0 None None None 2020-06-04 12:54:46 UTC

Description Servesha 2020-05-08 12:22:04 UTC
Description of problem (please be detailed as possible and provide log
snippets):


If a user mistakenly provides an ID of a running osd to the parameter FAILED_OSD_ID, the running osd can be marked as `out`. Only `down` osds should be taken out of the cluster.

`ceph osd out osd.ID` command marks out the osds regardless of they are up or down. We need a way to fix this so that we can prevent healthy osds from being out of the cluster.



Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
A workaround is to mark osd `in` manually (ceph osd in osd.ID). But toolbox would be needed to run ceph command. 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3


Can this issue reproducible?
yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCS
2. Note down the osd ID of a running osd
3. Provide the osd ID when running the job template


Actual results:

healthy osd gets marked as `out` if its ID is passed to the job template

Expected results:

healthy osd should not be marked as `out`


Additional info:

Comment 2 Neha Berry 2020-05-08 15:39:02 UTC
This seems to be high severity issue as chances of human errors are quite high. It would have been less error-prone if there was a way for getting the failed osd ids automatically 


>> oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=1 | oc create -f -  

Shouldn't this be considered as a blocker for this release to get the fix in OCS 4.4 itself ?

Comment 3 Elad 2020-05-08 15:50:00 UTC
Agree. Implications are very severe from this simple human error. Marking as a blocker

Comment 4 Michael Adam 2020-05-08 16:59:08 UTC
(In reply to Elad from comment #3)
> Agree. Implications are very severe from this simple human error. Marking as
> a blocker

agreed . acked

Comment 5 Travis Nielsen 2020-05-08 17:08:45 UTC
To be clear, the consequences of running the template on a healthy OSD is only that it's marked out. It won't be purged. If this happened, the admin could mark the osd back "in" to fix things. However, this will not be obvious to the user, they will have no idea what an "out" osd even is, and we would also need to document how to mark an osd back in if they picked the wrong one. 

If this were the day before release I wouldn't consider it a blocker, but sounds good as a blocker if it doesn't push the release date out.

Comment 8 Servesha 2020-05-08 20:27:09 UTC
Since there may have been a lesser probability of someone entering osd id of running osd and no possible data loss, I didn't see it as a blocker. But I agree! The blocker certainly sounds about right.

Comment 9 Sébastien Han 2020-05-11 17:13:06 UTC
Servesha, what about using "ceph osd info osd.$ID" in your script instead of the tree?

Comment 10 Eran Tamir 2020-05-12 12:04:03 UTC
I wouldn't consider it a blocker, as it's reversible, according to https://bugzilla.redhat.com/show_bug.cgi?id=1833317#c5

Comment 11 Servesha 2020-05-12 12:10:48 UTC
Hello @leseb,

~~~
NAMESPACE=openshift-storage
FAILED_OSD_ID=2
echo "finding operator pod"
ROOK_OPERATOR_POD=$(oc -n ${NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}')

#Find the status(up/down) of failed osd
osd_status=$(oc exec -it ${ROOK_OPERATOR_POD} -n ${NAMESPACE} --   ceph osd tree  --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring |grep osd.${FAILED_OSD_ID}|awk '{print $5}')

if [ "$osd_status" == "up"  ]
then
	echo "osd is up and running. Please check if you entered correct ID of failed osd!"
else
	oc exec -it ${ROOK_OPERATOR_POD} -n ${NAMESPACE} --   ceph osd out osd.${FAILED_OSD_ID}  --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring
	oc exec -it ${ROOK_OPERATOR_POD} -n ${NAMESPACE} --   ceph osd purge osd.${FAILED_OSD_ID} --force --yes-i-really-mean-it  --cluster=${NAMESPACE} --conf=/var/lib/rook/${NAMESPACE}/${NAMESPACE}.config --keyring=/var/lib/rook/${NAMESPACE}/client.admin.keyring

fi
~~~

I tried to run `ceph osd info osd.ID` command in rook operator pod and then in toolbox pod but in both cases it says the command is not valid. The above script is working fine though. Should I move forward with this script?

Comment 12 Sébastien Han 2020-05-12 13:19:36 UTC
That's fine, just go with the tree command then. Thanks.

Comment 13 Servesha 2020-05-12 13:27:58 UTC
@leseb 

ack!

Comment 14 Travis Nielsen 2020-05-13 20:30:50 UTC
@Servesha, I experimented with your script and ended up with a few modifications. This script was running inside my toolbox, which is in the same environment as will be used in the job template. I would expect all you need for your PR is to change the template command string to a multi-line command with this content: https://gist.github.com/travisn/782cf2ef9f4956d7799d658fca12942c

A few points:
- There are no "oc" commands. We're already in the ceph context so there is no need for oc.
- With grep it is important to see that multiple OSDs could be selected. For example, if you're purging osd.1, we want to make sure we don't purge osd.11 or osd.12. The best thing I found to prevent that was to add quotes with a space in the grep.
- The extra arguments for keyring and ceph config aren't necessary since we're already in the ceph context.

Comment 15 Michael Adam 2020-05-14 10:02:47 UTC
according to the decision in the PgM meeting, moving this one out of OCS 4.4.
It is not deemed critical enough to spin another RC for it.

Comment 18 Servesha 2020-05-14 12:09:00 UTC
Opened a PR- https://github.com/openshift/ocs-operator/pull/510

Comment 19 Raz Tamir 2020-05-17 11:26:12 UTC
Hi @Michael,

As discussed in the leads meeting, we agreed to have a fix in 4.4.
Moving back to 4.4 unless there are some limitations I'm not aware of

Comment 22 Raz Tamir 2020-05-18 07:55:54 UTC
Hi @Servesha and @Travis,

Could you add more information about the impact of having 1 OSD in out?
Some questions I have:
- How it will impact the cluster when the OSD will be marked as in when the customer will understand the mistake?
- What are the consequences of keep working with 1 OSD marked as out?
- how much time it will take to introduce a fix for this issue?


Thanks!

Comment 23 Michael Adam 2020-05-18 16:07:37 UTC
https://github.com/openshift/ocs-operator/pull/510 - master PR

Comment 24 Servesha 2020-05-18 16:44:37 UTC
Hi @Raz,

- In both cases- `osd out` and `osd in`, data movement will be there.
- If osd is marked `out`, the data inside osd will be copied to other present osds.
- If the same osd is marked `in`, the data will be backfilled to that osd.
- IMO, even if we keep working with `osd out` situation, a consequence will not be severe, since the data is safe. As soon as we bring the `out` osd `in`, data will start rebalancing and will be copied to the osd, pgs will become active+clean.
- I updated PR[1] with the last changes some time ago.

[1] https://github.com/openshift/ocs-operator/pull/510

Comment 25 Travis Nielsen 2020-05-18 16:56:31 UTC
Servesha already answered it, thanks

Comment 26 Servesha 2020-05-19 20:35:44 UTC
Update: 

I was testing template by installing the private OCS build, but for some reason, CSV wasn't getting created. I tested the job manually, it's working fine. I will update the status asap regarding ocs created template testing.

Thanks

Comment 27 Bipin Kunal 2020-05-20 17:08:39 UTC
Instead of blocking the removal of OSD when healthy OSD is provided, I would prefer to have extra check where we give a warning message that a healthy node is being purged and then ask the user to really mean it one more time. 

There can be use case where healthy OSD needs to be removed. So please do block that completely.

Comment 28 Travis Nielsen 2020-05-20 18:45:30 UTC
(In reply to Bipin Kunal from comment #27)
> Instead of blocking the removal of OSD when healthy OSD is provided, I would
> prefer to have extra check where we give a warning message that a healthy
> node is being purged and then ask the user to really mean it one more time. 
> 
> There can be use case where healthy OSD needs to be removed. So please do
> block that completely.

@Bipin The template doesn't have a way to interact. All it can do is kick off the job to clean up the OSD. If they really want to taken down a running OSD, they will need to first stop the OSD before running the template, for example by scaling down the OSD deployment to 0.

Comment 31 Raz Tamir 2020-05-22 10:16:35 UTC
Hi Travis,

I'm not following. 
The issue is still there if I choose the wrong healthy OSD.
The main purpose of the fix is to cover the user in case they choose a wrong OSD, so we can't rely on that (2nd item in your comment). 

I don't see anything changed since our last discussion on this except for the removing healthy OSD (with additional command in docs)

Comment 34 Travis Nielsen 2020-05-22 14:11:07 UTC
@Servesha, there was discussion in the OCS huddle today that the private install of OCS is working again now with the latest master and the new deployment instructions from Umanga. Can you rebase and try the test again?

@Neha It would still be ideal to get this in, really it's just a question of timing now if we can validate the fix without continued delay of the release.

Comment 35 Servesha 2020-05-22 17:01:22 UTC
@Travis, Okay I can try and test again!

Comment 36 Michael Adam 2020-05-25 22:40:04 UTC
(In reply to Servesha from comment #35)
> @Travis, Okay I can try and test again!

Servesha, it would be great if you could test this first thing tomorrow morning.
If it works, we can merge and backport the patch.
We are waiting for this to merge in order to build the next and hopefully last RC for OCS 4.4.

Thanks!

Comment 37 Servesha 2020-05-26 11:25:35 UTC
Hello Michael, 

Tried to test the code with new steps. Now ocs-operator pod is getting created but it's going in error state as soon as it gets created. Checking on it.

Comment 38 Servesha 2020-05-26 11:30:01 UTC
Hello Michael, 

Tried to test the code with new steps. Now ocs-operator pod is getting created but it's going in error state as soon as it gets created. Checking on it.

Comment 39 Michael Adam 2020-05-26 21:11:50 UTC
https://github.com/openshift/ocs-operator/pull/531 packport PR to 4.4 merged.

Comment 46 errata-xmlrpc 2020-06-04 12:54:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2393


Note You need to log in before you can comment on or make changes to this bug.