Red Hat Bugzilla – Bug 1464958
[RFE] Don't mark down OSDs out by default on small clusters
Last modified: 2018-05-04 18:14:49 EDT
For example, on a three node cluster it is rarely useful or sensible to mark an OSD out if there's any chance at all that it could come back online.
If it's a replica count 2 cluster it would make sense to mark osds out - otherwise you'd be left with only 1 copy, perhaps until an admin manually intervenes. There are likely other nuances here...
We currently have two options that govern this:
1. mon_osd_min_in_ratio = .75
We will stop marking OSDs out once we drop to only 75% of existing osds IN.
2. mon_osd_down_out_subtree_limit = rack
If an entire $subtree (rack by default) is down, we won't mark any of the devices as out.
I think the question is what size cluster are we talking about, and can we just tweak these values to get the behavior we want for it. It being a 3-node cluster doesn't tell us much without knowing how many OSDs per node... if we rely on the above options. For a "normal" 3-node cluster with 12x disks per node, that's 36 OSDs, which would require a mon_osd_min_in_ratio of .98 to prevent any OSDs from being marked down. I'm not sure that value makes sense for larger clusters.
On the other hand, even on a 3-node cluster, I think we *do* want to allow a few drives to be marked out (such that the data will rebuild on other drives in the same host).
My suggestion is to set mon_osd_min_in_ratio = .9 by default and be done with it. I think that's a reasonable value for *large* clusters too... as soon as 10% of the infrastructure has failed we need operator help...
What do you think?
also, mon_osd_down_out_subtree_limit = host for small clusters