For example, on a three node cluster it is rarely useful or sensible to mark an OSD out if there's any chance at all that it could come back online.
If it's a replica count 2 cluster it would make sense to mark osds out - otherwise you'd be left with only 1 copy, perhaps until an admin manually intervenes. There are likely other nuances here...
We currently have two options that govern this: 1. mon_osd_min_in_ratio = .75 We will stop marking OSDs out once we drop to only 75% of existing osds IN. 2. mon_osd_down_out_subtree_limit = rack If an entire $subtree (rack by default) is down, we won't mark any of the devices as out. -- I think the question is what size cluster are we talking about, and can we just tweak these values to get the behavior we want for it. It being a 3-node cluster doesn't tell us much without knowing how many OSDs per node... if we rely on the above options. For a "normal" 3-node cluster with 12x disks per node, that's 36 OSDs, which would require a mon_osd_min_in_ratio of .98 to prevent any OSDs from being marked down. I'm not sure that value makes sense for larger clusters. On the other hand, even on a 3-node cluster, I think we *do* want to allow a few drives to be marked out (such that the data will rebuild on other drives in the same host). My suggestion is to set mon_osd_min_in_ratio = .9 by default and be done with it. I think that's a reasonable value for *large* clusters too... as soon as 10% of the infrastructure has failed we need operator help... What do you think?
also, mon_osd_down_out_subtree_limit = host for small clusters
Any work done on this BZ upstream?
The current mon_osd_min_in_ratio and better recovery throttling defaults handle this well enough in luminous.