Bug 1464958 - [RFE] Don't mark down OSDs out by default on small clusters
[RFE] Don't mark down OSDs out by default on small clusters
Status: NEW
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS (Show other bugs)
Unspecified Unspecified
unspecified Severity unspecified
: rc
: 3.*
Assigned To: Josh Durgin
: FutureFeature
Depends On:
  Show dependency treegraph
Reported: 2017-06-26 05:56 EDT by John Spray
Modified: 2018-05-04 18:14 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description John Spray 2017-06-26 05:56:43 EDT
For example, on a three node cluster it is rarely useful or sensible to mark an OSD out if there's any chance at all that it could come back online.
Comment 4 Josh Durgin 2017-07-18 20:03:32 EDT
If it's a replica count 2 cluster it would make sense to mark osds out - otherwise you'd be left with only 1 copy, perhaps until an admin manually intervenes. There are likely other nuances here...
Comment 5 Sage Weil 2017-07-25 16:47:27 EDT
We currently have two options that govern this:

1. mon_osd_min_in_ratio = .75

We will stop marking OSDs out once we drop to only 75% of existing osds IN.

2. mon_osd_down_out_subtree_limit = rack

If an entire $subtree (rack by default) is down, we won't mark any of the devices as out.


I think the question is what size cluster are we talking about, and can we just tweak these values to get the behavior we want for it.  It being a 3-node cluster doesn't tell us much without knowing how many OSDs per node... if we rely on the above options.  For a "normal" 3-node cluster with 12x disks per node, that's 36 OSDs, which would require a mon_osd_min_in_ratio of .98 to prevent any OSDs from being marked down.  I'm not sure that value makes sense for larger clusters.

On the other hand, even on a 3-node cluster, I think we *do* want to allow a few drives to be marked out (such that the data will rebuild on other drives in the same host).

My suggestion is to set mon_osd_min_in_ratio = .9 by default and be done with it.  I think that's a reasonable value for *large* clusters too... as soon as 10% of the infrastructure has failed we need operator help...

What do you think?
Comment 6 Sage Weil 2017-07-28 10:25:20 EDT
also, mon_osd_down_out_subtree_limit = host for small clusters

Note You need to log in before you can comment on or make changes to this bug.