Bug 1464958 - [RFE] Don't mark down OSDs out by default on small clusters
Summary: [RFE] Don't mark down OSDs out by default on small clusters
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: rc
: 3.*
Assignee: Josh Durgin
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-26 09:56 UTC by John Spray
Modified: 2022-02-21 18:06 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-28 16:16:01 UTC
Embargoed:


Attachments (Terms of Use)

Description John Spray 2017-06-26 09:56:43 UTC
For example, on a three node cluster it is rarely useful or sensible to mark an OSD out if there's any chance at all that it could come back online.

Comment 4 Josh Durgin 2017-07-19 00:03:32 UTC
If it's a replica count 2 cluster it would make sense to mark osds out - otherwise you'd be left with only 1 copy, perhaps until an admin manually intervenes. There are likely other nuances here...

Comment 5 Sage Weil 2017-07-25 20:47:27 UTC
We currently have two options that govern this:

1. mon_osd_min_in_ratio = .75

We will stop marking OSDs out once we drop to only 75% of existing osds IN.

2. mon_osd_down_out_subtree_limit = rack

If an entire $subtree (rack by default) is down, we won't mark any of the devices as out.

--

I think the question is what size cluster are we talking about, and can we just tweak these values to get the behavior we want for it.  It being a 3-node cluster doesn't tell us much without knowing how many OSDs per node... if we rely on the above options.  For a "normal" 3-node cluster with 12x disks per node, that's 36 OSDs, which would require a mon_osd_min_in_ratio of .98 to prevent any OSDs from being marked down.  I'm not sure that value makes sense for larger clusters.

On the other hand, even on a 3-node cluster, I think we *do* want to allow a few drives to be marked out (such that the data will rebuild on other drives in the same host).

My suggestion is to set mon_osd_min_in_ratio = .9 by default and be done with it.  I think that's a reasonable value for *large* clusters too... as soon as 10% of the infrastructure has failed we need operator help...

What do you think?

Comment 6 Sage Weil 2017-07-28 14:25:20 UTC
also, mon_osd_down_out_subtree_limit = host for small clusters

Comment 7 Yaniv Kaul 2019-01-27 12:08:37 UTC
Any work done on this BZ upstream?

Comment 8 Josh Durgin 2019-01-28 16:16:01 UTC
The current mon_osd_min_in_ratio and better recovery throttling defaults handle this well enough in luminous.


Note You need to log in before you can comment on or make changes to this bug.